Category Archives: CLI

Zimbra commercial SSL renewal procedure

My quick notes on what I have to do every year to upgrade my Zimbra mail certificate with a new Namecheap SSL certificate:

  1. Request CSR
    • /opt/zimbra/bin/zmcertmgr createcsr comm -new -subject "/C=COUNTRY/ST=STATE/L=LOCATION/O=ORG/OU=OU/CN=CN.EXAMPLE.ORG" -subjectAltNames CN.EXAMPLE.ORG
    • cat /opt/zimbra/ssl/zimbra/commercial/commercial.csr
  2. Upload CSR, verify domain, receive cert bundle
  3. Copy CRT & CA Bundle files to /tmp/cert
  4. Change permissions of files to allow zimbra user to use them:
    sudo chown zimbra /tmp/cert
    sudo chown zimbra /tmp/cert/*
  5. Verify it works against private key
    zmcertmgr verifycrt comm /opt/zimbra/ssl/zimbra/commercial/commercial.key /tmp/cert/ISSUED_CRT.crt /tmp/cert/CA_BUNDLE.ca-bundle
  6. Import new key
    zmcertmgr deploycrt comm /tmp/cert/ISSUED_CRT.crt /tmp/cert/CA_BUNDLE.ca-bundle
  7. Restart zimbra
    zmcontrol restart

Trigger button to run a script in Home Assistant

I configured a button (Runlesswire Click) to log diaper changes for my new baby. The diaper changes are logged in a Google Docs spreadsheet. I set up a simple public facing Google Form that I could run unauthenticated curl requests against. I then configured Home Assistant to run that curl command when the button is pressed. Instant diaper logging by the press of a button.

Lessons learned:

  • Zigbee Home Assistant (ZHA) does not yet support the Zigbee Green protocol, which the RunlessWire Click uses. I had to pair the switches to my Hue hub instead.
    * It looks like they’re getting close to supporting it, though: https://github.com/zigpy/zigpy/pull/1282

Here was my process:

  • Create Google Form
  • Obtain form ID from URL bar
  • Get pre-filled link to get names of fields by clicking the three dots on top right and clicking “Get pre-filled link”. Make note of the names for each entry e.g. entry.1363419348
    Thanks to help from: https://stackoverflow.com/questions/65142364/i-cant-find-name-attribute-while-inspecting-input-elements-of-google-form-ho
  • Curl command is:
    curl https://docs.google.com/forms/<FORM_URL>/formResponse -d ifq -d <ENTRY_NAME>=<ENTITY_VALUE> -d <ADDITIONAL_ENTRY_NAME>=<ADDITIONAL_ENTRY_VALUE> -d submit=Submit
    Thanks to help from: https://eureka.ykyuen.info/2014/07/30/submit-google-forms-by-curl-command/
  • Shell commands go into configuration.yaml
    shell_command:
    log_pee: <CURL_COMMAND>
    log_poo: <CURL_COMMAND>
    Thanks to help from: https://community.home-assistant.io/t/dont-understand-how-to-use-shell-commands/576580/9
  • Restart Home Assistant to pick up your configuration changes.
  • Configure the automation to call Service: shell_command

Success!

Saltstack gitfs ‘Failed to authenticate SSH session: Callback returned error’ fix

I lost several days of productivity with this one. I wanted to connect my Cent 7 salt master’s salt & pillar data to a gitfs backend. I configured /etc/salt/master per the docs but kept getting this error message:

Error occurred fetching gitfs remote 'git@github.com:<owner>/<repo>': Failed to authenticate SSH session: Callback returned error

I eventually discovered this bit of info that pointed me in the right direction that it was likely an issue with the certificate I was using. I followed the steps of generating a new certificate, but this time I received the error message “You’re using an RSA key with SHA-1, which is no longer allowed. Please use a newer client or a different key type.”

The issue stemmed from the fact that github tightened their security for SSH keys. More digging revealed that the pygit2 python module that comes with CentOS 7 is old and does not recognize the new cipher. I eventually found a fix – use pip to install a compatible version of pygit2. The latest version that works on Cent 7 is 1.6.1. Simply installing it wasn’t enough, though – you must also purge the system-installed pygit2 yum package.

Steps to fix

  1. Remove system supplied pygit2 version
    sudo yum remove python3-pygit2
  2. Install version 1.6.1 of pygit2 via pip. Sudo must be used to ensure global paths are updated.
    sudo python3 -m pip install pygit2==1.6.1 -U
  3. Restart the salt master
    sudo systemctl restart salt-master
  4. Review /var/log/salt/master for errors.

Troubleshooting

Monitor /var/log/salt/master for errors. I occasionally ran into errors such as this one:

2024-03-15 13:01:45,957 [salt.utils.gitfs :878 ][WARNING ][31763] gitfs_global_lock is enabled and update lockfile /var/cache/salt/master/gitfs/5b5f257b5dc909390cd0dfab5b6722334c9bc541912da272389f39cf5b80602e/.git/update.lk is present for gitfs remote ‘git@github.com:<owner>/<repo>’. Process 31793 obtained the lock

The solution was to remove the file and restart the salt master.

Configure Zimbra live replication

I’ve recently configured live active replication from my Zimbra e-mail server to a backup server. This is really slick – in the event of primary server failure, I can bring up my secondary in a matter of minutes with no data loss. I used the Zimbra live sync scripts on Gitlab to accomplish this.

These are my notes on things I needed to do in addition to the readme to get things to work properly on my Zimbra 8.8.15 Open Source Edition installs on CentOS 7 boxes.

Install atd (at package):
sudo yum install atd

Make sure the backup server has the same firewall rules as the primary: https://wiki.zimbra.com/wiki/Ports

On the backup server, configure DNS for the mail server to resolve to the Backup server’s IP address. hostname: mail.server.dns -> mirror mail server.

Disable DNS forwarding for primary mail server domain if configured (to ensure mail goes to backup server in the event of switchover.)

Clone over prod mail server, spin up and change network settings:

  • keep hostname (important)
  • change IP, DNS, hosts to use new IP address/network

/etc/sysconfig/network-scripts/ifcfg-eth0
/etc/hosts
/etc/resolv.conf

Ensure proper VLAN settings in backup VM (may be different than primary)

Systemd service:
add Environment=PATH=/opt/zimbra/bin:/opt/zimbra/common/lib/jvm/java/bin:/opt/zimbra/common/bin:/opt/zimbra/common/sbin:/usr/sbin:/sbin:/bin:/usr/sbin:/usr/bin
WorkingDirectory=/opt/zimbra

Remove start argument from ExecStart: ExecStart=/opt/zimbra/live_sync/live_syncd

This is the complete systemd unit for live sync:

[Unit]
Description=Zimbra live sync - to be run on the mirror server
After=network.target

[Service]
ExecStart=/opt/zimbra/live_sync/live_syncd
ExecStop=/opt/zimbra/live_sync/live_syncd kill
User=zimbra
Environment=PATH=/opt/zimbra/bin:/opt/zimbra/common/lib/jvm/java/bin:/opt/zimbra/common/bin:/opt/zimbra/common/sbin:/usr/sbin:/sbin:/bin:/usr/sbin:/usr/bin
WorkingDirectory=/opt/zimbra

[Install]
WantedBy=multi-user.target

Time limit

It looks like there’s a time limit for how long Zimbra keeps redo logs. It means you will get a lost mail situation if you try to bring your primary server back up after it’s been offline for too long (more than a few weeks.) If you’ve been failed over to your secondary mail server for more than two weeks, you’ll want to do the reverse procedure – clone the backup to the primary, edit IP addresses, then run the zimbra live sync. Log into the restored server to ensure mails from greater than 2 weeks ago are all there.

Replace unavail disk in ZFS

I had an issue where I removed a drive in my ZFS array and replaced it with a new drive which the OS gave the same device name (/dev/sdd). I had a hard time getting zfs to replace the drive until I discovered the -g flag for zpool status (thanks to this stackexchange post.)

That did the trick! Simply running zpool status -g showed the GUIDs of each device, which I could then use to properly use zpool replace on:

sudo zpool replace Poolname 12922644002107879117 /dev/sdd

Success!

Fix makemkv not compiling in Arch

I’ve had my Arch Linux desktop system for several years now. Over that time, cruft has built up. It bit me today when I tried to install makemkv. No matter what I tried I could not get it to compile. Configure constantly failed an this step:

checking whether LIBAVCODEC_VERSION_MAJOR is declared... yes
checking LIBAVCODEC_VERSION_MAJOR... 52
...
configure: error: The libavcodec library is too old. Please get a recent one from http://www.ffmpeg.org

I had to systematically delete anything containing ffmpeg, then re-install ffmpeg, in order to finally get it to work.

Get a list of installed packages containing ffmpeg:

yay -Ss ffmpeg | grep Installed

Remove ffmpeg-containing packages:

yay -R chromaprint-fftw grilo-plugins gst-plugins-bad cheese gnome-music gnome-video-effects totem ffmpeg-compat-54 ffmpeg-compat-57 ffmpeg0.10 ffmpeg4.4 vlc libavutil-52 faudio

Install makemkv:

yay -S makemkv

My “nuke all ffmpeg from orbit” approach worked. After I did so, makemkv compiled!

Fix cron output not being sent via e-mail

I had an issue where I had cron jobs that output data to stdout, yet mail of the output was never delivered. Everything showed fine in cron.log :

Aug  3 21:21:01 mail CROND[10426]: (nicholas) CMD (echo “test”)
Aug  3 21:21:01 mail CROND[10424]: (nicholas) CMDOUT (test)

yet no e-mail was sent. I finally found out how to fix this in a roundabout way. I came across this article on cpanel.net on how to silence cron e-mails. I then thought I’d try the reverse of a suggestion and add MAILTO= variable at the top of my cron file. It worked! Example crontab:

MAILTO=”youremail@address.com”
0 * * * * /home/nicholas/queue-check.sh

This came about due to my Zimbra box not sending system e-mails. In addition to the above, I had to configure zimbra as a sendmail alternative per this Zimbra wiki post: https://wiki.zimbra.com/wiki/How_to_%22fix%22_system%27s_sendmail_to_use_that_of_zimbra

Proxmox Ceph storage configuration

These are my notes for migrating my VM storage from NFS mount to Ceph hosted on Proxmox. I ran into a lot of bumps, but after getting proper server-grade SSDs, things have been humming smoothly long enough that it’s time to publish.

A note on SSDs

I had a significant amount of trouble getting ceph to work with consumer-grade SSDs. This is because ceph does a cache writeback call for each transaction – much like NFS. On my ZFS array, I could disable this, but not so for ceph. The result is very slow performance. It wasn’t until I got some Intel DC S3700 drives that ceph became reliable and fast. More details here.

Initial install

I used the Proxmox GUI to install ceph on each node by going to <host> / Ceph. Then I used the GUI to create a monitor, manager, and OSD on each host. Lastly, I used the GUI to create a ceph storage target in Datacenter config.

Small cluster (3 nodes)

My Proxmox cluster is small (3 nodes.) I discovered I didn’t have enough space for 3 replicas (the default ceph configuration), so I had to drop my pool size/min down to 2/1 despite warnings not to do so, since a 3-node cluster is a special case:

https://forum.proxmox.com/threads/ceph-pool-size-is-2-1-really-a-bad-idea.68939/#post-440755

More discussion: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/UB44GH4Z2NJUV52ZTHKO4TGYEX3DZ4CB/

I have not had any problems with this configuration and it provides the space I need.

Ceph pool size

In my early testing, I discovered that if I removed a disk from pool, the size of the pool increased! After doing some reading in redhat documentation, I learned the basics of why this happened.

Size = number of copies of the data in the pool

Minsize = minimum number of copies before pool operation is suspended

I didn’t have enough space for 3 copies of the data. When I removed a disk, the pool it dropped down to the minsize setting (2 copies) – which I did have enough room for. The pool rebalanced to reflect this and it resulted in more space.

Configure Alerting

It turns out that alerting for problems with ceph OSDs and monitors does not come out of the box. You must configure it. Thanks to this thread and the ceph documentation for how to do so. I did this on each proxmox node.

apt install ceph-mgr-dashboard
ceph config set mgr mgr/alerts/smtp_host <MAIL_HOST>'
ceph config set mgr mrg/alerts/smtp_ssl false
ceph config set mgr mgr/alerts/smtp_ssl false
ceph config set mgr mgr/alerts/smtp_port 25
ceph config set mgr mgr/alerts/smtp_destination <DEST_EMAIL>
ceph config set mgr mgr/alerts/smtp_sender <SENDER_EMAIL>
ceph config set mgr mgr/alerts/smtp_from_name 'Proxmox Ceph Cluster'

Test this by telling ceph to send its alerts:

ceph alerts send

Move VM disks to Ceph storage

I ended up writing a simple for loop to move all my existing Proxmox VM disks onto my new ceph cluster. None of my VMs had more than 3 scsi devices. If your VMs have more than that you’ll have to tweak this rudimentary command:

for vm in $(qm list | awk '{print $1}'|grep -v VMID); do qm move-disk $vm scsi0 <CEPH_POOL_NAME>; qm move-disk $vm scsi1 <CEPH_POOL_NAME>; qm move-disk $vm scsi2 <CEPH_POOL_NAME>; done

Rename storage

I tried to edit /etc/pve/storage.cfg to change the name I gave my ceph cluster in Proxmox. That didn’t work (question mark next to the storage after renaming it) so I just removed and re-added instead.

Maintenance

Begin maintenance:

Ceph constantly tries to keep itself in balance. If you take a node down and it stays down for too long, ceph will begin to rebalance the data among the remaining nodes. If you’re doing short term maintenance, you can control this behavior to avoid unnecessary rebalance traffic.

ceph osd set nobackfill
ceph osd set norebalance

Reboot / perform OSD maintenance.

After maintenance is completed:

ceph osd unset nobackfill
ceph osd unset norebalance

Performance benchmark

I did a lot of performance checking when I first started to try and track down why the pool was so slow. In the end it was my consumer-grade SSDs. I’ll keep this section here for future reference.

Redhat article on ceph performance benchmarking

Ceph wiki on benchmarking

rados bench -p SSD 10 write --no-cleanup
rados bench -p SSD 10 seq
rados bench -p SSD 10 seq
rados bench -p SSD 10 rand
rbd create image01 --size 1024 --pool SSD
rbd map image01 --pool SSD --name client.admin
mkfs.ext4 /dev/rbd/SSD/image01  
mkdir /mnt/ceph-block-device
mount /dev/rbd/SSD/image01 /mnt/ceph-block-device/
rbd bench --io-type write image01 --pool=SSD
pveperf /mnt/ceph-block-device/
rados -p SSD cleanup

Undo:

 umount /mnt/ceph-block-device  
 rbd unmap image01 --pool SSD
 rbd rm image01 --pool SSD

MTU 9000 warning

I read that it was recommended to set network MTU to 9000 (jumbo frames. When I did this I experienced weird behavior, connection timeouts – ceph ground to a halt, complaining about slow OSDs, mons. It was too much hassle for me to troubleshoot, so I went back to the standard 1500 MTU.

Datacenter settings

I discovered you can have a host automatically migrate hosts off when you issue the reboot command via the migrate shutdown policy. https://pve.proxmox.com/wiki/High_Availability

Proxmox GUI / Datacenter / Options / HA Settings

Specify SSD or HDD for pools

I have not done this yet but here’s a link I found that explains how to do it: https://stackoverflow.com/questions/58060333/ceph-how-to-place-a-pool-on-specific-osd

Helpful commands

Determine IPs of OSDs:

ceph osd dump - determine IPs of OSDs

Remove monitor from failed node:

ceph mon remove <host>
Also needs to be removed from /etc/ceph/ceph.conf

Configure Backup

I had been using ZFS snapshots and ZFS send to backup my VM disks before the move to ceph. While ceph has snapshot capability, it is slow and takes up extra space in the pool. My solution was to spin up a Proxmox Backup Server and regularly back up to that instead.

Proxmox backup server: can be installed to an existing PVE server if you desire:

https://pbs.proxmox.com/docs/installation.html

Configure the apt repository as follows:

# PBS pbs-no-subscription repository provided by proxmox.com,
# NOT recommended for production use
deb http://download.proxmox.com/debian/pbs bullseye pbs-no-subscription

# security updates
deb http://security.debian.org/debian-security bullseye-security main contrib

# apt-get update
# apt-get install proxmox-backup

I had to add a regular user and give admin permissions on PBS side, then add the host on the proxmox side using those credentials.

Configure automated backup in PVE via Datacenter tab / Backup.

Remember to configure automated verify jobs (scrubs).

Make sure to add an e-mail address for proxmox backup user for alerts.

Edit which account & e-mail is used, and how often notified, at the Datastore level.

Sync jobs

I wanted to synchronize my Proxmox Backup repository to a non-PBS server (simply host the files.) I accomplished this by doing the following:

  • Add 127.0.0.1 as a Remote host (Configuration / Remotes.) Copy the PBS server fingerprint from Certificates / Fingerprint.
  • Create remote datastore in /etc/fstab manually (I used SSHFS to backup to a synology over SSH.)
  • Add datastore in PBS, pointing to manual fstab mount. Then add sync job there

Import PBS datastore (in case of total crash)

I wanted to know how to import the data into a fresh instance of PBS. This is the procedcure:

edit /etc/proxmox-backup/datastore.cfg and add config about the datastore manually. Copy from existing datastore config for syntax.

Space still being taken up after deleting backups

PBS uses access time to determine if something has been touched. It waits 24 hours after the last touch. Garbage collection manually updates atime, but still recommended to keep atime on for the dataset PBS is using. Sources:

https://forum.proxmox.com/threads/zpool-atime-turned-off-effect-on-garbage-collection.76590/

https://pbs.proxmox.com/docs/backup-client.html#garbage-collection

Troubleshooting

Really slow VM IOPS during degrade / rebuild

This also ended up being due to having consumer-grade SSDs in my ceph pools. I’m keeping my notes for what I did to troubleshoot in case they’re useful.

https://forum.proxmox.com/threads/ceph-high-i-o-wait-on-osd-add-remove.20271/

Small cluster. Lower backfill activity so recovery doesn’t cause slowdown:

ceph config set osd osd_max_backfills 1
ceph config set osd osd_recovery_max_active 3

Verify setting was applied: https://www.suse.com/support/kb/doc/?id=000019693

ceph-conf --show-config|egrep "osd_max_backfills|osd_recovery_max_active"
ceph config dump | grep osd

Ramp up backfill performance:

ceph tell osd.* injectargs --osd_max_backfills=2 --osd-recovery_max_active=8 # 2x Increase
ceph tell osd.* injectargs --osd_max_backfills=3 --osd-recovery_max_active=12 # 3x Increase
ceph tell osd.* injectargs --osd_max_backfills=4 --osd_recovery_max_active=16 # 4x Increase
ceph tell osd.* injectargs --osd_max_backfills=1 --osd-recovery_max_active=3 # Back to Defaults

The above didn’t help, turns out consumer SSDs are very bad:

https://yourcmc.ru/wiki/Ceph_performance#General_benchmarking_principles

https://blog.cypressxt.net/hello-ceph-and-samsung-850-evo/

I bought some Intel DC S3700 on ebay for $75 a piece. It fixed all my latency/speed issues.

Dead mon despite being removed from cli

I had a situation where a monitor showed up as dead in proxmox, but I was unable to delete it. I followed this procedure:

rm /etc/systemd/system/ceph-mon.target.wants/ceph-mon@<nodename>.service

Dead pve node procedure

remove from /etc/ceph/ceph.conf, remove /var/lib/ceph/mon/ceph-<node>, remove rm /etc/systemd/system/ceph-mon.target.wants/ceph-mon@pve2.service

https://forum.proxmox.com/threads/ceph-cant-remove-monitor-with-unknown-status.63613/

Adding through GUI brought me back to the same problem.

Bring node back manually

https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/

 ceph auth get mon. -o /tmp/key
 ceph mon getmap -o /tmp/map
 ceph-mon -i <node_name> –mkfs –monmap /tmp/map –keyring /tmp/key  
 ceph-mon -i <node_name> –public-addr <node_ip>:6789  
 ceph mon enable-msgr2
 vi /etc/pve/ceph.conf

In the end the most surefire way to fix this problem was to re-image the affected host.

Clear HEALTH_WARNING in GUI

In my testing I had tried pulling disks at random, then putting them back in. This recovered well, but I had this message:

HEALTH_WARN 1 daemons have recently crashed

To clear it I had to drop to the CLI and run this command:

ceph crash archive-all

Thanks to the Proxmox Forums for the fix.

Pool cleanup

I noticed I would get rbd error: rbd: listing images failed: (2) No such file or directory (500) when trying to look at which disks were on my Ceph pool. I fixed this by removing the offending images as per this post.

I then ran another rbd ls -l <POOL_NAME> command to see what was left and noticed several items without anything in the LOCK column. I discovered these were artifacts from failed disk migrations I tried early on – wasted space. I removed them one by one with the following command:

rbd rm <VM_FILE_NAME> -p <POOL_NAME>

Be careful to verify they’re not disks that are in use with VMs with are powered off – they will also show no lock for non-running VMs.

Disk errors

I had a disk fail, but then I pulled out the wrong disk. I kept getting these errors:

Warning: Error fsyncing/closing /dev/mapper/ceph--fc741b6c--499d--482e--9ea4--583652b541cc-osd--block--843cf28a--9be1--4286--a29c--b9c6848d33ba: Input/output error

I was unable to remove it from the GUI. After a while I realized the problem – I was on the wrong node. I needed to be on the node that has the disks when creating an OSD in the Proxmox GUI.

Steps to determine which disk is assigned to an OSD, from ceph docs:

ceph-volume lvm list
====== osd.2 =======

 [block]       /dev/ceph-680265f2-0b3c-4426-b2a8-acf2774d82e0/osd-block-2096f339-0572-4e1d-bf20-52335af9b374

     block device              /dev/ceph-680265f2-0b3c-4426-b2a8-acf2774d82e0/osd-block-2096f339-0572-4e1d-bf20-52335af9b374
     block uuid                tcnwFr-G33o-ybue-n0mP-cDpe-sp9y-d0gvYS
     cephx lockbox secret       
     cluster fsid              65f26da0-fca0-4419-ba15-20269a5a363f
     cluster name              ceph
     crush device class        ssd
     encrypted                 0
     osd fsid                  2096f339-0572-4e1d-bf20-52335af9b374
     osd id                    2
     osdspec affinity           
     type                      block
     vdo                       0
     devices                   /dev/sde

Update 6/20/2024

One year later and Ceph has been running great. So great, in fact, that I migrated my bulk storage to it as well. Here are my notes on that endeavor.

Optimal number of PGs

I discovered that there is an optimal number of PGs you want in a ceph cluster. It depends on how many OSDs you have. Link: https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/storage_strategies_guide/placement_groups_pgs#pg-count-for-small-clusters

The optimal number of PGs is the following, rounding up to the nearest power of two:

                (OSDs * 100)
   Total PGs =  ------------
                 pool size

In my case (only 3 OSDs – one per node) that made my optimal number of PGs 256.

Slow write speeds for HDDs

Moving OSD DB to SSD – The slow way

I had pretty slow write speeds when adding my 3 HDDs to a new pool (50 MB/s max.) I read the best way to help with this is to offload the db and WAL to an SSD for each OSD. It’s possible to have multiple OSDs using a single SSD for such operations, but since I don’t have enterprise-grade SSDs, I opted to do a 1:1 HDD:SSD mapping. Unfortunately, I had already created the OSDs before I realized I needed to do this. So I had to destroy & re-create each OSD one by one to add the SSD.

https://www.reddit.com/r/ceph/comments/fgvcte/replace_osd_node_without_remapping_pgs

set flag norebalance, norecover, nobackfill, destroy the OSD and join the new OSD as the same ID of the old one.

This worked, but took two days to rebuild. I set out to find a faster option

Moving OSD DB to SSD – The fast way

Migrate DB to SSD without destroying OSD

https://www.reddit.com/r/ceph/comments/1awwoch/yet_another_ceph_poor_performance_post_part_deux

https://github.com/45Drives/scripts/blob/main/add-db-to-osd.sh

Requires jq and bc

I kept getting the error message

WARNING: Device selected (/dev/sdd) has a LVM2_member signature, but no volume group
Wipe disk and run again

despite completely wiping the drive. I dove into the source of the script and found it creates a PV & VG for the drive, and that must be failing, so I did it manually:

pvcreate /dev/sdd

vgcreate ceph-$(uuidgen) /dev/sdd

./add-db-to-osd.sh -b 465G -d /dev/sdd -o 3

This worked beautifully.

Move OSD DB to new device

I discovered that when it comes to DB devices, the same advice about SSDs is still true: Don’t waste your time with consumer SSDs. I ordered some more Intel DC S3700 drives and now needed to replace them. The 45 drives script doesn’t work because the DB had already been migrated to a different SSD. This is the process to move from one dedicated DB device to another:

Thanks to this thread https://www.reddit.com/r/ceph/comments/1bk6e9s/moving_db_and_wal_from_ssd_to_hdd/

and this documentation: https://docs.ceph.com/en/latest/ceph-volume/lvm/list/

https://docs.ceph.com/en/quincy/ceph-volume/lvm/migrate

Plug new drive in alongside existing drive

Obtain OSD fsid with this command: ceph-volume lvm list

pvcreate /dev/<new_device>

vgcreate ceph-$(uuidgen) /dev/<new_device>

lvcreate -l100%FREE -n ceph-osd-db-<OSD FSID> ceph-<UUIDGEN_FROM_ABOVE>

systemctl stop ceph-osd@<OSD_ID>

ceph-volume lvm migrate –osd-id <OSD_ID> –osd-fsid <OSD_FSID> –from db wal –target ceph-<UUIDGEN_FROM_ABOVE>/ceph-osd-db-<OSD FSID>

--> Migrate to new, Source: ['--devs-source', '/var/lib/ceph/osd/ceph-4/block.db'] Target: /dev/ceph-60969103-7d88-4340-a13f-a77f98e1da46/osd-db-800G
Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-4/block.db
Running command: /usr/bin/chown -R ceph:ceph /dev/dm-6
--> Migration successful.

systemctl start ceph-osd@<OSD_ID>

System with no additional HDD slots
Used a USB3 SSD adapter temporarily. Migrated DB,  remove old device, add new device. Reboot node.

Sizing DB device

https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref

For RBD workloads, however, block.db usually needs no more than 1% to 2% of the block size.

Move from dedicated DB device back to OSD

https://www.reddit.com/r/ceph/comments/1bwma91/script_to_move_separate_db_lv_back_to_block_device