These are my notes for migrating my VM storage from NFS mount to Ceph hosted on Proxmox. I ran into a lot of bumps, but after getting proper server-grade SSDs, things have been humming smoothly long enough that it’s time to publish.
A note on SSDs
I had a significant amount of trouble getting ceph to work with consumer-grade SSDs. This is because ceph does a cache writeback call for each transaction – much like NFS. On my ZFS array, I could disable this, but not so for ceph. The result is very slow performance. It wasn’t until I got some Intel DC S3700 drives that ceph became reliable and fast. More details here.
Initial install
I used the Proxmox GUI to install ceph on each node by going to <host> / Ceph. Then I used the GUI to create a monitor, manager, and OSD on each host. Lastly, I used the GUI to create a ceph storage target in Datacenter config.
Small cluster (3 nodes)
My Proxmox cluster is small (3 nodes.) I discovered I didn’t have enough space for 3 replicas (the default ceph configuration), so I had to drop my pool size/min down to 2/1 despite warnings not to do so, since a 3-node cluster is a special case:
https://forum.proxmox.com/threads/ceph-pool-size-is-2-1-really-a-bad-idea.68939/#post-440755
More discussion: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/UB44GH4Z2NJUV52ZTHKO4TGYEX3DZ4CB/
I have not had any problems with this configuration and it provides the space I need.
Ceph pool size
In my early testing, I discovered that if I removed a disk from pool, the size of the pool increased! After doing some reading in redhat documentation, I learned the basics of why this happened.
Size = number of copies of the data in the pool
Minsize = minimum number of copies before pool operation is suspended
I didn’t have enough space for 3 copies of the data. When I removed a disk, the pool it dropped down to the minsize setting (2 copies) – which I did have enough room for. The pool rebalanced to reflect this and it resulted in more space.
Configure Alerting
It turns out that alerting for problems with ceph OSDs and monitors does not come out of the box. You must configure it. Thanks to this thread and the ceph documentation for how to do so. I did this on each proxmox node.
apt install ceph-mgr-dashboard
ceph config set mgr mgr/alerts/smtp_host <MAIL_HOST>'
ceph config set mgr mrg/alerts/smtp_ssl false
ceph config set mgr mgr/alerts/smtp_ssl false
ceph config set mgr mgr/alerts/smtp_port 25
ceph config set mgr mgr/alerts/smtp_destination <DEST_EMAIL>
ceph config set mgr mgr/alerts/smtp_sender <SENDER_EMAIL>
ceph config set mgr mgr/alerts/smtp_from_name 'Proxmox Ceph Cluster'
Test this by telling ceph to send its alerts:
ceph alerts send
Move VM disks to Ceph storage
I ended up writing a simple for loop to move all my existing Proxmox VM disks onto my new ceph cluster. None of my VMs had more than 3 scsi devices. If your VMs have more than that you’ll have to tweak this rudimentary command:
for vm in $(qm list | awk '{print $1}'|grep -v VMID); do qm move-disk $vm scsi0 <CEPH_POOL_NAME>; qm move-disk $vm scsi1 <CEPH_POOL_NAME>; qm move-disk $vm scsi2 <CEPH_POOL_NAME>; done
Rename storage
I tried to edit /etc/pve/storage.cfg
to change the name I gave my ceph cluster in Proxmox. That didn’t work (question mark next to the storage after renaming it) so I just removed and re-added instead.
Maintenance
Begin maintenance:
Ceph constantly tries to keep itself in balance. If you take a node down and it stays down for too long, ceph will begin to rebalance the data among the remaining nodes. If you’re doing short term maintenance, you can control this behavior to avoid unnecessary rebalance traffic.
ceph osd set nobackfill
ceph osd set norebalance
Reboot / perform OSD maintenance.
After maintenance is completed:
ceph osd unset nobackfill
ceph osd unset norebalance
Performance benchmark
I did a lot of performance checking when I first started to try and track down why the pool was so slow. In the end it was my consumer-grade SSDs. I’ll keep this section here for future reference.
Redhat article on ceph performance benchmarking
Ceph wiki on benchmarking
rados bench -p SSD 10 write --no-cleanup
rados bench -p SSD 10 seq
rados bench -p SSD 10 seq
rados bench -p SSD 10 rand
rbd create image01 --size 1024 --pool SSD
rbd map image01 --pool SSD --name client.admin
mkfs.ext4 /dev/rbd/SSD/image01
mkdir /mnt/ceph-block-device
mount /dev/rbd/SSD/image01 /mnt/ceph-block-device/
rbd bench --io-type write image01 --pool=SSD
pveperf /mnt/ceph-block-device/
rados -p SSD cleanup
Undo:
umount /mnt/ceph-block-device
rbd unmap image01 --pool SSD
rbd rm image01 --pool SSD
MTU 9000 warning
I read that it was recommended to set network MTU to 9000 (jumbo frames. When I did this I experienced weird behavior, connection timeouts – ceph ground to a halt, complaining about slow OSDs, mons. It was too much hassle for me to troubleshoot, so I went back to the standard 1500 MTU.
Datacenter settings
I discovered you can have a host automatically migrate hosts off when you issue the reboot command via the migrate shutdown policy. https://pve.proxmox.com/wiki/High_Availability
Proxmox GUI / Datacenter / Options / HA Settings
Specify SSD or HDD for pools
I have not done this yet but here’s a link I found that explains how to do it: https://stackoverflow.com/questions/58060333/ceph-how-to-place-a-pool-on-specific-osd
Helpful commands
Determine IPs of OSDs:
ceph osd dump - determine IPs of OSDs
Remove monitor from failed node:
ceph mon remove <host>
Also needs to be removed from /etc/ceph/ceph.con
f
Configure Backup
I had been using ZFS snapshots and ZFS send to backup my VM disks before the move to ceph. While ceph has snapshot capability, it is slow and takes up extra space in the pool. My solution was to spin up a Proxmox Backup Server and regularly back up to that instead.
Proxmox backup server: can be installed to an existing PVE server if you desire:
https://pbs.proxmox.com/docs/installation.html
Configure the apt repository as follows:
# PBS pbs-no-subscription repository provided by proxmox.com,
# NOT recommended for production use
deb http://download.proxmox.com/debian/pbs bullseye pbs-no-subscription
# security updates
deb http://security.debian.org/debian-security bullseye-security main contrib
# apt-get update
# apt-get install proxmox-backup
I had to add a regular user and give admin permissions on PBS side, then add the host on the proxmox side using those credentials.
Configure automated backup in PVE via Datacenter tab / Backup.
Remember to configure automated verify jobs (scrubs).
Make sure to add an e-mail address for proxmox backup user for alerts.
Edit which account & e-mail is used, and how often notified, at the Datastore level.
Sync jobs
I wanted to synchronize my Proxmox Backup repository to a non-PBS server (simply host the files.) I accomplished this by doing the following:
- Add 127.0.0.1 as a Remote host (Configuration / Remotes.) Copy the PBS server fingerprint from Certificates / Fingerprint.
- Create remote datastore in /etc/fstab manually (I used SSHFS to backup to a synology over SSH.)
- Add datastore in PBS, pointing to manual fstab mount. Then add sync job there
Import PBS datastore (in case of total crash)
I wanted to know how to import the data into a fresh instance of PBS. This is the procedcure:
edit /etc/proxmox-backup/datastore.cfg and add config about the datastore manually. Copy from existing datastore config for syntax.
Space still being taken up after deleting backups
PBS uses access time to determine if something has been touched. It waits 24 hours after the last touch. Garbage collection manually updates atime, but still recommended to keep atime on for the dataset PBS is using. Sources:
https://forum.proxmox.com/threads/zpool-atime-turned-off-effect-on-garbage-collection.76590/
https://pbs.proxmox.com/docs/backup-client.html#garbage-collection
Troubleshooting
Really slow VM IOPS during degrade / rebuild
This also ended up being due to having consumer-grade SSDs in my ceph pools. I’m keeping my notes for what I did to troubleshoot in case they’re useful.
https://forum.proxmox.com/threads/ceph-high-i-o-wait-on-osd-add-remove.20271/
Small cluster. Lower backfill activity so recovery doesn’t cause slowdown:
ceph config set osd osd_max_backfills 1
ceph config set osd osd_recovery_max_active 3
Verify setting was applied: https://www.suse.com/support/kb/doc/?id=000019693
ceph-conf --show-config|egrep "osd_max_backfills|osd_recovery_max_active"
ceph config dump | grep osd
Ramp up backfill performance:
ceph tell osd.* injectargs --osd_max_backfills=2 --osd-recovery_max_active=8 # 2x Increase
ceph tell osd.* injectargs --osd_max_backfills=3 --osd-recovery_max_active=12 # 3x Increase
ceph tell osd.* injectargs --osd_max_backfills=4 --osd_recovery_max_active=16 # 4x Increase
ceph tell osd.* injectargs --osd_max_backfills=1 --osd-recovery_max_active=3 # Back to Defaults
The above didn’t help, turns out consumer SSDs are very bad:
https://yourcmc.ru/wiki/Ceph_performance#General_benchmarking_principles
https://blog.cypressxt.net/hello-ceph-and-samsung-850-evo/
I bought some Intel DC S3700 on ebay for $75 a piece. It fixed all my latency/speed issues.
Dead mon despite being removed from cli
I had a situation where a monitor showed up as dead in proxmox, but I was unable to delete it. I followed this procedure:
rm /etc/systemd/system/ceph-mon.target.wants/ceph-mon@<nodename>.service
Dead pve node procedure
remove from /etc/ceph/ceph.conf, remove /var/lib/ceph/mon/ceph-<node>, remove rm /etc/systemd/system/ceph-mon.target.wants/ceph-mon@pve2.service
https://forum.proxmox.com/threads/ceph-cant-remove-monitor-with-unknown-status.63613/
Adding through GUI brought me back to the same problem.
Bring node back manually
https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/
ceph auth get mon. -o /tmp/key
ceph mon getmap -o /tmp/map
ceph-mon -i <node_name> –mkfs –monmap /tmp/map –keyring /tmp/key
ceph-mon -i <node_name> –public-addr <node_ip>:6789
ceph mon enable-msgr2
vi /etc/pve/ceph.conf
In the end the most surefire way to fix this problem was to re-image the affected host.
Clear HEALTH_WARNING in GUI
In my testing I had tried pulling disks at random, then putting them back in. This recovered well, but I had this message:
HEALTH_WARN 1 daemons have recently crashed
To clear it I had to drop to the CLI and run this command:
ceph crash archive-all
Thanks to the Proxmox Forums for the fix.
Pool cleanup
I noticed I would get rbd error: rbd: listing images failed: (2) No such file or directory (500)
when trying to look at which disks were on my Ceph pool. I fixed this by removing the offending images as per this post.
I then ran another rbd ls -l <POOL_NAME>
command to see what was left and noticed several items without anything in the LOCK column. I discovered these were artifacts from failed disk migrations I tried early on – wasted space. I removed them one by one with the following command:
rbd rm <VM_FILE_NAME> -p <POOL_NAME>
Be careful to verify they’re not disks that are in use with VMs with are powered off – they will also show no lock for non-running VMs.
Disk errors
I had a disk fail, but then I pulled out the wrong disk. I kept getting these errors:
Warning: Error fsyncing/closing /dev/mapper/ceph--fc741b6c--499d--482e--9ea4--583652b541cc-osd--block--843cf28a--9be1--4286--a29c--b9c6848d33ba: Input/output error
I was unable to remove it from the GUI. After a while I realized the problem – I was on the wrong node. I needed to be on the node that has the disks when creating an OSD in the Proxmox GUI.
Steps to determine which disk is assigned to an OSD, from ceph docs:
ceph-volume lvm list
====== osd.2 =======
[block] /dev/ceph-680265f2-0b3c-4426-b2a8-acf2774d82e0/osd-block-2096f339-0572-4e1d-bf20-52335af9b374
block device /dev/ceph-680265f2-0b3c-4426-b2a8-acf2774d82e0/osd-block-2096f339-0572-4e1d-bf20-52335af9b374
block uuid tcnwFr-G33o-ybue-n0mP-cDpe-sp9y-d0gvYS
cephx lockbox secret
cluster fsid 65f26da0-fca0-4419-ba15-20269a5a363f
cluster name ceph
crush device class ssd
encrypted 0
osd fsid 2096f339-0572-4e1d-bf20-52335af9b374
osd id 2
osdspec affinity
type block
vdo 0
devices /dev/sde