Tag Archives: backup

Proxmox Ceph storage configuration

April 22, 2023 nicholas 1 Comment

These are my notes for migrating my VM storage from NFS mount to Ceph hosted on Proxmox. I ran into a lot of bumps, but after getting proper server-grade SSDs, things have been humming smoothly long enough that it’s time to publish.

A note on SSDs

I had a significant amount of trouble getting ceph to work with consumer-grade SSDs. This is because ceph does a cache writeback call for each transaction – much like NFS. On my ZFS array, I could disable this, but not so for ceph. The result is very slow performance. It wasn’t until I got some Intel DC S3700 drives that ceph became reliable and fast. More details here.

Initial install

I used the Proxmox GUI to install ceph on each node by going to <host> / Ceph. Then I used the GUI to create a monitor, manager, and OSD on each host. Lastly, I used the GUI to create a ceph storage target in Datacenter config.

Small cluster (3 nodes)

My Proxmox cluster is small (3 nodes.) I discovered I didn’t have enough space for 3 replicas (the default ceph configuration), so I had to drop my pool size/min down to 2/1 despite warnings not to do so, since a 3-node cluster is a special case:

https://forum.proxmox.com/threads/ceph-pool-size-is-2-1-really-a-bad-idea.68939/#post-440755

More discussion: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/UB44GH4Z2NJUV52ZTHKO4TGYEX3DZ4CB/

I have not had any problems with this configuration and it provides the space I need.

Ceph pool size

In my early testing, I discovered that if I removed a disk from pool, the size of the pool increased! After doing some reading in redhat documentation, I learned the basics of why this happened.

Size = number of copies of the data in the pool

Minsize = minimum number of copies before pool operation is suspended

I didn’t have enough space for 3 copies of the data. When I removed a disk, the pool it dropped down to the minsize setting (2 copies) – which I did have enough room for. The pool rebalanced to reflect this and it resulted in more space.

Configure Alerting

It turns out that alerting for problems with ceph OSDs and monitors does not come out of the box. You must configure it. Thanks to this thread and the ceph documentation for how to do so. I did this on each proxmox node.

apt install ceph-mgr-dashboard

ceph config set mgr mgr/alerts/smtp_host <MAIL_HOST>'
ceph config set mgr mrg/alerts/smtp_ssl false
ceph config set mgr mgr/alerts/smtp_ssl false
ceph config set mgr mgr/alerts/smtp_port 25
ceph config set mgr mgr/alerts/smtp_destination <DEST_EMAIL>
ceph config set mgr mgr/alerts/smtp_sender <SENDER_EMAIL>
ceph config set mgr mgr/alerts/smtp_from_name 'Proxmox Ceph Cluster'

Test this by telling ceph to send its alerts:

ceph alerts send

Move VM disks to Ceph storage

I ended up writing a simple for loop to move all my existing Proxmox VM disks onto my new ceph cluster. None of my VMs had more than 3 scsi devices. If your VMs have more than that you’ll have to tweak this rudimentary command:

for vm in $(qm list | awk '{print $1}'|grep -v VMID); do qm move-disk $vm scsi0 <CEPH_POOL_NAME>; qm move-disk $vm scsi1 <CEPH_POOL_NAME>; qm move-disk $vm scsi2 <CEPH_POOL_NAME>; done

Rename storage

I tried to edit /etc/pve/storage.cfg to change the name I gave my ceph cluster in Proxmox. That didn’t work (question mark next to the storage after renaming it) so I just removed and re-added instead.

Maintenance

Begin maintenance:

Ceph constantly tries to keep itself in balance. If you take a node down and it stays down for too long, ceph will begin to rebalance the data among the remaining nodes. If you’re doing short term maintenance, you can control this behavior to avoid unnecessary rebalance traffic.

ceph osd set nobackfill
ceph osd set norebalance

Reboot / perform OSD maintenance.

After maintenance is completed:

ceph osd unset nobackfill
ceph osd unset norebalance

Performance benchmark

I did a lot of performance checking when I first started to try and track down why the pool was so slow. In the end it was my consumer-grade SSDs. I’ll keep this section here for future reference.

Redhat article on ceph performance benchmarking

Ceph wiki on benchmarking

rados bench -p SSD 10 write --no-cleanup
rados bench -p SSD 10 seq
rados bench -p SSD 10 seq
rados bench -p SSD 10 rand
rbd create image01 --size 1024 --pool SSD
rbd map image01 --pool SSD --name client.admin
mkfs.ext4 /dev/rbd/SSD/image01  
mkdir /mnt/ceph-block-device
mount /dev/rbd/SSD/image01 /mnt/ceph-block-device/
rbd bench --io-type write image01 --pool=SSD
pveperf /mnt/ceph-block-device/
rados -p SSD cleanup

Undo:

 umount /mnt/ceph-block-device  
 rbd unmap image01 --pool SSD
 rbd rm image01 --pool SSD

MTU 9000 warning

I read that it was recommended to set network MTU to 9000 (jumbo frames. When I did this I experienced weird behavior, connection timeouts – ceph ground to a halt, complaining about slow OSDs, mons. It was too much hassle for me to troubleshoot, so I went back to the standard 1500 MTU.

Datacenter settings

I discovered you can have a host automatically migrate hosts off when you issue the reboot command via the migrate shutdown policy. https://pve.proxmox.com/wiki/High_Availability

Proxmox GUI / Datacenter / Options / HA Settings

Specify SSD or HDD for pools

I have not done this yet but here’s a link I found that explains how to do it: https://stackoverflow.com/questions/58060333/ceph-how-to-place-a-pool-on-specific-osd

Helpful commands

Determine IPs of OSDs:

ceph osd dump - determine IPs of OSDs

Remove monitor from failed node:

ceph mon remove <host>
Also needs to be removed from /etc/ceph/ceph.conf

Configure Backup

I had been using ZFS snapshots and ZFS send to backup my VM disks before the move to ceph. While ceph has snapshot capability, it is slow and takes up extra space in the pool. My solution was to spin up a Proxmox Backup Server and regularly back up to that instead.

Proxmox backup server: can be installed to an existing PVE server if you desire:

https://pbs.proxmox.com/docs/installation.html

Configure the apt repository as follows:

# PBS pbs-no-subscription repository provided by proxmox.com,
# NOT recommended for production use
deb http://download.proxmox.com/debian/pbs bullseye pbs-no-subscription

# security updates
deb http://security.debian.org/debian-security bullseye-security main contrib

# apt-get update
# apt-get install proxmox-backup

I had to add a regular user and give admin permissions on PBS side, then add the host on the proxmox side using those credentials.

Configure automated backup in PVE via Datacenter tab / Backup.

Remember to configure automated verify jobs (scrubs).

Make sure to add an e-mail address for proxmox backup user for alerts.

Edit which account & e-mail is used, and how often notified, at the Datastore level.

Sync jobs

I wanted to synchronize my Proxmox Backup repository to a non-PBS server (simply host the files.) I accomplished this by doing the following:

Add 127.0.0.1 as a Remote host (Configuration / Remotes.) Copy the PBS server fingerprint from Certificates / Fingerprint.
Create remote datastore in /etc/fstab manually (I used SSHFS to backup to a synology over SSH.)
Add datastore in PBS, pointing to manual fstab mount. Then add sync job there

Import PBS datastore (in case of total crash)

I wanted to know how to import the data into a fresh instance of PBS. This is the procedcure:

edit /etc/proxmox-backup/datastore.cfg and add config about the datastore manually. Copy from existing datastore config for syntax.

Space still being taken up after deleting backups

PBS uses access time to determine if something has been touched. It waits 24 hours after the last touch. Garbage collection manually updates atime, but still recommended to keep atime on for the dataset PBS is using. Sources:

https://forum.proxmox.com/threads/zpool-atime-turned-off-effect-on-garbage-collection.76590/

https://pbs.proxmox.com/docs/backup-client.html#garbage-collection

Troubleshooting

Really slow VM IOPS during degrade / rebuild

This also ended up being due to having consumer-grade SSDs in my ceph pools. I’m keeping my notes for what I did to troubleshoot in case they’re useful.

https://forum.proxmox.com/threads/ceph-high-i-o-wait-on-osd-add-remove.20271/

Small cluster. Lower backfill activity so recovery doesn’t cause slowdown:

ceph config set osd osd_max_backfills 1
ceph config set osd osd_recovery_max_active 3

Verify setting was applied: https://www.suse.com/support/kb/doc/?id=000019693

ceph-conf --show-config|egrep "osd_max_backfills|osd_recovery_max_active"
ceph config dump | grep osd

Ramp up backfill performance:

ceph tell osd.* injectargs --osd_max_backfills=2 --osd-recovery_max_active=8 # 2x Increase
ceph tell osd.* injectargs --osd_max_backfills=3 --osd-recovery_max_active=12 # 3x Increase
ceph tell osd.* injectargs --osd_max_backfills=4 --osd_recovery_max_active=16 # 4x Increase
ceph tell osd.* injectargs --osd_max_backfills=1 --osd-recovery_max_active=3 # Back to Defaults

The above didn’t help, turns out consumer SSDs are very bad:

https://yourcmc.ru/wiki/Ceph_performance#General_benchmarking_principles

https://blog.cypressxt.net/hello-ceph-and-samsung-850-evo/

I bought some Intel DC S3700 on ebay for $75 a piece. It fixed all my latency/speed issues.

Dead mon despite being removed from cli

I had a situation where a monitor showed up as dead in proxmox, but I was unable to delete it. I followed this procedure:

rm /etc/systemd/system/ceph-mon.target.wants/ceph-mon@<nodename>.service

Dead pve node procedure

remove from /etc/ceph/ceph.conf, remove /var/lib/ceph/mon/ceph-<node>, remove rm /etc/systemd/system/ceph-mon.target.wants/ceph-mon@pve2.service

https://forum.proxmox.com/threads/ceph-cant-remove-monitor-with-unknown-status.63613/

Adding through GUI brought me back to the same problem.

Bring node back manually

https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/

ceph auth get mon. -o /tmp/key
ceph mon getmap -o /tmp/map
ceph-mon -i <node_name> –mkfs –monmap /tmp/map –keyring /tmp/key
ceph-mon -i <node_name> –public-addr <node_ip>:6789
ceph mon enable-msgr2
vi /etc/pve/ceph.conf

In the end the most surefire way to fix this problem was to re-image the affected host.

Clear HEALTH_WARNING in GUI

In my testing I had tried pulling disks at random, then putting them back in. This recovered well, but I had this message:

HEALTH_WARN 1 daemons have recently crashed

To clear it I had to drop to the CLI and run this command:

ceph crash archive-all

Thanks to the Proxmox Forums for the fix.

Pool cleanup

I noticed I would get rbd error: rbd: listing images failed: (2) No such file or directory (500) when trying to look at which disks were on my Ceph pool. I fixed this by removing the offending images as per this post.

I then ran another rbd ls -l <POOL_NAME> command to see what was left and noticed several items without anything in the LOCK column. I discovered these were artifacts from failed disk migrations I tried early on – wasted space. I removed them one by one with the following command:

rbd rm <VM_FILE_NAME> -p <POOL_NAME>

Be careful to verify they’re not disks that are in use with VMs with are powered off – they will also show no lock for non-running VMs.

Disk errors

I had a disk fail, but then I pulled out the wrong disk. I kept getting these errors:

Warning: Error fsyncing/closing /dev/mapper/ceph--fc741b6c--499d--482e--9ea4--583652b541cc-osd--block--843cf28a--9be1--4286--a29c--b9c6848d33ba: Input/output error

I was unable to remove it from the GUI. After a while I realized the problem – I was on the wrong node. I needed to be on the node that has the disks when creating an OSD in the Proxmox GUI.

Steps to determine which disk is assigned to an OSD, from ceph docs:

ceph-volume lvm list

====== osd.2 =======

 [block]       /dev/ceph-680265f2-0b3c-4426-b2a8-acf2774d82e0/osd-block-2096f339-0572-4e1d-bf20-52335af9b374

     block device              /dev/ceph-680265f2-0b3c-4426-b2a8-acf2774d82e0/osd-block-2096f339-0572-4e1d-bf20-52335af9b374
     block uuid                tcnwFr-G33o-ybue-n0mP-cDpe-sp9y-d0gvYS
     cephx lockbox secret       
     cluster fsid              65f26da0-fca0-4419-ba15-20269a5a363f
     cluster name              ceph
     crush device class        ssd
     encrypted                 0
     osd fsid                  2096f339-0572-4e1d-bf20-52335af9b374
     osd id                    2
     osdspec affinity           
     type                      block
     vdo                       0
     devices                   /dev/sde

Update 6/20/2024

One year later and Ceph has been running great. So great, in fact, that I migrated my bulk storage to it as well. Here are my notes on that endeavor.

Optimal number of PGs

I discovered that there is an optimal number of PGs you want in a ceph cluster. It depends on how many OSDs you have. Link: https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/storage_strategies_guide/placement_groups_pgs#pg-count-for-small-clusters

The optimal number of PGs is the following, rounding up to the nearest power of two:

                (OSDs * 100)
   Total PGs =  ------------
                 pool size

In my case (only 3 OSDs – one per node) that made my optimal number of PGs 256.

Slow write speeds for HDDs

Moving OSD DB to SSD – The slow way

I had pretty slow write speeds when adding my 3 HDDs to a new pool (50 MB/s max.) I read the best way to help with this is to offload the db and WAL to an SSD for each OSD. It’s possible to have multiple OSDs using a single SSD for such operations, but since I don’t have enterprise-grade SSDs, I opted to do a 1:1 HDD:SSD mapping. Unfortunately, I had already created the OSDs before I realized I needed to do this. So I had to destroy & re-create each OSD one by one to add the SSD.

https://www.reddit.com/r/ceph/comments/fgvcte/replace_osd_node_without_remapping_pgs

set flag norebalance, norecover, nobackfill, destroy the OSD and join the new OSD as the same ID of the old one.

This worked, but took two days to rebuild. I set out to find a faster option

Moving OSD DB to SSD – The fast way

Migrate DB to SSD without destroying OSD

https://www.reddit.com/r/ceph/comments/1awwoch/yet_another_ceph_poor_performance_post_part_deux

https://github.com/45Drives/scripts/blob/main/add-db-to-osd.sh

Requires jq and bc

I kept getting the error message

WARNING: Device selected (/dev/sdd) has a LVM2_member signature, but no volume group
Wipe disk and run again

despite completely wiping the drive. I dove into the source of the script and found it creates a PV & VG for the drive, and that must be failing, so I did it manually:

pvcreate /dev/sdd

vgcreate ceph-$(uuidgen) /dev/sdd

./add-db-to-osd.sh -b 465G -d /dev/sdd -o 3

This worked beautifully.

Move OSD DB to new device

I discovered that when it comes to DB devices, the same advice about SSDs is still true: Don’t waste your time with consumer SSDs. I ordered some more Intel DC S3700 drives and now needed to replace them. The 45 drives script doesn’t work because the DB had already been migrated to a different SSD. This is the process to move from one dedicated DB device to another:

Thanks to this thread https://www.reddit.com/r/ceph/comments/1bk6e9s/moving_db_and_wal_from_ssd_to_hdd/

and this documentation: https://docs.ceph.com/en/latest/ceph-volume/lvm/list/

https://docs.ceph.com/en/quincy/ceph-volume/lvm/migrate

Plug new drive in alongside existing drive

Obtain OSD fsid with this command: ceph-volume lvm list

pvcreate /dev/<new_device>

vgcreate ceph-$(uuidgen) /dev/<new_device>

lvcreate -l100%FREE -n ceph-osd-db-<OSD FSID> ceph-<UUIDGEN_FROM_ABOVE>

systemctl stop ceph-osd@<OSD_ID>

ceph-volume lvm migrate –osd-id <OSD_ID> –osd-fsid <OSD_FSID> –from db wal –target ceph-<UUIDGEN_FROM_ABOVE>/ceph-osd-db-<OSD FSID>

--> Migrate to new, Source: ['--devs-source', '/var/lib/ceph/osd/ceph-4/block.db'] Target: /dev/ceph-60969103-7d88-4340-a13f-a77f98e1da46/osd-db-800G
Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-4/block.db
Running command: /usr/bin/chown -R ceph:ceph /dev/dm-6
--> Migration successful.

systemctl start ceph-osd@<OSD_ID>

System with no additional HDD slots
Used a USB3 SSD adapter temporarily. Migrated DB,  remove old device, add new device. Reboot node.

Sizing DB device

https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref

For RBD workloads, however, block.db usually needs no more than 1% to 2% of the block size.

Move from dedicated DB device back to OSD

https://www.reddit.com/r/ceph/comments/1bwma91/script_to_move_separate_db_lv_back_to_block_device

CLI, OS

Restore files from remote borg repository disk image

May 9, 2021 nicholas Leave a comment

My off-site backup involves sending borgbackup archives of VM images to a remote synology server. I recently needed to restore a single file from one of the VM images stored within this borg backup repository on the remote server. My connection to this server is not very fast so I didn’t want to wait to download the entire image file to mount it locally.

My solution was to mount the remote borgbackup repository on my local machine over SSH so I could poke around for and copy the specific file I wanted. This requires the borgbackup binary to be present on the remote machine. Since it’s a synology, I simply copied the standalone binary over.

The restore process was complicated by the fact that the VM disk image is owned by root, so in order to access the file I needed to mount the remote repository as root.

This is the process:

Set BORG_REMOTE_PATH
1. export BORG_REMOTE_PATH=<PATH_TO_BORG_BINARY_ON_REMOTE_SYSTEM>
(Arch Linux): install python-llfuse
Mount repository over SSH:
1. borg mount <USER>@<REMOTE_SYSTEM>:<PATH_TO_REMOTE_BORGBACKUP_REPOSITORY>::<BACKUP_NAME> <MOUNT_FOLDER>
Follow disk image mounting process
1. losetup -Pr -f <PATH_TO_MOUNTED_BORGBACKUP>/<FILENAME_OF_VM_IMAGE>
2. mount -o ro /dev/loop0p2 /mnt/loop0/
Follow reverse to unmount when done:
1. umount /mnt/loop0
2. losetup -d /dev/loop0
3. borg umount <MOUNT_FOLDER>

Success! I was able to restore an individual file within a raw VM image backup on a remote Borgbackup repository using this method.

CLI, OS

Backup your systems with urBackup

June 5, 2017 nicholas 5 Comments

In addition to my ZFS snapshots I decided to implement a secondary backup system. I decided to land on urbackup for ease of use and, more importantly, it was easier to set up.

Server Install

Assuming a Cent-based system:

cd /etc/yum.repos.d/
sudo wget http://download.opensuse.org/repositories/home:uroni/CentOS_7/home:uroni.repo
sudo yum -y install urbackup-server
sudo systemctl enable urbackup-server
sudo systemctl start urbackup-server

Open up necessary ports for the server:

sudo firewall-cmd --add-port=55413-55415/tcp --permanent
sudo systemctl reload firewalld

By default urbackup listens on port 55414 for connections. You can change this to port 80 and/or 443 for HTTPS by installing nginx and having it proxy the connections for you.

sudo yum -y install nginx
sudo systemctl enable nginx
sudo setsebool -P httpd_can_network_connect 1 #if you're using selinux

Copy the following into /etc/nginx/conf.d/urbackup.conf (make sure to change server_name to suit your needs)

server {
        server_name backup;

        location / {
                proxy_pass http://localhost:55414/;
        }
}

Then start nginx:

sudo systemctl start nginx

You should then be able to access the urbackup console by navigating to the IP / hostname of your backup server in a browser.

Client Install:

Urbackup can use a snapshot system known as dattobd. You should use it if you can in order to get more consistent backups, otherwise urbackup will simply copy files from the host which isn’t always desirable (databases, for example)

Install dattobd (optional):

sudo yum -y update
# reboot if your kernel ends up being updated
sudo yum -y localinstall https://cpkg.datto.com/datto-rpm/repoconfig/datto-el-rpm-release-$(rpm -E %rhel)-latest.noarch.rpm
sudo yum -y install dkms-dattobd dattobd-utils

Install urbackup client:

TF=`mktemp` && wget "https://hndl.urbackup.org/Client/2.1.15/UrBackup%20Client%20Linux%202.1.15.sh" -O $TF && sudo sh $TF; rm $TF
#Select dattobd when prompted if desired

Configure Firewall:

sudo firewall-cmd --add-port=35621-35623/tcp --permanent
sudo systemctl reload firewalld

Once a client is installed, assuming they’re on the same network as the backup server, they will automatically add themselves and begin backing up. If they don’t show up it’s usually a firewall issue.

Restore

Restoration of individual files is easily done through the web console. If you have a windows system, restoring from an image backup is also easy.

Linux hosts

Recovery is trickier if you want to restore a Linux system. Install an empty system of same distribution. Give it the same hostname. Install the client as outlined above, then run:

sudo /usr/local/bin/urbackupclientctl restore-start -b last

Troubleshooting

If for some reason the client not showing up after removing it from the GUI: Uninstall & re-install client software

sudo /usr/local/sbin/uninstall_urbackupclient

TF=`mktemp` && wget "https://hndl.urbackup.org/Client/2.1.15/UrBackup%20Client%20Linux%202.1.15.sh" -O $TF && sudo sh $TF; rm $TF

Mobile, OS

Manually update firmware on Nexus devices

December 9, 2015 nicholas Leave a comment

The release of Android 6.0.1 had me excited because it enables LTE band 12 for my phone, the Nexus 5X, which currently uses T-mobile. Band 12 is in the 700mhz range which should greatly increase speed and coverage. I’m too impatient to wait for the OTA!

This tutorial will walk you through how to manually backup, unlock, flash, re-lock, and restore a Google Nexus 5X, but the procedure is pretty much the same for any Nexus device.

First, obtain two necessary Android development tools: adb and fastboot. Do not use your distribution’s versions of these tools – they are likely out of date. Instead, download the Android SDK directly from Google by going here and scrolling to the bottom of the page. Java is required for the sdk to install – install it if you haven’t already. Thanks to this site for explaining how to only obtain platform-tools.

sudo apt-get install openjdk-7-jre
tar zxvf android-sdk_r24.4.1-linux.tgz
android-sdk-linux/tools/android update sdk --no-ui --filter 1,platform-tools

Once you have platform-tools you need to add them to your PATH to make scripts run adb successfully (thanks to this site for the information). When you run the command below make sure to update /path/to/… to the folder where you extracted the android-sdk.

echo "export PATH=$PATH{}:/path/to/android-sdk/tools:/path/to/android/platform-tools" >> ~/.bashrc
source ~/.bashrc

Next, obtain the latest firmware for your device from the google developers site. Extract it somewhere you will remember for later.

Now, enable USB debugging on your phone if you haven’t already (thanks to this site for the info.) To do this, go to Settings / About phone, scroll to the bottom and press on the build number 7 times. Press back and go to developer options, and enable USB debugging and enable OEM unlocking. Then plug in your phone to the computer with adb installed and run this command (thanks to xda for the information)

adb backup -apk -shared -all -f <backup_filename>

Confirm the on-screen prompt on your phone. Make sure you specify a password for encryption. The above command should backup everything, but in my case it did not backup files in the flash partition (downloads, pictures, etc). Make sure you manually copy any important files from your phone before you proceed.

My first attempt at the above command didn’t work for me. I received the error message:

adb: unable to connect for backup

When I ran adb devices it showed this:

List of devices attached 
00c739918fbf4e2a offline

It turns out I had an old version of adb installed. Make sure you download Google’s official SDK instead of relying on your distribution’s version.

Once the backup is complete, you then need to reboot your phone into fastboot mode:

adb reboot bootloader

Wait for the reboot, then run the following command. Warning: this command will wipe your device. Make sure you have a reliable backup and confirm the message on your phone screen.

fastboot oem unlock

Now, navigate to the directory where you extracted your latest firmware and execute the flash-all script:

cd bullhead-mmb29k
./flash-all.sh

After some time your phone will reboot into your shiny new updated OS. Skip everything setup-wise. Re-enable developer mode and android debugging, and then re-lock your bootloader:

adb reboot bootloader
fastboot oem lock

Lastly, we need to restore everything from the backup we made and re-lock the bootloader for security. Once again skip everything setup-wise on the phone, re-enable developer mode and android debugging, and then restore your stuff:

adb restore <filename>

Don’t forget to manually copy back any flash files you manually backed up earlier.

The very last (optional) step is to go into developer options settings and disable OEM unlocking.

Success!

CLI, Virtualization

Xenserver SSH backup script

September 1, 2015 nicholas 5 Comments

Citrix Xenserver is an amazing hypervisor with pretty much every function released to you for free. One thing they do not handle, however, is automated backups.

I have hacked together a backup script for myself that seems to work fairly well. It is my own mix of this and this script along with some logic for e-mail reporting that I came up with myself. It does not require any modification of the xenserver host at all (no need to mount anything!) with the exception of adding a public key to the xenserver’s authorized_keys file.

This script can be run on anything with BASH and the appropriate UNIX tools (even other xenservers if you want) and uses SSH to initiate and transfer the backups to a location of your choosing.

Place this script on the machine you want to be initiating / saving the backups on. It requires that you generate an SSH public/private key pair, which can be done with this command:

ssh-keygen

Add the resulting .pub file’s contents to your xenserver’s /root/.ssh/authorized_keys file (create it if it doesn’t exist.) Take note of the location of the private key file that was generated with that command and put that path in the script.

You can download the script here or view it below. This script has been working pretty well for me. Note it will not work with any VMs that have spaces in their names. I was too lazy to debug this so I just renamed the problem VMs to remove the spaces. Enjoy!

#!/bin/bash

# Modified August 30, 2015 by Nicholas Jeppson
# Taken from http://discussions.citrix.com/topic/345960-xenserver-automated-snapshots-script/ and modified to allow for ssh backups
# Additional insight taken from https://github.com/cepa/xen-backup

# [Usage & Config]
# This script involves two computers: a xenserver machine and a backup machine.
# Put this script on the backup server and run with any account that has privileges to the desired export directory.
#
# This script assumes you have already created a private and public key pair on the backup server
# as well as adding respective the public key to the xenserver authorized_keys file.
#
# [How it works]
# Step1: Snapshots each VM on the xen server
# Step2: Backs up the snapshots to specified location
# Step3: Deletes temporary snapshot created in step 1
# Step4: Deletes old VM backups as defined later in this file
#
# [Note]
# This script will only work with VMs that don't have spaces in their names
# Please make sure you have enough disk space for BACKUP_PATH, or backup will fail
#
# Tested on xenserver 6.5
#
# Modify the variables in the config section below to suit your particular environment's needs.
#

############### Config section ###############

#Location where you want backups to go
BACKUP_PATH="/mnt/backup/OS/"

#SSH configuration
SSH_CIPHER="arcfour128"

#Number of backups to keep
NUM_BACKUPS=2

#Xenserver ssh configuration
#This dictates the address and location of keyfiles as they reside on the xenserver
XEN_ADDRESS="192.168.1.1"
XEN_KEY_LOCATION="/home/backup/.ssh/backup"
XEN_USER="root"

#E-mail configuration
EMAIL_ADDRESS="youremail@provider.here"
EMAIL_SUBJECT="`hostname -s | awk '{print "["toupper($1)"]"}'` VM Backup Report: `date +"%A %b %d %Y"`"

########## End of Config section ###############

ret_code=0

#Replace any spaces found with backslashes because dd doesn't like them
BACKUP_PATH_ESCAPED="`echo $BACKUP_PATH | sed 's/ /\\\ /g'`"

# SSH command
remote_exec() {
chmod 0600 $XEN_KEY_LOCATION
ssh -i $XEN_KEY_LOCATION -o "StrictHostKeyChecking no" -c $SSH_CIPHER $XEN_USER@$XEN_ADDRESS $1
}

backup() {
echo "======================================================"
echo "VM Backup started: `date`"
begin="$(date +%s)"
echo "Backup location: ${BACKUP_PATH}"
echo

#add a slash to the end of the backup path if it doesn't exist
if [[ "$BACKUP_PATH" != */ ]]; then
BACKUP_PATH="$BACKUP_PATH/"
fi

#Build array of VM names
VMNAMES=$(remote_exec "xe vm-list is-control-domain=false | grep name-label | cut -d ':' -f 2 | tr -d ' '")

for VMNAME in $VMNAMES
do
echo "======================================================"
echo "$VMNAME backup started `date`"
echo
before="$(date +%s)"

# create snapshot
TIMESTAMP=`date '+%Y%m%d-%H%M%S'`
SNAPNAME="$VMNAME-$TIMESTAMP"
SNAPUUID=$(remote_exec "xe vm-snapshot vm=\"$VMNAME\" new-name-label=\"$SNAPNAME\"")

# export snapshot
# remote_exec "xe snapshot-export-to-template snapshot-uuid=$SNAPUUID filename= | gzip" | gunzip | dd of="$BACKUP_PATH/$SNAPNAME.xva"
remote_exec "xe snapshot-export-to-template snapshot-uuid=$SNAPUUID filename=" | dd of="$BACKUP_PATH/$SNAPNAME.xva"

#if export was unsuccessful, return error
if [ $? -ne 0 ]; then
echo "Failed to export snapshot name = $snapshot_name$backup_ext"
ret_code=1

else
#calculate backup time, print results
after="$(date +%s)"
elapsed=`bc -l <<< "$after-$before"`
elapsedMin=`bc -l <<< "$elapsed/60"`
echo "Snapshot of $VMNAME saved to $SNAPNAME.xva"
echo "Backup completed in `echo $(printf %.2f $elapsedMin)` minutes"

# destroy snapshot
remote_exec "xe snapshot-uninstall force=true snapshot-uuid=$SNAPUUID"

#remove old backups (uses num_backups variable from top)
BACKUP_COUNT=$(find $BACKUP_PATH -name "$VMNAME*.xva" | wc -l)

if [[ "$BACKUP_COUNT" -gt "$NUM_BACKUPS" ]]; then

OLDEST_BACKUP=$(find $BACKUP_PATH -name "$VMNAME*" -print0 | xargs -0 ls -tr | head -n 1)
echo
echo "Removing oldest backup: $OLDEST_BACKUP"
rm $OLDEST_BACKUP
if [ $? -ne 0 ]; then
echo "Failed to remove $OLDEST_BACKUP"
fi
fi
echo "======================================================"
fi
done

end=$"$(date +%s)"
total_time=`bc -l <<< "$end-$begin"`
total_time_min=`bc -l <<< "$total_time/60"`
echo "Backup completed: `date`"
echo "VM Backup completed in `echo $(printf %.2f $total_time_min)` minutes"
echo
}

#Run the backup function and save all output to a variable, including stderr
BACKUP_OUTPUT=$(backup 2>&1)

#Clean up the output of the backup function
#Remove records count from dd, do some basic math to make dd's numbers more human readable
BACKUP_OUTPUT_HUMANIZED=$(echo "$BACKUP_OUTPUT" | sed -r '/.*records /d' | tr -d '()' \
| awk '{sub(/.*bytes /, $1/1024/1024/1024" GB "); sub(/in .* secs/, "in "$5/60" mins "); sub(/mins .*/, "mins (" $7/1024/1024" MB/sec)"); print}')

#Send a report e-mail with the backup results
echo "$BACKUP_OUTPUT_HUMANIZED" | mail -s "$EMAIL_SUBJECT" "$EMAIL_ADDRESS"

exit $ret_code

A note on SSDs

Initial install

Small cluster (3 nodes)

Ceph pool size

Configure Alerting

Move VM disks to Ceph storage

Rename storage

Maintenance

Begin maintenance:

After maintenance is completed:

Performance benchmark

MTU 9000 warning

Datacenter settings

Specify SSD or HDD for pools

Helpful commands

Configure Backup

Sync jobs

Import PBS datastore (in case of total crash)

Space still being taken up after deleting backups

Troubleshooting

Really slow VM IOPS during degrade / rebuild

Dead mon despite being removed from cli

Clear HEALTH_WARNING in GUI

Pool cleanup

Disk errors

Update 6/20/2024

Optimal number of PGs

Slow write speeds for HDDs

Moving OSD DB to SSD – The slow way

Moving OSD DB to SSD – The fast way

Move OSD DB to new device

Sizing DB device

Move from dedicated DB device back to OSD

Server Install

Client Install:

Restore

Linux hosts

Troubleshooting

Nick's technical musings