Category Archives: Virtualization

Posts about hypervisors and virtualization

Fix no internet in KVM/QEMU VMs after installing docker

I ran into a frustrating issue where my KVM VMs would lose network connectivity if I installed docker on my Arch Linux system. After some digging I finally discovered the cause (thanks to

It turns out, docker adds a bunch of iptables rules by default which prevent communication. These will interfere with an already existing bridge, and suddenly your VMs will report no network.

There are two ways to fix this. I went with the route of telling docker to NOT mess with iptables on startup. Less secure, but my system is not directly connected to the internet. I created /etc/docker/daemon.json and added this to it:

    "iptables" : false

Then restarted my machine. This did the trick!

Proxmox Ceph storage configuration

These are my notes for migrating my VM storage from NFS mount to Ceph hosted on Proxmox. I ran into a lot of bumps, but after getting proper server-grade SSDs, things have been humming smoothly long enough that it’s time to publish.

A note on SSDs

I had a significant amount of trouble getting ceph to work with consumer-grade SSDs. This is because ceph does a cache writeback call for each transaction – much like NFS. On my ZFS array, I could disable this, but not so for ceph. The result is very slow performance. It wasn’t until I got some Intel DC S3700 drives that ceph became reliable and fast. More details here.

Initial install

I used the Proxmox GUI to install ceph on each node by going to <host> / Ceph. Then I used the GUI to create a monitor, manager, and OSD on each host. Lastly, I used the GUI to create a ceph storage target in Datacenter config.

Small cluster (3 nodes)

My Proxmox cluster is small (3 nodes.) I discovered I didn’t have enough space for 3 replicas (the default ceph configuration), so I had to drop my pool size/min down to 2/1 despite warnings not to do so, since a 3-node cluster is a special case:

More discussion:

I have not had any problems with this configuration and it provides the space I need.

Ceph pool size

In my early testing, I discovered that if I removed a disk from pool, the size of the pool increased! After doing some reading in redhat documentation, I learned the basics of why this happened.

Size = number of copies of the data in the pool

Minsize = minimum number of copies before pool operation is suspended

I didn’t have enough space for 3 copies of the data. When I removed a disk, the pool it dropped down to the minsize setting (2 copies) – which I did have enough room for. The pool rebalanced to reflect this and it resulted in more space.

Configure Alerting

It turns out that alerting for problems with ceph OSDs and monitors does not come out of the box. You must configure it. Thanks to this thread and the ceph documentation for how to do so. I did this on each proxmox node.

apt install ceph-mgr-dashboard
ceph config set mgr mgr/alerts/smtp_host <MAIL_HOST>'
ceph config set mgr mrg/alerts/smtp_ssl false
ceph config set mgr mgr/alerts/smtp_ssl false
ceph config set mgr mgr/alerts/smtp_port 25
ceph config set mgr mgr/alerts/smtp_destination <DEST_EMAIL>
ceph config set mgr mgr/alerts/smtp_sender <SENDER_EMAIL>
ceph config set mgr mgr/alerts/smtp_from_name 'Proxmox Ceph Cluster'

Test this by telling ceph to send its alerts:

ceph alerts send

Move VM disks to Ceph storage

I ended up writing a simple for loop to move all my existing Proxmox VM disks onto my new ceph cluster. None of my VMs had more than 3 scsi devices. If your VMs have more than that you’ll have to tweak this rudimentary command:

for vm in $(qm list | awk '{print $1}'|grep -v VMID); do qm move-disk $vm scsi0 <CEPH_POOL_NAME>; qm move-disk $vm scsi1 <CEPH_POOL_NAME>; qm move-disk $vm scsi2 <CEPH_POOL_NAME>; done

Rename storage

I tried to edit /etc/pve/storage.cfg to change the name I gave my ceph cluster in Proxmox. That didn’t work (question mark next to the storage after renaming it) so I just removed and re-added instead.


Begin maintenance:

Ceph constantly tries to keep itself in balance. If you take a node down and it stays down for too long, ceph will begin to rebalance the data among the remaining nodes. If you’re doing short term maintenance, you can control this behavior to avoid unnecessary rebalance traffic.

ceph osd set nobackfill
ceph osd set norebalance

Reboot / perform OSD maintenance.

After maintenance is completed:

ceph osd unset nobackfill
ceph osd unset norebalance

Performance benchmark

I did a lot of performance checking when I first started to try and track down why the pool was so slow. In the end it was my consumer-grade SSDs. I’ll keep this section here for future reference.

Redhat article on ceph performance benchmarking

Ceph wiki on benchmarking

rados bench -p SSD 10 write --no-cleanup
rados bench -p SSD 10 seq
rados bench -p SSD 10 seq
rados bench -p SSD 10 rand
rbd create image01 --size 1024 --pool SSD
rbd map image01 --pool SSD --name client.admin
mkfs.ext4 /dev/rbd/SSD/image01  
mkdir /mnt/ceph-block-device
mount /dev/rbd/SSD/image01 /mnt/ceph-block-device/
rbd bench --io-type write image01 --pool=SSD
pveperf /mnt/ceph-block-device/
rados -p SSD cleanup


 umount /mnt/ceph-block-device  
 rbd unmap image01 --pool SSD
 rbd rm image01 --pool SSD

MTU 9000 warning

I read that it was recommended to set network MTU to 9000 (jumbo frames. When I did this I experienced weird behavior, connection timeouts – ceph ground to a halt, complaining about slow OSDs, mons. It was too much hassle for me to troubleshoot, so I went back to the standard 1500 MTU.

Datacenter settings

I discovered you can have a host automatically migrate hosts off when you issue the reboot command via the migrate shutdown policy.

Proxmox GUI / Datacenter / Options / HA Settings

Specify SSD or HDD for pools

I have not done this yet but here’s a link I found that explains how to do it:

Helpful commands

Determine IPs of OSDs:

ceph osd dump - determine IPs of OSDs

Remove monitor from failed node:

ceph mon remove <host>
Also needs to be removed from /etc/ceph/ceph.conf

Configure Backup

I had been using ZFS snapshots and ZFS send to backup my VM disks before the move to ceph. While ceph has snapshot capability, it is slow and takes up extra space in the pool. My solution was to spin up a Proxmox Backup Server and regularly back up to that instead.

Proxmox backup server: can be installed to an existing PVE server if you desire:

Configure the apt repository as follows:

# PBS pbs-no-subscription repository provided by,
# NOT recommended for production use
deb bullseye pbs-no-subscription

# security updates
deb bullseye-security main contrib

# apt-get update
# apt-get install proxmox-backup

I had to add a regular user and give admin permissions on PBS side, then add the host on the proxmox side using those credentials.

Configure automated backup in PVE via Datacenter tab / Backup.

Remember to configure automated verify jobs (scrubs).

Make sure to add an e-mail address for proxmox backup user for alerts.

Edit which account & e-mail is used, and how often notified, at the Datastore level.

Sync jobs

I wanted to synchronize my Proxmox Backup repository to a non-PBS server (simply host the files.) I accomplished this by doing the following:

  • Add as a Remote host (Configuration / Remotes.) Copy the PBS server fingerprint from Certificates / Fingerprint.
  • Create remote datastore in /etc/fstab manually (I used SSHFS to backup to a synology over SSH.)
  • Add datastore in PBS, pointing to manual fstab mount. Then add sync job there

Import PBS datastore (in case of total crash)

I wanted to know how to import the data into a fresh instance of PBS. This is the procedcure:

edit /etc/proxmox-backup/datastore.cfg and add config about the datastore manually. Copy from existing datastore config for syntax.

Space still being taken up after deleting backups

PBS uses access time to determine if something has been touched. It waits 24 hours after the last touch. Garbage collection manually updates atime, but still recommended to keep atime on for the dataset PBS is using. Sources:


Really slow VM IOPS during degrade / rebuild

This also ended up being due to having consumer-grade SSDs in my ceph pools. I’m keeping my notes for what I did to troubleshoot in case they’re useful.

Small cluster. Lower backfill activity so recovery doesn’t cause slowdown:

ceph config set osd osd_max_backfills 1
ceph config set osd osd_recovery_max_active 3

Verify setting was applied:

ceph-conf --show-config|egrep "osd_max_backfills|osd_recovery_max_active"
ceph config dump | grep osd

Ramp up backfill performance:

ceph tell osd.* injectargs --osd_max_backfills=2 --osd-recovery_max_active=8 # 2x Increase
ceph tell osd.* injectargs --osd_max_backfills=3 --osd-recovery_max_active=12 # 3x Increase
ceph tell osd.* injectargs --osd_max_backfills=4 --osd_recovery_max_active=16 # 4x Increase
ceph tell osd.* injectargs --osd_max_backfills=1 --osd-recovery_max_active=3 # Back to Defaults

The above didn’t help, turns out consumer SSDs are very bad:

I bought some Intel DC S3700 on ebay for $75 a piece. It fixed all my latency/speed issues.

Dead mon despite being removed from cli

I had a situation where a monitor showed up as dead in proxmox, but I was unable to delete it. I followed this procedure:

rm /etc/systemd/system/<nodename>.service

Dead pve node procedure

remove from /etc/ceph/ceph.conf, remove /var/lib/ceph/mon/ceph-<node>, remove rm /etc/systemd/system/

Adding through GUI brought me back to the same problem.

Bring node back manually

 ceph auth get mon. -o /tmp/key
 ceph mon getmap -o /tmp/map
 ceph-mon -i <node_name> –mkfs –monmap /tmp/map –keyring /tmp/key  
 ceph-mon -i <node_name> –public-addr <node_ip>:6789  
 ceph mon enable-msgr2
 vi /etc/pve/ceph.conf

In the end the most surefire way to fix this problem was to re-image the affected host.


In my testing I had tried pulling disks at random, then putting them back in. This recovered well, but I had this message:

HEALTH_WARN 1 daemons have recently crashed

To clear it I had to drop to the CLI and run this command:

ceph crash archive-all

Thanks to the Proxmox Forums for the fix.

Pool cleanup

I noticed I would get rbd error: rbd: listing images failed: (2) No such file or directory (500) when trying to look at which disks were on my Ceph pool. I fixed this by removing the offending images as per this post.

I then ran another rbd ls -l <POOL_NAME> command to see what was left and noticed several items without anything in the LOCK column. I discovered these were artifacts from failed disk migrations I tried early on – wasted space. I removed them one by one with the following command:

rbd rm <VM_FILE_NAME> -p <POOL_NAME>

Be careful to verify they’re not disks that are in use with VMs with are powered off – they will also show no lock for non-running VMs.

Disk errors

I had a disk fail, but then I pulled out the wrong disk. I kept getting these errors:

Warning: Error fsyncing/closing /dev/mapper/ceph--fc741b6c--499d--482e--9ea4--583652b541cc-osd--block--843cf28a--9be1--4286--a29c--b9c6848d33ba: Input/output error

I was unable to remove it from the GUI. After a while I realized the problem – I was on the wrong node. I needed to be on the node that has the disks when creating an OSD in the Proxmox GUI.

Steps to determine which disk is assigned to an OSD, from ceph docs:

ceph-volume lvm list
====== osd.2 =======

 [block]       /dev/ceph-680265f2-0b3c-4426-b2a8-acf2774d82e0/osd-block-2096f339-0572-4e1d-bf20-52335af9b374

     block device              /dev/ceph-680265f2-0b3c-4426-b2a8-acf2774d82e0/osd-block-2096f339-0572-4e1d-bf20-52335af9b374
     block uuid                tcnwFr-G33o-ybue-n0mP-cDpe-sp9y-d0gvYS
     cephx lockbox secret       
     cluster fsid              65f26da0-fca0-4419-ba15-20269a5a363f
     cluster name              ceph
     crush device class        ssd
     encrypted                 0
     osd fsid                  2096f339-0572-4e1d-bf20-52335af9b374
     osd id                    2
     osdspec affinity           
     type                      block
     vdo                       0
     devices                   /dev/sde

Fix no network after Proxmox 7 upgrade

I upgraded my proxmox server to version 7 and was dismayed to find it had no network connections after a reboot. After much digging I was finally able to find this post which mentioned:

After installing ifupdown2 everything works fine.

Sure enough, ifupdown2 was not installed anymore, and I had configured my networks with it. I had to manually assign an IP address to my node long enough to issue the command
apt install ifupdown2

Once I rebooted, everything came up like it should. Lesson learned: if you use ifupdown2, you must make sure it’s there before you reboot your server!

windows 10 KERNEL_SECURITY_CHECK_FAILED in qemu vm

I upgraded to a shiny new AMD Ryzen 3rd gen processer (Threadripper 3960x.) After doing so I could not boot up my Windows 10 gaming VM (it uses VFIO / PCI Passthrough for the video card.) The message I kept getting as it tried to boot was:


After reading this reddit thread and this one It turns out it’s a culmination of a few things:

  • Running Linux kernel greater than 5.4
  • Running QEMU 5
  • Using 3rd gen AMD Ryzen CPU
  • Using host-passthrough CPU mode

The problem comes with a new speculative execution protection hardware feature in the Ryzen Gen 3 chipsets – stibp. Qemu doesn’t know how to handle it properly, thus the bluescreens.

There are two ways to fix it

  • Change host-model from host-passthrough to epyc
  • Add CPU parameters to your Virtual Machine’s XML file instructing it to not use the stibp CPU feature.

Since I have some software that checks CPU model and refuses to work if it’s not in the desktop class (Geforce Experience) I opted for route #2.

First, check the qemu logs to see which CPU parameters your VM was using (pick a time where it worked.) Replace ‘win10’ with the name of your VM.

sudo cat /var/log/libvirt/qemu/win10.log | grep "\-cpu"

in my case, it was -cpu host,migratable=on,topoext=on,kvmclock=on,hv-time,hv-relaxed,hv-vapic,hv-spinlocks=0x1fff,hv-vendor-id=1234567890ab,kvm=off \

Copy everything after -cpu and before the last backslash. Then edit your VM’s XML file (change last argument to the name of your VM)

sudo virsh edit win10

Scroll down to the bottom qemu:commandline section (if it doesn’t exist, create it right above the last line – </domain>. Paste the following information obtained from the above log (ignoring the qemu:commandline lines if they already exist.) In my case it looked like this:

    <qemu:arg value='-cpu'/>
    <qemu:arg value='host,topoext=on,kvmclock=on,hv-time,hv-relaxed,hv-vapic,hv-

What you’re doing is copying the CPU arguments you found in the log and adding them to the qemu:commandline section, with a twist – adding -amd-stibp which instructs qemu to remove that CPU flag.

This did the trick for me!

Threadripper / Epyc processor core optimization

I had a pet project (folding@home) where I wanted to maximize computing power. I became frustrated with default CPU scheduling of my folding@home threads. Ideal performance would keep similar threads on the same CPU, but the threads were jumping all over the place, which was impacting performance.

Step one was to figure out which threads belonged to which physical cores. I found on this site that you can use cat to find out what your “sibling threads” are:

cat /sys/devices/system/cpu/cpu{0..15}/topology/thread_siblings_list

The above command is for my Threadripper & Epyc systems, which each have 16 cores hyperthreaded to 32 cores. Adjust the {0..15} number to match your number of cores (core 0 being the fist core.) This was my output:

cat /sys/devices/system/cpu/cpu{0..15}/topology/thread_siblings_list


Now that I know the sibling threads are offset by 16, I can use this information to optimize my folding@home VMs. I modified my CPU pinning script to take this into consideration. The script ensures that each VM is pinned to only use sibling threads (ensuring they all stay on the same physical CPU.)

This script should be used with caution. It pins processes to specific CPUs, which limits the kernel scheduler’s ability to move things around if needed. If configured badly this can cause the machine to lock up or VMs to be terminated.

I saw some impressive results spinning up four separate 8 core VMs and pinning them to sibling cores using this script. It almost doubled the rate at which I completed folding@home work units.

And now, the script:

#Properly assign CPU cores to their respective die for EPYC/Threadripper systems
#Based on how hyperthreads are done in these systems
#cat /sys/devices/system/cpu/cpu{0..15}/topology/thread_siblings_list

#The script takes two arguments - the ID of the Proxmox VM to modify, and the core to begin the VM on
#If running this against multiple VMs, make sure to increment this second number by half of the cores of the previous VM
#For example, if I have one 8 core VM and I run this script specifying 0 for the offset, if I spin up a second VM, the second argument would be 4
#this would ensure the second VM starts on core 4 (the 5th core) and assigns sibling cores to match

set -eo pipefail

#take First argument as which VMID to pin CPU cores to, the second argument is which core to start pinning to

#Determine offset for sibling threads
SIBLING_THREAD_OFFSET=$(cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list| sed 's/,/ /g' | awk '{print $2}')

#Function to determine number of CPU cores a VM has
cpu_tasks() {
	expect <<EOF | sed -n 's/^.* CPU .*thread_id=\(.*\)$/\1/p' | tr -d '\r' || true
spawn qm monitor $VMID
expect ">"
send "info cpus\r"
expect ">"

#Only act if VMID & OFFSET are set
if [[ -z $VMID  || -z $OFFSET ]]
	echo "Usage: <VMID> <OFFSET>"
	exit 1
	#Get PIDs of each CPU core for VM, count number of VM cores, and get even/odd PIDs for assignment
	VCPU_EVEN_THREADS=($(for EVEN_THREAD in "${VCPUS[@]}"; do echo $EVEN_THREAD; done | awk '!(NR%2)'))
	VCPU_ODD_THREADS=($(for ODD_THREAD in "${VCPUS[@]}"; do echo $ODD_THREAD; done | awk '(NR%2)'))

	if [[ $VCPU_COUNT -eq 0 ]]; then
		echo "* No VCPUS for VM$VMID"
		exit 1

	echo "* Detected ${#VCPUS[@]} assigned to VM$VMID..."
	echo "* Resetting cpu shield..."

	#Start at offset CPU number, assign odd numbered PIDs to their own CPU thread, then increment CPU core number
	#0-3 if offset is 0, 4-7 if offset is 4, etc
	for PID in "${VCPU_ODD_THREADS[@]}"
		echo "* Assigning ODD thread $ODD_CPU_INDEX to $PID..."
		taskset -pc "$ODD_CPU_INDEX" "$PID"

	#Start at offset + CPU count, assign even number PIDs to their own CPU thread, then increment CPU core number
	#16-19 if offset is 0,	20-23 if offset is 4, etc
	for PID in "${VCPU_EVEN_THREADS[@]}"
		echo "* Assigning EVEN thread $EVEN_CPU_INDEX to $PID..."
		taskset -pc "$EVEN_CPU_INDEX" "$PID"

create podman services with podman-compose

Podman is a fork of Docker that Redhat is using. I really liked docker-compose functionality; fortunately there is a podman-compose project which is more or less the same thing.

I now have a setup where each podman container is controlled by a systemd service, set to run on startup, with version controlled podman-compose files.

First, I installed podman-compose:

sudo curl -o /usr/local/bin/podman-compose
chmod +x /usr/local/bin/podman-compose

I then created podman-compose files (syntax identical to docker-compose) for each container. Here is one example (jackett.yml)

version: "2"
    image: linuxserver/jackett
    container_name: jackett
      - PUID=1000
      - PGID=1000
      - TZ=America/Boise
      - /mnt/storage/Docker/Jackett/config:/config
      - /mnt/storage/Docker/Jackett/downloads:/downloads
      - 9117:9117
    restart: unless-stopped

I then created a corresponding systemd unit file for each container:



# Compose up
ExecStart=/usr/local/bin/podman-compose -f /home/nicholas/podman/jackett.yml up

# Compose down, remove containers and volumes
ExecStop=/usr/local/bin/podman-compose -f /home/nicholas/podman/jackett.yml down -v


I then do a systemctl daemon-reload, and enable the service for startup:

sudo systemctl daemon-reload
sudo systemctl enable jackett


Why not create a single podman-compose file for all my services, instead of creating individual services for each container? I wanted to be able to clearly see log output for each container with journalctl -f -u <service name.> If you lump all your services in a single compose file, the output from each container gets all jumbled into that single service log. Separating out each container into its own service was more clean.

Podman no internet in container fix

I’ve started experimenting with CentOS 8 & Podman (a fork of Docker.) I ran into an issue where one of my containers needed internet access, but could not connect. After some digging I found this site which explains why:

I had to configure the firewall on the podman host to allow for IP masquerade:

sudo firewall-cmd --zone=public --add-masquerade --permanent

After running the above command, my container had internet access!

JQ select specific value from array

I had some AWS ec2 JSON output that I needed to parse. I wanted to grab a specific value from an array and it proved to be tricky for a JSON noob like me. I finally found this site which was very helpful: In my case I wanted the value of a specific AWS EC2 tag.

The trick is to grab down to the Tags[] array, and then pipe that to a select command. If your tags have dots in them (as mine did) then make sure to quote the tag name. Then add the .Value to the end of the select statement. This is my query:

jq -r '.Reservations[].Instances[].Tags[] | select (.Key == "EC2.Tag.Name").Value' jsonfile.json

The above query grabs all the tags (an array of Key,Value lines), then searches the result for a specific key “EC2.Tag.Name” and returns the Value line associated with it.

migrate between amd & intel in proxmox

I recently acquired an Intel based server and plugged it into my AMD-based Proxmox cluster. I ran into an issue transferring from AMD to Intel boxes (the other direction worked fine.) After a few moments, every VM that moved from AMD to Intel would kernel panic.

Fortunately I found here that the fix is to add a few custom CPU flags to your VMs. Once I did this they could move back and forth freely (assuming they had the kvm64 CPU assigned to them – host obviously won’t work.)

qm set *VMID* --args "-cpu 'kvm64,+ssse3,+sse4.1,+sse4.2,+x2apic'"

Proxmox HA management script

I was a bit frustrated at the lack of certain functions of ProxMox. I wanted an easy way to tag a VM and manage that tag as a group. My solution was to create HA groups for VMs with various functions. I can then manage the group and tell them to migrate storage or turn off & on.

I wrote a script to facilitate this. Right now it only covers powering on, powering off, and migrating the location of the primary disk, but more is to come.

Here’s what I have so far:

#Proxmox HA management script
#Migrates storage, starts, or stops Proxmox HA groups based on the name and function passed to it.
#Usage: <start|stop|migrate> <ha-group-name> [local|remote]

#Change to the name of your local storage (for migrating from remote to local storage)

function get_vm_name() {
    #Determine the name of the VMID passed to this function
    VM_NAME=$(qm config "$1" | grep '^name:' | awk '{print $2}')

function get_group_VMIDs() {
    #Get a list of VMIDs based on the name of the HA group passed to this function
    group_VMIDs=$(ha-manager config | grep -B1 "$1" | grep vm: | sed 's/vm://g')

function group_power_state() {
    #Loop through members of HA group passed to this function
    for group in "$1" 
        get_group_VMIDs "$group"
        for VM in $group_VMIDs
            get_vm_name "$VM"
            echo "$OPERATION $VM_NAME in HA group $group"
            ha-manager set $VM --state $VM_STATE

function group_migrate() {
    #This function migrates the VM's first disk (scsi0) to the specified location (local/remote)
    #TODO String to determine all disk IDs:  qm config 115 | grep '^scsi[0-9]:' | tr -d ':' | awk '{print $1}'

    #Loop through each VM in specified group name (second argument passed on CLI)
    for group in "$2" 
        get_group_VMIDs "$group"
        for VM in $group_VMIDs
            #Determine the names of each VM in the HA group
            get_vm_name "$VM"

            #Set storage location based on argument
            if [[ "$3" == "remote" ]]; then

            #Move primary disk to desired location
            echo "Migrating $VM_NAME to "$3" storage"
            qm move_disk $VM $disk $storage --delete=1


case "$1" in 
        group_power_state "$2" 
        group_power_state "$2"
        case "$3" in
                group_migrate "$@"
                echo "Usage: migrate <ha-group-name> <local|remote>"
        echo "Usage: <start|stop|migrate> <ha-group-name> [local|remote]"
        exit 1