Tag Archives: Debian

Proxmox Ceph storage configuration

These are my notes for migrating my VM storage from NFS mount to Ceph hosted on Proxmox. I ran into a lot of bumps, but after getting proper server-grade SSDs, things have been humming smoothly long enough that it’s time to publish.

A note on SSDs

I had a significant amount of trouble getting ceph to work with consumer-grade SSDs. This is because ceph does a cache writeback call for each transaction – much like NFS. On my ZFS array, I could disable this, but not so for ceph. The result is very slow performance. It wasn’t until I got some Intel DC S3700 drives that ceph became reliable and fast. More details here.

Initial install

I used the Proxmox GUI to install ceph on each node by going to <host> / Ceph. Then I used the GUI to create a monitor, manager, and OSD on each host. Lastly, I used the GUI to create a ceph storage target in Datacenter config.

Small cluster (3 nodes)

My Proxmox cluster is small (3 nodes.) I discovered I didn’t have enough space for 3 replicas (the default ceph configuration), so I had to drop my pool size/min down to 2/1 despite warnings not to do so, since a 3-node cluster is a special case:


More discussion: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/UB44GH4Z2NJUV52ZTHKO4TGYEX3DZ4CB/

I have not had any problems with this configuration and it provides the space I need.

Ceph pool size

In my early testing, I discovered that if I removed a disk from pool, the size of the pool increased! After doing some reading in redhat documentation, I learned the basics of why this happened.

Size = number of copies of the data in the pool

Minsize = minimum number of copies before pool operation is suspended

I didn’t have enough space for 3 copies of the data. When I removed a disk, the pool it dropped down to the minsize setting (2 copies) – which I did have enough room for. The pool rebalanced to reflect this and it resulted in more space.

Configure Alerting

It turns out that alerting for problems with ceph OSDs and monitors does not come out of the box. You must configure it. Thanks to this thread and the ceph documentation for how to do so. I did this on each proxmox node.

apt install ceph-mgr-dashboard
ceph config set mgr mgr/alerts/smtp_host <MAIL_HOST>'
ceph config set mgr mrg/alerts/smtp_ssl false
ceph config set mgr mgr/alerts/smtp_ssl false
ceph config set mgr mgr/alerts/smtp_port 25
ceph config set mgr mgr/alerts/smtp_destination <DEST_EMAIL>
ceph config set mgr mgr/alerts/smtp_sender <SENDER_EMAIL>
ceph config set mgr mgr/alerts/smtp_from_name 'Proxmox Ceph Cluster'

Test this by telling ceph to send its alerts:

ceph alerts send

Move VM disks to Ceph storage

I ended up writing a simple for loop to move all my existing Proxmox VM disks onto my new ceph cluster. None of my VMs had more than 3 scsi devices. If your VMs have more than that you’ll have to tweak this rudimentary command:

for vm in $(qm list | awk '{print $1}'|grep -v VMID); do qm move-disk $vm scsi0 <CEPH_POOL_NAME>; qm move-disk $vm scsi1 <CEPH_POOL_NAME>; qm move-disk $vm scsi2 <CEPH_POOL_NAME>; done

Rename storage

I tried to edit /etc/pve/storage.cfg to change the name I gave my ceph cluster in Proxmox. That didn’t work (question mark next to the storage after renaming it) so I just removed and re-added instead.


Begin maintenance:

Ceph constantly tries to keep itself in balance. If you take a node down and it stays down for too long, ceph will begin to rebalance the data among the remaining nodes. If you’re doing short term maintenance, you can control this behavior to avoid unnecessary rebalance traffic.

ceph osd set nobackfill
ceph osd set norebalance

Reboot / perform OSD maintenance.

After maintenance is completed:

ceph osd unset nobackfill
ceph osd unset norebalance

Performance benchmark

I did a lot of performance checking when I first started to try and track down why the pool was so slow. In the end it was my consumer-grade SSDs. I’ll keep this section here for future reference.

Redhat article on ceph performance benchmarking

Ceph wiki on benchmarking

rados bench -p SSD 10 write --no-cleanup
rados bench -p SSD 10 seq
rados bench -p SSD 10 seq
rados bench -p SSD 10 rand
rbd create image01 --size 1024 --pool SSD
rbd map image01 --pool SSD --name client.admin
mkfs.ext4 /dev/rbd/SSD/image01  
mkdir /mnt/ceph-block-device
mount /dev/rbd/SSD/image01 /mnt/ceph-block-device/
rbd bench --io-type write image01 --pool=SSD
pveperf /mnt/ceph-block-device/
rados -p SSD cleanup


 umount /mnt/ceph-block-device  
 rbd unmap image01 --pool SSD
 rbd rm image01 --pool SSD

MTU 9000 warning

I read that it was recommended to set network MTU to 9000 (jumbo frames. When I did this I experienced weird behavior, connection timeouts – ceph ground to a halt, complaining about slow OSDs, mons. It was too much hassle for me to troubleshoot, so I went back to the standard 1500 MTU.

Datacenter settings

I discovered you can have a host automatically migrate hosts off when you issue the reboot command via the migrate shutdown policy. https://pve.proxmox.com/wiki/High_Availability

Proxmox GUI / Datacenter / Options / HA Settings

Specify SSD or HDD for pools

I have not done this yet but here’s a link I found that explains how to do it: https://stackoverflow.com/questions/58060333/ceph-how-to-place-a-pool-on-specific-osd

Helpful commands

Determine IPs of OSDs:

ceph osd dump - determine IPs of OSDs

Remove monitor from failed node:

ceph mon remove <host>
Also needs to be removed from /etc/ceph/ceph.conf

Configure Backup

I had been using ZFS snapshots and ZFS send to backup my VM disks before the move to ceph. While ceph has snapshot capability, it is slow and takes up extra space in the pool. My solution was to spin up a Proxmox Backup Server and regularly back up to that instead.

Proxmox backup server: can be installed to an existing PVE server if you desire:


Configure the apt repository as follows:

# PBS pbs-no-subscription repository provided by proxmox.com,
# NOT recommended for production use
deb http://download.proxmox.com/debian/pbs bullseye pbs-no-subscription

# security updates
deb http://security.debian.org/debian-security bullseye-security main contrib

# apt-get update
# apt-get install proxmox-backup

I had to add a regular user and give admin permissions on PBS side, then add the host on the proxmox side using those credentials.

Configure automated backup in PVE via Datacenter tab / Backup.

Remember to configure automated verify jobs (scrubs).

Make sure to add an e-mail address for proxmox backup user for alerts.

Edit which account & e-mail is used, and how often notified, at the Datastore level.

Sync jobs

I wanted to synchronize my Proxmox Backup repository to a non-PBS server (simply host the files.) I accomplished this by doing the following:

  • Add as a Remote host (Configuration / Remotes.) Copy the PBS server fingerprint from Certificates / Fingerprint.
  • Create remote datastore in /etc/fstab manually (I used SSHFS to backup to a synology over SSH.)
  • Add datastore in PBS, pointing to manual fstab mount. Then add sync job there

Import PBS datastore (in case of total crash)

I wanted to know how to import the data into a fresh instance of PBS. This is the procedcure:

edit /etc/proxmox-backup/datastore.cfg and add config about the datastore manually. Copy from existing datastore config for syntax.

Space still being taken up after deleting backups

PBS uses access time to determine if something has been touched. It waits 24 hours after the last touch. Garbage collection manually updates atime, but still recommended to keep atime on for the dataset PBS is using. Sources:




Really slow VM IOPS during degrade / rebuild

This also ended up being due to having consumer-grade SSDs in my ceph pools. I’m keeping my notes for what I did to troubleshoot in case they’re useful.


Small cluster. Lower backfill activity so recovery doesn’t cause slowdown:

ceph config set osd osd_max_backfills 1
ceph config set osd osd_recovery_max_active 3

Verify setting was applied: https://www.suse.com/support/kb/doc/?id=000019693

ceph-conf --show-config|egrep "osd_max_backfills|osd_recovery_max_active"
ceph config dump | grep osd

Ramp up backfill performance:

ceph tell osd.* injectargs --osd_max_backfills=2 --osd-recovery_max_active=8 # 2x Increase
ceph tell osd.* injectargs --osd_max_backfills=3 --osd-recovery_max_active=12 # 3x Increase
ceph tell osd.* injectargs --osd_max_backfills=4 --osd_recovery_max_active=16 # 4x Increase
ceph tell osd.* injectargs --osd_max_backfills=1 --osd-recovery_max_active=3 # Back to Defaults

The above didn’t help, turns out consumer SSDs are very bad:



I bought some Intel DC S3700 on ebay for $75 a piece. It fixed all my latency/speed issues.

Dead mon despite being removed from cli

I had a situation where a monitor showed up as dead in proxmox, but I was unable to delete it. I followed this procedure:

rm /etc/systemd/system/ceph-mon.target.wants/ceph-mon@<nodename>.service

Dead pve node procedure

remove from /etc/ceph/ceph.conf, remove /var/lib/ceph/mon/ceph-<node>, remove rm /etc/systemd/system/ceph-mon.target.wants/ceph-mon@pve2.service


Adding through GUI brought me back to the same problem.

Bring node back manually


 ceph auth get mon. -o /tmp/key
 ceph mon getmap -o /tmp/map
 ceph-mon -i <node_name> –mkfs –monmap /tmp/map –keyring /tmp/key  
 ceph-mon -i <node_name> –public-addr <node_ip>:6789  
 ceph mon enable-msgr2
 vi /etc/pve/ceph.conf

In the end the most surefire way to fix this problem was to re-image the affected host.


In my testing I had tried pulling disks at random, then putting them back in. This recovered well, but I had this message:

HEALTH_WARN 1 daemons have recently crashed

To clear it I had to drop to the CLI and run this command:

ceph crash archive-all

Thanks to the Proxmox Forums for the fix.

Pool cleanup

I noticed I would get rbd error: rbd: listing images failed: (2) No such file or directory (500) when trying to look at which disks were on my Ceph pool. I fixed this by removing the offending images as per this post.

I then ran another rbd ls -l <POOL_NAME> command to see what was left and noticed several items without anything in the LOCK column. I discovered these were artifacts from failed disk migrations I tried early on – wasted space. I removed them one by one with the following command:

rbd rm <VM_FILE_NAME> -p <POOL_NAME>

Be careful to verify they’re not disks that are in use with VMs with are powered off – they will also show no lock for non-running VMs.

Disk errors

I had a disk fail, but then I pulled out the wrong disk. I kept getting these errors:

Warning: Error fsyncing/closing /dev/mapper/ceph--fc741b6c--499d--482e--9ea4--583652b541cc-osd--block--843cf28a--9be1--4286--a29c--b9c6848d33ba: Input/output error

I was unable to remove it from the GUI. After a while I realized the problem – I was on the wrong node. I needed to be on the node that has the disks when creating an OSD in the Proxmox GUI.

Steps to determine which disk is assigned to an OSD, from ceph docs:

ceph-volume lvm list
====== osd.2 =======

 [block]       /dev/ceph-680265f2-0b3c-4426-b2a8-acf2774d82e0/osd-block-2096f339-0572-4e1d-bf20-52335af9b374

     block device              /dev/ceph-680265f2-0b3c-4426-b2a8-acf2774d82e0/osd-block-2096f339-0572-4e1d-bf20-52335af9b374
     block uuid                tcnwFr-G33o-ybue-n0mP-cDpe-sp9y-d0gvYS
     cephx lockbox secret       
     cluster fsid              65f26da0-fca0-4419-ba15-20269a5a363f
     cluster name              ceph
     crush device class        ssd
     encrypted                 0
     osd fsid                  2096f339-0572-4e1d-bf20-52335af9b374
     osd id                    2
     osdspec affinity           
     type                      block
     vdo                       0
     devices                   /dev/sde

Site to site Wireguard VPN between OPNSense & Debian Linux server

I have a Debian linode box acting as a wireguard server. I wanted to join my opnsense firewall to it to allow devices behind it to access the box through the wireguard tunnel. It was not as straightforward as I had hoped, but thankfully I got it all working.

OPNSense side

Documentation link

Install wireguard via GUI

Install the os-wireguard package. Manually drop to the CLI and install the wireguard package as well:
sudo pkg install wireguard

Configure Local instance

  • Name and listen port can be random. Tunnel address is the subnet you wish to expose to the other end (the subnet you wish to have access to the tunnel.)
  • Leave everything else blank and hit save
  • Edit your new connection and copy the Public key, this will need to be sent to the Debian server

Configure Endpoint

  • Name: hostname of Debian server
  • Public Key: Public key of server (can be obtained by running wg show on the server)
  • Shared Secret: blank (unless you’ve configured it on the server)
  • Allowed IPs: IPs or subnets on the Debian server you wish to expose to the client side (the OPNSense box)
  • Endpoint address: DNS name of Debian server
  • Endpoint port: Port Debian wireguard instance is listening on

Enable the VPN

General tab / Enable wireguard checkbox and hit apply.

Debian side

Take down the tunnel

sudo wg-quick down wg0

Edit wireguard config to add peer

sudo vim /etc/wireguard/wg0.conf

AllowedIPs = <IPs or Subnets behind the OPNSense side you wish to be exposed to the Debian side> 

Restart wireguard

sudo wg-quick up wg0

Check connections

Example wg show output below with dummy IPs:

sudo wg show
interface: wg0
  public key: f+/J4JO0aL6kwOaudAvZVa1H2mDzR8Nh3Vfeqq+anF8=
  private key: (hidden)
  listening port: 12345

peer: TuUW7diXcWlaV97z3cQ1/92Zal2Pm9Qz/W2OMN+v20g=
  allowed ips:
  latest handshake: 17 seconds ago
  transfer: 5.14 KiB received, 3.81 KiB sent

peer: CZuC/+wxvzj9+TiGeyZtcT/lMGZnXsfSs/h5Jtw2VSE=
  allowed ips:
  latest handshake: 7 minutes, 8 seconds ago
  transfer: 5.89 MiB received, 952.20 MiB sent

The endpoint: line gets populated when a successful VPN connection is made. If it’s missing, the tunnel was not established.


OPNSense box

Nothing happens after saving information and enabling tunnel

Make sure latest wireguard package is installed

sudo pkg install wireguard

Get more log output by opening a shell on your OPNSense box and running

sudo /usr/local/etc/rc.d/wireguard start

In my case I was getting this interesting message

[!] Missing WireGuard kernel support (ifconfig: SIOCIFCREATE2: Invalid argument). Falling back to slow userspace implementation.
[#] wireguard-go wg0
│                                                                                                                                                   │
│   Running wireguard-go is not required because this                                                     │
│   kernel has first class support for WireGuard. For                                                          │
│   information on installing the kernel module,                                                                 │
│   please visit                                                                                                                             │
│         https://www.wireguard.com/install/                                                                           │

I fixed this problem by manually installing wireguard with the pkg install command above.

Debian box

Wireguard config not saving

make sure to stop the tunnel first, otherwise your changes get overwritten by the daemon.

sudo wg-quick down wg0
<make changes>
sudo wg-quick up wg0

Fix no network after Proxmox 7 upgrade

I upgraded my proxmox server to version 7 and was dismayed to find it had no network connections after a reboot. After much digging I was finally able to find this post which mentioned:

After installing ifupdown2 everything works fine.

Sure enough, ifupdown2 was not installed anymore, and I had configured my networks with it. I had to manually assign an IP address to my node long enough to issue the command
apt install ifupdown2

Once I rebooted, everything came up like it should. Lesson learned: if you use ifupdown2, you must make sure it’s there before you reboot your server!

proxmox bond not present fix

I really banged my head on the wall on this one. I recently decided to re-architect my networking setup in proxmox to utilize bonded network configuration. I followed this writeup exactly. The problem is it didn’t work.

I would copy the example exactly, only changing the interface name, and yet every time I tried to start the networking service I would get this lovely error:

rawdevice bond0 not present

I finally found on the Debian Wiki one critical line :

First install the ifenslave package, necessary to enable bonding

For some reason the ProxMox howtos don’t speak of this – I guess because it comes installed by default. I discovered, however, that if you install ifupdown2 it removes ifenslave. I had installed ifupdown2 in the past to reload network configuration without rebooting. Aha!

I re-installed ifenslave (which removed ifupdown2 and re-installed ifupdown) and suddenly, the bond worked!

Bond not falling back to primary intrerface

I had configured my bond in active – backup mode. I wanted it to prefer the faster link, but if there was a failure in that link it wouldn’t switch back automatically (thanks to this site for showing me the command to check:

cat /proc/net/bonding/bond0

I read again in Debian bonding wiki that I needed to add this directive to the bond:

        bond-primary enp2s0

Here is my complete working active-backup configuration, assigning vlan 2 to the host, and making enp2s0 (the 10gig nic) the primary, with a 1gig backup (eno1)

auto bond0
iface bond0 inet manual
        slaves enp2s0 eno1
        bond-primary enp2s0
        bond_miimon 100
        bond_mode active-backup

iface bond0.2 inet manual

auto vmbr0v2
iface vmbr0v2 inet static
        bridge_ports bond0.2
        bridge_stp off
        bridge_fd 0

auto vmbr0
iface vmbr0 inet manual
        bridge_ports bond0
        brideg_stp off
        bridge_fd 0

KeePassRPC incompatible with the current KeePass version

I keep forgetting about this snag so I’ll document it. In Debian / Ubuntu distros, once you’ve added the PPA to have the latest version of KeePass installed, if you try to install the KeePassRPC plugin (Kee) it will tell you the version is incompatible, even though it is.

The following plugin is incompatible with the current KeePass version: /usr/lib/keepass2/Plugins/KeePassRPC.plgx

Have a look at the plugin's website for an appropriate version.

The fix, as found here, is to install the mono-complete package

sudo apt install mono-complete

Restart KeePass after installation. That’s it!

Split flac files with shnsplit

I had a few single FLAC files with cue files I wanted to put into Plex but to my dismay it doesn’t read the CUE files at all. Thus I needed to split the one FLAC file into multiple pieces with shnsplit. Thanks to Stack Exchange for the help.

On my Debian system:

 sudo apt install cuetools shntool flac

With the necessary tools installed you simply have to run the shnsplit command:

 shnsplit -f FILENAME.cue -t "%a - %n %t - %p" FILENAME.flac

the -t parameters formats the filename as desired per the manpage

-t fmt
Name output files in user‐specified format based on CUE sheet fields. The following formatting strings are recognized: 

Track title
Track number

Rasbperry Pi as a dashboard computer

Here are my raw, unpolished notes on how I set up a raspberry pi to serve as a dashboard display:

Use Raspbian OS

Autostart Chrome in kiosk mode

Eliminate Chrome crash bubble thanks to this post

mkdir -p ~/.config/lxsession/LXDE-pi/
nano ~/.config/lxsession/LXDE-pi/autostart

Add this line:
@chromium-browser --kiosk --app=<URL>

Mouse removal

sudo apt-get install unclutter

in ~/.config/lxsession/LXDE-pi/autostart add

@unclutter -idle 5

Disable screen blank:

in /etc/lightdm/lightdm.conf add

xserver-command=X -s 0 -dpms

Open up SSH & VNC

Pi / Preferences / Raspberry Pi Configuration: Interfaces tab

SSH: Enable
VNC: Enable

Increase swap file

sudo nano /etc/dphys-swapfile

Configure NTP

sudo apt-get install openntpd ntpdate
sudo systemctl enable openntpd
sudo ntpdate <IP of NTP server>

edit /etc/openntpd/ntpd.conf and modify servers lines to fit your NTP server

Disable overscan

Pi / Preferences / Raspberry Pi Configuration: System tab
Overscan: Disable

Add Ubuntu PPA key to Debian

Occasionally I want to install packages located at an Ubuntu PPA repository on my Debian stretch machine. There’s a bit of a trick to it, thanks to chrisjean.com for outlining what needs to be done.

Step 1 is the same as in Ubuntu, add the PPA with add-apt-repository (install if it’s not already there)

sudo add-apt-repository ppa:<contributor>/<ppa name>

This will appear to work but when you do an apt update you may get something similar to this

W: GPG error: http://ppa.launchpad.net/jonathonf/gcc-7.1/ubuntu xenial InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 8CF63AD3F06FC659
W: The repository 'http://ppa.launchpad.net/jonathonf/gcc-7.1/ubuntu xenial InRelease' is not signed.
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use.
N: See apt-secure(8) manpage for repository creation and user configuration details.

The last step is to manually import the key with the following command:

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys <KEY_OF_PPA>

The PPA key will be listed on the PPA’s page. Once I ran that second command everything worked swimmingly.

Supermicro fans constantly spinning to 100% fix

My fancy new Supermicro-powered AMD Epyc 7 series server is the bee’s knees. When I first set up I had an annoying problem though – the fans would spin to 100% and back down to normal speeds constantly. Logs kept spamming the same thing over and over:

SENSOR_NAME: FAN5            
EVENT_DESCRIPTION: Lower Critical going low
EVENT SEVERITY:"information"
SENSOR_NAME: FAN5            
EVENT_DESCRIPTION: Lower Critical going low
EVENT SEVERITY:"information"

It was doing this for all 3 fans I had plugged in there. I finally came across this post which explained what the problem was. The fans I had installed defaulted to a low RPM mode, too low for the motherboard’s liking. The BMC would detect the low RPM and assume something was wrong and bring all fans to 100% in order to rescue the system. After doing this it would see the RPM go back to normal range and turn off the “emergency fan mode” only to turn back on when the RPMs of my fans went below the threshold.

The fix, thankfully, is pretty simple provided you have ipmitools installed. One simply has to use the tool to change the fan thresholds. For my Debian-based Proxmox install I did the following to quiet this beast:

apt install ipmitool
ipmitool sensor thresh FAN1 lower 300 300 400
ipmitool sensor thresh FAN2 lower 300 300 400
ipmitool sensor thresh FAN5 lower 300 300 400
#you'll want to modify the fans to correspond with your own server.

I ran the above commands to turn my 3 fans back to a sane speed. I got caught up for a while because I only ran this command on 2 of my 3 fans at first. The deafening noise continued. This is because if any fan in the system goes below, all fans spin up. Once I changed that third fan’s threshold all was gravy. My ears were ringing for a while, but they’re fine now.

Fix Apache Permission Denied errors

The other day I ran the rsync command to migrate files from an old webserver to a new one. What I didn’t notice right away was that the rsync changed the permissions of the folder I was copying into.

The problem presented itself with a very lovely 403 forbidden error message when trying to access any website that server hosted. Checking the logs (/var/log/apache2/error.log on my Debian system) revealed this curious message:

[error] [client] (13)Permission denied: access to / denied

This made it look like apache was denying access for some reason. I verified apache config and confirmed it shouldn’t be denying anything. After some head scratching I came across this site which explained that Apache throws that error when it encounters filesystem access denied error messages.

I was confused because /var/www, where the websites live, had the appropriate permissions. After some digging I found that the culprit in my case was not /var/www, but rather the /var directory underneath /var/www. For some reason the rsync changed /var to not have any execute permissions (necessary for folder access.)  A simple

chmod o+rx /var/

resolved my problem. Next time you get 403 it could be underlying filesystem issues and not apache at all.