Tag Archives: frr

Add backup link for Proxmox cluster

I kept having a frustrating issue where my 40gig link for all hosts in my 3-node mesh network would go down if one host became unresponsive. This would cause the healthy nodes to fence themselves and take the whole cluster down until I rebooted them. I needed a way to fall back to my 1gig links if the 40gig failed.

I landed on this writeup which is pretty straightforward. To add secondary links after the cluster has been created, you simply need to modify /etc/pve/corosync.conf. Increment the version number, add the backup addresses per node (ring1_addr), and add the link to the base cluster config (interface link number: 1). The priority numbers are abitrary. The higher the priority number, the more preferred it is.

root@pve-a:/etc/pve# cat corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve-a
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.255.11
    ring1_addr: 192.168.4.111
  }
  node {
    name: pve-b
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.255.12
    ring1_addr: 192.168.4.112
  }
  node {
    name: pve-c
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.255.13
    ring1_addr: 192.168.4.113
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: pve-cluster-a
  config_version: 4
  interface {
    linknumber: 0
    knet_link_priority: 255
  }
  interface {
    linknumber: 1
    knet_link_priority: 4
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

If you get an error that the file is read only, ensure your cluster has quorum: pvecm status|grep Quorate

The failures that caused this for me are complete system hangs. They wouldn’t happen if the system powered off suddenly – frr handles that fine already. For whatever reason a system lockup caused frr to lose its mind and crash. The cluster remained working but vtysh would simply hang. Once the frozen host was rebooted, the 40g links would fail completely. I would have to restart the frr service to bring things back up. As a stopgap, I added restart=always to the frr systemd service file as per this writeup.

systemctl edit frr
[Service]
Restart=always
RestartSec=5s

After implementing a backup interface, the next system hang did not take down the cluster!