I came home yesterday to discover that every last one of my VMs were unresponsive. It was most distressing. I couldn’t even SSH into my xenserver – it was unresponsive too. Its physical console had dropped into an emergency shell. A reboot allowed me to get a physical console again, but my networking and VMs would not start.
In trying to pick up the pieces and put everything back together I ran
systemctl --failed
which revealed several key services not running – namely openvswitch and xapi (very important services.) Manually starting them did nothing – they would silently fail and immediately quit working.
After banging my head against a wall for a bit (I really didn’t want to restore from backup) I stumbled across this post. It states in essence that xapi won’t start if the disk is full. I checked disk usage and it said I had a few gigs free, but thought I’d try the steps in the post anyway.
ls /var/log
revealed quite a lot of log files. I then decided to just delete all the .gz archived logs:
rm /var/log/*.gz
After doing this, xapi started. I restarted the hypervisor for good measure and everything came up – all back to normal as if nothing had happened.
It’s incredibly frustrating that Xenserver is designed to be a ticking time bomb with default configuration. If you don’t take care to manually delete old logs, or alternatively send logs to a remote log server, it will crash and burn. This is stupid. That being said, I was impressed that it recovered so gracefully once I freed up some disk space.
If you’re running xenserver, make sure you’re logging somewhere else – or put a cron job to delete old log files!