Windows VM with GTX 1070 GPU passthrough in ProxMox 5

I started this blog four years ago to document my highly technical adventures – mainly so I could reproduce them later. One of my first articles dealt with GPU passthrough / virtualization. It was a complicated ordeal with Xen. Now that I’ve switched to KVM (ProxMox) I thought I’d give it another go. It’s still complicated but not nearly as much this time.

To get my Nvidia GTX 1070 GPU properly passed through to a Windows VM hosted by ProxMox 5 I simply followed this excellent guide written by sshaikh. I will summarize what I took from his guide to get my setup to work.

  1. Ensure VT-d is supported and enabled in the BIOS
  2. Enable IOMMU on the host
    1. append the following to the GRUB_CMDLINE_LINUX_DEFAULT line in /etc/default/grub
      intel_iommu=on
    2. Save your changes by running
      update-grub
  3. Blacklist NVIDIA & Nouveau kernel modules so they don’t get loaded at boot
    1. echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
      echo "blacklist nvidia" >> /etc/modprobe.d/blacklist.conf
    2. Save your changes by running
      update-initramfs -u
  4. Add the following lines to /etc/modules
    vfio
    vfio_iommu_type1
    vfio_pci
    vfio_virqfd
  5. Determine the PCI address of your GPU
    1. Run
      lspci -v

      and look for your card. Mine was 01:00.0 & 01:00.1. You can omit the part after the decimal to include them both in one go – so in that case it would be 01:00

    2. Run lspci -n -s <PCI address> to obtain vendor IDs. Example :
      lspci -n -s 01:00
      01:00.0 0300: 10de:1b81 (rev a1)
      01:00.1 0403: 10de:10f0 (rev a1)
  6. Assign your GPU to vfio driver using the IDs obtained above. Example:
    echo "options vfio-pci ids=10de:1b81,10de:10f0" > /etc/modprobe.d/vfio.conf
  7. Reboot the host
  8. Create your Windows VM using the UEFI bios hardware option (not the deafoult seabios) but do not start it yet. Modify /etc/pve/qemu-server/<vmid>.conf and ensure the following are in the file. Create / modify existing entries as necessary.
    bios: ovmf
    machine: q35
    cpu: host,hidden=1
    numa: 1
  9. Install Windows, including VirtIO drivers. Be sure to enable Remote desktop.
  10. Pass through the GPU.
    1. Modify /etc/pve/qemu-server/<vmid>.conf and add
      hostpci0: <device address>,x-vga=on,pcie=1. Example

      hostpci0: 01:00,x-vga=on,pcie=1
  11. Profit.

Troubleshooting

Code 43

I received the dreaded code 43 error after installing CUDA drivers. The workaround was to add hidden=1 to the CPU option of the VM:

cpu: host,hidden=1

Blue screening when launching certain games

Heroes of the Storm and Starcraft II would consistently blue screen on me with the following error:

kmode_exception_not_handled

The fix as outlined here was to create /etc/modprobe.d/kvm.conf and add the parameter “options kvm ignore_msrs=1”

echo "options kvm ignore_msrs=1" > /etc/modprobe.d/kvm.conf

Update 4/9/18: Blue screening happens to Windows 10 1803 as well with the error

System Thread Exception Not Handled

The fix for this is the same – ignore_msrs=1

GPU optimization:

Give as many CPUs as the host (in my case 8) and then enable NUMA for the CPU. This appeared to make my GTX 1070 perform better in the VM – near native performance.

ZFS delete oldest n snapshots

I came across a need to trim old ZFS snapshots. These are my quick and dirty notes on how I accomplished it.

Basic syntax taken from here:

 zfs list -H -t snapshot -o name -S creation -r <dataset name> | tail -10

You can omit the -r <dataset name> if you want to query snapshots over all your datasets. Change the tail number for the desired number of oldest snapshots.

You can pass this over to actually delete snapshots using the xargs command:

zfs list -H -t snapshot -o name -S creation -r <dataset name> | tail -10 | xargs -n 1 zfs  destroy

I came across an odd error message when trying to delete some old snapshots:

Can't delete snapshot: dataset busy

I discovered here that that means the snapshots have a hold on them. I read ZFS documentation to learn how to release the holds:

zfs release -r <tag name> <snapshot name>

After massaging these commands for a bit I was able to free up some needed space by removing ancient snapshots.

Proxmox VM migration failed found stale volume copy

Recently I had a few VMs on shared storage I couldn’t live migrate. The cryptic error messages made it sound like local LVM was required, even though in the GUI all I could see was shared storage for the VM. The errors I kept getting were like this one:

volume pve/vm-103-disk-1 already exists
command 'dd 'if=/dev/pve/vm-103-disk-1' 'bs=64k'' failed: got signal 13
send/receive failed, cleaning up snapshot(s)..
ERROR: Failed to sync data - command 'set -o ...' failed: exit code 255
aborting phase 1 - cleanup resources
ERROR: found stale volume copy 'local-lvm:vm-103-disk-1' on node 'nick-desktop'
ERROR: migration aborted (duration 00:00:01): Failed to sync data - command 'set -o pipefail ...' failed: exit code 255
TASK ERROR: migration aborted

After a ton of digging I found this forum post that had the solution:

Most likely there is some stale disk somewhere. Try to run:
# qm rescan –vmid 101

That indeed was the problem. I ran

qm rescan –vmid 103

on the node in question, then refreshed the management page. After doing that, a ‘phantom’ disk entry showed up for the VM. I deleted it, but then had to run another qm –rescan –vmid103 before it would migrate.

So to recap, run qm rescan –vmid (vmid#) once, then delete the stale disk that shows up, then run that same command again.

Fix wordpress PHP change was reverted error

Since WordPress 4.9 I’ve had a peculiar issue when trying to edit theme files using the web GUI. Whenever I tried to save changes I would get this error message:

Unable to communicate back with site to check for fatal errors, so the PHP change was reverted. You will need to upload your PHP file change by some other means, such as by using SFTP.

After following this long thread I saw the suggestion to install and use the Health Check plugin to get more information into why this is happening. In my case I kept getting this error message:

The loopback request to your site failed, this may prevent WP_Cron from working, along with theme and plugin editors.<br>Error encountered: (0) cURL error 28: Connection timed out after 10001 milliseconds

I researched what a loopback request is in this case. It’s the webserver reaching out to its own site’s url to talk to itself. My webserver was being denied internet access, which included its own URL, so it couldn’t complete the loopback request.

One solution, mentioned here, is to edit the hosts file on your webserver to point to 127.0.0.1 for the URL of your site. My solution was to open up the firewall to allow my server to connect to its URL. I then ran into a different problem:

The loopback request to your site failed, this may prevent WP_Cron from working, along with theme and plugin editors.<br>Error encountered: (0) cURL error 60: Peer's Certificate issuer is not recognized.

After digging for a while I found this site which explains how to edit php.ini to point to an acceptable certificate list. To fix this on my Cent7 machine I edited /etc/php.ini and added this line (you could also add it to /etc/php.d/curl.ini)

curl.cainfo="/etc/pki/tls/cert.pem"

This caused php’s curl module to use the same certificate trust store that the underlying OS uses.

Then restart php-fpm if you’re using it:

sudo systemctl restart php-fpm

Success! Loopback connections now work properly.