Tag Archives: migration

migrate between amd & intel in proxmox

I recently acquired an Intel based server and plugged it into my AMD-based Proxmox cluster. I ran into an issue transferring from AMD to Intel boxes (the other direction worked fine.) After a few moments, every VM that moved from AMD to Intel would kernel panic.

Fortunately I found here that the fix is to add a few custom CPU flags to your VMs. Once I did this they could move back and forth freely (assuming they had the kvm64 CPU assigned to them – host obviously won’t work.)

qm set *VMID* --args "-cpu 'kvm64,+ssse3,+sse4.1,+sse4.2,+x2apic'"

Proxmox HA management script

I was a bit frustrated at the lack of certain functions of ProxMox. I wanted an easy way to tag a VM and manage that tag as a group. My solution was to create HA groups for VMs with various functions. I can then manage the group and tell them to migrate storage or turn off & on.

I wrote a script to facilitate this. Right now it only covers powering on, powering off, and migrating the location of the primary disk, but more is to come.

Here’s what I have so far:

#!/bin/bash
#Proxmox HA management script
#Migrates storage, starts, or stops Proxmox HA groups based on the name and function passed to it.
#Usage: manage-HA-group.sh <start|stop|migrate> <ha-group-name> [local|remote]

#Change to the name of your local storage (for migrating from remote to local storage)
LOCAL_STORAGE_NAME="pve-1TB"

function get_vm_name() {
    #Determine the name of the VMID passed to this function
    VM_NAME=$(qm config "$1" | grep '^name:' | awk '{print $2}')
}

function get_group_VMIDs() {
    #Get a list of VMIDs based on the name of the HA group passed to this function
    group_VMIDs=$(ha-manager config | grep -B1 "$1" | grep vm: | sed 's/vm://g')
}

function group_power_state() {
    #Loop through members of HA group passed to this function
    for group in "$1" 
    do
        get_group_VMIDs "$group"
        for VM in $group_VMIDs
        do
            get_vm_name "$VM"
            echo "$OPERATION $VM_NAME in HA group $group"
            ha-manager set $VM --state $VM_STATE
        done
    done
}

function group_migrate() {
    #This function migrates the VM's first disk (scsi0) to the specified location (local/remote)
    #TODO String to determine all disk IDs:  qm config 115 | grep '^scsi[0-9]:' | tr -d ':' | awk '{print $1}'
    disk="scsi0"    

    #Loop through each VM in specified group name (second argument passed on CLI)
    for group in "$2" 
    do
        get_group_VMIDs "$group"
        for VM in $group_VMIDs
        do
            #Determine the names of each VM in the HA group
            get_vm_name "$VM"

            #Set storage location based on argument
            if [[ "$3" == "remote" ]]; then
                storage="$VM_NAME"
            else
                storage="$LOCAL_STORAGE_NAME"
            fi

            #Move primary disk to desired location
            echo "Migrating $VM_NAME to "$3" storage"
            qm move_disk $VM $disk $storage --delete=1

        done
    done
}

case "$1" in 
    start)
        VM_STATE="started"
        OPERATION="Starting"
        group_power_state "$2" 
        ;;
    stop)
        VM_STATE="stopped"
        OPERATION="Stopping"
        group_power_state "$2"
        ;;
    migrate)
        case "$3" in
            local|remote)
                group_migrate "$@"
                ;;
            *)
                echo "Usage: manage-HA-group.sh migrate <ha-group-name> <local|remote>"
                ;;
        esac        
    ;;
    *)
        echo "Usage: manage-HA-group.sh <start|stop|migrate> <ha-group-name> [local|remote]"
        exit 1
        ;;
esac

Upgrading AWX

AWX is the open source version of Ansible Tower. It’s a powerful tool, but unfortunately AWX has no in place upgrade capability. If you want to upgrade your AWX to the latest version it takes a bit of trickery (the easy way out being just to pay for Ansible Tower.)

Essentially to upgrade AWX you need to spin up a completely new instance and then migrate your data over to it. Fortunately there is a script out there that makes doing this a bit easier.

Below are my notes for how I upgraded my instance of AWX from version 1.0.6 to 2.1.0.

Create temporary AWX migration server

Spin up new server with ansible installed, then clone AWX

git clone https://github.com/ansible/awx.git 
cd awx 
git clone https://github.com/ansible/awx-logos.git

Modify AWX install to expose 5432 externally:

edit installer/roles/local_docker/tasks/standalone.yml and add

    ports:
      - "5432:5432" 

right above the when: pg_hostname is not defined or pg_hostname == '' line. Complete stanza looks like this:

- name: Activate postgres container
  docker_container:
    name: postgres
    state: started
    restart_policy: unless-stopped
    image: "{{ postgresql_image }}"
    volumes:
      - "{{ postgres_data_dir }}:/var/lib/postgresql/data:Z"
    env:
      POSTGRES_USER: "{{ pg_username }}"
      POSTGRES_PASSWORD: "{{ pg_password }}"
      POSTGRES_DB: "{{ pg_database }}"
      PGDATA: "/var/lib/postgresql/data/pgdata"
    ports:
      - "5432:5432"
  when: pg_hostname is not defined or pg_hostname == ''
  register: postgres_container_activate

Make sure you have port 5432 open on your host-based firewall.

Install AWX on the new host. Verify you can log into the empty instance and that it’s the version you want to upgrade to.

Prepare original AWX server to send

Kill the AWX postgres container on the source machine, and re-run awx installer after modifying it to expose its postgres port as described above.

Install tower-cli (this can be on either source or destination servers)

sudo pip install ansible-tower-cli

Configure tower-cli

tower-cli config username SRC_AWX_USERNAME
towercli config password SRC_AWX_PASSWORD
towercli config host SRC_AWX_HOST

Make sure to use full ansible URL as accessed from a browser for both source and destination

Install awx-migrate:

git clone https://github.com/autops/awx-migrate.git

Update awx-migrate/awx-migrate-wrapper with correct source and destination info

Run awx-migrate-wrapper. It will generate json files with your configuration.

Migrate database to temporary server

Modify tower-cli config, set host, username and password to that of the destination AWX instance

tower-cli config username DEST_AWX_USERNAME
towercli config password DEST_AWX_PASSWORD
towercli config host: DEST_AWX_HOST

Send JSON info to destination:

tower-cli send awx-data.json

You will now have a fresh new, updated AWX instance working, with imported database, on the destination host. Confirm you can log into it with the admin account you set it up with.

Prepare original AWX server to receive

Now, on the source, remove  the old AWX docker containers:

sudo docker rm -f postgres awx_task awx_web memcached rabbitmq

Move / delete the database folder the postgres docker container was using (as defined in awx installer inventory) in my case:

/var/lib/awx
/var/db/pgsqldocker

Remove and re-install AWX folder with a fresh git checkout

rm -rf awx
git clone https://github.com/ansible/awx.git
cd awx
git clone https://github.com/ansible/awx-logos.git

Re-run the AWX installer to re-create a blank database on the source host, modify the new awx/installer/inventory as needed. Also modify installer/roles/local_docker/tasks/standalone.yml as outlined above.

cd awx/installer
sudo ansible-playbook -i inventory install.yml

Migrate from temporary AWX server back to source AWX server

Once a new, empty version of awx is running on the source host,  start the awx-migrate process in reverse to migrate the database on the destination instance back to the source. Modify awx-migrate-wrapper and tower-cli to switch src and destination (the destination has become the source and the source has become the destination)

Use awx-migrate-wrapper to generate  new ansible version json files (don’t confuse them with the old json files – best to delete / move all json files before running awx-migrate-wrapper)

Modify tower-cli to point to original AWX URL

Run tower-cli send awx-data.json

Once completed, log in as the admin account. Input LDAP BIND password under settings, then delete any imported LDAP users.

Cleanup

You may want to remove the exposed postgres database ports. Simply undo the changes you made in awx/installer/roles/local_docker/tasks/standalone.yml to remove the Ports part of the first play, then remove your postgres container and re-install AWX with install.yml

Also remember to delete the JSON files generated with awx-migrate as they contain all your credentials in plaintext.

Success.

 

Free up RAM after Proxmox live migration

I ran into an issue where after migrating a bunch of VMs off of one of  my hosts, the remaining VMs on it refused to turn on. Every time I tried the command would hang for a while and eventually error out with this message

TASK ERROR: start failed: command '/usr/bin/kvm -id <truncated>... ' failed: got timeout

I suspected this might be due to RAM use and sure enough the usage was too high for a system that didn’t have any VMs running on it.  I found here that I could run a command to flush the cache:

echo 3 > /proc/sys/vm/drop_caches

That caused the RAM usage to go down but the symptom of the VM not starting remained. I then saw the KSM sharing still had some memory in it. I decided to restart the KSM sharing service:

sudo systemctl restart ksmtuned

After running that the VM started!

Proxmox VM migration failed found stale volume copy

Recently I had a few VMs on shared storage I couldn’t live migrate. The cryptic error messages made it sound like local LVM was required, even though in the GUI all I could see was shared storage for the VM. The errors I kept getting were like this one:

volume pve/vm-103-disk-1 already exists
command 'dd 'if=/dev/pve/vm-103-disk-1' 'bs=64k'' failed: got signal 13
send/receive failed, cleaning up snapshot(s)..
ERROR: Failed to sync data - command 'set -o ...' failed: exit code 255
aborting phase 1 - cleanup resources
ERROR: found stale volume copy 'local-lvm:vm-103-disk-1' on node 'nick-desktop'
ERROR: migration aborted (duration 00:00:01): Failed to sync data - command 'set -o pipefail ...' failed: exit code 255
TASK ERROR: migration aborted

After a ton of digging I found this forum post that had the solution:

Most likely there is some stale disk somewhere. Try to run:
# qm rescan –vmid 101

That indeed was the problem. I ran

qm rescan –vmid 103

on the node in question, then refreshed the management page. After doing that, a ‘phantom’ disk entry showed up for the VM. I deleted it, but then had to run another qm –rescan –vmid103 before it would migrate.

So to recap, run qm rescan –vmid (vmid#) once, then delete the stale disk that shows up, then run that same command again.

Migrate from Xenserver to Proxmox

I was dismayed to see Citrix’s recent announcement about Xenserver 7.3 removing several key features from the free version. Xenserver’s free features are the reason I switched over to them in the first place back in 2014. Xenserver has been rock solid; I haven’t had any complaints until now. Their removal of xenmotion and migration in the free version forced me to look elsewhere for my virtualization needs.

I’ve settled on ProxMox, which is KVM based. Their documentation is excellent and it has all the features I need – for free. I’m also in love with their web based management – no more Windows fat client!

Below are my notes on how I successfully migrated all my Xenserver VMs over to the ProxMox Virtual Environment (PVE).

  • Any changes to network interfaces, such as bringing them up, require a reboot of the host
  • If you have an existing ISO share, you can create a directory called  “template” in your ISO repository folder, then inside symlink “iso” back to your ISO folder. Proxmox looks inside template/iso for ISO images for whatever storage you configure.
  • Do not create your ProxMox host with ZFS unless you have tons of RAM. If you don’t have enough RAM you will run into huge CPU load times making the system unresponsive in cases of high disk load, such as VM copies / backups. More reading here.

Cluster of two:

ProxMox’s clustering is a bit different – better, in my opinion. No more master, slave dynamic – ever node is a master. Important reading: https://pve.proxmox.com/wiki/Proxmox_VE_4.x_Cluster

If you have two node cluster, like I do, it creates some problems, though. If one goes down, the other can’t do anything to the pool (create VM, backup) until it comes back up. In my situation I have one primary host that is up all the time and I bring the secondary host up only when I want to do maintenance on the first.

In that specific situation you can still designate a “master” of sorts by increasing the number of quorum votes it gets from 1 to 2.  That way when the secondary node is down, the primary node can still do cluster operations because the default number of votes to stay quorate is 2. See here for more reading on the subject.

On either host (they must both be up and in the cluster for this to work)

vi /etc/pve/corosync.conf

Find your primary server in the nodelist settings and change

quorum_votes: 2

Also find the quorum section and add expected_votes: 2

Make sure to increment config_version number (bottom of the file.) Now if your secondary is down you can still operate the primary.

Migrating VMs

I migrated my Xen VMs to KVM by creating VMs with identical specs in PVE, copying the VHD files from the Xen host to the new PVE host, running qemu-img to convert them to RAW format, and then using dd to copy the raw information over to corresponding empty VM  disks. Depending on the OS of the VM there was some after-copy tweaking I also had to do.

From shared storage

Grab the VHD file (quiesce any snapshots away first) of each xen VM and convert them to raw format

qemu-img convert <VHD_FILE_NAME>.vhd -O raw <RAW_FILE_NAME>.raw

Create a new VM with identical configuration, especially disk size. Go to the hardware tab and take note of the name of the disk. For example, one of mine was:

local-zfs:vm-100-disk-1,discard=on,size=40G

The interesting part is between local-zfs and discard=on, namely vm-100-disk-1. This is the name of the disk we want to overwrite with data from our Xenserver VM’s disk.

Next figure out the full path of this disk on your proxmox host

find / -name vm-100-disk-1*

The result in my case was /dev/zvol/rpool/data/vm-100-disk-1

Take the name and put it in the following command to complete the process:

dd if=<RAW_FILE_NAME>.raw of=/dev/zvol/rpool/data/vm-100-disk-1 bs=16M

Once that’s done you can delete your .vhd and .raw files.

From local / LVM storage

In case your Xen VMs are stored in LVM device format instead of a VHD file, get UUID of storage by doing xe vdi-list and finding the name of the hard disk from the VM you want. It’s helpful to rename the hard disks to something easy to spot. I chose the word migrate.

xe vdi-list|grep -B3 migrate
uuid ( RO) : a466ae1b-80c7-4ef2-91a3-5c1ba1f6fc2f
 name-label ( RW):  migrate

Once you have the UUID of the drive, you can use lvscan to find the full LVM device path of that disk:

lvscan|grep a466ae1b-80c7-4ef2-91a3-5c1ba1f6fc2f
 inactive '/dev/VG_XenStorage-1ada0a08-7e6d-a5b6-d0b4-515e251c0c75/VHD-a466ae1b-80c7-4ef2-91a3-5c1ba1f6fc2f' [10.03 GiB] inherit

Shut down the corresponding VM and reactivate its logical volume (xen deactivates LVMs if the VM is shut off:

lvchange -ay <full /dev/VG_XenStorage path discovered above>

Now that we have the full LVM path and the volume is active, we can use dd over SSH to transfer the image to our proxmox server:

sudo dd if=<full /dev/VG/Xenstorage path discovered above> | ssh <IP_OF_PROXMOX_SERVER> dd of=<LOCATION_ON_PROXMOX_THAT_HAS_ENOUGH_SPACE>/<NAME_OF_VDI_FILE>.vhd

then follow vhd -> raw -> dd to proxmox drive process described in the From Shared Storage section.

Post-Migration tweaks

For the most part Debian-based systems moved over perfectly without any needed tweaks; Some VMs changed interface names due to network device changes. eth0 turned into ens8. I had to modify /etc/network/interfaces to change eth0 to ens8 to get virtio networking working.

CentOS

All my CentOS VMs failed to boot after migration due to a lack of virtio disk drivers in the initial RAM disk. The fix is to change the disk hardware to IDE mode (they boot fine this way) and then modify the initrd of each affected host:

sudo dracut --add-drivers "virtio_pci virtio_blk virtio_scsi virtio_net virtio_ring virtio" -f -v /boot/initramfs-`uname -r`.img `uname -r`
sudo sh -c "echo 'add_drivers+=\" virtio_pci virtio_blk virtio_scsi virtio_net virtio_ring virtio \"' >> /etc/dracut.conf"
sudo shutdown -h now

Once that’s done you can detach the hard disk and re-attach it back as SCSI (virtio) mode. Don’t forget to modify the options and change the boot order from ide0 to scsi0

Arch Linux

One of my Arch VMs had UUID configured which complicated things. The root device UUID changes in KVM virtio vs IDE mode. The easiest way to fix it is to boot this VM into an Arch install CD. Mount the root partition and then run arch-chroot /mnt/sda1. Once in the chroot runpacman -Sy kernel to reinstall the kernel and generate appropriate kernel modules.

mount /dev/sda1 /mnt
arch-chroot /mnt
pacman -Sy kernel

Also make sure to modify /etc/fstab to reflect appropriate device id or UUID (xen used /dev/xvda1, kvm /dev/sda1)

Windows

Create your Windows VM using non-virtio drivers (default settings in PVE.) Obtain the latest windows virtio drivers here and extract them somewhere memorable. Switch everything but the disk over to Virtio in the VM’s hardware config and reboot the VM. Go into device manager and point to extracted driver location for each unknown device.

To get Virtio disk to work, add a new disk to the VM of any size and SCSI (virtio) type. Boot the Windows VM and install drivers for that drive. Then shut down, remove that second drive, detach the primary drive and change to virtio SCSI. It should then come up with full virtio drivers.

All hosts

KVM has a guest agent like xenserver does called qemu-agent. Turn it on in VM options and install qemu-guest-agent in your guest. This KVM a bit more insight into your host.

Determine which VMs need guest agent installed:

qm agent $id ping

If nothing is returned, it means qemu-agent is working. You can test all your VMs at once with this one-liner (change your starting and finishing VM IDs as appropriate)

for id in {100..114}; do echo $id; qm agent $id ping; done

This little one-liner will output the VM ID it’s trying to ping and will return any errors it finds. No errors means everything is working.

Disable support nag

PVE has a support model and will nag you at each login. If you don’t like this you can change it like so (the line number might be different depending on which version you’re running:

vi +850 /usr/share/pve-manager/js/pvemanagerlib.js

Modify the line if (data.status !== ‘Active’); change it to

if (false)

Troubleshooting

Remove a failed node

See here: https://pve.proxmox.com/wiki/Proxmox_VE_4.x_Cluster#Remove_a_cluster_node

systemctl stop pvestatd.service
systemctl stop pvedaemon.service
systemctl stop pve-cluster.service
rm -r /etc/corosync/*
rm -r /var/lib/pve-cluster/*
reboot

Quorum never establishes / takes forever

I had a really strange issue where I was able to establish quorum with a second node, but after a reboot quorum never happened again. I re-installed that second node and re-joined it several times but I never got past the “waiting for quorum….” stage.

After much research I came across this article which explained what was happening. Corosync uses multicast to establish cluster quorum. Many switches (including mine) have a feature called IGMP snooping, which, without an IGMP querier, essentially means multicast never happens. Sure enough, after logging into my switches and disabling IGMP snooping, quorum was instantly established. The article above says this is not recommended, but in my small home lab it hasn’t produced any ill effects. Your mileage may vary. You can also configure your cluster to use unicast instead.

USB Passthrough not working properly

With Xenserver I was able to pass through the USB controller of my host to the guest (a JMICRON USB to ATAATAPI bridge holding a 4 disk bay.) I ran into issues with PVE, though. Using the GUI to pass the USB device did not work. Manually adding PCI passthrough directives (hostpci0: 00:14.1) didn’t work. I finally found on a little nugget on the PCI Passthrough page about how you can simply pass the entire device and not the function like I had in Xenserver. So instead of doing hostpci0: 00:14.1, I simply did hostpci0: 00:14 . That  helped a little bit, but I was still unable to fully use these drives simultaneously.

My solution was eventually to abandon PCI passthrough altogether in favor of just passing individual disks to the guest as outlined here.

Find the ID of the desired disks by issuing ls -l /dev/disk/by-id. You only need to know the UUIDs of the disks, not the partitions. Then modify the KVM config of your desired host (mine was located at /etc/pve/qemu-server/101.conf) and a new line for each disk, adjusting scsi device numbers and UUIDs to match:

scsi5: /dev/disk/by-id/scsi-SATA_ST5000VN000-1H4_Z111111

With that direct disk access everything is working splendidly in my FreeNAS VM.

Transfer VMs between Xenserver pools

When Xenserver 7 came out I found myself unable to easily upgrade to it thanks to my custom RAID 1 build.  If I wanted Xenserver 7 I would have to blow the whole instance away and start from scratch. This posed a problem because I have a pool of 2 xenserver hosts. You cannot add a server with a higher xenserver version to a lower versioned pool; the pool master must always have the highest version of Xenserver installed. My decision to have an mdadm RAID 1 setup on my pool master ultimately turned into forced VM downtime for an upgrade despite having a pool of other xenserver hosts.

After transferring VMs to my secondary host and promoting it to pool master, I wiped my primary xenserver and installed 7. When it was up and running I essentially had two separate pools running. To transfer my VMs back to my primary server I had to resort to the command line.

Offline VM transfer

The xe vm-export and vm-import commands work with stdin/out and piping. This is how I accomplished transferring my VMs directly between two pools. Simply pipe xe vm-export commands with an ssh xe vm-import command like so:

xe vm-export uuid=<VM_UUID> filename= | ssh <other_server> xe vm-import filename=/dev/stdin

Note the lack of a filename – this instructs xenserver to pipe to standard output instead. Also note that transferring the VM scrambles the MAC addresses of its interfaces. If you want to keep the MAC address you’ll have to manually re-assign it after the copy is complete.

Minimal downtime

For the method above you will have to turn the VM off in order to transfer it.  I had some VMs that I didn’t want to stay down for the entire transfer. A way around this is to take a snapshot of the VM and then copy the snapshot to the other pool. Note that this method does not retain any changes made inside the VM that occurred after you took the snapshot. You will have to manually transfer any file changes that took place during the VM transfer (or be fine with losing them.)

In order to export a snapshot you must first convert it to a VM from a template (thanks to this site for outlining how.) The full procedure is as follows:

  1. Take a snapshot of the VM you want to move
    xe vm-snapshot uuid=<VM_UUID> new-name-label=<snapshotname>
  2. Convert the snashot template to a VM (the command xe snapshot-list is a handy way to obtain UUIDs of your snapshots)
    xe template-param-set is-a-template=false ha-always-run=false uuid=<UUID of snapshot>
  3. Transfer the template to the new pool
    xe vm-export uuid=<UUID of snapshot> filename= | ssh <other_server> xe vm-import filename=/dev/stdin
  4. Rename VM and/or modify interface MAC addresses as needed on the new host. Stop the VM on the old host and start it on the new one.

I used both methods above to successfully move my VMs from my older 6.5 pool to the newer 7 pool. Success.

Fix owncloud client sync “not valid JSON!” error

Recently I migrated my owncloud installation from one webserver to another. I learned that all you have to do is copy the data/ and config/ directories from your owncloud installation to the new machine’s owncloud folder to migrate everything over.

After the migration, I noticed the Windows desktop sync clients stopped working (android worked fine, though.) The main error messasge was not helpful:

Failed to connect to owncloud at https://servername/path/status.php: Unknown error

I found out from here that you can alter the shortcut for the Windows client and append –logwindow to the Target section. Once that was done, I was able to get more information:

03-02 10:01:27:240 0x50f9ec8 networkjobs.cpp:453 status.php from server is not valid JSON!

Manually navigating to status.php in the browser didn’t reveal anything – it appeared to load normally.

After much digging I found a suggestion to check the admin settings within owncloud. This is where I realized I probably didn’t migrate things properly. There was a big warning about invalid .htaccess files. Progress!

The lack of an .htaccess file made me realize that instead of completely moving the entire folder from the old owncloud install to the new, I needed to copy into the existing new owncloud directory. In moving instead of copying I somehow missed a few important files.

I started over, this time copying all files inside data and config into the new owncloud data and config directories. Apparently the Windows sync client requires valid .htaccess configuration before it will work.  Success!