Podman no internet in container fix

I’ve started experimenting with CentOS 8 & Podman (a fork of Docker.) I ran into an issue where one of my containers needed internet access, but could not connect. After some digging I found this site which explains why:

I had to configure the firewall on the podman host to allow for IP masquerade:

sudo firewall-cmd --zone=public --add-masquerade --permanent

After running the above command, my container had internet access!

JQ select specific value from array

I had some AWS ec2 JSON output that I needed to parse. I wanted to grab a specific value from an array and it proved to be tricky for a JSON noob like me. I finally found this site which was very helpful: https://garthkerr.com/search-json-array-jq/. In my case I wanted the value of a specific AWS EC2 tag.

The trick is to grab down to the Tags[] array, and then pipe that to a select command. If your tags have dots in them (as mine did) then make sure to quote the tag name. Then add the .Value to the end of the select statement. This is my query:

jq -r '.Reservations[].Instances[].Tags[] | select (.Key == "EC2.Tag.Name").Value' jsonfile.json

The above query grabs all the tags (an array of Key,Value lines), then searches the result for a specific key “EC2.Tag.Name” and returns the Value line associated with it.

WD*EZRZ NAS array spindown fix

I recently acquired some 5TB Western Digital Blue drives (WD50EZRZ.) These particular drives were shucked from external USB enclosures. When I tried to add them into my ZFS raid array, though, I ran into constant problems. I would continually get errors like this from the kernel:

[155069.298001] sd 0:0:10:0: attempting task abort! scmd(ffff8f0678887100)
[155069.298005] sd 0:0:10:0: [sdk] tag#5 CDB: Write(16) 8a 00 00 00 00 01 a8 1e 77 10 00 00 00 58 00 00
[155069.298008] scsi target0:0:10: handle(0x0014), sas_address(0x5001438023a93296), phy(22)
[155069.298010] scsi target0:0:10: enclosure logical id(0x5001438023a932a5), slot(53) 
[155069.298012] sd 0:0:10:0: task abort: SUCCESS scmd(ffff8f0678887100)
[155069.298016] sd 0:0:10:0: [sdk] tag#5 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[155069.298018] sd 0:0:10:0: [sdk] tag#5 CDB: Write(16) 8a 00 00 00 00 01 a8 1e 77 10 00 00 00 58 00 00
[155069.298020] blk_update_request: I/O error, dev sdk, sector 7115536144
[155069.298023] zio pool=storage vdev=/dev/disk/by-id/ata-WDC_WD50EZRZ-32RWYB1_WD-WX31XXXXXVA-part1 error=5 type=2 offset=3643153457152 size=45056 flags=180880

After a couple of said errors, the drive would be marked as bad and taken out of the array. A battery of tests on a different system revealed the drives to be fine. It did not matter where I inserted these drives on my NAS, they did the same thing, even on ports I knew had working drives. It wasn’t a cabling or other hardware issue.

The drives would resilver back into the array just fine, and then pop out again at random intervals – sometimes minutes later, other times hours later. After a lot of research I came across this post that got me thinking – this sounds like a drive spindown issue! The random nature of it could simply be the drives not being used and then powering themselves down.

I tried using hdparm to set the spindown timer but was greeted with this lovely error:

sudo hdparm -B /dev/sdk
/dev/sdk:
 APM_level	= not supported

I eventually found this post complaining about their Western Digital drives spinning down aggressively.

idle3 to the rescue

The above post mentions apmtimer which did not help me, however more searches reveled this godsend: idle3-tools

idle3-tools is an open source utility to handle spindown on Western Digital drives themselves (not the OS level.)

Download & compile idle3:

wget https://sourceforge.net/projects/idle3-tools/files/latest/download
cd idle3-tools-0.9.1/
make
sudo make install

Use idle3 to query current spindown status (update drive letters to suit your needs)

for drive in {a..p}; do echo /dev/sd$drive; sudo idle3ctl -g /dev/sd$drive; done

For anything that doesn’t say Idle3 timer is disabled run the following:

sudo idle3ctl -s 0 /dev/sd(DRIVE_LETTER)

No more drive spindown!

proxmox openvswitch bond

Recently I had to switch my Proxmox server which was using Linux bonds to using openvswitch. These are my notes:

Install openvswitch:

apt install openvswitch-switch

Configure openvswitch to bond interfaces and use VLANs using https://pve.proxmox.com/wiki/Open_vSwitch as an example:

allow-vmbr0 bond0
iface bond0 inet manual
	ovs_bonds enp4s0f0 eno1
	ovs_type OVSBond
	ovs_bridge vmbr0
	ovs_options bond_mode=active-backup

auto lo
iface lo inet loopback

iface eno1 inet manual

iface enp4s0f0 inet manual

allow-ovs vmbr0
iface vmbr0 inet manual
	ovs_type OVSBridge
	ovs_ports bond0 vlan50 vlan10

#Proxmox communication
allow-vmbr0 vlan50
iface vlan50 inet static
  ovs_type OVSIntPort
  ovs_bridge vmbr0
  ovs_options tag=50
  ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
  address 10.0.50.2
  netmask 255.255.255.0
  gateway 10.0.50.1

#Storage network
allow-vmbr0 vlan10
iface vlan10 inet static
  ovs_type OVSIntPort
  ovs_bridge vmbr0
  ovs_options tag=10
  ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
  address 192.168.10.2
  netmask 255.255.255.0

List active interface:

ovs-appctl bond/show bond0

Update 3/14/2020

I realized that openvswitch won’t fail back over to the original slave once it comes back online. I couldn’t for the life of me find the equivalent of bond-primary syntax for openvswitch; however I did find this command:

ovs-appctl list-commands

which reveals this command:

bond/set-active-slave port slave

So you can manually fallback using this command:

ovs-appctl bond/set-active-slave bond0 enp4s0f1

chroot into encrypted drive

I foolishly went browsing in my EFI partition on my Ubuntu (Elementary OS) laptop and decided to delete the Ubuntu folder. This made my laptop unbootable. This was my procedure to bring it back to life:

Boot into Ubuntu Live CD / USB environment

Decrypt LUKS encrypted drive (https://blog.sleeplessbeastie.eu/2015/11/16/how-to-mount-encrypted-lvm-logical-volume/)

sudo fdisk -l
#Determine encrypted partition is /dev/nvme0n1p3 because it's the largest
sudo cryptsetup luksOpen /dev/nvme0n1p3 encrypted_device
sudo vgchange -ay

Mount encrypted drive & chroot (https://askubuntu.com/questions/831216/how-can-i-reinstall-grub-to-the-efi-partition)

sudo mount /dev/elementary-vg/root /mnt
sudo mount /dev/nvme0n1p2 /mnt/boot/
sudo mount /dev/nvme0n1p1 /mnt/boot/efi
for i in /dev /dev/pts /proc /sys /run; do sudo mount -B $i /mnt$i; done
sudo chroot /mnt
sudo grub-install
update-grub  

#end chroot & unmount
exit
cd
for i in /mnt/dev/pts /mnt/dev  /mnt/proc /mnt/sys /mnt/run /mnt/boot/efi /mnt/boot /mnt; do sudo umount $i;  done

proxmox bond not present fix

I really banged my head on the wall on this one. I recently decided to re-architect my networking setup in proxmox to utilize bonded network configuration. I followed this writeup exactly. The problem is it didn’t work.

I would copy the example exactly, only changing the interface name, and yet every time I tried to start the networking service I would get this lovely error:

rawdevice bond0 not present

I finally found on the Debian Wiki one critical line :

First install the ifenslave package, necessary to enable bonding

For some reason the ProxMox howtos don’t speak of this – I guess because it comes installed by default. I discovered, however, that if you install ifupdown2 it removes ifenslave. I had installed ifupdown2 in the past to reload network configuration without rebooting. Aha!

I re-installed ifenslave (which removed ifupdown2 and re-installed ifupdown) and suddenly, the bond worked!

Bond not falling back to primary intrerface

I had configured my bond in active – backup mode. I wanted it to prefer the faster link, but if there was a failure in that link it wouldn’t switch back automatically (thanks to this site for showing me the command to check:

cat /proc/net/bonding/bond0

I read again in Debian bonding wiki that I needed to add this directive to the bond:

        bond-primary enp2s0

Here is my complete working active-backup configuration, assigning vlan 2 to the host, and making enp2s0 (the 10gig nic) the primary, with a 1gig backup (eno1)

auto bond0
iface bond0 inet manual
        slaves enp2s0 eno1
        bond-primary enp2s0
        bond_miimon 100
        bond_mode active-backup

iface bond0.2 inet manual

auto vmbr0v2
iface vmbr0v2 inet static
        address 192.168.2.2
        netmask 255.255.255.0
        gateway 192.168.2.1
        bridge_ports bond0.2
        bridge_stp off
        bridge_fd 0

auto vmbr0
iface vmbr0 inet manual
        bridge_ports bond0
        brideg_stp off
        bridge_fd 0

migrate between amd & intel in proxmox

I recently acquired an Intel based server and plugged it into my AMD-based Proxmox cluster. I ran into an issue transferring from AMD to Intel boxes (the other direction worked fine.) After a few moments, every VM that moved from AMD to Intel would kernel panic.

Fortunately I found here that the fix is to add a few custom CPU flags to your VMs. Once I did this they could move back and forth freely (assuming they had the kvm64 CPU assigned to them – host obviously won’t work.)

qm set *VMID* --args "-cpu 'kvm64,+ssse3,+sse4.1,+sse4.2,+x2apic'"

Fix proxmox iommu operation not permitted

In trying to passthrough some LSI SAS cards to a VM I kept receiving this error:

kvm: -device vfio-pci,host=0000:03:00.0,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0,rombar=0: vfio 0000:03:00.0: failed to setup container for group 7: Failed to set iommu for container: Operation not permitted

I found on this post that the fix is to add a line to /etc/modprobe.d/vfio.conf with the following:

options vfio_iommu_type1 allow_unsafe_interrupts=1

then reboot the host. It worked in my case!

use zdb to remove pesky device from zfs pool

I had the following problem with a device in my pool:

config:

        NAME                                            STATE     READ WRITE CKSUM
        storage                                         DEGRADED     0     0     0
          mirror-0                                      ONLINE       0     0     0
            WORKING_DISK_1  ONLINE       0     0     0
            WORKING_DISK_2    ONLINE       0   0     0
          mirror-1                                      DEGRADED     0     0     0
            WORKING_DISK_3  ONLINE       0     0     0
            replacing-1                                 DEGRADED     0     0     0
              PROBLEM_DISK  FAULTED      6     1     0  too many errors

However when I tried to replace the drive I got this message:

no such device in pool

I found here that you can use zdb to obtain the GUID of the problem device and replace it that way:

root@nas:~# zdb -l PROBLEM_DISK
failed to unpack label 0
------------------------------------
LABEL 1
------------------------------------
    version: 5000
    name: 'storage'
    state: 0
    txg: 5675107
    pool_guid: 8785893899843624400
    errata: 0
    hostname: 'nas'
    top_guid: 9425730683443378041
    guid: 3449631978925631053
    vdev_children: 2
    vdev_tree:
        type: 'mirror'
        id: 1
        guid: 9425730683443378041
        metaslab_array: 41
        metaslab_shift: 35
        ashift: 12
        asize: 4000782221312
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 17168510556101954329
            path: 'WORKING_DISK_3'
            devid: 'WORKING_DISK_3_ID'
            phys_path: 'pci-0000:00:1f.2-ata-2'
            whole_disk: 1
            DTL: 14700
            create_txg: 4
        children[1]:
            type: 'disk'
            id: 1
    ----->  guid: 3449631978925631053
            path: 'PROBLEM_DISK'
            devid: 'PROBLEM_DISK_ID'
            phys_path: 'pci-0000:00:1f.2-ata-4'
            whole_disk: 1
            DTL: 14699
            create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
    labels = 1 2 3 

I used the guid of the problem disk, and all was well:

zpool replace storage 3449631978925631053 NEW_WORKING_DISK

worked instead of complaining the device I was trying to replace didn’t exist.

Add static route in CentOS7

I recently began a project of segmenting my LAN into various VLANs. One issue that cropped up had me banging my head against the wall for days. I had a particular VM that would use OpenVPN to a private VPN provider. I had that same system sending things to a file share via transmission-daemon.

Pre-subnet move everything worked, but once I moved my file server to a different subnet suddenly this VM could not access it while on the VPN. Transmission would hang for some time before finally saying

transmission-daemon.service: Failed with result 'timeout'.

The problem was since my file server was on a different subnet, it was trying to route traffic to it via the default gateway, which in this case was the VPN provider. I had to add a specific route to tell the server to use my LAN network instead of the VPN network in order to restore connectivity to the file server (thanks to this site for the primer.)

I had to create a file /etc/sysconfig/network-scripts/route-eth0 and give it the following line:

192.168.2.0/24 via 192.168.1.1 dev eth0

This instructed my VM to get to the 192.168.2 network via the 192.168.1.1 gateway on eth0. Restart the network service (or reboot) and success!