Category Archives: CLI

WD*EZRZ NAS array spindown fix

I recently acquired some 5TB Western Digital Blue drives (WD50EZRZ.) These particular drives were shucked from external USB enclosures. When I tried to add them into my ZFS raid array, though, I ran into constant problems. I would continually get errors like this from the kernel:

[155069.298001] sd 0:0:10:0: attempting task abort! scmd(ffff8f0678887100)
[155069.298005] sd 0:0:10:0: [sdk] tag#5 CDB: Write(16) 8a 00 00 00 00 01 a8 1e 77 10 00 00 00 58 00 00
[155069.298008] scsi target0:0:10: handle(0x0014), sas_address(0x5001438023a93296), phy(22)
[155069.298010] scsi target0:0:10: enclosure logical id(0x5001438023a932a5), slot(53) 
[155069.298012] sd 0:0:10:0: task abort: SUCCESS scmd(ffff8f0678887100)
[155069.298016] sd 0:0:10:0: [sdk] tag#5 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[155069.298018] sd 0:0:10:0: [sdk] tag#5 CDB: Write(16) 8a 00 00 00 00 01 a8 1e 77 10 00 00 00 58 00 00
[155069.298020] blk_update_request: I/O error, dev sdk, sector 7115536144
[155069.298023] zio pool=storage vdev=/dev/disk/by-id/ata-WDC_WD50EZRZ-32RWYB1_WD-WX31XXXXXVA-part1 error=5 type=2 offset=3643153457152 size=45056 flags=180880

After a couple of said errors, the drive would be marked as bad and taken out of the array. A battery of tests on a different system revealed the drives to be fine. It did not matter where I inserted these drives on my NAS, they did the same thing, even on ports I knew had working drives. It wasn’t a cabling or other hardware issue.

The drives would resilver back into the array just fine, and then pop out again at random intervals – sometimes minutes later, other times hours later. After a lot of research I came across this post that got me thinking – this sounds like a drive spindown issue! The random nature of it could simply be the drives not being used and then powering themselves down.

I tried using hdparm to set the spindown timer but was greeted with this lovely error:

sudo hdparm -B /dev/sdk
/dev/sdk:
 APM_level	= not supported

I eventually found this post complaining about their Western Digital drives spinning down aggressively.

idle3 to the rescue

The above post mentions apmtimer which did not help me, however more searches reveled this godsend: idle3-tools

idle3-tools is an open source utility to handle spindown on Western Digital drives themselves (not the OS level.)

Download & compile idle3:

wget https://sourceforge.net/projects/idle3-tools/files/latest/download
cd idle3-tools-0.9.1/
make
sudo make install

Use idle3 to query current spindown status (update drive letters to suit your needs)

for drive in {a..p}; do echo /dev/sd$drive; sudo idle3ctl -g /dev/sd$drive; done

For anything that doesn’t say Idle3 timer is disabled run the following:

sudo idle3ctl -s 0 /dev/sd(DRIVE_LETTER)

No more drive spindown!

proxmox openvswitch bond

Recently I had to switch my Proxmox server which was using Linux bonds to using openvswitch. These are my notes:

Install openvswitch:

apt install openvswitch-switch

Configure openvswitch to bond interfaces and use VLANs using https://pve.proxmox.com/wiki/Open_vSwitch as an example:

allow-vmbr0 bond0
iface bond0 inet manual
	ovs_bonds enp4s0f0 eno1
	ovs_type OVSBond
	ovs_bridge vmbr0
	ovs_options bond_mode=active-backup

auto lo
iface lo inet loopback

iface eno1 inet manual

iface enp4s0f0 inet manual

allow-ovs vmbr0
iface vmbr0 inet manual
	ovs_type OVSBridge
	ovs_ports bond0 vlan50 vlan10

#Proxmox communication
allow-vmbr0 vlan50
iface vlan50 inet static
  ovs_type OVSIntPort
  ovs_bridge vmbr0
  ovs_options tag=50
  ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
  address 10.0.50.2
  netmask 255.255.255.0
  gateway 10.0.50.1

#Storage network
allow-vmbr0 vlan10
iface vlan10 inet static
  ovs_type OVSIntPort
  ovs_bridge vmbr0
  ovs_options tag=10
  ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
  address 192.168.10.2
  netmask 255.255.255.0

List active interface:

ovs-appctl bond/show bond0

Update 3/14/2020

I realized that openvswitch won’t fail back over to the original slave once it comes back online. I couldn’t for the life of me find the equivalent of bond-primary syntax for openvswitch; however I did find this command:

ovs-appctl list-commands

which reveals this command:

bond/set-active-slave port slave

So you can manually fallback using this command:

ovs-appctl bond/set-active-slave bond0 enp4s0f1

chroot into encrypted drive

I foolishly went browsing in my EFI partition on my Ubuntu (Elementary OS) laptop and decided to delete the Ubuntu folder. This made my laptop unbootable. This was my procedure to bring it back to life:

Boot into Ubuntu Live CD / USB environment

Decrypt LUKS encrypted drive (https://blog.sleeplessbeastie.eu/2015/11/16/how-to-mount-encrypted-lvm-logical-volume/)

sudo fdisk -l
#Determine encrypted partition is /dev/nvme0n1p3 because it's the largest
sudo cryptsetup luksOpen /dev/nvme0n1p3 encrypted_device
sudo vgchange -ay

Mount encrypted drive & chroot (https://askubuntu.com/questions/831216/how-can-i-reinstall-grub-to-the-efi-partition)

sudo mount /dev/elementary-vg/root /mnt
sudo mount /dev/nvme0n1p2 /mnt/boot/
sudo mount /dev/nvme0n1p1 /mnt/boot/efi
for i in /dev /dev/pts /proc /sys /run; do sudo mount -B $i /mnt$i; done
sudo chroot /mnt
sudo grub-install
update-grub  

#end chroot & unmount
exit
cd
for i in /mnt/dev/pts /mnt/dev  /mnt/proc /mnt/sys /mnt/run /mnt/boot/efi /mnt/boot /mnt; do sudo umount $i;  done

use zdb to remove pesky device from zfs pool

I had the following problem with a device in my pool:

config:

        NAME                                            STATE     READ WRITE CKSUM
        storage                                         DEGRADED     0     0     0
          mirror-0                                      ONLINE       0     0     0
            WORKING_DISK_1  ONLINE       0     0     0
            WORKING_DISK_2    ONLINE       0   0     0
          mirror-1                                      DEGRADED     0     0     0
            WORKING_DISK_3  ONLINE       0     0     0
            replacing-1                                 DEGRADED     0     0     0
              PROBLEM_DISK  FAULTED      6     1     0  too many errors

However when I tried to replace the drive I got this message:

no such device in pool

I found here that you can use zdb to obtain the GUID of the problem device and replace it that way:

root@nas:~# zdb -l PROBLEM_DISK
failed to unpack label 0
------------------------------------
LABEL 1
------------------------------------
    version: 5000
    name: 'storage'
    state: 0
    txg: 5675107
    pool_guid: 8785893899843624400
    errata: 0
    hostname: 'nas'
    top_guid: 9425730683443378041
    guid: 3449631978925631053
    vdev_children: 2
    vdev_tree:
        type: 'mirror'
        id: 1
        guid: 9425730683443378041
        metaslab_array: 41
        metaslab_shift: 35
        ashift: 12
        asize: 4000782221312
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 17168510556101954329
            path: 'WORKING_DISK_3'
            devid: 'WORKING_DISK_3_ID'
            phys_path: 'pci-0000:00:1f.2-ata-2'
            whole_disk: 1
            DTL: 14700
            create_txg: 4
        children[1]:
            type: 'disk'
            id: 1
    ----->  guid: 3449631978925631053
            path: 'PROBLEM_DISK'
            devid: 'PROBLEM_DISK_ID'
            phys_path: 'pci-0000:00:1f.2-ata-4'
            whole_disk: 1
            DTL: 14699
            create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
    labels = 1 2 3 

I used the guid of the problem disk, and all was well:

zpool replace storage 3449631978925631053 NEW_WORKING_DISK

worked instead of complaining the device I was trying to replace didn’t exist.

Add static route in CentOS7

I recently began a project of segmenting my LAN into various VLANs. One issue that cropped up had me banging my head against the wall for days. I had a particular VM that would use OpenVPN to a private VPN provider. I had that same system sending things to a file share via transmission-daemon.

Pre-subnet move everything worked, but once I moved my file server to a different subnet suddenly this VM could not access it while on the VPN. Transmission would hang for some time before finally saying

transmission-daemon.service: Failed with result 'timeout'.

The problem was since my file server was on a different subnet, it was trying to route traffic to it via the default gateway, which in this case was the VPN provider. I had to add a specific route to tell the server to use my LAN network instead of the VPN network in order to restore connectivity to the file server (thanks to this site for the primer.)

I had to create a file /etc/sysconfig/network-scripts/route-eth0 and give it the following line:

192.168.2.0/24 via 192.168.1.1 dev eth0

This instructed my VM to get to the 192.168.2 network via the 192.168.1.1 gateway on eth0. Restart the network service (or reboot) and success!

Dell LSI SAS2008 2TB drive fix

I just recently got a $40 external SAS adapter for my new storage server. The plan was to create a DAS device from my old NAS chassis and have it be driven by my new storage server (new to me anyway – a Dell PowerEdge R610.) I ordered what was listed simply as “Dell SAS External Dual Ports PCI-E 6GB/S Host Bus Server Adapter 12DNW 342-0910 Consumer Electronics” from Amazon for $40 to accomplish this goal.

When I plugged everything in, to my dismay none of my disks with greater than 2TB capacity showed up. Well, they sort of showed up – they all reported capacities of exactly 2TB. I was clearly running into some sort of firmware issue.

lspci revealed this card uses the LSI SAS2008 chipset, which from what I’ve read is capable of drives greater than 2TB in size. I later found the model number of my card – Dell PERC H200E – which proved to be quite vital information. After hours of digging around in unholy corners of the internet I finally arrived on this Dell Support page. It had exactly what I was hoping for:

ENHANCEMENTS:
– Added support for SAS HDDs larger than 2TB

To flash this I chose to create a bootable dos ISO as per the instructions here. First, download the Windows installer, open with your archive program of choice and extract to the folder you’re going to build your ISO from. Then follow the instructions linked to above of downloading a freeDOS ISO, extracting it to the same folder you extracted the firmware to, then running the command to build your ISO (adjust as needed)

mkisofs -o <ISO_OUTPUT_LOCATION -q -l -N -boot-info-table -iso-level 4 -no-emul-boot -b isolinux/isolinux.bin -publisher "FreeDOS - www.freedos.org" -A "FreeDOS beta9 Distribution" -V FDOS_BETA9 -v .

I got so far and yet tripped at the finish line. If you simply run flash.bat you’ll be greeted with a message saying no compatible adapters were found. Fortunately that’s a LIE. My savior was this writeup on how to flash certain versions of these cards to IT mode. I didn’t care about IT mode (my card is not a RAID card) but it had the information I needed. Here are the magic commands!

sas2flsh -listall

#Use the number in the first column to get the SAS Address for the card.
sas2flsh -c 0 -list
#Write down the SAS Address and continue to the next steps.
sas2flsh -o -f 6GBPSAS.FW
sas2flsh -o -sasadd 5xxxxxxxxxxxxxxx (replace this address with the one you wrote down in the first steps).

Reboot, and finally, after hours of banging my head on the wall… success!!!

These 4 drives were only being reported as 2TB before

I didn’t end up using it but in my internet travels I came across this. Broadcom offers a neat utility called the LSI pre-boot USB tool that I didn’t end up using: https://www.broadcom.com/support/knowledgebase/1211161499804/lsi-pre-boot-usb-tool-download

Update 3/7/2020

I had some issues with my 4tb+ drives dropping out of my zpools. I found better firmware to flash in order to fix it. It was very frustrating to flash, however. I tried following the instructions as laid out here but I was met with this lovely message:

"Cannot Flash IT Firmware over IR Firmware"..

I found this guide on how to use the megarec utility to wipe the firmware in order to flash over properly. I was able to find the megarec utility here.

I very frustratingly found I couldn’t use the megarec utility on my Dell server; megarec would simply hang

I ended up taking the card out and putting it into my desktop to run megarec commands. Comically, my desktop had a chipset that caused sas2flash not to work!! It would fail with the message

Failed to initialize PAL

Instructions per this page were to boot to EFI and run the flash utilities there, but that desktop didn’t have an EFI shell and I couldn’t get it to boot one from USB.

My final resort: an even older desktop (my Dad’s old PC, circa 2008.) It did the deed!

FINALLY

With both utilities working I was still having trouble with sas2flash erroring out on me. I finally found the wise words from fourlynx on this homelab reddit discussion on the final song and dance I had to perform to get my Dell H200 card to work with the LSI firmware I wanted

  1. Flash to Dell 6GBPSAS.FW
    1. I used megarec to wipe the card first before it would let me install that firmware
  2. Erase the card
    1. sas2flsh -o -e 7 -c 0
  3. Flash to 6GBPSAS.FW again
    1. sas2flsh -o -f 6GBPSAS.FW
    2. If asked me to state a firmware, I entered 6GBPSAS.FW, waited for it to finish, then ran the sas2flsh command (flashed a total of 3 times the same firmware.)
  4. Reboot
  5. Finally flash LSI firmware
    1. sas2flsh -o -f 2118it.bin

No need to flash BIOS (-b flag) if not going to boot from that controller. Also no need to set SAS address if it’s the only card in the server.

Words of wisdom from fourlynx:

For what concerns your case, I’d try to flash it to the Dell firmware first (any of your choice, for H200I, H200A or with the 6GBPSAS.fw). From there, clear it completely sas2flsh -o -e 7 -c 0 and flash the 6GBPSAS.fw before rebooting. You should now have better luck in crossflashing that to the LSI firmware. Note that you’ll need to use the v5 or v7 version of the flasher to do this step as newer versions will refuse to crossflash. You can then flash the bootloader for EFI (x64sas2.rom) or for BIOS (mptsas2.rom) at your leisure according to what you’re going to use, or flash both, or none if you’re not going to boot from those drives at all but instead use an USB key.

megarec -cleanflash 0 is equivalent to sas2flsh -o -e 7, btw, and the megarec -writesbr sbrempty.bin command that is often found in guides is only relevant when coming from a M1015 afaik, so not being able to use megarec is not a show stopper.

I feel I should add that, contrary to what seems the popular opinion in the various guides, these cards aren’t really easy to brick and I haven’t managed to achieve that despite all the experiments I’ve subjected them to 🙂

Update 3/8/2020

I still had issues with a drive popping out of the array so I found this page with an even better firmware for my card:

https://www.ixsystems.com/community/threads/crossflash-dell-h200e-to-lsi-9200-8e.41307/

Things seem more stable now!

Proxmox HA management script

I was a bit frustrated at the lack of certain functions of ProxMox. I wanted an easy way to tag a VM and manage that tag as a group. My solution was to create HA groups for VMs with various functions. I can then manage the group and tell them to migrate storage or turn off & on.

I wrote a script to facilitate this. Right now it only covers powering on, powering off, and migrating the location of the primary disk, but more is to come.

Here’s what I have so far:

#!/bin/bash
#Proxmox HA management script
#Migrates storage, starts, or stops Proxmox HA groups based on the name and function passed to it.
#Usage: manage-HA-group.sh <start|stop|migrate> <ha-group-name> [local|remote]

#Change to the name of your local storage (for migrating from remote to local storage)
LOCAL_STORAGE_NAME="pve-1TB"

function get_vm_name() {
    #Determine the name of the VMID passed to this function
    VM_NAME=$(qm config "$1" | grep '^name:' | awk '{print $2}')
}

function get_group_VMIDs() {
    #Get a list of VMIDs based on the name of the HA group passed to this function
    group_VMIDs=$(ha-manager config | grep -B1 "$1" | grep vm: | sed 's/vm://g')
}

function group_power_state() {
    #Loop through members of HA group passed to this function
    for group in "$1" 
    do
        get_group_VMIDs "$group"
        for VM in $group_VMIDs
        do
            get_vm_name "$VM"
            echo "$OPERATION $VM_NAME in HA group $group"
            ha-manager set $VM --state $VM_STATE
        done
    done
}

function group_migrate() {
    #This function migrates the VM's first disk (scsi0) to the specified location (local/remote)
    #TODO String to determine all disk IDs:  qm config 115 | grep '^scsi[0-9]:' | tr -d ':' | awk '{print $1}'
    disk="scsi0"    

    #Loop through each VM in specified group name (second argument passed on CLI)
    for group in "$2" 
    do
        get_group_VMIDs "$group"
        for VM in $group_VMIDs
        do
            #Determine the names of each VM in the HA group
            get_vm_name "$VM"

            #Set storage location based on argument
            if [[ "$3" == "remote" ]]; then
                storage="$VM_NAME"
            else
                storage="$LOCAL_STORAGE_NAME"
            fi

            #Move primary disk to desired location
            echo "Migrating $VM_NAME to "$3" storage"
            qm move_disk $VM $disk $storage --delete=1

        done
    done
}

case "$1" in 
    start)
        VM_STATE="started"
        OPERATION="Starting"
        group_power_state "$2" 
        ;;
    stop)
        VM_STATE="stopped"
        OPERATION="Stopping"
        group_power_state "$2"
        ;;
    migrate)
        case "$3" in
            local|remote)
                group_migrate "$@"
                ;;
            *)
                echo "Usage: manage-HA-group.sh migrate <ha-group-name> <local|remote>"
                ;;
        esac        
    ;;
    *)
        echo "Usage: manage-HA-group.sh <start|stop|migrate> <ha-group-name> [local|remote]"
        exit 1
        ;;
esac

Find local accounts with shell login

I spent some time building this query so I thought I’d write it down. To find out which accounts in /etc/passwd with a UID greater than 500 have anything other than nologin or false set as their login shell, combine grep & awk to print them out:

grep -v 'nologin\|false' /etc/passwd | awk -F: '{if($3>500)print}'

grep -v: exclude results
awk -F: use colon as a field separator
{if($3>500)print: if the third column using the above colon separator is greater than 500, print the line

If running this command via salt cmd.run, you must use double quotes and escape the $ to get it to work properly:

salt <hostname> cmd.run "grep -v 'nologin\|false' /etc/passwd | awk -F: '{if(\$3>500)print}'"

apache reverse proxy with basic authentication

I have an old Apache server that’s serving as a reverse proxy for my webcam. I swapped webcams out and unfortunately the new one requires authentication. I had to figure out how to get Apache to reverse proxy with the proper authentication. The best information I found was given by user ThR37 at superuser.com

Essentially you have to use an Apache module called headers to add an HTTP header to the request. On my Debian system this was not enabled, so I had to install it (thanks to Andy over at serverfault)

sudo a2enmod headers
#if you're on ubuntu then it's mod_headers

I then needed to generate the basic authentication hash for the header to pass. This was done via a simple python script:

#replace USERNAME:PASSWORD below with your credentials
import base64
hash = base64.b64encode(b'USERNAME:PASSWORD')
print hash

Save the above script into a file hash.py and then run it by typing

python hash.py

With headers enabled and hash acquired I just needed to tweak my config by adding a RequestHeader line:

RequestHeader set Authorization "Basic <HASH>"
#Replace <HASH> with hash acquired above

After adding that one line and restarting apache, it worked!

GIT branch and merge from vs code

I’ve configured VS Code to follow git best practices when it comes to creating branches and merging via pull requests. Here are my notes:

Install gitlens plugin

To open the command pallet: F1 or Command Shift P
CP is short for Command Pallet for following commands.

CP : git clone

  • Enter <repo URL> -> select destination
  • optional: add folder to workspace (bottom right)

CP: git add remote

  • select repository -> label upstream

CP: git fetch from all remotes

CP: git create branch from

  • select repo -> provide name -> select upstream/master

Click on a file in the repo, verify you’re in your new branch (bottom left)

Publish branch to your repo:

CP: git publish branch

  • select repo -> origin

Make your changes

Commit your changes to your branch (git icon on the left – type commit message, optionally compare files) – hit check mark to commit

— your changes are committed to your personal repo branch —

Log into gitlab, go to repo, click Compare & pull request

If you ever need to remove upstream URL:

CP: git remove remote