Category Archives: CLI

Threadripper / Epyc processor core optimization

I had a pet project (folding@home) where I wanted to maximize computing power. I became frustrated with default CPU scheduling of my folding@home threads. Ideal performance would keep similar threads on the same CPU, but the threads were jumping all over the place, which was impacting performance.

Step one was to figure out which threads belonged to which physical cores. I found on this site that you can use cat to find out what your “sibling threads” are:

cat /sys/devices/system/cpu/cpu{0..15}/topology/thread_siblings_list

The above command is for my Threadripper & Epyc systems, which each have 16 cores hyperthreaded to 32 cores. Adjust the {0..15} number to match your number of cores (core 0 being the fist core.) This was my output:

cat /sys/devices/system/cpu/cpu{0..15}/topology/thread_siblings_list

0,16
1,17
2,18
3,19
4,20
5,21
6,22
7,23
8,24
9,25
10,26
11,27
12,28
13,29
14,30
15,31

Now that I know the sibling threads are offset by 16, I can use this information to optimize my folding@home VMs. I modified my CPU pinning script to take this into consideration. The script ensures that each VM is pinned to only use sibling threads (ensuring they all stay on the same physical CPU.)

This script should be used with caution. It pins processes to specific CPUs, which limits the kernel scheduler’s ability to move things around if needed. If configured badly this can cause the machine to lock up or VMs to be terminated.

I saw some impressive results spinning up four separate 8 core VMs and pinning them to sibling cores using this script. It almost doubled the rate at which I completed folding@home work units.

And now, the script:

#!/bin/bash
#Properly assign CPU cores to their respective die for EPYC/Threadripper systems
#Based on how hyperthreads are done in these systems
#cat /sys/devices/system/cpu/cpu{0..15}/topology/thread_siblings_list

#The script takes two arguments - the ID of the Proxmox VM to modify, and the core to begin the VM on
#If running this against multiple VMs, make sure to increment this second number by half of the cores of the previous VM
#For example, if I have one 8 core VM and I run this script specifying 0 for the offset, if I spin up a second VM, the second argument would be 4
#this would ensure the second VM starts on core 4 (the 5th core) and assigns sibling cores to match

set -eo pipefail

#take First argument as which VMID to pin CPU cores to, the second argument is which core to start pinning to
VMID=$1
OFFSET=$2

#Determine offset for sibling threads
SIBLING_THREAD_OFFSET=$(cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list| sed 's/,/ /g' | awk '{print $2}')

#Function to determine number of CPU cores a VM has
cpu_tasks() {
	expect <<EOF | sed -n 's/^.* CPU .*thread_id=\(.*\)$/\1/p' | tr -d '\r' || true
spawn qm monitor $VMID
expect ">"
send "info cpus\r"
expect ">"
EOF
}

#Only act if VMID & OFFSET are set
if [[ -z $VMID  || -z $OFFSET ]]
then
	echo "Usage: cpupin.sh <VMID> <OFFSET>"
	exit 1
else
	#Get PIDs of each CPU core for VM, count number of VM cores, and get even/odd PIDs for assignment
	VCPUS=($(cpu_tasks))
	VCPU_COUNT="${#VCPUS[@]}"
	VCPU_EVEN_THREADS=($(for EVEN_THREAD in "${VCPUS[@]}"; do echo $EVEN_THREAD; done | awk '!(NR%2)'))
	VCPU_ODD_THREADS=($(for ODD_THREAD in "${VCPUS[@]}"; do echo $ODD_THREAD; done | awk '(NR%2)'))

	if [[ $VCPU_COUNT -eq 0 ]]; then
		echo "* No VCPUS for VM$VMID"
		exit 1
	fi

	echo "* Detected ${#VCPUS[@]} assigned to VM$VMID..."
	echo "* Resetting cpu shield..."

	#Start at offset CPU number, assign odd numbered PIDs to their own CPU thread, then increment CPU core number
	#0-3 if offset is 0, 4-7 if offset is 4, etc
	ODD_CPU_INDEX=$OFFSET
	for PID in "${VCPU_ODD_THREADS[@]}"
	do
		echo "* Assigning ODD thread $ODD_CPU_INDEX to $PID..."
		taskset -pc "$ODD_CPU_INDEX" "$PID"
		((ODD_CPU_INDEX+=1))
	done

	#Start at offset + CPU count, assign even number PIDs to their own CPU thread, then increment CPU core number
	#16-19 if offset is 0,	20-23 if offset is 4, etc
	EVEN_CPU_INDEX=$(($OFFSET + $SIBLING_THREAD_OFFSET))
	for PID in "${VCPU_EVEN_THREADS[@]}"
	do
		echo "* Assigning EVEN thread $EVEN_CPU_INDEX to $PID..."
		taskset -pc "$EVEN_CPU_INDEX" "$PID"
		((EVEN_CPU_INDEX+=1))
	done
fi

UBUNTU 20.04 cloned VM same DHCP IP fix

I cloned an Ubuntu 20.04 VM and was frustrated to see both boxes kept getting the same DHCP IP address despite having different network MAC addresses. I finally found on this helpful post which states Ubuntu 20.04 uses systemd-networkd for DHCP leases which behaves differently than dhclient. As wickedchicken states,

systemd-networkd uses a different method to generate the DUID than dhclientdhclient by default uses the link-layer address while systemd-networkd uses the contents of /etc/machine-id. Since the VMs were cloned, they have the same machine-id and the DHCP server returns the same IP for both.

To fix, replace the contents of one or both of /etc/machine-id. This can be anything, but deleting the file and running systemd-machine-id-setup will create a random machine-id in the same way done on machine setup.

So my fix was to run the following on the cloned machine:

sudo rm /etc/machine-id
sudo systemd-machine-id-setup
sudo reboot

That did the trick!


For the systems that registered their hostnames under the wrong IPs, I had to take the following action for my Ubuntu 20.04 desktop as well as my Ubiquiti USG-Pro 4

Ubiquiti: Clear DHCP lease

clear dhcp lease ip <ip_address>

Ubuntu desktop: Flush DNS

sudo systemd-resolve --flush-caches

Folding@home opencl error fix

I decided to contribute my GPU on my Ubuntu-based system to the Folding@Home effort for COVID-19. I kept getting this error message for my NVIDIA GeForce GTX 1050 TI when I tried:

ERROR:WU00:FS00:Failed to start core: OpenCL device matching slot 0 not found, make sure the OpenCL driver is installed or try setting 'opencl-index' manually

I had the nvidia opencl packages installed but apparently missed something. I finally found on the folding at home forum what I was missing – ocl-icd-opencl-dev

sudo apt install ocl-icd-opencl-dev

After running the above command and restarting the FAHClient service, the GPU started folding. For science!


EDIT 5/6/2020: After a re-install I had the issue where the GPU wouldn’t show up at all. It addition to ocl-icd-opencl-dev, it looks like you also need nvidia-cuda-dev.

sudo apt install ocl-icd-opencl-dev nvidia-cuda-dev

Sort by middle of a string

I had a list of items I wanted to sort in a non-standard way:

app-function1.site1.jeppson.org
app-function2.site3.jeppson.org
app-function3.site4.jeppson.org
app-function4.site2.jeppson.org
app-function1.site6.jeppson.org
app-function3.site9.jeppson.org
app-function4.site7.jeppson.org

It’s a generalized list for publication but you get the idea. I wanted to sort by site name. Thanks to this post I found it’s relatively easy. You can tell the sort command to use a character as a tab delimiter (-t) and then specify which key “column” to sort by (-k)

In my case I sorted by site by specifying the dot character '.' as the delimiter, and the second “column” as the key '-k2'

The end result was this:

cat apps-by-site-unsorted.txt | sort -t. -k2
app-function1.site1.jeppson.org
app-function4.site2.jeppson.org
app-function2.site3.jeppson.org
app-function3.site4.jeppson.org
app-function1.site6.jeppson.org
app-function4.site7.jeppson.org
app-function3.site9.jeppson.org

Success

git checkout only specific directory from repo

I have a git repo where I just wanted a specific folder, not the entire repo, cloned to one of my virtual machines. Git doesn’t handle this straightforwardly, but thanks to this article I found there is a roundabout way of doing it., by combining a git sparse checkout and a git shallow checkout.

Below are the commands to run (I ran these directly in my home directory.) Replace FOLDER with the folder from within the repository you wish to clone.

git init <repo> 
cd <repo>
git remote add origin <url to remote repo> 
git config core.sparsecheckout true 
echo "FOLDER/*" >> .git/info/sparse-checkout 
git pull --depth=1 origin master 

Success! Now this particular machine only has the folder within the repo I want, not the entire git repository.

JQ select specific value from array

I had some AWS ec2 JSON output that I needed to parse. I wanted to grab a specific value from an array and it proved to be tricky for a JSON noob like me. I finally found this site which was very helpful: https://garthkerr.com/search-json-array-jq/. In my case I wanted the value of a specific AWS EC2 tag.

The trick is to grab down to the Tags[] array, and then pipe that to a select command. If your tags have dots in them (as mine did) then make sure to quote the tag name. Then add the .Value to the end of the select statement. This is my query:

jq -r '.Reservations[].Instances[].Tags[] | select (.Key == "EC2.Tag.Name").Value' jsonfile.json

The above query grabs all the tags (an array of Key,Value lines), then searches the result for a specific key “EC2.Tag.Name” and returns the Value line associated with it.

WD*EZRZ NAS array spindown fix

I recently acquired some 5TB Western Digital Blue drives (WD50EZRZ.) These particular drives were shucked from external USB enclosures. When I tried to add them into my ZFS raid array, though, I ran into constant problems. I would continually get errors like this from the kernel:

[155069.298001] sd 0:0:10:0: attempting task abort! scmd(ffff8f0678887100)
[155069.298005] sd 0:0:10:0: [sdk] tag#5 CDB: Write(16) 8a 00 00 00 00 01 a8 1e 77 10 00 00 00 58 00 00
[155069.298008] scsi target0:0:10: handle(0x0014), sas_address(0x5001438023a93296), phy(22)
[155069.298010] scsi target0:0:10: enclosure logical id(0x5001438023a932a5), slot(53) 
[155069.298012] sd 0:0:10:0: task abort: SUCCESS scmd(ffff8f0678887100)
[155069.298016] sd 0:0:10:0: [sdk] tag#5 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[155069.298018] sd 0:0:10:0: [sdk] tag#5 CDB: Write(16) 8a 00 00 00 00 01 a8 1e 77 10 00 00 00 58 00 00
[155069.298020] blk_update_request: I/O error, dev sdk, sector 7115536144
[155069.298023] zio pool=storage vdev=/dev/disk/by-id/ata-WDC_WD50EZRZ-32RWYB1_WD-WX31XXXXXVA-part1 error=5 type=2 offset=3643153457152 size=45056 flags=180880

After a couple of said errors, the drive would be marked as bad and taken out of the array. A battery of tests on a different system revealed the drives to be fine. It did not matter where I inserted these drives on my NAS, they did the same thing, even on ports I knew had working drives. It wasn’t a cabling or other hardware issue.

The drives would resilver back into the array just fine, and then pop out again at random intervals – sometimes minutes later, other times hours later. After a lot of research I came across this post that got me thinking – this sounds like a drive spindown issue! The random nature of it could simply be the drives not being used and then powering themselves down.

I tried using hdparm to set the spindown timer but was greeted with this lovely error:

sudo hdparm -B /dev/sdk
/dev/sdk:
 APM_level	= not supported

I eventually found this post complaining about their Western Digital drives spinning down aggressively.

idle3 to the rescue

The above post mentions apmtimer which did not help me, however more searches reveled this godsend: idle3-tools

idle3-tools is an open source utility to handle spindown on Western Digital drives themselves (not the OS level.)

Download & compile idle3:

wget https://sourceforge.net/projects/idle3-tools/files/latest/download
cd idle3-tools-0.9.1/
make
sudo make install

Use idle3 to query current spindown status (update drive letters to suit your needs)

for drive in {a..p}; do echo /dev/sd$drive; sudo idle3ctl -g /dev/sd$drive; done

For anything that doesn’t say Idle3 timer is disabled run the following:

sudo idle3ctl -s 0 /dev/sd(DRIVE_LETTER)

No more drive spindown!

proxmox openvswitch bond

Recently I had to switch my Proxmox server which was using Linux bonds to using openvswitch. These are my notes:

Install openvswitch:

apt install openvswitch-switch

Configure openvswitch to bond interfaces and use VLANs using https://pve.proxmox.com/wiki/Open_vSwitch as an example:

allow-vmbr0 bond0
iface bond0 inet manual
	ovs_bonds enp4s0f0 eno1
	ovs_type OVSBond
	ovs_bridge vmbr0
	ovs_options bond_mode=active-backup

auto lo
iface lo inet loopback

iface eno1 inet manual

iface enp4s0f0 inet manual

allow-ovs vmbr0
iface vmbr0 inet manual
	ovs_type OVSBridge
	ovs_ports bond0 vlan50 vlan10

#Proxmox communication
allow-vmbr0 vlan50
iface vlan50 inet static
  ovs_type OVSIntPort
  ovs_bridge vmbr0
  ovs_options tag=50
  ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
  address 10.0.50.2
  netmask 255.255.255.0
  gateway 10.0.50.1

#Storage network
allow-vmbr0 vlan10
iface vlan10 inet static
  ovs_type OVSIntPort
  ovs_bridge vmbr0
  ovs_options tag=10
  ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
  address 192.168.10.2
  netmask 255.255.255.0

List active interface:

ovs-appctl bond/show bond0

Update 3/14/2020

I realized that openvswitch won’t fail back over to the original slave once it comes back online. I couldn’t for the life of me find the equivalent of bond-primary syntax for openvswitch; however I did find this command:

ovs-appctl list-commands

which reveals this command:

bond/set-active-slave port slave

So you can manually fallback using this command:

ovs-appctl bond/set-active-slave bond0 enp4s0f1

chroot into encrypted drive

I foolishly went browsing in my EFI partition on my Ubuntu (Elementary OS) laptop and decided to delete the Ubuntu folder. This made my laptop unbootable. This was my procedure to bring it back to life:

Boot into Ubuntu Live CD / USB environment

Decrypt LUKS encrypted drive (https://blog.sleeplessbeastie.eu/2015/11/16/how-to-mount-encrypted-lvm-logical-volume/)

sudo fdisk -l
#Determine encrypted partition is /dev/nvme0n1p3 because it's the largest
sudo cryptsetup luksOpen /dev/nvme0n1p3 encrypted_device
sudo vgchange -ay

Mount encrypted drive & chroot (https://askubuntu.com/questions/831216/how-can-i-reinstall-grub-to-the-efi-partition)

sudo mount /dev/elementary-vg/root /mnt
sudo mount /dev/nvme0n1p2 /mnt/boot/
sudo mount /dev/nvme0n1p1 /mnt/boot/efi
for i in /dev /dev/pts /proc /sys /run; do sudo mount -B $i /mnt$i; done
sudo chroot /mnt
sudo grub-install
update-grub  

#end chroot & unmount
exit
cd
for i in /mnt/dev/pts /mnt/dev  /mnt/proc /mnt/sys /mnt/run /mnt/boot/efi /mnt/boot /mnt; do sudo umount $i;  done