Tag Archives: ZFS

Replace unavail disk in ZFS

I had an issue where I removed a drive in my ZFS array and replaced it with a new drive which the OS gave the same device name (/dev/sdd). I had a hard time getting zfs to replace the drive until I discovered the -g flag for zpool status (thanks to this stackexchange post.)

That did the trick! Simply running zpool status -g showed the GUIDs of each device, which I could then use to properly use zpool replace on:

sudo zpool replace Poolname 12922644002107879117 /dev/sdd

Success!

Reset root password on OPNSense with ZFS root

A lot of the guides for resetting the root password on an OPNSense box assume a UFS root partition. The password recovery steps do not work if you installed OPNSense with a ZFS root partition. If you try to follow the steps you get a lovely error about “unrecognized filesystem”

The process for ZFS (thanks to this article) is instead to run the following commands:

zfs set readonly=off zroot
zfs mount -a

Once that is done, you can proceed with the rest of the steps. Password recovery steps in full:

  1. Press the number 2 immediately on boot to go into single user mode
  2. Press enter when prompted for shell
  3. Make ZFS read/write:
    1. zfs set readonly=off zroot
    2. zfs mount -a
  4. Reset password
    1. opnsense-shell password
  5. Reboot

WD*EZRZ NAS array spindown fix

I recently acquired some 5TB Western Digital Blue drives (WD50EZRZ.) These particular drives were shucked from external USB enclosures. When I tried to add them into my ZFS raid array, though, I ran into constant problems. I would continually get errors like this from the kernel:

[155069.298001] sd 0:0:10:0: attempting task abort! scmd(ffff8f0678887100)
[155069.298005] sd 0:0:10:0: [sdk] tag#5 CDB: Write(16) 8a 00 00 00 00 01 a8 1e 77 10 00 00 00 58 00 00
[155069.298008] scsi target0:0:10: handle(0x0014), sas_address(0x5001438023a93296), phy(22)
[155069.298010] scsi target0:0:10: enclosure logical id(0x5001438023a932a5), slot(53) 
[155069.298012] sd 0:0:10:0: task abort: SUCCESS scmd(ffff8f0678887100)
[155069.298016] sd 0:0:10:0: [sdk] tag#5 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[155069.298018] sd 0:0:10:0: [sdk] tag#5 CDB: Write(16) 8a 00 00 00 00 01 a8 1e 77 10 00 00 00 58 00 00
[155069.298020] blk_update_request: I/O error, dev sdk, sector 7115536144
[155069.298023] zio pool=storage vdev=/dev/disk/by-id/ata-WDC_WD50EZRZ-32RWYB1_WD-WX31XXXXXVA-part1 error=5 type=2 offset=3643153457152 size=45056 flags=180880

After a couple of said errors, the drive would be marked as bad and taken out of the array. A battery of tests on a different system revealed the drives to be fine. It did not matter where I inserted these drives on my NAS, they did the same thing, even on ports I knew had working drives. It wasn’t a cabling or other hardware issue.

The drives would resilver back into the array just fine, and then pop out again at random intervals – sometimes minutes later, other times hours later. After a lot of research I came across this post that got me thinking – this sounds like a drive spindown issue! The random nature of it could simply be the drives not being used and then powering themselves down.

I tried using hdparm to set the spindown timer but was greeted with this lovely error:

sudo hdparm -B /dev/sdk
/dev/sdk:
 APM_level	= not supported

I eventually found this post complaining about their Western Digital drives spinning down aggressively.

idle3 to the rescue

The above post mentions apmtimer which did not help me, however more searches reveled this godsend: idle3-tools

idle3-tools is an open source utility to handle spindown on Western Digital drives themselves (not the OS level.)

Download & compile idle3:

wget https://sourceforge.net/projects/idle3-tools/files/latest/download
cd idle3-tools-0.9.1/
make
sudo make install

Use idle3 to query current spindown status (update drive letters to suit your needs)

for drive in {a..p}; do echo /dev/sd$drive; sudo idle3ctl -g /dev/sd$drive; done

For anything that doesn’t say Idle3 timer is disabled run the following:

sudo idle3ctl -s 0 /dev/sd(DRIVE_LETTER)

No more drive spindown!

use zdb to remove pesky device from zfs pool

I had the following problem with a device in my pool:

config:

        NAME                                            STATE     READ WRITE CKSUM
        storage                                         DEGRADED     0     0     0
          mirror-0                                      ONLINE       0     0     0
            WORKING_DISK_1  ONLINE       0     0     0
            WORKING_DISK_2    ONLINE       0   0     0
          mirror-1                                      DEGRADED     0     0     0
            WORKING_DISK_3  ONLINE       0     0     0
            replacing-1                                 DEGRADED     0     0     0
              PROBLEM_DISK  FAULTED      6     1     0  too many errors

However when I tried to replace the drive I got this message:

no such device in pool

I found here that you can use zdb to obtain the GUID of the problem device and replace it that way:

root@nas:~# zdb -l PROBLEM_DISK
failed to unpack label 0
------------------------------------
LABEL 1
------------------------------------
    version: 5000
    name: 'storage'
    state: 0
    txg: 5675107
    pool_guid: 8785893899843624400
    errata: 0
    hostname: 'nas'
    top_guid: 9425730683443378041
    guid: 3449631978925631053
    vdev_children: 2
    vdev_tree:
        type: 'mirror'
        id: 1
        guid: 9425730683443378041
        metaslab_array: 41
        metaslab_shift: 35
        ashift: 12
        asize: 4000782221312
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 17168510556101954329
            path: 'WORKING_DISK_3'
            devid: 'WORKING_DISK_3_ID'
            phys_path: 'pci-0000:00:1f.2-ata-2'
            whole_disk: 1
            DTL: 14700
            create_txg: 4
        children[1]:
            type: 'disk'
            id: 1
    ----->  guid: 3449631978925631053
            path: 'PROBLEM_DISK'
            devid: 'PROBLEM_DISK_ID'
            phys_path: 'pci-0000:00:1f.2-ata-4'
            whole_disk: 1
            DTL: 14699
            create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
    labels = 1 2 3 

I used the guid of the problem disk, and all was well:

zpool replace storage 3449631978925631053 NEW_WORKING_DISK

worked instead of complaining the device I was trying to replace didn’t exist.

Using ProxMox as a NAS

Lately I’ve been very unhappy with latest FreeBSD causing reboots randomly during disk resilvering. I simply cannot tolerate random reboots of my fileserver. This fact combined with the migration of OpenZFS to the ZFS on Linux code base means it’s time for me to move from a FreeBSD based ZFS NAS to a Linux-based one.

Sadly there aren’t many options in this space yet. I wanted something where basic tasks were taken care of, like what FreeNAS does, but also supports ZFS. The solution I settled on was ProxMox, which is a hypervisor, but it also has ZFS support.

The biggest drawback of ProxMox vs FreeNAS is the GUI. There are some disk-related GUI options in ProxMox, but mostly it’s VM focused. Thus, I had to configure my required services via CLI.

Following are the settings I used when I configured my NAS to run ProxMox.

Repo setup

If you don’t want to pay for a proxmox license, change the PVE enterprise repository to the free version by modifying /etc/apt/sources.list.d/pve-enterprise.list to the following:

deb http://download.proxmox.com/debian/pve buster pve-no-subscription

Then run at apt update & apt upgrade.

Email alerts

Postfix configuration

Edit /etc/postfix/main.cf and tweak your mail server config as needed (relayhost). Restart postfix after editing:

systemctl restart postfix

Forward mail for root to your own email

Edit /etc/aliases and add an alias for root to forward to your desired e-mail address. Add this line:

root: YOUR_EMAIL_ADDRESS

Afterward run:

newaliases

ZFS configuration

Pool Import

Import the pool using the zpool import -f command (-f to force import despite having been active in a different system)

zpool import -f  

By default they’re imported into the main root directory (/). If you want to have them go to /mnt, use the zfs set mountpoint command:

zfs set mountpoint=/mnt/ 

Monitoring

Install and configure zfs-zed

apt install zfs-zed

Modify /etc/zfs/zed.d/zed.rc and uncomment ZED_EMAIL_ADDR, ZED_EMAIL_PROG, and ZED_EMAIL_OPTS. Edit them to suit your needs (default values work fine, they just need to be uncommented.) Optionally uncomment ZED_NOTIFY_VERBOSE and change to 1 if you want more verbose notices like what FreeNAS does (scrub notifications, for example.)

After modifying /etc/zfs/zed.d/zed.rc, restart zed:

systemctl restart zfs-zed

Scrubbing

By default ProxMox scrubs each of your datasets on the second Sunday of every month. This cron job is located in /etc/cron.d/zfsutils-linux. Modify to your liking.

Snapshot & Replication

There are many different snapshot & replication scripts out there. I landed on Sanoid. Thanks to SvennD for helping me grasp how to get it working.

Install sanoid :

#Install necessary packages
apt install debhelper libcapture-tiny-perl libconfig-inifiles-perl pv lzop mbuffer git
# Clone repo, build deb, install
git clone https://github.com/jimsalterjrs/sanoid.git cd sanoid
ln -s packages/debian . 
dpkg-buildpackage -uc -us 
apt install ../sanoid_*_all.deb 

Snapshots

Edit /etc/sanoid/sanoid.conf with a backup and retention schedule for each of your datasets. Example taken from sanoid documentation:

[data/home]
	use_template = production
[data/images]
	use_template = production
	recursive = yes
	process_children_only = yes
[data/images/win7]
	hourly = 4

#############################
# templates below this line #
#############################

[template_production]
        frequently = 0
        hourly = 36
        daily = 30
        monthly = 3
        yearly = 0
        autosnap = yes
        autoprune = yes

Once sanoid.conf is to your liking, create a cron job to launch sanoid every hour (sanoid determines whether any action is needed when executed.)

crontab -e
#Add this line, save and exit
0 * * * * /usr/sbin/sanoid --cron

Replication

syncoid (part of sanoid) easily replicates snapshots. The syntax is pretty straightforward:

syncoid <source> <destination> -r 
#-r means recursive and is optional

For remote locations specify a username@ before the ip/hostname, then a colon and the dataset name, for example:

syncoid root@10.0.0.1:sourceDataset localDataset -r

You can even have a remote source go to a different remote destination, which is pretty neat.

Other syncoid options of interest:

--debug  #for seeing everything happening, useful for logging
--exclude #Regular expression to exclude certain datasets
--src-bwlimit #Set an upload limit so you don't saturate your bandwidth
--quiet #don't output anything unless it's an error

Automate synchronization by placing the same syncoid command into a cronjob:

0 */4 * * * /usr/sbin/syncoid --exclude=bigdataset1 --source-bwlimit=1M --recursive pool/data root@192.168.1.100:pool/data
#if you don't want status emails when the cron job runs, add --quiet

NFS

Install the nfs-kernel-server package and specify your NFS exports in /etc/exports.

apt install nfs-kernel-server portmap

Example /etc/exports :

/mnt/example/DIR1 192.168.0.0/16(rw,sync,all_squash,anonuid=0,anongid=0)

Restart nfs-server after modifying your exports:

systemctl restart nfs-server

Samba

Install samba, configure /etc/samba/smb.conf, and add users.

apt install samba
systemctl enable smbd

/etc/samba/smb.conf syntax is fairly straightforward. See the samba documentation for more information. Example share configuration:

[exampleshare]
comment = Example share
path = /mnt/example
valid users = user1 user2
writable = yes

Add users to the system itself with the adduser command:

adduser user1

Add those same users to samba with the smbpasswd -a command. Example:

smbpasswd -a user1

Restart samba after making changes:

systemctl restart smbd

SMART monitoring

Taken from https://pve.proxmox.com/wiki/Disk_Health_Monitoring:

By default, smartmontools daemon smartd is active and enabled, and scans the disks under /dev/sdX and /dev/hdX every 30 minutes for errors and warnings, and sends an e-mail to root if it detects a problem. 

Edit the file /etc/smartd.conf to suit your needs. You can specify/exclude devices, smart attributes, etc there. See here for more information. Restart the smartd service after modifying.

UPS monitoring

apc-upsd was easiest for me to configure, so I went with it. Thanks to this blog for giving me the information to get started.

First, install apcupsd:

apt install apcupsd apcupsd-doc

As soon as it was installed my console kept getting spammed about IRQ issues. To stop these errors I stopped the apcupsd daemon:

 systemctl stop apcupsd

Now modify /etc/apcupsd/apcupssd.conf to suit your needs. The section I added for my CyberPower OR2200LCDRT2U was simply:

UPSTYPE usb
DEVICE

Then modify /etc/default/apcupsd to specify it’s configured:

#/etc/default/apcupsd
ISCONFIGURED=yes

After configuring, you can restart the apcupsd service

systemctl start apcupsd

To check the status of your UPS, you can run the apcaccess status command:

/sbin/apcaccess status

Log monitoring

Install Logwatch to monitor system events. Here is a good primer on all of Logwatch’s options.

apt install logwatch

Modify /usr/share/logwatch/default.conf/logwatch.conf to suit your needs. By default it runs daily (defined in /etc/cron.daily/00logwatch). I added the following lines for my config to filter out unwanted information:

Service = "-zz-disk_space"
Service = "-postfix"
Service = "vsmartd"
Service = "-zz-lm_sensors"

Manually run logwatch to get a preview of what you’ll see:

logwatch --range today --mailto YOUR_EMAIL_ADDRESS

UPDATE 8/29/2020
I discovered additional tweaking to logwatch to get it exactly how I like it (thanks to this post and this one at serverfault.)

Defaults for monitored services are located in /usr/share/logwatch/default.conf/services/

You can copy this default file to /etc/logwatch/conf/services/<filename.conf> and then modify the service as needed. In my case I wanted to ignore logins for a particular user from a particular machine. This can be done by copying & editing sshd.conf and adding the following:

# Ignore these hosts
*Remove = 192.168.100.1
*Remove = X.Y.123.123
# Ignore these usernames
*Remove = testuser
# Ignore other noise. Note that we need to escape the ()
*Remove = "pam_succeed_if\(sshd:auth\): error retrieving information about user netscan.*

Troubleshooting

ZFS-ZED not sending email

If ZED isn’t sending emails it’s likely due to an error in the config. For some reason default values still need to be uncommented for zed to work, even if left unaltered. Thanks to this post for the info.

Samba share access denied

If you get access denied when trying to write to a SMB share, double check the file permissions on the server level. Execute chmod / chown as appropriate. Example:

chown user1 -R /mnt/example/user1

zfs drive removal ‘part of active pool’ fix

Occasionally I will manually offline a disk in my ZFS pool for one reason or another. Annoyingly I will sometimes get this error when I try to online that same disk back into the pool:

cannot online /dev/sda: cannot relabel '/dev/sda': unable to read disk capacity

The fix, thankfully, is fairly simple. Simply run the following command (make double sure you’re doing it on the correct device!)

sudo wipefs -a <DEVICE>

After I ran that command ZFS automatically picked the disk back and resilvered it into the pool.

Thanks to this superuser.com discussion for the advice!

Recover files from ZFS snapshot of ProxMox VM

I recently needed to restore a specific file from one of my ProxMox VMs that had been deleted. I didn’t want to roll back the entire VM from a previous snapshot – I just wanted a single file from the snapshot. My snapshots are handled via ZFS using FreeNAS.

Since my VM was CentOS 7 it uses XFS, which made things a bit more difficult. I couldn’t find a way to crash-mount a read-only XFS snapshot – it simply resufed to mount, so I had to make everything read/write. Below is the process I used to recover my file:

On the FreeNAS server, find the snapshot you wish to clone:

sudo zfs list -t snapshot -o name -s creation -r DATASET_NAME

Next, clone the snapshot

sudo zfs clone SNAPSHOT_NAME CLONED_SNAPSHOT_NAME

Next, on a Linux box, use SSHFS to mount the snapshot:

mkdir Snapshot
sshfs -o allow_other user@freenas:/mnt/CLONED_SNAPSHOT_NAME Snapshot/

Now create a read/write loopback device following instructions found here:

sudo -i #easy lazy way to get past permissions issues
cd /path/to/Snapshot/folder/created/above
losetup -P -f VM_DISK_FILENAME.raw
losetup 
#Take note of output, it's likely set to /dev/loop0 unless you have other loopbacks

Note if your VM files are not in RAW format, extra steps will need to be taken in order to convert it to RAW format.

Now we have an SSH-mounted loopback device ready for mounting. Things are complicated if your VM uses LVM, which mine does (CentOS 7). Once the loopback device is set, lvscan should see the image’s logical volumes. Make the desired volume active

sudo lvscan
sudo lvchange -ay /dev/VG_NAME/LV_NAME

Now you can mount your volume:

mkdir Restore
mount /dev/VG_NAME/LV_NAME Restore/

Note: for XFS you must have read/write capability on the loopback device for this to work.

When you’re done, do your steps in reverse to unmount the snaspshot:

#Unmount snapshot
umount Restore
#Deactivate LVM
lvchange -an /dev/VG_NAME/LV_NAME
Remove loopback device
losetup -d /dev/loop0 #or whatever the loopback device was
#Unmount SSHfs mount to ZFS server
umount Snapshot

Finally, on the ZFS server, delete the snapshot:

sudo zfs destroy CLONED_SNAPSHOT_NAME

Troubleshooting

When I tried to mount the LVM partition at this point I got this error message:

mount: /dev/mapper/centos_plexlocal-root: can't read superblock

It ended up being because I was accidentally creating a read-only loopback device. I destroy the loopback device and re-created with write support and all was well.

FreeNAS ZFS tuning for SSDs

I wanted to optimize my all SSD storage array on my FreeNAS server but I had a hard time finding information in one place. After a lot of digging I pulled things from several places. This is what I came up with. It boiled down to two main settings

  • ashift
  • recordsize

Checking ashift on existing pools

zdb -U /data/zfs/zpool.cache | grep ashift

I read here a recommended setting of ashift=13, recordsize=8k for VM workloads on SSDs.

How to change recordsize:

This is easily done in the GUI or command line and can be changed on the fly.

zfs set recordize <value> <volume>

How to change ashift:

Backup your data and destroy the pool.

Modify the setting dictating minimum ashift setting as outlined here

sysctl vfs.zfs.min_auto_ashift=13

Re-create the pool.

Note: this lustre wiki actually recommends ashift 12 instead of 13.

UPDATE 2/26/20: You can also set the ashift at pool creation time as dictated here

zpool create POOL_NAME -o ashift=12 ...

Additional reading

http://open-zfs.org/wiki/Performance_tuning#Alignment_shift
https://www.reddit.com/r/zfs/comments/7pfutp/zfs_pool_planning_for_vm_storage/

ZFS delete oldest n snapshots

I came across a need to trim old ZFS snapshots. These are my quick and dirty notes on how I accomplished it.

Basic syntax taken from here:

 zfs list -H -t snapshot -o name -S creation -r <dataset name> | tail -10

You can omit the -r <dataset name> if you want to query snapshots over all your datasets. Change the tail number for the desired number of oldest snapshots.

You can pass this over to actually delete snapshots using the xargs command:

zfs list -H -t snapshot -o name -S creation -r <dataset name> | tail -10 | xargs -n 1 zfs  destroy

I came across an odd error message when trying to delete some old snapshots:

Can't delete snapshot: dataset busy

I discovered here that that means the snapshots have a hold on them. I read ZFS documentation to learn how to release the holds:

zfs release -r <tag name> <snapshot name>

After massaging these commands for a bit I was able to free up some needed space by removing ancient snapshots.

Change ZFS based NFS SR address in Xenserver

I recently acquired a shiny new set of SSDs to host my VMs. The problem is I needed to create a new ZFS array to accommodate them. I needed to figure out a way to migrate my VMs to the new array and then instruct Xenserver to use the new array instead of the old one.

Fortunately with a bit of research I learned this is fairly painless. Thanks to this discussion on citrix forums that got me pointed in the right direction. To change the server / IP address of an existing NFS storage repository in Xenserver you must do the following:

  • Shut down affected VMs
  • Shutdown any VMs using NFS SRs
  • Copy the NFS SRs (the directories containing the .vhd files) to the new NFS server
  • xe pbd-unplug uuid=<uuid of pbd pointing to the NFS SR>
  • xe pbd-destroy uuid=<uuid of pbd pointing to the NFS SR>
  • xe pbd-create host-uuid=<uuid of Xen Host> sr-uuid=<uuid of the NFS SR> device-config-server=<New NFS server name> device-config-serverpath=<NFS Share Name>
  • xe pbd-plug uuid=<uuid of the pbd created above>
  • Reboot the VMs using NFS SRs

In my case since my VMs were on an existing ZFS volume with snapshots I wanted to preserve, I used ZFS send and receive to transfer data from my old array to my SSD array. Bonus: I was able to do this while the VMs were still running to ensure minimal downtime. My ZFS copy procedure was as follows:

  • Create recursive snapshot of my VM dataset
    zfs snapshot -r storage/VMs@migrate
  • Start the initial data transfer (this took quite some time to finish)
    zfs send -R storage/VMs@migrate | zfs recv ssd/VMs
  • Do another incremental snapshot and transfer after initial huge transfer is complete (this took much less time to do)
    zfs snapshot -r storage/VMs@migrate2
    zfs send -R -i storage/VMs@migrate storage/VMs@migrate2 | zfs recv ssd/VMs
  • Shutdown all affected VMs and do one more ZFS snapshot & transfer to ensure consistent data:
    zfs snapshot -r storage/VMs@migrate3
    zfs send -R -i storage/VMs@migrate2 storage/VMs@migrate3 | zfs recv ssd/VMs

In the above examples my source dataset was storage/VMs and the destination dataset was ssd/VMs.

Once the data was all transferred to the new location it was time to tell Xenserver about it. I had enough VMs that it was worth my time to write a little script to do it. It’s quick and dirty but it did the job. Behold:

#!/bin/bash
#Author: Nicholas Jeppson
#A simple script to change a xenserver NFS storage repository address to a new location
#Modify NFS_SERVER, NFS_PATH and/or NFS_VERSION to match your environment. 
#Run this script on each xenserver host in your pool. Empty output means the transfer was successful.
#This script takes one argument - the name of the SR to be transferred.

SR_NAME="$1"

NFS_SERVER=10.0.0.1
NFS_PATH=/mnt/ssd/VMs/$SR_NAME
NFS_VERSION=4

#Use sed and awk to grab necessary UUIDs
HOST_UUID=$(xe host-list|egrep -B3 `hostname`$ | grep uuid | awk '{print $5}')
PBD_UUID=$(xe pbd-list|grep -A4 -B4 $SR_NAME | grep -B2 $HOST_UUID |grep -w '^uuid ( RO)' | awk '{print $5}')
SR_UUID=$(xe pbd-list|grep -A4 -B4 $SR_NAME | grep -A2 $HOST_UUID | grep 'sr-uuid' | awk '{print $4}')

#Unplug & destroy old NFS location, create new NFS location
xe pbd-unplug uuid=$PBD_UUID
xe pbd-destroy uuid=$PBD_UUID
NEW_PBD_UUID=$(xe pbd-create host-uuid=$HOST_UUID sr-uuid=$SR_UUID device-config-server=$NFS_SERVER device-config-serverpath=$NFS_PATH device-config-nfsversion=$NFS_VERSION)
xe pbd-plug uuid=$NEW_PBD_UUID

Download the script here (right click / save as)

You can run this script in a simple for loop with something like this:

for SR in <list of SR names separated by a space>; do bash <name of script saved from above> $SR; done

If you named the above script nfs-migrate.sh, and you had three SRs to change (blog1, blog2, blog3) then it would be:

for SR in blog1 blog2 blog3; do bash nfs-migrate.sh $SR; done

After I migrated the data and ran that script, my VMs booted up using the new SSD array. Success.