Scenario: A drive is your RaidZ pool has gone bad. You have a replacement drive ready to go. You pull the drive you thought was the failed drive.. only to realize that you just pulled a good drive out, causing the array to go completely offline.
Has this happened to you? It has not happened to me yet, but I wanted to see how ZFS responded. I have to say I am pretty impressed.
I purposely pulled two working drives from my test zpool array. The status of the pool became Unavailable, as is to be expected. The zpool status command gave a helpful hint “Replace the drive and run zpool clear”
I replaced the last drive I had previously pulled and ran the command:
zpool clear storage
That was all I had to do! The array came back up (although in a degraded state) and all my files were there.
Output of zpool status at this point:
[root@freenas /data]# zpool status
pool: storage
state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: scrub repaired 0 in 0h26m with 0 errors on Sun Sep 7 09:51:19 2014
config:
NAME STATE READ WRITE CKSUM
storage DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
ada2p1 ONLINE 0 0 0
7167795297630497018 REMOVED 0 0 0 was /dev/ada3p1
ada4p1 ONLINE 0 0 0
ada1p1 ONLINE 0 0 0
errors: No known data errors
My next experiment was to bring the pool back to full health again. I tried to simply re-insert the last drive into my pool but it complained that it was already a part of the pool. The drive in question used to be labeled ada3p1. I tried “zpool detach storage ada3p1” but it complained: only applicable to mirror and replacing vdevs
After searching I found a mention here that said you can call out specific devices in your pool to clear. I ran the command
“zpool clear storage ada3p1” and it completed without any issues; however it still wouldn’t let me add the drive back into the pool saying it was already there.
What allowed me to bring the array back to full health was:
zpool online storage ada3p1
The amazing part – zfs realized that it only needed to sync a small amount of data to bring it back into sync with the pool!
scan: resilvered 24K in 0h0m with 0 errors on Sun Sep 7 12:23:39 2014
config:
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ada2p1 ONLINE 0 0 0
ada3p1 ONLINE 0 0 0
ada4p1 ONLINE 0 0 0
ada1p1 ONLINE 0 0 0
Compared to mdadm where it would rebuild the whole array even if it was the same disk you pulled, this is astounding.
I realized that this issue would only happen if you’re putting the same drive you just pulled back into the array, so I then tried pulling a drive and putting another in its place. After partitioning the drive, a simple
zpool replace storage 7167795297630497018 ada3p1
Did the trick (where the string of numbers is the placeholder for the drive you pulled – a zfs status will tell you what that number is.) Done.