Recently it became apparent that a large data transfer I did might have had some errors. I wanted to find an easy way to compare the source and destination to make sure that they were identical. My solution: rsync, sed, cat and tee
I have used rsync quite a bit but did not know about the –checksum flag until recently. When you run rsync with –checksum, it takes much longer, but it effectively does something similar to what a ZFS scrub does – it runs a checksum of every source file and compares it with the checksum of each destination file. If there is a mismatch, rsync will overwrite the destination file with the source file to correct it.
In my situation I performed a large data migration from my old mdadam-based RAID array to my brand new ZFS array. During the transfer the disks were acting very strange, and at one point one of the disks even popped out of the array. The culprit turned out to be a faulty SATA controller. I bought a cheap 4 port SATA controller from Amazon for my new ZFS array. Do not do this! Spring the cash out for a better controller. The cheap ones, this one at least, only caused headache. I removed it and used the on-board SATA ports on my motherboard and the issues went away.
All of those shennanigans made me wonder if there was corrupt data on my new ZFS array. A ZFS scrub repaired 15.5G of data! While I’m sure that fixed a lot of the issues, I realized there probably was still some corruption. This is how I verified it
rsync -Pahn --checksum /path/to/source /path/to/destination | tee migration.txt
-P shows progress, -a means archive, -h is for human readable measurements, and -n means dry run (don’t actually copy anything)
Tee is a cool utility that allows you to redirect output of a command both to a file and to standard output. This is useful if you want to see the verification take place in real time but also want to analyze it later.
After the comparison (which took a while!) I wanted to see the discrepancies. the -P flag lists each directory rsync checks as well as which files it detected. You can use sed in conjunction with cat to weed out the unwanted lines (directory listings) so that only the files with discrepancies are left.
cat pictures.txt | sed '/\/$/d' | tee pictures-truncated.txt
The sed regex simply looks for any line ending in a / (directory listing) and removes that line. What is left is the files in question. You can combine the entire thing into one line like so
rsync -Pahn --checksum /path/to/source /path/to/destination | sed '/\/$/d' | tee migration.txt
In my case I wanted to compare discrepencies with rsync and make decisions on if I wanted to actually fix the issues. If you are 100% sure the source is OK to remove the destination completely, you can simply run
rsync -Pah --checksum --delete /path/to/source /path/to/destination