Tag Archives: awk

Change ZFS based NFS SR address in Xenserver

I recently acquired a shiny new set of SSDs to host my VMs. The problem is I needed to create a new ZFS array to accommodate them. I needed to figure out a way to migrate my VMs to the new array and then instruct Xenserver to use the new array instead of the old one.

Fortunately with a bit of research I learned this is fairly painless. Thanks to this discussion on citrix forums that got me pointed in the right direction. To change the server / IP address of an existing NFS storage repository in Xenserver you must do the following:

  • Shut down affected VMs
  • Shutdown any VMs using NFS SRs
  • Copy the NFS SRs (the directories containing the .vhd files) to the new NFS server
  • xe pbd-unplug uuid=<uuid of pbd pointing to the NFS SR>
  • xe pbd-destroy uuid=<uuid of pbd pointing to the NFS SR>
  • xe pbd-create host-uuid=<uuid of Xen Host> sr-uuid=<uuid of the NFS SR> device-config-server=<New NFS server name> device-config-serverpath=<NFS Share Name>
  • xe pbd-plug uuid=<uuid of the pbd created above>
  • Reboot the VMs using NFS SRs

In my case since my VMs were on an existing ZFS volume with snapshots I wanted to preserve, I used ZFS send and receive to transfer data from my old array to my SSD array. Bonus: I was able to do this while the VMs were still running to ensure minimal downtime. My ZFS copy procedure was as follows:

  • Create recursive snapshot of my VM dataset
    zfs snapshot -r storage/VMs@migrate
  • Start the initial data transfer (this took quite some time to finish)
    zfs send -R storage/VMs@migrate | zfs recv ssd/VMs
  • Do another incremental snapshot and transfer after initial huge transfer is complete (this took much less time to do)
    zfs snapshot -r storage/VMs@migrate2
    zfs send -R -i storage/VMs@migrate storage/VMs@migrate2 | zfs recv ssd/VMs
  • Shutdown all affected VMs and do one more ZFS snapshot & transfer to ensure consistent data:
    zfs snapshot -r storage/VMs@migrate3
    zfs send -R -i storage/VMs@migrate2 storage/VMs@migrate3 | zfs recv ssd/VMs

In the above examples my source dataset was storage/VMs and the destination dataset was ssd/VMs.

Once the data was all transferred to the new location it was time to tell Xenserver about it. I had enough VMs that it was worth my time to write a little script to do it. It’s quick and dirty but it did the job. Behold:

#Author: Nicholas Jeppson
#A simple script to change a xenserver NFS storage repository address to a new location
#Modify NFS_SERVER, NFS_PATH and/or NFS_VERSION to match your environment. 
#Run this script on each xenserver host in your pool. Empty output means the transfer was successful.
#This script takes one argument - the name of the SR to be transferred.



#Use sed and awk to grab necessary UUIDs
HOST_UUID=$(xe host-list|egrep -B3 `hostname`$ | grep uuid | awk '{print $5}')
PBD_UUID=$(xe pbd-list|grep -A4 -B4 $SR_NAME | grep -B2 $HOST_UUID |grep -w '^uuid ( RO)' | awk '{print $5}')
SR_UUID=$(xe pbd-list|grep -A4 -B4 $SR_NAME | grep -A2 $HOST_UUID | grep 'sr-uuid' | awk '{print $4}')

#Unplug & destroy old NFS location, create new NFS location
xe pbd-unplug uuid=$PBD_UUID
xe pbd-destroy uuid=$PBD_UUID
NEW_PBD_UUID=$(xe pbd-create host-uuid=$HOST_UUID sr-uuid=$SR_UUID device-config-server=$NFS_SERVER device-config-serverpath=$NFS_PATH device-config-nfsversion=$NFS_VERSION)
xe pbd-plug uuid=$NEW_PBD_UUID

Download the script here (right click / save as)

You can run this script in a simple for loop with something like this:

for SR in <list of SR names separated by a space>; do bash <name of script saved from above> $SR; done

If you named the above script nfs-migrate.sh, and you had three SRs to change (blog1, blog2, blog3) then it would be:

for SR in blog1 blog2 blog3; do bash nfs-migrate.sh $SR; done

After I migrated the data and ran that script, my VMs booted up using the new SSD array. Success.

Find top 10 requests returning 404 errors

I had a website where I was curious what the top 10 URLs that were returning 404s were along with how many hits those URLs got. This was after a huge site redesign so I was curious what old links were still trying to be accessed.

Getting a report on this can be accomplished with nothing more than the Linux command line and the log file you’re interested in. It involves combining grep, sed, awk, sort, uniq, and head commands. I enjoyed how well these tools work together so I thought I’d share. Thanks to this site for giving me the inspiration to do this.

This is the command I used to get the information I wanted:

grep '404' _log_file_ | sed 's/, /,/g' | awk {'print $7'} | sort | uniq -c | sort -n -r | head -10

Here is a rundown of each command and why it was used:

  • grep ‘404’ _log_file_ (replace with filename of your apache, tomcat, or varnish access log.) grep reads a file and returns all instances of what you want, in this case I’m looking for the number 404 (page not found HTTP error)
  • sed ‘s/, /,/g’ Sed will edit a stream of text in any way that you specify. The command I gave it (s/, /,/g) tells sed to look for instances of commas followed by spaces and replace them with just commas (eliminating the space after any comma it sees.) This was necessary in my case because sometimes the source IP address field has multiple IP addresses and it messed up the results. This may be optional if your server isn’t sitting behind any type of reverse proxy.
  • awk {‘print $7’} Awk has a lot of similar functions to sed – it allows you to do all sorts of things to text. In this case we’re telling awk to only display the 7th column of information (the URL requested in apache and varnish logs is the 7th column)
  • sort This command (absent of arguments) sorts our results alphabetically, which is necessary for the next command to work properly.
  • uniq -c This command eliminates any duplicates in the results. The -c argument adds a number indicating how many times that unique string was found.
  • sort -n -r Sorts the results in reverse alphabetical order. The -n argument sorts things numerically so that 2 follows 1 instead of 10. -r Indicates to reverse the order so the highest number is at the top of the results instead of the default which is to put the lowest number first.
  • head -10 outputs the top 10 results. This command is optional if you want to see all the results instead of the top 10. A similar command is tail – if you want to see the last results instead.

This was my output – exactly what I was looking for. Perfect.

2186 http://<sitename>/source/quicken/index.ini
2171 http://<sitename>/img/_sig.png
1947 http://<sitename>/img/email/email1.aspx
1133 http://<sitename>/source/quicken/index.ini
830 http://<sitename>/img/_sig1.png
709 https://<sitename>/img/email/email1.aspx
370 http://<sitename>/apple-touch-icon.png
204 http://<sitename>/apple-touch-icon-precomposed.png
193 http://<sitename>/About-/Plan.aspx
191 http://<sitename>/Contact-Us.aspx

Website load testing with wget and siege

In an effort to see how well my varnish configuration stood up to real world tests I came across a really cool piece of software: siege. You can use siege to benchmark / load test your site by having it hit your site repeatedly in a configurable fashion.

I played around with the  siege configuration until I found one I liked. It involves using wget to spider your target site, varnishncsa and tee (if you use varnish), awk to parse the access log into a usable URL list, and then passing that list to siege to test the results.


For my purposes I wanted to test how well varnish performed. Varnish is a caching server that sits in front of your webserver. By default it doesn’t log anything – this is where varnishncsa comes in. Run the following command to monitor the varnish log while simultaneously outputting the results to a file:

varnishncsa | tee varnish.log

If you don’t use varnish and rather serve pages up directly through nginx, apache, or something else, then you can skip this step.


The next step is to spider your site to get a list of URLs to send to siege; wget fits in nicely for this task. This is the configuration I used (replace aoeu.1234.com with the domain of your web server)

wget -r -l4 --spider -D aoeu1234.com http://www.aoeu1234.com

-r tells wget to be recursive (follow links)
-l is the number of levels you want to go down when following links
— spider tells wget to not actually download anything (no need to save the results, we just want to hit the site to generate log entries)
-D specifies a comma separated list of domains you want wget to crawl (useful if you don’t want wget to follow links outside your site)


Once the spidering is finished, we want to use awk to parse the resulting server logfile to extract the URLs that were accessed and pipe the results into a file named crawl.txt.

awk '{ print $7 }' varnish.log | sort | uniq > crawl.txt

varnish.log is what I named the output from the varnishncsa command above; if you use apache/nginx, then you would use your respective access log file instead.


This is where the fun comes in. You can now use siege to stress test your website to see how well it does.

siege -c 550 -i -t 3M -f crawl.txt -d 10

There are many siege options so you’ll definitely want to read up on the man page. I picked the following based on this site:

-c concurrent connections. I had to tweak siege configuration to go up that high.
-i tells siege to access the URLs in a random fashion
-t specifies how long to run siege for in seconds, minutes, or hours
-f specifies a list of URLs for siege to use
-d specifies the maximum delay between user requests (it will pick a random number between 0 and this specified number to make requests more random)

siege configuration

I had to tweak siege a little bit to allow more than 256 connections at a time. First, run


to generate an initial siege configuration. Then modify ~/.siege/siege.conf to increase the limit of concurrent connections:

sed -i ~/.siege/siege.conf -e 's/limit = 256/limit = 1000/g'

Note: the config says the following:

This was fine in my case because it’s hitting varnish, not apache, so we can really ramp up the simultaneous user count.

sysctl configuration

When playing with siege I ran into a few different error messages. Siege is a complicated enough program that it requires you to tweak system settings for it to work fully. I followed some advice on this site to tweak my sysctl configuration to make it work better. Your mileage may vary.


# Increase size of file handles and inode cache
fs.file-max = 2097152

# Do less swapping
vm.swappiness = 10
vm.dirty_ratio = 60
vm.dirty_background_ratio = 2


# Number of times SYNACKs for passive TCP connection.
net.ipv4.tcp_synack_retries = 2

# Allowed local port range
net.ipv4.ip_local_port_range = 2000 65535

# Protect Against TCP Time-Wait
net.ipv4.tcp_rfc1337 = 1

# Decrease the time default value for tcp_fin_timeout connection
net.ipv4.tcp_fin_timeout = 15

# Decrease the time default value for connections to keep alive
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 15


# Default Socket Receive Buffer
net.core.rmem_default = 31457280

# Maximum Socket Receive Buffer
net.core.rmem_max = 12582912

# Default Socket Send Buffer
net.core.wmem_default = 31457280

# Maximum Socket Send Buffer
net.core.wmem_max = 12582912

# Increase number of incoming connections
net.core.somaxconn = 4096

# Increase number of incoming connections backlog
net.core.netdev_max_backlog = 65536

# Increase the maximum amount of option memory buffers
net.core.optmem_max = 25165824

# Increase the maximum total buffer-space allocatable
# This is measured in units of pages (4096 bytes)
net.ipv4.tcp_mem = 65536 131072 262144
net.ipv4.udp_mem = 65536 131072 262144

# Increase the read-buffer space allocatable
net.ipv4.tcp_rmem = 8192 87380 16777216
net.ipv4.udp_rmem_min = 16384

# Increase the write-buffer-space allocatable
net.ipv4.tcp_wmem = 8192 65536 16777216
net.ipv4.udp_wmem_min = 16384

# Increase the tcp-time-wait buckets pool size to prevent simple DOS attacks
net.ipv4.tcp_max_tw_buckets = 1440000
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1

Unzip multiple files into a single directory

Occasionally I have a need to unzip multiple zip files into a single directory, renaming any files with duplicate names so all zip contents end up in the same directory. I accomplish this in a lazy fashion with find and unzip.

First, put all the zip files you need to extract in the same folder. I used find with the -ctime flag to find files created today (as those are the ones I want.) I use the -exec flag to execute a command on the resulting files; in this case, the unzip program with the -B flag, which doesn’t overwrite files with duplicate names, and the -d flag to specify which folder to extract to:

find *.zip -ctime 1 unzip -B {} -d example/ \;

This finds and unzips all my files, but there is a catch: files with the same filename have been renamed with tildes at the end of each file, for example pic1.jpg~. I do another quick find operation to simply tack .jpg to the end of each of these files

find example/ -name "*~*" -exec mv {} {}.jpg \;

The result is a directory full of the files you desire. My case is very simplistic as it assumes that all files in each zip file are of the same extension. You could use something like awk to parse the result of find command and re-add appropriate extensions, but I won’t detail that here (see laziness reference above.)


Manipulate dd output with sed, tr, and awk

While improving a backup script I came across a need to modify the information output by the dd command. FreeNAS, the system the script runs on, does not have a version of dd that has the human readable option. My quest: make the output from dd human readable.

I’ve only partially fulfilled this quest at this point, but the basic functionality is there. The first step is to redirect all output of the dd command (both stdout and stderr) to a variable (this particular syntax is for bash)

DD_OUTPUT=$(dd if=/dev/zero of=test.img bs=1000000 count=1000 2>&1)

The 2>&1 redirects all output to the variable instead of the console.

The next step is to use sed to remove unwanted information (number records in and out)

sed -r '/.*records /d'

Next, remove any parenthesis. This is due to the parenthesis from dd output messing up the math that’s going to be done later.

tr -d '()'

-d ‘()’ simply tells tr to remove any instance of the characters between the single quotes.

And now, for awk. Awk allows you to manipulate certain pieces of a text file. Each section, separated by a space, is known as a record. Awk lets us edit some parts of text while keeping others intact. My expertise with awk is that of a n00b so I’m sure there is a more efficient way of doing this; nevertheless here is my solution:

awk '{print ($1/1024)/1024 " MB" " " $3 " " $4 " " $5*60 " minutes"  " (" ($7/1024)/1024 " MB/sec)" }'

Since dd outputs its information in bytes I’m having it divide by 1024 twice so the resulting number is now in megabytes. I also have it divide the seconds by 60 to return the number of minutes dd took. Additionally I’m re-inserting the parenthesis removed by the tr command now that the math is correctly done.

The last step is to pipe all of this together:

DD_OUTPUT=$(dd if=/dev/zero of=test.img bs=1000000 count=10000 2>&1)
DD_HUMANIZED=$(echo "$DD_OUTPUT" | sed -r '/.*records /d' | tr -d '()' | awk '{print ($1/1024)/1024 " MB" " " $3 " " $4 " " $5/60 " minutes"  " (" ($7/1024)/1024 " MB/sec)" }')

After running the above here are the results:

Original output:

echo "$DD_OUTPUT"
10000+0 records in
10000+0 records out
10000000000 bytes transferred in 103.369538 secs (96740299 bytes/sec)

Humanized output:

9536.74 MB transferred in 1.72283 minutes (92.2587 MB/sec)

The printf function of awk would allow us to round the calculations but I couldn’t quite get the syntax to work and have abandoned the effort for now.

Of course all this is not necessary if you have the correct dd version. The GNU version of dd defaults to human readability; certain BSD versions have the option of passing the msgfmt=human argument per here.

Update: I discovered that the awk method above will only print the one line it finds and ignore all other lines, which is less than ideal for scripting. I updated the awk syntax to do a search and replace (sub) instead so that it will print all other lines as well:

awk '{sub(/.*bytes /, $1/1024/1024" MB "); sub(/in .* secs/, "in "$5/60" mins "); sub(/mins .*/, "mins (" $7/1024/1024" MB/sec)"); printf}')

My new all in one line is:

DD_HUMANIZED=$(echo "$DD_OUTPUT" | sed -r '/.*records /d' | tr -d '()' | awk '{sub(/.*bytes /, $1/1024/1024" MB "); sub(/in .* secs/, "in "$5/60" mins "); sub(/mins .*/, "mins (" $7/1024/1024" MB/sec)"); printf}')