I had a website where I was curious what the top 10 URLs that were returning 404s were along with how many hits those URLs got. This was after a huge site redesign so I was curious what old links were still trying to be accessed.
Getting a report on this can be accomplished with nothing more than the Linux command line and the log file you’re interested in. It involves combining grep, sed, awk, sort, uniq, and head commands. I enjoyed how well these tools work together so I thought I’d share. Thanks to this site for giving me the inspiration to do this.
This is the command I used to get the information I wanted:
grep '404' _log_file_ | sed 's/, /,/g' | awk {'print $7'} | sort | uniq -c | sort -n -r | head -10
Here is a rundown of each command and why it was used:
- grep ‘404’ _log_file_ (replace with filename of your apache, tomcat, or varnish access log.) grep reads a file and returns all instances of what you want, in this case I’m looking for the number 404 (page not found HTTP error)
- sed ‘s/, /,/g’ Sed will edit a stream of text in any way that you specify. The command I gave it (s/, /,/g) tells sed to look for instances of commas followed by spaces and replace them with just commas (eliminating the space after any comma it sees.) This was necessary in my case because sometimes the source IP address field has multiple IP addresses and it messed up the results. This may be optional if your server isn’t sitting behind any type of reverse proxy.
- awk {‘print $7’} Awk has a lot of similar functions to sed – it allows you to do all sorts of things to text. In this case we’re telling awk to only display the 7th column of information (the URL requested in apache and varnish logs is the 7th column)
- sort This command (absent of arguments) sorts our results alphabetically, which is necessary for the next command to work properly.
- uniq -c This command eliminates any duplicates in the results. The -c argument adds a number indicating how many times that unique string was found.
- sort -n -r Sorts the results in reverse alphabetical order. The -n argument sorts things numerically so that 2 follows 1 instead of 10. -r Indicates to reverse the order so the highest number is at the top of the results instead of the default which is to put the lowest number first.
- head -10 outputs the top 10 results. This command is optional if you want to see all the results instead of the top 10. A similar command is tail – if you want to see the last results instead.
This was my output – exactly what I was looking for. Perfect.
2186 http://<sitename>/source/quicken/index.ini 2171 http://<sitename>/img/_sig.png 1947 http://<sitename>/img/email/email1.aspx 1133 http://<sitename>/source/quicken/index.ini 830 http://<sitename>/img/_sig1.png 709 https://<sitename>/img/email/email1.aspx 370 http://<sitename>/apple-touch-icon.png 204 http://<sitename>/apple-touch-icon-precomposed.png 193 http://<sitename>/About-/Plan.aspx 191 http://<sitename>/Contact-Us.aspx