Block annoying bots with Apache .htaccess

Recently one of my sites has been having its database crash repeatedly. Investigation reveals it always happens while an aggressive bot is crawling it. Since the site is small it was causing the database to run out of memory and die.

The Web Application Firewall that this site is behind frustratingly does not have a feature for blocking user agents. I decided to resort to Apache on the webserver itself to serve as a gatekeeper. The user agent in question? flipboard proxy. It also conveniently appears to ignore robots.txt.

Thanks to this article I learned the details on how to get Apache to block this particular user agent. Creating an .htaccess file (if it doesn’t already exist) and putting it in the root directory of the website causes it to apply to the entire site. Within the .htaccess file, place the following directives:

#block bad bots with a 403
#SetEnvIfNoCase User-Agent "facebookexternalhit" bad_bot
SetEnvIfNoCase User-Agent "Twitterbot" bad_bot
SetEnvIfNoCase User-Agent "Baiduspider" bad_bot
SetEnvIfNoCase User-Agent "MetaURI" bad_bot
SetEnvIfNoCase User-Agent "mediawords" bad_bot
SetEnvIfNoCase User-Agent "FlipboardProxy" bad_bot
<Limit GET POST HEAD>
 Order Allow,Deny
 Allow from all
 Deny from env=bad_bot
</Limit>

Save the file in the root of your website and make sure its permissions are such that your apache server can read the file. Success! Flipboard proxy (and other bad bots) no longer crashes the site. Instead, it gets served a 403 – Forbidden page for every request it makes.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.