Recently one of my sites has been having its database crash repeatedly. Investigation reveals it always happens while an aggressive bot is crawling it. Since the site is small it was causing the database to run out of memory and die.
The Web Application Firewall that this site is behind frustratingly does not have a feature for blocking user agents. I decided to resort to Apache on the webserver itself to serve as a gatekeeper. The user agent in question? flipboard proxy. It also conveniently appears to ignore robots.txt.
Thanks to this article I learned the details on how to get Apache to block this particular user agent. Creating an .htaccess file (if it doesn’t already exist) and putting it in the root directory of the website causes it to apply to the entire site. Within the .htaccess file, place the following directives:
#block bad bots with a 403 #SetEnvIfNoCase User-Agent "facebookexternalhit" bad_bot SetEnvIfNoCase User-Agent "Twitterbot" bad_bot SetEnvIfNoCase User-Agent "Baiduspider" bad_bot SetEnvIfNoCase User-Agent "MetaURI" bad_bot SetEnvIfNoCase User-Agent "mediawords" bad_bot SetEnvIfNoCase User-Agent "FlipboardProxy" bad_bot
<Limit GET POST HEAD> Order Allow,Deny Allow from all Deny from env=bad_bot </Limit>
Save the file in the root of your website and make sure its permissions are such that your apache server can read the file. Success! Flipboard proxy (and other bad bots) no longer crashes the site. Instead, it gets served a 403 – Forbidden page for every request it makes.