Recently I have noticed some odd traffic coming to one of my blogs. This particular blog is set to NOT be indexed by search engines b(robots.txt deny.) Every bot that’s touched that site has honored that file… until now.
Periodically I will get huge spikes of traffic (huge for my small site, anyway.) The culprit is always the same: Apple! Why are they crawling my site? I can’t find a definitive reason. A couple searches reveals articles like this one speculating that Apple is starting a search engine. The problem is the traffic I’m seeing from Apple shows just a safari user agent, nothing about being a bot. A discussion on Reddit talks about Apple crawling sites, but they also list a user agent I’m not seeing.
The user agent reported by the bot that’s been crawling me (ignoring robots.txt file) is:
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1623.0 Safari/537.36
The IPs rotate randomly from Apple’s IP space, with the biggest offender being 22.214.171.124.
I e-mailed Apple at email@example.com requesting they stop this action. I didn’t expect anything from it, and indeed nothing happened. I kept getting crawled.
So, now to the title of this post. I had to tell my Web Application Firewall to block Apple’s address space. Sophos UTM 9.3 makes this easier, although the option is somewhat hidden for some reason. The option is in the “Site Path Routing” tab within the Web Application Firewall context. Once there, edit your site path and check the “Access Control” checkbox.
In my case I decided to block the entire subnet – 126.96.36.199/8. No more Apple crawling.. at least from the 17 network.