However, every so often the module craps out and logs a large number of visits from crawlers, bots and spiders as legitimate hits, since those are not in the official list of crawlers, bot and spiders to look out for. To fix this, I went out to look for the list and to add to it. One quick GitHub code search later, I found that the file class-jetpack-user-agent.php is responsible for hosting the list of non-humans to look out for. What I found inside was actually a pretty comprehensive list of software, but one that definitely needed extending.
If you want to do what I did, find the file in your WP installation at –
Inside the file, look for the following array variable –
You’ll see that the array already contains common bots like alexa, googlebot, baiduspider and so on. However, I deepdived (meaning did a sublime text search) into my access.log files and found some more. To extend the array, simply look for the last element (which should be yammybot) and extend it as follows –
'yammybot', 'ahrefsbot', 'pingdom.com_bot', 'kraken', 'yandexbot', 'twitterbot', 'tweetmemebot', 'openhosebot', 'queryseekerspider', 'linkdexbot', 'grokkit-crawler', 'livelapbot', 'germcrawler', 'domaintunocrawler', 'grapeshotcrawler', 'cloudflare-alwaysonline',
Note that you want to leave in the last comma, and you want all the entries in lower case. This doesn’t actually matter, because the PHP function that does the string compare is case-insensitive, but it just looks neater. You’ll also notice that I’ve added the precise names of the bots, like ‘grokkit-crawler’ and ‘clousflare-alwaysonline’ but you can be less specific and save yourself some pain. This will, however, affect your final stats outcome.
- Some of the bots are pretty interesting. I saw tweetmemebot, which is from a company called datasift, which seems to be in the business of trawling all social networks for interesting links and providing meaningful insights into them. Another was twitterbot. Why the heck does twitter need to send out a bot? We submit our links to it willingly! Also interesting were livelapbot, germcrawler and kraken. I have no idea why they’re looking at my site.
- Although Jetpack does not have a comprehensive list of bots, it still does a pretty good job. I found the main culprit of the stats mess in my case. Turns out, CloudFlare, in an effort to provide their AlwaysOnline service (which is enabled for my site), looks at all our pages frequently and this doesn’t sit well with Jetpack. I hope this tweak will fix this now.
- Although this fix is currently in place, every time the Jetpack plugin gets updated, all these entries will disappear. That’s why this blog post is both a tutorial for you all and a reminder and diary entry for me to make this change every time I run a Jetpack update. However, if someone can tell me a way to permanently extend Jetpack, or if someone can reach out to the Jetpack team (hey Nitin, why don’t you file a GitHub issue against this?) it’ll be awesome and I’ll be super thankful!
Update – I was trying to be hip and did a fork of Jetpack and GitHub, made the changes and then tried to make a pull request. Turns out, I don’t know how to do that, so I opened an issue instead. It sits here.