- b2evolution CMS User Manual
- Operations Reference
- Performance
- Recognizing a crawler attack
Recognizing a crawler attack
Apart from comment spam (where you see many comments coming in), your site can also be under heavy load due to a crawler attack.
When you look at your b2evolution’s Analytics Tab, you may see a huge increase of traffic like this:
If your site did not get mentioned in a prominent source of traffic, this looks suspicious.
Further drilling down into the Web browsers hit summary we can see this:
The majority of the traffic is self referred. Here a minority of new users browse many many pages on the site (or they reload them madly). This is characteristic of a crawler robot pretending to be a human… but clicking through your site much faster than a human.
Note: b2evolution already detects all well known robots which play nice and identify themselves (see /conf/_stats.php -> $user_agents = array( ... )
). Such robots would appear clearly in light orange on the first screenshot above. Such robots would also be easy to control, either by asking them in robots.txt or by blocking them with a Rewrite rule in Htaccess. In this case though, the crawling robot is *not* playing nice. It doesn’t even advertise itself as a robot. It pretends to be a human. And that would be almost fine, undetected and problem-less… if only it wasn’t "clicking around" so fast…
If you can isolate this traffic as coming from a single IP (through the "All hits" tab), you may block that IP in Htaccess.
However, modern crawler robots use many different IPs at once. In this case it’s a much more complex problem. You may look at the Performance Optimization page for ideas to optimize your site in order to better resist to such attacks.
Hello!
Thanks sooo much @fplanque for creating this page! It's incredibly useful for B2evolution users. I work with the InMotion Community Support team and we were looking into the case that resulted in you creating this page. Basically, as per the report, a user was getting highly escalated traffic that was resulting in high resource usage by a B2evolution website on one of our shared hosting servers. Unfortunately, in order to keep the site from adversely affecting other accounts on the server this particular site was suspended.
There are many ways that this can happen, but the main focus of this tutorial was on recognizing a Crawler attack. Check your traffic using your available analytics tools (including B2Evolution's graph as shown above). If you are a customer of InMotion, a service ticket can also be submitted requesting an analysis of the website traffic. The question is then, how do you stop the crawler or in this case what we believe to be a case of bots hitting the site?
One of the best ways to help stem the tide is to use your .htaccess file. We have a tutorial that explains how this can be done. The title of the article is Block Unwanted Users on Your Site using .htaccess (http://www.inmotionhosting.com/support/website/security/block-unwanted-users-from-your-site-using-htaccess#block-by-user-agent). We are still in the process of investigating the issue, though the suggestion given to the user was use caching. I will be posting on the forum concerning this issue, shortly. The good news is that the site is not currently suspended. Taking the steps to block these crawlers from hammering the site will help reduce further problems.
Thanks again for your time and help!
Arnel C.
InMotion Hosting Community Support Team