Protecting Your Site From Aggressive Web Crawler bots

If your website is consuming an abnormal amount of bandwidth and we have confirmed that this is due to these web crawlers (aka bots).

You don't want to block 'good' web crawlers or your SEO may be affected. One can however block some known bad bots. To do so, edit or create the .htaccess file in your web site root folder (/httpdocs) and add the following lines at the top of your file: http://pastebin.com/L397kQ9A

Note: Once you access to that url, please click the toggle button 

If search engine bots are visiting and indexing your website too often and is overloading your account/server:

You can try this: Set the number of seconds between successive requests to your website by the web crawlers. To do so, edit or create a robots.txt file in the domain's folder on the server (/httpdocs) and add the following lines:

 

User-agent: * 
Crawl-delay: 5

We would alos recommend you block the access to folders where you have sensitive data that should not be accessed by web crawlers. For example, if you have a WordPress application installed, you can tell the bots that you don't wish them to access folders and files that shouldn't be indexed by adding the following additional to the robots.txt file:

 Disallow: /feed/
Disallow: /trackback/
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /xmlrpc.php
Disallow: /wp-*

 

Finally, Google does not obey the Crawl-delay directive as they have their own crawl rate setting available in their webmaster tools system. To access this system you will first need to sign up and activate them for your site with Google Webmasters tools:

http://www.google.com/webmasters/

Here is an article on how to change the crawl rate for Google's web crawlers:

http://support.google.com/webmasters/bin/answer.py?hl=en&answer=48620

You will need to set the crawl rate every 90 days, or it will revert to its default value.


Properties ID: 000311   Views: 4835   Updated: 11 years ago
Filed under: