Monitoring Web Indexing Robots

With the rapid increase of search engines and their deployment of robots which roam the World Wide Web finding and indexing content to add to their databases, there has developed a certain amount of concern amongst web site administrators about exactly how much of their precious server time and bandwidth is being used to service requests from these engines.

Whilst the majority of web servers keep comprehensive transfer logs, there can be difficulties in identifying robots activity from these logs. It is possible to identify robots manually from log files, although it is a time consuming process. Far better is an automated approach - such as Botwatch, a perl utility.

This approach will detect robots that are well behaved, but there are some robots appearing that are either deliberately or accidentally broken. These may access areas of your sites that are protected by robots.txt files, or send accesses to your site at a ridiculously rapid rate. There is a mailing list for reports of these types of robots, and a script has been written by Rob Hartill of the IMDB to detect them and mail the server operator about them as they happen.

robots.txt files are the key to stopping robots from indexing your server, yet many people still fail to get the syntax correct. For this reason, I've written a robots.txt syntax checker which will check that your robots.txt file will be obeyed by current robots.