BotWatch - Monitoring Robots Indexing Your Site

BotWatch is a short perl script that analyses log files (in either the Common, or NCSA Extended log file formats) and produces an HTML page reporting on the robots seen.

Downloading

This is a first release which requires some alteration of the files on a site by site basis. Please follow the instructions below.

BotWatch is available as a .tar.gz archive from ftp://ftp.tardis.ed.ac.uk/users/sxw/botwatch.tar.gz

When untarred the archive produces a directory containing two files. The indexers.lst file contains a list of all the robots that BotWatch recognizes. The $config_file variable in BotWatch.pl should be set to the location of this file.

Usage

BotWatch takes a log file on stdin and produces a HTML document on stdout.

You may see the following errors :

Cannot read in the list of robots
The indexers.lst configuration could not be found, or was not readable. Ensure that the location of this file is correctly set in the perl file as is detailed above.

Sorry! Your logfile format is not recognised!
The first line of the logfile was in a format that BotWatch doesn't know about. If this line is uncorrupted then please mail me a copy of the line and I'll integrate support for it into the next version.

How it works

How BotWatch works depends on the format of the logfile that you supply it with.

Logfiles containing User Agent information

The User Agent sent with any HTTP request gives the name of the program that is making the request. Some User Agents are known as those as robots, and these are stored in the configuration file. If BotWatch sees a request from one of these User Agents then the request is flagged as being from that robot.

Robots can also be identified by requests for the robots.txt file on a server. Robots that are found by this method are flagged as "Unknown" with their User Agent shown in brackets.

Logfiles without User Agent information

If no User-Agent is present then the other information must be used. As the majority of search spiders access sites from known IP addresses and/or domains this information is used to classify them, again from information in the configuration file.

Robots.txt file accesses are flagged in the output as being from "Unknown" robots

The Configuration File

A list of the robots contained in BotWatch's current configuration file is available. If you wish to add new entries to the file use the following format.

robot-id: A short string, used internally, should not contain spaces (Required)
robot-name: The name of the service provided by the robot (Required)
robot-cover-url URL for a page providing details of the robot or the service it provides
robot-hostIP The IP which accesses from the robot come from
robot-hostName The names of the hosts which accesses come from
robot-useragent The User Agent string sent by the robot

If you add entries for new robots please mail me them so they can be added to the list.