This is a first release which requires some alteration of the files on a site by site basis. Please follow the instructions below.
BotWatch is available as a .tar.gz archive from ftp://ftp.tardis.ed.ac.uk/users/sxw/botwatch.tar.gz
When untarred the archive produces a directory containing two files. The indexers.lst file contains a list of all the robots that BotWatch recognizes. The $config_file variable in BotWatch.pl should be set to the location of this file.
BotWatch takes a log file on stdin and produces a HTML document on stdout.
You may see the following errors :
- Cannot read in the list of robots
- The indexers.lst configuration could not be found, or was not
readable. Ensure that the location of this file is correctly set in the
perl file as is detailed above.
- Sorry! Your logfile format is not recognised!
- The first line of the logfile was in a format that BotWatch doesn't know about. If this line is uncorrupted then please mail me a copy of the line and I'll integrate support for it into the next version.
How BotWatch works depends on the format of the logfile that you supply it with.
Logfiles containing User Agent information
The User Agent sent with any HTTP request gives the name of the program that is making the request. Some User Agents are known as those as robots, and these are stored in the configuration file. If BotWatch sees a request from one of these User Agents then the request is flagged as being from that robot.
Robots can also be identified by requests for the robots.txt file on a server. Robots that are found by this method are flagged as "Unknown" with their User Agent shown in brackets.
Logfiles without User Agent information
If no User-Agent is present then the other information must be used. As the majority of search spiders access sites from known IP addresses and/or domains this information is used to classify them, again from information in the configuration file.
Robots.txt file accesses are flagged in the output as being from "Unknown" robots
A list of the robots contained in BotWatch's current configuration file is available. If you wish to add new entries to the file use the following format.
robot-id: A short string, used internally, should not contain spaces
(Required)
robot-name: The name of the service provided by the robot
(Required)
robot-cover-url URL for a page providing details of the robot or the
service it provides
robot-hostIP The IP which accesses from the robot come from
robot-hostName The names of the hosts which accesses come from
robot-useragent The User Agent string sent by the robot
If you add entries for new robots please mail me them so they can be added to the list.