JC's ABC search bot


This directory contains the search bot for JC's Tune Finder. This is a program that looks for web sites with files in the ABC music notation. When it finds one, it extracts assorted interesting musical information from the file, and adds it to the Tune Finder's database. Meanwhile, people are connecting and asking about tunes, downloading them, or requesting them in any of the several output formats that the Finder supplies.

Some of the interesting things here:

abcbot
is the search bot itself. It's a perl program that uses the URLs file as a list of starting points and places to avoid. Also, abcbot expects a list of hosts on the command line, and it scans only those sites. It keeps data for each host in a file in the hst/ directory.
ABCsearch
queries several big search sites for "ABC music notation". This is the best set of keywords that we've found to locate ABC sites. About 20% of our sites were found this way. It writes info for each into a file in the add directory.
Hosts2html
rebuilds the index files in ndx/ These are the files used by the Tune Finder. The index files are rebuilt after a search run. They may also be rebuilt at any time if there are problems, or if we've done a special scan of one or more sites.
Makefile
is a conventional unix makefile. We use it to drive the search process, which is started by hand once a month or so.
NewURL
takes a URL and does a scan for ABC files with the URL as a starting point. If the scan is successful, the URL should usually be added to URLs.
scandata
is a file showing statistics from the most recent scans, one line per host. This usually includes "false positive" hosts from the big search sites, so we can see how successful ABCsearch was.
UpdateHosts
is a script that drives the search run. We normally call abcbot for a single host. UpdateHosts has the task of keeping track of which hosts have been scanned, and starting up abcbot for each host. It uses both the hst/ and add directories to decide which hosts to scan.
findrobotstxt
is a little program to list the URLs of all robots.txt files in our ABC hosts. It may be slow, because it queries every host, and sometimes they don't respond. So run it in the background.
hst/
contains one file per host, listing all the interesting files and ABC tunes that abcbot found on that host.
lck/
contains lockfiles for hosts, so that we don't get two programs trying to scan the same host.
ndx/
is our copy of the index files used by the Tune Finder. The Tune Finder actually works out of ../ndx/, so the files are linked there after we've rebuilt and verified them here.
sh/
is assorted small scripts that are useful here.
stat/
contains past statistics files of various types. It should probably be purged occasionally.
webcat
is a program that does downloads. This is a separate program because of an intractable problem: TCP connections to web servers sometimes block permanently in the connect() call. This happens on several OSs, and there seems to be no solution. So abcbot calls webcat as a subprocess. If webcat doesn't respond or exit within the timeout period, we kill it and go on. This is a bit of a waste of cpu time, but it solves the problem.