Over the years I've thrown together various bits of code that have crawling functionality built into them. There was termite, used to find backup copies, renames or common temporary locations of your entire web site. There was indexfinder, used to crawl your site and find anything that looked like a directory listing. There was also htcomment, used to ferret out all of the comments found in your html.
These tools were all fairly well tested and worked quite well, but everytime I dusted off the code and fixed a bug or added functionality, my CS books would scowl at me. The core crawling code was literally cut and pasted between tools. The problem with this is obvious -- a bug or missing bit of functionality in the core crawler code had to be fixed in numerous places. Design at its worst.
Starting maybe a month ago I decided to fix this problem. The result is Hawler, a Ruby gem that encapsulates all of what I deem to be core web crawling functionality into an easy to use package. The result is that I can now focus more on writing the code that is unique to each particular task and not have to worry as much about the crawler bits. Its usage is quite simple, as described in the README.
As an example of Hawler's usage, I've put together two tools that I've found quite useful so far. First is htgrep. It is exactly what it sounds like: grep for the web. How many times does the word shot occur within 1 hop of www.latimes.com? Lets find out, but sleep 1 second between each request (got to play nice) and utilize HEAD (-p) to only harvest links from pages that are likely to have them in the first place:
$ htgrep shot www.latimes.com -r 1 -s 1 -p |wc -l 43
Only 43? A peaceful day in LA! What about the distribution of HTTP error codes on spoofed.org? Use htcodemap:
$ htcodemap spoofed.org -r Done -- codemap is spoofed.org-codemap.png
The result? Not too shabby:
What about drawing rediculous maps of relationships within a website? Well, assuming you have enough RAM (blame graphviz/dot, not me!), enjoy htmap. An example, here is a fairly deep crawl and mapping of spoofed.org:
$ htmap spoofed.org -r 2 Done, map is spoofed.org-map.png, spoofed.org-map.dot
The image is here.
I expect that a lot of cool tools will be born from Hawler, and I'll be sure to link and post them as they turn up. Until then, enjoy!
Comments and suggestions are very much welcome.