Sunday, January 20, 2008

Hawler, the Ruby crawler

Over the years I've thrown together various bits of code that have crawling functionality built into them. There was termite, used to find backup copies, renames or common temporary locations of your entire web site. There was indexfinder, used to crawl your site and find anything that looked like a directory listing. There was also htcomment, used to ferret out all of the comments found in your html.

These tools were all fairly well tested and worked quite well, but everytime I dusted off the code and fixed a bug or added functionality, my CS books would scowl at me. The core crawling code was literally cut and pasted between tools. The problem with this is obvious -- a bug or missing bit of functionality in the core crawler code had to be fixed in numerous places. Design at its worst.

Starting maybe a month ago I decided to fix this problem. The result is Hawler, a Ruby gem that encapsulates all of what I deem to be core web crawling functionality into an easy to use package. The result is that I can now focus more on writing the code that is unique to each particular task and not have to worry as much about the crawler bits. Its usage is quite simple, as described in the README.

As an example of Hawler's usage, I've put together two tools that I've found quite useful so far. First is htgrep. It is exactly what it sounds like: grep for the web. How many times does the word shot occur within 1 hop of www.latimes.com? Lets find out, but sleep 1 second between each request (got to play nice) and utilize HEAD (-p) to only harvest links from pages that are likely to have them in the first place:

$  htgrep shot www.latimes.com -r 1 -s 1 -p  |wc -l  
43

Only 43? A peaceful day in LA! What about the distribution of HTTP error codes on spoofed.org? Use htcodemap:

$  htcodemap spoofed.org -r
Done -- codemap is spoofed.org-codemap.png

The result? Not too shabby:

What about drawing rediculous maps of relationships within a website? Well, assuming you have enough RAM (blame graphviz/dot, not me!), enjoy htmap. An example, here is a fairly deep crawl and mapping of spoofed.org:

$ htmap spoofed.org -r 2 Done, map is spoofed.org-map.png, spoofed.org-map.dot

The image is here.

I expect that a lot of cool tools will be born from Hawler, and I'll be sure to link and post them as they turn up. Until then, enjoy!

Comments and suggestions are very much welcome.

5 comments:

Anonymous said...

Help! I can't install this gem due to:
http://spoofed.org/files/hawler/gems/ does not appear to be a repository

Anonymous said...

I have tried to install hawler by calling:
sudo gem install --source http://spoofed.org/files/hawler/ hawler

But i got this error:

WARNING: RubyGems 1.2+ index not found for:
http://spoofed.org/files/hawler/

RubyGems will revert to legacy indexes degrading performance.
ERROR: could not find gem hawler locally or in a repository

Any hints?

Francesco Levorato said...

Hi and thanks for this contribute to the community! I wonder if you could take the time to update your script to be used by the new Gem system. I'm getting this error with latest RubyGems (1.3.1):
$ sudo gem install --source http://spoofed.org/files/hawler/ hawler
[sudo] password for flevour:
WARNING: RubyGems 1.2+ index not found for:
http://spoofed.org/files/hawler/

Jon Hart said...

All of the retardness with regards to older/newer versions of gem not working should be resolved. I had been utilizing an older method of gem generation. A quick update to 1.3.x, which includes backwards compatibility, appears to have fixed things.

Please let me know if you find otherwise.

-jon

Francesco Levorato said...

Works like a charm, thank you!