Saturday, January 7, 2012

Improving Nokogiri XPath operations through predictive hashing

A significant portion of my day job involves me writing large amounts of code to massage, manipulate and transform various different data sources from third parties into things that ultimately result in my "bread and butter", so to speak. But, its the weekend. What on earth am I doing writing about this now? Well, honestly, its because I lose sleep over things like ugly, unreliable or poorly performing code. In some situations, my code has one or more of those three attributes for reasons that aren't really under my control -- vendors. Other times, honestly, its because I am a perfectionist and my code is never good enough.

One bit of code that you could call "old faithful" is code that I use to parse XML files from a particular vendor. The files group data by year, cover 2002 until now, 2012, and are from 2 to 22M in size. My weapon of choice in this battle is Ruby 1.9.3 and Nokogiri. The majority of the operations that are performed against this data is searching for children of the root node that have a particular attribute. So, assuming we had a sample document like this:

  <food id="ID-2001-0001">
  <food id="ID-2001-1234"/>

The bulk of the searches would be things like looking for food with ID ID-2011-0123, or, in XPath, '//store/food[@id="ID-2011-0123"]'. In reality, the documents are considerably larger -- ~6000 entries in each document, each entry from a few hundred bytes to several kilobytes. And a batch run of results in an average of 500 lookups across basically all years.

Memory and CPU are not really a concern, but time is. While the documents are numerous, large and complex, it was never acceptable that the batch searches would take minutes or more. For a while, I was using the same "seat of the pants" tuning that shadetree mechanics use -- it sure feels faster! But, is it really faster?

From its initial inception, it was clear that opening the correct document, using XPath to locate the entry and then repeating this process was not going to work, because the code would quickly get OOM killed. So, I investigate some different strategies for optimizing these lookups:

  1. Caching the resulting Nokogiri documents, using a hash for each year after they were first opened. Subsequent lookups for that year would not have to incur reprocessing. XPath was still used within individual documents to locate entries.
  2. As a new year was encountered, opening the corresponding document and caching all of the entries for that year using the entry ID as a key in a hash, and the entry element itself as the value.
  3. Similar to the second approach, however *ALL* possible entries were cached. First, all possible entries were hashed using their predicted ID (ID-<year>-<4 digit sequence number>) as the key and nil as their value. Then, all valid IDs were parsed from the document we obtained and hashed using the approach in #2. The result is that all possible XML entries had a place in the hash.
  4. Similar to the first approach, but instead of using XPath, walking the document and selecting nodes based on simple comparison operations. This was a technique I found on Jud's blog

Now, to answer the burning question of which is faster, I utilized Ruby's Benchmark. Because the searches are often for IDs that don't exist, I sampled the results 10 times of searching for 100 random entries to searching for 100 only valid entries for each approach. The results were impressive, hopefully correct, and a little curious:

Approach User time System time Total time Real time
Random, Caching by year6.3800000.0400006.4200006.457234
Good, Caching by year42.3100000.05000042.36000042.502168
Random, Caching only valid entries74.1600000.290000 74.45000074.750306
Good, Caching only valid entries5.1900000.050000 5.2400005.272732
Random, Caching all entries9.0700000.0800009.1500009.194298
Good, Caching all entries5.8000000.0300005.8300005.865078
Random, using a walker14.9900000.09000015.08000015.143777
Good, using a walker55.7500000.06000055.81000055.998989

Tuesday, January 3, 2012

Various Racket Updates

No, I have not fallen off the face of the earth. Lets just say I've been preoccupied.
There have been several updates on the Racket front. In no particular order:
  1. Racket has been yanked from Metasploit and replaced by Todd's Packetfu. As cool as it was to have Racket powering so many interesting bits of metasploit for as long as it did, in the end I don't get paid to maintain Racket and I simply couldn't keep pace with what the Metasploit team needed with my full time job situation the way it is.
  2. Hosting of the Racket gem has moved to, so now you can just 'gem install racket'. Source and the SVN repository are still in the original places if you are so inclined.
  3. bit-struct was added as a dependency for Racket. At one point in time, bit-struct was not available as a gem so I couldn't require it in the gemspec. Now it is so now I can.
  4. pcaprub is not currently a required dependency for Racket, though I'm still at a bit of a loss as to how you'd get any value out of Racket without pcaprub. With that said, I have had a request to make it not required, and my decision, at least for Racket 1.0.11, was made easier by gem breakage in pcaprub 0.11 and 0.11.1 that are still getting resolved.

Saturday, November 28, 2009

Racket 1.0.6 Released

Over the Thanksgiving holiday and thanks to the fact that I've been trapped indoors for two weeks, I've made some major improvements to Racket, released in version 1.0.6. For those not in the know, Racket is a Ruby Gem used for reading, writing and handling raw packets in an intuitive manner. Between 1.0.2 and 1.0.6, there have been countless changes, including but not limited to:
  • Full ICMPv6 support
  • Much improved IPv6 support, thanks largely to Daniele Bellucci
  • Revamped, more efficient ICMP support (basically copied all the cool things from ICMPv6)
  • All encoding/decoding classes moved under their respective layer in Racket::L3, etc.
  • Large documentation, test and example improvements
So, as usual, `gem install --source racket` to install it and then take a stroll through the docs and examples. Enjoy!

Saturday, November 14, 2009

Racket version 1.0.2 released, now in Metasploit

Many months back I got word that Metasploit would be including Racket to handle much of its reading and writing of raw packets. Racket was selected for its speed and ease of use and I'm glad to see my work pay off. To celebrate this, I'm releasing 1.0.2, which includes:
  • VRRP
  • SCTP
  • EGP
  • General cleanup so as to not trash namespaces
  • Various bug fixes
  • Numerous documentation and examples cleaned up
Give Racket a whirl, I assure you you'll find it useful. I openly encourage testing, bug reports, suggestions or solicitations for additional functionality.

Monday, June 22, 2009

Ubuntu Server on a Soekris

I've been running the remnants of the non-hosted portions of on an older small form factor computer in a closet for almost two years now. In addition to being a Debian install from ~2006, the box was generally quite a waste and all it really did was make heat, suck power and buzz its fans.

So, this weekend I took a few hours and installed Ubuntu 9.04 (Jaunty Jackalope) Server on one of my trusty Soekris 4801s.

There is plenty of documentation out there that either describes a similar process using significantly older versions of Ubuntu, or involves unnecessarily complicated methods of achieving the same end. Following on a entry I did a few years ago on installing OpenBSD on a Soekris, I once again took the route of using qemu to aid in installation.

OK, lets cut to the chase.

  1. Download the Ubuntu Server ISO
  2. Remove the CF or 2.5" disk from your Soekris and plug it in to the system you'll be doing the install on. Take note of what device it gets assigned -- my 2.5" laptop drive got /dev/sdd.
  3. Fire up qemu. Change your memory (512), hard disk, and cdrom options as necessary. Note that the -no-acpi option is necessary to get the installer to start:
    sudo /usr/bin/qemu  -m 512 -boot d -hda '/dev/sdd' -cdrom  '~warchild/ubuntu-9.04-server-i386.iso' -net nic,vlan=0 -net user,vlan=0 -localtime -no-acpi
  4. Install as you normally would.
  5. After the install has finished, halt qemu and restart, booting directly off your new Ubuntu installation instead of the ISO:
    sudo /usr/bin/qemu  -m 512 -hda '/dev/sdd' -net nic,vlan=0 -net user,vlan=0 -localtime -no-acpi
  6. Optional: if your Soekris does not support PAE -- the Geode processors used in the 48xx and 45xx certainly do not -- you'll need to install a kernel that does not require PAE. The kernel that ships with Jaunty Server -- 2.6.28-11-server -- requires PAE. You can either recompile and remove that requirement, or take the easy/easier route and just install the generic kernel
    sudo apt-get install linux-image-generic
  7. Reconfigure the your system to spawn a login shell on the serial port. Put the following in /etc/event.d/ttyS0:
    start on runlevel 2
    start on runlevel 3 
    start on runlevel 4 
    start on runlevel 5 
    stop on runlevel 0
    stop on runlevel 1 
    stop on runlevel 6
    exec /sbin/getty 115200 ttyS0
  8. Somewhere towards the top of /boot/grub/menu.lst, ensure that the following two lines are present. The first just configures the serial port (change speed if necessary), and the second configures terminal I/O to be on that serial port:
    serial --unit=0 --speed=115200 --word=8 --parity=no --stop=1
    terminal --timeout=5 serial
    Next, find the commented out section of /boot/grub/menu.lst that defines defoptions. Leave it commented out, but append the console directive to tie all this serial goodness together:
    # defoptions=splash console=ttyS0,115200
    Now, run update-grub to regenerate menu.lst :
    $  sudo update-grub        
    Searching for GRUB installation directory ... found: /boot/grub
    Searching for default file ... found: /boot/grub/default
    Testing for an existing GRUB menu.lst file ... found: /boot/grub/menu.lst
    Searching for splash image ... none found, skipping ...
    Found kernel: /vmlinuz-2.6.28-13-generic
    Found kernel: /vmlinuz-2.6.28-11-server
    Found kernel: /memtest86+.bin
    Updating /boot/grub/menu.lst ... done
    $  sudo grep console /boot/grub/menu.lst
    # defoptions=splash console=ttyS0,115200
    # xenkopt=console=tty0
    kernel  /vmlinuz-2.6.28-13-generic root=UUID=f48b39a6-020d-46e6-b25d-9210472ba1fd ro splash console=ttyS0,115200 
    kernel  /vmlinuz-2.6.28-11-server root=UUID=f48b39a6-020d-46e6-b25d-9210472ba1fd ro splash console=ttyS0,115200 
  9. As a last step before you boot your Soekris, it probably wouldn't hurt to update:
    sudo apt-get update && sudo apt-get -u upgrade
  10. Halt your Ubuntu host running in qemu, remove the disk and install it in your Soekris
  11. Now, configure your Soekris so that it'll jive with the serial settings you just configured in Ubuntu. Unless you have already changed it, your Soekris will (likely) come from the factory with its serial port configured at 9600n81. Configure your favorite serial communication program (minicom) to 9600n81, connect your null-modem serial cable to your Sorkris and host system, and then power on the Soekris. Press ctrl-p to get to the Soekris prompt. Set ConSpeed to 115200 (or whatever you configured your kernel to above):
    set ConSpeed 115200
    Now your Soekris will be speaking at 115200, so reconfigure your serial communication program as necessary.
  12. Ensure that the boot order is correct (show BootOrder). If it does not begin with 80 81, 81 80 or something similar, use set BootOrder to remedy that. Remember, 80 is the CF, 81 is the first IDE device if present.
  13. Type 'boot'
  14. Enjoy.

Monday, January 26, 2009

Name-based Virtual Hosting and Web Application Security

Over the past several weeks I've had the privilege of beginning evaluations of the best that the commercial security sector has to offer in the realm of web application security auditing. Arguments about whether simply buying a web application security firewall or pursuing this initiative from a code auditing point of view instead aside, the offerings from the top names in this space run the gamut from extremely impressive to downright depressing.

For purposes of this post, I am just focusing on the tools out there that do your traditional blackbox based approach to auditing. Given an IP address or a URL and some other minimal hand-holding, nearly all of the big names in this arena will do fairly good job of identifying the bulk of the "low hanging fruit."

Being the studious and meticulous geek that I am, one of my first requirements when evaluating these products was that they properly support the auditing of web applications that live on a host utilizing named-based virtual hosting. Simple, right? RFC2616 makes this sort of thing very clear and one would expect that not only would tools currently available be able to employ this knowledge to further leverage their way into a web application, but when asked about these features a given vendor would understand how they are useful. Needless to say, the HTTP/1.1 "Host" header plays a pretty important role on the majority of modern websites.

Well, I hate to break it to you, folks, but this is sadly not the case. Of all the vendors on the market, not a single one fully supports this functionality. Yes, you read that right -- no vendor in the blackbox web application security auditing space truly supports HTTP/1.1's "Host" header.

Pausing here for a second, I can sense the heated emails and comments aimed in my general direction. Please, hear me out.

So, just a hint of background. Many products utilize readily available HTTP client libraries written in their language of choice; rightfully so -- why reinvent the wheel! At least one even manipulates IE or Firefox directly to do their bidding, which I find particularly interesting.

And this is where the problem starts to manifest itself. When a human being makes a web request using a browser, they typically enter the address either using an IP address or a host name of some manner. Under the hood, as many of us know, the host name (if present) is resolved and the IP address that results is used as the destination address of the network socket. Once this connection is established using a modern browser, the host name (and port!) from the original URL is sent along as part of the request in the "Host" header of an HTTP/1.1 request. Simple, right? This allows a number of 1999-esque features such as being able to host more than one website on a single IP address or do fun SEO things like determining what URL a visitor originally used to arrive on your property.

Whats the big problem, right? Of course, all vendors that I've dealt with properly support sending HTTP/1.1 requests with a "Host" header, but they don't exploit all of its beauties. Imagine the following two situations.

One. Imagine two hosts: (PRODUCTION) and (TEST). Both PRODUCTION and TEST webservers contain name-based virtual hosts for and DNS for and point to How do you audit TEST accurately and completely? The obvious and hacky answer that most vendors gave was to diddle DNS before each audit so that the host performing the audit can correctly resolve the target host names to the IP addresses of the particular environment (PRODUCTION or TEST) that you are attempting to audit. If that was your answer, please hold while I drop the >$200k some of these tools cost for a single user into a bag and thwap some sense into you.

Two. Same two hosts as above, but toss a couple dozen other named-based virtual hosts, years of HTTP configurations gone wrong and a little luck into the mix, and you just might hit the proverbial pay-dirt if you grok the Host header. All manner of fun things can be discovered if you play around with it, such as:

  • Other virtual hosts that you might not have known about. Hint: send a 1.1 request with an invalid Host header.
  • Previously "protected" areas of the website now exposed by accessing it using a different Host header

Much like many of my other ideas which come while sleeping or in other prominent thinking spots, I've decided to whack together some code to explore these things more. The resulting abomination is vhinfo, which given a URL and/or an IP address will play around with various HTTP/1.1 Host headers and see what other VHOSTs it can find based on information that the server provides or is otherwise freely available. Like I said, its gross, partially unfinished and needs cleanup, but the results are fun:

Looks like Facebook really likes to strip the first portion of the hostname out and replace it with a 'www'.

$  vhinfo
HTTP/1.1 Host: on -> 200: OK
HTTP/1.0  on -> HTTP/1.1 200 OK ()
HTTP/1.0 on -> HTTP/1.1 302 Found (
HTTP/1.1 Host: on -> 200: OK
HTTP/1.0  on -> HTTP/1.1 200 OK ()
HTTP/1.0 on -> HTTP/1.1 302 Found (http://www.63.176.140/common/browser.php)
HTTP/1.1 Host: localhost on -> 200: OK
HTTP/1.0  on -> HTTP/1.1 200 OK ()
HTTP/1.0 localhost on -> HTTP/1.1 302 Found (http://www.ocalhost/common/browser.php)
HTTP/1.1 Host: on -> 200: OK
HTTP/1.0  on -> HTTP/1.1 200 OK ()
HTTP/1.0 on -> HTTP/1.1 302 Found (http://www.0.0.1/common/browser.php)
HTTP/1.1 Host: on -> 200: OK
HTTP/1.0  on -> HTTP/1.1 200 OK ()
HTTP/1.0 on -> HTTP/1.1 302 Found (

Sun likes to throw 500's on certain virtual hosts (these should probably actually be 404's):

$  vhinfo                   
HTTP/1.1 Host: on -> 200: OK
HTTP/1.0  on -> HTTP/1.1 200 OK 
HTTP/1.0 on -> HTTP/1.1 200 OK 
HTTP/1.1 Host: on -> 200: OK
HTTP/1.0  on -> HTTP/1.1 200 OK 
HTTP/1.0 on -> HTTP/1.1 200 OK 
HTTP/1.1 Host: localhost on -> 500: Server Error
HTTP/1.0  on -> HTTP/1.1 200 OK 
HTTP/1.0 localhost on -> HTTP/1.1 500 Server Error 
HTTP/1.1 Host: on -> 500: Server Error
HTTP/1.0  on -> HTTP/1.1 200 OK 
HTTP/1.0 on -> HTTP/1.1 500 Server Error 

And Yahoo, not to be outdone, throws standards to the wind and just returns raw HTML text sans HTTP response headers (and breaks my code :().

$  vhinfo
HTTP/1.1 Host: on -> 301: Moved Permanently (
HTTP/1.1 Host: on -> 301: Moved Permanently (
Connection to failed! -- wrong status line: "<!doctype html public \"-//W3C//DTD HTML 4.01//EN\" \"\">"
/home/warchild/bin/vhinfo:128:in `check': undefined method `[]' for nil:NilClass (NoMethodError)

In summary, the HTTP/1.1 "Host" header is an important part of accurately and completely performing web application security audits and has a role in any vulnerability assessment that is definitely worth considering.

Friday, January 2, 2009

Hawler, the Ruby crawler, 0.3 released

I received an email yesterday from ET LoWNOISE, a Metasploit contributor, regarding adding proxy support to Hawler. Apparently the hope is to be able utilize Hawler for the crawling duties within WMAP, the new web application scanning framework in Metasploit.

Since it has been several months since I've had to do anything to Hawler, I figured this was a good time to go in an do some much needed cleanup and improvements. Chief among the changes are:

  • Proxy support ("-P [IP:PORT]")
  • Documentation cleanup
  • Support crawling frame and form tags
  • Add a useful default banner to calling scripts if none provided
  • Print out defaults when help is called

Thanks to ET for his proxy contributions.

As usual, the following will get you up and running with Hawler:

gem install --source hawler

Using Hawler? Comments? Complaints? Suggestions? Drop me a line - I'd like to hear it.