Sunday, March 4, 2007

Saving bandwidth by ... removing comments?

I fixed up a number of shortcomings in htcomment the other day. As part of the development of this tool, I regularly run it against sites I frequent. I started to notice what appeared to be an inordinate amount of comments on some sites, so that got me thinking -- on average, what percentage of a site's web content is comments?

Some quick analysis gave some interesting results:

$  for site in; do

full=`lynx -source http://$site | wc -c`    
comments=`./htcomment -q http://$site |wc -c`                               
echo "$site is"`echo "scale=4; ($comments/$full)*100" | bc`"% comments"   
done                                         is 16.9600 % comments is 5.5300 % comments is 1.6200 % comments is 15.2600 % comments is 2.6600 % comments is 5.8500 % comments is 4.5900 % comments is .7300 % comments is 3.7300 % comments is 0 % comments is 0 % comments is 0 % comments is 2.7500 % comments is 17.0700 % comments is .2600 % comments is 0 % comments is .3400 % comments

Unfortunately these numbers are not 100% accurate -- htcomment can't differentiate between "kjdflakjfdaf " and just "", so the numbers for the sites that do have comments can be a bit skewed in some respects, but it is a good first order approximation. It is no coincidence, in my opinion, that google, w3c and craigslist have 0 comments on their frontpage. For sites that have >5% comments on their frontpage alone, you can't help but wonder how the behavior of their site or their bandwidth expenses would change if those comments were filtered out at their edge, or never put there in the first place.

1 comment:

cashback said...

Really good article and tip gives me a better insight, thanks.