Watching digital paint dry

I have just watched the digital equivalent of paint dry, looking at half an hour worth of tail -f access.log. I actually did this to understand the behavior of the referer spammers
Which actually was quite useful. Not so much for locating any patterns in refererspammer behavior, but for other reasons.


h3. msnbot
I simply don’t understand this bot:
* It visits every document twice
* On the second visit, it will still try to visit the original URL, even if my webserver has instructed it that it has a new permanent address by sending 301 Moved Permanently
* It doesn’t care if a document is 404 Missing, or even 410 Gone. It will still visit it twice.
* It visits in bursts, and will fetch robots.txt at the start of each burst.
h3. Internet Explorer
Of all the requests from browsers claiming to be Internet Explorer, only about half of them actually fetch additional content, such as stylesheets and images embedded within stylesheets. I guess there is a huge number of malicious bots out there, lying about their identity in an effort to avoid being detected. God (and spammers) only knows what their real purpose is.
h3. Hall of Shame
If you’re attempting to alert people about your online (genuine) search services, doing so by sending referer spam once in a while is _not_ a good idea. The following “services” can consider themselves as banned:
* world-of-newave dot info
* dailyorbit dot com

3 Comments

  1. Does this mean that global web stats for Internet Explorer are out by 50%?

  2. No, it means that IE stats may be off by as much as 50% on _this_ site.
    The demographic for this site (very likely) consist of relatively tech-savvy users that are aware of, and have moved on to greener pastures, be it Opera, Safari or Firefox, and as such the number of IE users is relatively low.
    If this site had been about birdwatching, many of my tech-savvy visitors would have been replaced by ornithologists who don’t care about browsers. For the moment, I don’t think too many of them are aware of Opera, Safari or Firefox. This means that I would have had a higher share of IE users, while the robots would have stayed the same, and as such, that 50% number would have dropped significantly.
    If I look into the magic 8-ball, and try to come up with a traffic figure, it predicts around three (perhaps three and a half) million pageviews (take this number with a huge grain of salt, it’s based on the traffic so far this year).
    If I suddenly become more popular, and receive 30 million pageviews instead, with roughly the same demographics, I would have ten times as many IE visitors. For some reason, I doubt the bot population would grow tenfold.
    This would mean that the bot visitors causing overreporting to fall to around 10%. Had I grown to become a regular Slashdot, the overreporting would be 1% with the same bot population.
    (It should be mentioned that the bot population is likely to have some growth as a site becomes more popular, but not nearly as much as human visitors.)

  3. I should probably also add that there are several factors that might affect my 50% number.
    There might be human bias, meaning: My estimate may have been wrong. Further, the sample size may have been too small. I could have watched the samples at a time when a particularily noisy botnet was running.
    However, I redid my “When is MSIE not MSIE?”:http://virtuelvis.com/archives/2003/02/properly-measuring-msie test of yesteryesteryear, now that the traffic is significantly higher: Almost 15% of the CSS-capable browser here are now misrepresenting themselves as MSIE. The vast majority of these are Opera.