Can we please stop this statistics nonsense?

Recently, there has been some discussion on statistics, both over at “Asa Dotzler’s blog”:http://weblogs.mozillazine.org/asa/archives/008076.html and in “other places”:http://operawatch.blogspot.com/2005/05/operas-market-share.html.
Could we all please stop this nonsense? Browser statistics are rather uninteresting, since they are fairly likely to be invalid, even within one single domain/site. If applied across domains, they are even more invalid.
I have spent quite some time looking at what is happening in my own server logs, and I have made some observations


h3. Undercounting
Text-only browsers without scripting capabilites are, on average, undercounted by a factor of around 3.2 when visiting this site: These browsers do not fetch any associated style sheets, scripts or images, and hence will always be undercounted.
As for Opera: Opera has a much more aggressive caching policy than the competing browsers — it mostly doesn’t verify the document with the server when going back and forward, but instead fetches them from the RAM Cache. How severe the undercounting of Opera is due to this aggressive caching is not measurable, unless I want to pay a visit to each and every one of my users and observe how they use their back and forward functions.
h3. Overcounting
First, let’s start with browsers in the Gecko family: They support a feature called “prefetching”:http://www.mozilla.org/projects/netlib/Link_Prefetching_FAQ.html where the browser under certain conditions will fetch documents without user interaction. For this reason, sites that use links with the @rel=”next”@ and @rel=”prev”@ relationships will be subject to some overcounting of browsers in this family.
Internet Explorer is subject to several distinct forms of overcounting:
Spambots or other rogue spiders of all kinds tend to identify themselves as IE. So far in 2005, I have 403’ed around 20000 requests from such bots, and even so, quite a number of rogue spiders have gotten through. This means that IE will always be somewhat overcounted.
Depending on which tool is used to gather stats, IE may be overcounted at the expense of Opera. “I wrote about this”:http://virtuelvis.com/archives/2003/02/properly-measuring-msie in 2003, and “revisited the issue”:http://virtuelvis.com/archives/2003/04/msie-not-msie-revisited a few months later, where I found the following:
bq. 35.2% of browsers that claim to be MSIE aren’t. 84.2% of browsers that use a faked MSIE UA string, are using Opera.
In addition, certain sites that are using “conditional comments”:http://virtuelvis.com/archives/2004/02/css-ie-only to serve MSIE-specific styles, will overcount MSIE, since it will be making at least one extra request. Likewise, if a site uses conditional comments to deny MSIE some styles, MSIE will be undercounted.
h3. Download numbers
This is perhaps the most uninteresting stat you can use to compare the popularity of different browsers:
* One single download may be installed on hundreds or thousands of machines.
* A downloaded piece of software does not mean that it’s ever installed.
* Even if software is downloaded and installed, it doesn’t neccesarily mean that it’s ever used. For instance, I have probably downloaded 1.0+ builds of Firefox around 25 times, to different computers, for reinstallation purposes. I hardly ever use it. There are probably Firefox users that do the same with Opera, and there is likely MSIE users who try both browsers, only to return to MSIE or their favourite MSIE skin.
The only context in which download numbers are remotely interesting, is for the software vendors themselves: How popular is Opera 7.54 compared to Opera 8? How many people downloaded Firefox 1.0, 1.0.1, 1.0.2 and 1.0.3, respectively? _These numbers are not a good measure of “popularity” compared to the competition._
h3. Conclusion
Can we please stop this statistics nonsense, and its attached flamewars? The Mozilla Foundation and Opera Software has far more important things to attend to. Such as fighting for web standards. Fighting against a innovation-hostile browser monopoly. Fighting for users.

Previous Post

15 Comments

  1. What can I say – you da man.
    I’m wondering though, whether when they present their statistics, they’re counting unique users (IP addresses?) rather than ‘hits’ – since that would get around what you were saying about text based browsers etc. Though I suppose people behind a huge network would only supply a limited number of IPs.
    Do you have more insight as to how these people get the stats they get? I mean if they’re using AWstats (or whatever), that’s probably a limitation yes/no? Or are they cleverer than that?
    I mean I agree that the stats may not mean what they mean, and that their prominence is a bit annoying, but shouldn’t that mean we want/are interested in more meaningful statistics rather than uninterested in statistics in general?
    Stats can be important, especially if the stats are good and collected with sound methodology – and if nothing else, there had to have been some way to prove that an “innovation-hostile browser” had what amounted to a virtual monopoly.
    I’m sure stats would also show that Opera users would be people more willing to pay money for things, and to do so online – or are more willing to tolerate advertising. And generally smarter and prettier than everyone else. Of course. Similar things are said about Mac users and gay people, which is probably a different subject.

  2. Statistics is the art of lying with documented numbers. The fact that the documentation is flawed has never stopped anyone.
    Statistics may at best be seen as indicators worth further investigation. Statistics themselves rarely ever survive thorough investigations, so there’s no point in following those.
    For web sites it’s even simpler: if one adjust to statistics then one ends up fulfilling them. It’s a bad circle where each site can get just the results it wants. One minor flaw becomes a really big one.

  3. This is complete and utter BS. Website statistics surveys and website stats software counts VISITORS. That’s what browser percentages are based on. NOT HITS. NOT PAGE VIEWS. VISITORS. So, it doesn’t matter how good Opera’s cahce is. Nor does it matter if a Gecko-based browser downloads an entire website. Both are still counted as one visitor and the browsers stats are counted based on that.
    In addition, EVERY major website statistics program can accurately detect Opera’s default MSIE-spoof user agent string and correctly detect it as Opera.
    Yes, stats are imperfect. But, please, get the facts right.

  4. John: Before you decide that you want to come of as a rude jerk, wouldn’t you think it was a good idea if you were a bit informed? Analog, Webalizer and AWStats all calculate their visitor numbers based on _hits._ Not pageviews, not visitors. _Hits._
    As for offsite statistics, most of these do not actually disclose what they are measuring. Some might be measuring “unique visitors,” and some are simple pageview counters. Even the “unique visitor” counters are error-prone:
    What is a unique visitor? An IP address? No. You can have thousands of visitors visiting from the same IP. 3rd-party tracking cookies? You simply can’t tell, neither can the people gathering the stats.
    For those that still don’t get it: I’m specifically _not_ trying to bash a particular browser. I am just saying that you can’t take stats gathered in one domain and generalize them: You don’t know what errors you have, and thus, the _statistic itself does not have any merit at all. You should under no circumstance make judgements based on said stats, and *under no circumstance whatsoever* should you design web sites or services based on said stats._

  5. (btw: The definition of ‘domain’ in the last comment is not the computer definition)

  6. Subtitles:
    bq.. I mean I agree that the stats may not mean what they mean, and that their prominence is a bit annoying, but shouldn’t that mean we want/are interested in more meaningful statistics rather than uninterested in statistics in general?
    Stats can be important, especially if the stats are good and collected with sound methodology – and if nothing else, there had to have been some way to prove that an “innovation-hostile browser” had what amounted to a virtual monopoly.
    p. Indeed, properly gathered statistics can be important. But for a statistic to be “properly gathered”, you actually have to know exactly how the stat is gathered, you have to know each and every source of potential errors, and you have to analyze your data, taking these errors into account. Readily digested numbers do not possess these qualities.
    Having said all that: It is possible to make more or less educated guesses by averaging data from a number of sources, but the number you’re getting is by no means accurate, and of no significance. Telling that Gecko core browsers are gaining popularity is possible, but telling whether this amounts to 3% or 13% is impossible. It’s also of no significance whatsoever that another browser, such as Opera rises from 1.8 to 1.9%, or if it goes from 1.9 to 1.8%.
    And, concering the monopolist browser: making educated guesses about something having a virtual monopoly can be observed by other means, such as looking at the percentage of web sites that more or less need MSIE to render properly (or that only renders properly because other browsers go to great lengths in emulating MSIE’s bugs and quirks).

  7. bq. John: Before you decide that you want to come of as a rude jerk, wouldn’t you think it was a good idea if you were a bit informed? Analog, Webalizer and AWStats all calculate their visitor numbers based on hits. Not pageviews, not visitors. Hits.
    Arve: I’d somehow associated your blog posting with treego, the rude (and rather ignorant) fellow over on Asa’s blog. It seems that carried into my initial post’s tone… and for that I apologize.
    I’m absolutely amazed that there are stats packages out there that are doing browser stats based on hits. The hits myth was debunked back in the bubble days, so it’s rather surprising that anyone still uses those numbers for anything at all.
    bq.. As for offsite statistics, most of these do not actually disclose what they are measuring. Some might be measuring “unique visitors,” and some are simple pageview counters. Even the “unique visitor” counters are error-prone:
    What is a unique visitor? An IP address? No. You can have thousands of visitors visiting from the same IP. 3rd-party tracking cookies? You simply can’t tell, neither can the people gathering the stats.
    p. Quite true. I’ve advised many a client on this same issue. The most accurate method is IP in conjunction with cookies (which is what I use for my sites) but even that is suspect — how many people disable cookies, etc?
    As for the major stats vendors. I’ve heard (admittedly unconfirmed) from at least one that it is a combination of IP and cookie… which is about as accurate as you can get from the server’s point of view. I know that TheCounter’s tracker works this way as well. WebSideStory’s hitbox also uses IP + cookie… but I do not know if they do browser percentages based on hits or visitors (though I lean highly towards visitors).
    bq. For those that still don’t get it: I’m specifically not trying to bash a particular browser. I am just saying that you can’t take stats gathered in one domain and generalize them: You don’t know what errors you have, and thus, the statistic itself does not have any merit at all. You should under no circumstance make judgements based on said stats, and under no circumstance whatsoever should you design web sites or services based on said stats.
    While suspect and not truely accurate, the overall stats can be quite useful. They do show that non-IE browser use is growing and that any site that is still IE-only is basically turning away customers at the door. For the business and numbers types that can’t justify the “extra cost” of coding to standards and then adapting to all browsers, these numbers can be VERY helpful in justifying the “cost”. [Ex: Do you want to turn away every 20th customer?] That’s actually the main reason behind my attempts to combine the stats into something a bit more useful (and hopefully accurate) and breaking things down based on rendering engine. (Browsers Statistics by Rendering Engine) The numbers, of course, have all the usual caveats… but if they can get us to convince even a handful of developers to code more to standards, then it’s worth it.
    Regards,
    John

  8. bq. Arve: I’d somehow associated your blog posting with treego, the rude (and rather ignorant) fellow over on Asa’s blog. It seems that carried into my initial post’s tone… and for that I apologize.
    Apology accepted. Even if I do prefer Opera over Firefox, I just do not want to be dragged into the perpetual flamewar between the zealots on either side.

  9. bq. Even if I do prefer Opera over Firefox, I just do not want to be dragged into the perpetual flamewar between the zealots on either side.
    Yeah, I hear you. It seems that there are always Opera fanboys posting on Mozilla forums/blogs and Firefox fanboys poo-pooing Opera on Opera forums. I don’t really get it. Everyone’s entitled to their opinion and there’s a lot to like about Opera. (Heck, there’s quite a bit I like about Opera even though I haven’t bought it since version 3) If another browser comes along that does even more of what I want than Firefox… I may switch to that. (and I’d imagine you’d do the same)
    In the same respect, I don’t get the defensiveness inspired by valid criticisms. I’ve encountered it from Opera fans and Firefox fans. And it reminds me of the Microsoft vs Apple rhetoric. I don’t encounter it as much from Firefox folks but that could be because they know I’m not trying to put it down (since they see my Portable Firefox, etc work). And there are LOTS of criticisms that can be directed at Firefox, Opera, Safari, etc. (Though there are more than all three that can be directed at Internet Explorer)
    For the fanboys: turning a critical eye towards all software is what helps us make it better. Contructive criticism has helped both Opera and Firefox improve in the past and will (hopefully) continue to do so in the future.
    Oh, and a pseudo-random side note… I’m planning on doing a launcher and possibly a package of Portable Opera (along the lines of my Portable Firefox). I may just do instructions and a launcher, though, as the package would be a license violation (something I haven’t had to worry about with Portable Firefox, Thunderbird, NVU, Sunbird and OpenOffice) so I’m not sure about that yet. If you’re interested in testing, please let me know.
    Regards,
    John

  10. Stats can be useful if parsed in a sensible way. For determining marketshare, most of the undercounting and overcounting problems that you mention can be overcome by looking only at one request per IP per day. If you then throw out all the requests that did not result in an HTTP 200/206 response, you should have a pretty solid number.
    Arriving at useful numbers for other types of statistics is rather more difficult.

  11. bq. Stats can be useful if parsed in a sensible way. For determining marketshare, most of the undercounting and overcounting problems that you mention can be overcome by looking only at one request per IP per day.
    There are two problems with this:
    # Counting once per IP will introduce errors from users connecting from behind a NATed network, since they will all be counted as one IP.
    # Most, if not all of the referral spamming bots use open proxies, and as such, the UA strings used by these bots will be severly overcounted.
    End result: You still have a unreliable result.

  12. Opera Editorial: Crimes, Misdemeanors, and Browser Statistics

    I had actually planned an editorial (which will still appear eventually) about how well Opera seems to have handled, for instance, security companies such as Secunia; cooperating about the timing of security announcements etc. And then this came along….

  13. Grimm

     /  2005-06-22

    Statistics aren’t nonsense (and no, not browser statistics either). Most people who analyse logs and develop statistics are actually aware of what they’re doing, and they pay close attention to all of the obvious points you’re making. There are even terms for these phenomenons in statistics, for example standard deviation and variance.
    Are you possibly confusing statistics with facts? Now remember, statistics is an educated guess. What you want, is the blatant truth. Well good luck finding it.
    Meanwhile, I’ll stick with perfectly good statistics.

  14. Visitor-per-day (even if using cookies to detect being behind a firewall) isn’t accurate because you don’t differentiate between casual surfers of a site and those who are true clients of a website and spend real time on the site. For instance, someone will be counted the same who just logs on to check their Excite horoscope as someone who sits there browsing the site an hour every day.

  15. Jo

     /  2005-10-27

    Can a javascripted stat included in a page be accounted for being reliable when it boils down to counting unique visitors by means of counting unique IP addresses? Eg Extreme Tracker