11.08.10

Commentary: StatCounter ‘Global’ Statistics

Posted in GNU/Linux at 8:10 pm by Dr. Roy Schestowitz

StatCounter bias

Summary: How StatCounter turns 4-5% of the world’s population into 25% and reduces the world’s largest Internet population (China) to just 2.46%, then claims to be measuring global market share (other surveys do the same thing)

AL submits: “Thank you for all your hard work in bringing us news through Techrights. I am reading it daily and find lots of interesting information.

“I read one of the comments from Mad Hatter in which he was talking about Wikipedia article on OS market share. I went to check it out and found that they use 1% for Linux (globally) based on the research by StatCounter Global. I was interested to see how this group is gathering their statistical data. If you go to their FAQ section they talk about sample size per country/region and there is a link to the full list of all countries. As they stated themselves their pool is 16,3 bln hits. Quite large I would say. But there is something interesting – the biggest group (region) is United States with 3,965,972,279 hits. That is almost 25% of the total pool. Now, my days of statistical studies are long gone but I still remember that in order to have accurate result you cannot over-represent one group. The result will be obviously skewed. We have one country that contributes almost 25% to the result compared to the rest of the world. As StatCounter states that they choose randomly that makes it very likely that lots of data on hits would be taken from USA. You know, for example, how much is the share of hits from China? 2,46%! In fact, looking at the whole list you can see that starting from Korea and further down the share is less than 1%! That includes countries like Poland, Greece, Japan, Russia, Switzerland etc.

“The result will be obviously skewed. We have one country that contributes almost 25% to the result compared to the rest of the world.”
      –Al
“I know some can say that there are many more computers sold in USA than in other countries (can’t be true). But market share is more complex. If we have 95% (example) Linux presence on desktops in China, they would hardly make any influence with representation of only 2,46% on the StatCounter data. Do you see what I mean? There are of course many more problems with that. What kind of websites StatCounter is using to get hits? If we put hit counter on the website with Silverlight I don’t think we will get many hits from Linux OS desktops, right? And even if the websites are getting hits from same amount of Linux OS and other OS desktops what will happen? StatCounter will randomly select hits from global pool and as data from USA will be more likely to get selected it will greatly skew the result and linux will always get under-represented. Lets say you have two crates: one with 10 pears and one with 250 tomatoes + 150 pears and you draw five times. However 3 times from first crate and 2 times from the second. You will have selected more pears than tomatoes. Even though there are 250 tomatoes and 150+10=160 pears. Is this reliable representation?”

Share in other sites/networks: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Reddit
  • email

This post is also available in Gemini over at:

gemini://gemini.techrights.org/2010/11/08/statcounter-bunk-data/

If you liked this post, consider subscribing to the RSS feed or join us now at the IRC channels.

Pages that cross-reference this one

8 Comments

  1. StatCounterGlobalStats said,

    November 9, 2010 at 3:59 am

    Gravatar

    Hi there,

    I’ve just stumbled across your article and hope to clear up some confusion.

    Our StatCounter Global Stats measure various market share and other stats for all countries across the globe… hence the name.

    Our methodology is very simple and we’ve purposefully kept it that way. Specifically, our stats are based on more than 15 billion hits per month to our 3 million+ member sites. We’re not aware of any other publicly available service providing market share stats that has a bigger sample size on which they base their information.

    You’re absolutely correct about us NOT weighting our data. We do not impose artificial weightings on our stats and this is a conscious and deliberate decision. Weighting stats means that the stats are only as good as the weighting methodology used. If the weighting data is inaccurate or out of date, then it renders the data completely incorrect. For these reasons again, we choose NOT to weight our data in any way and instead we report it as we record – other commentators can, however, weight the data as they wish. All our work is shared under a Creative Commons Attribution-Share Alike License (http://creativecommons.org/licenses/by-sa/3.0/) for this specific purpose – so please feel free to download our data and apply whatever weights you see fit.

    StatCounter Global Stats came about because we decided to publicly share interesting trends that we were monitoring in-house. We aim to make our stats and methodology as clear as possible and appreciate all comments, queries and suggestions. If you have any questions for us, please don’t hesitate to contact us via our feedback form (http://gs.statcounter.com/feedback) or by direct email.

    Dr. Roy Schestowitz Reply:

    Hi,

    A few quick points:

    1. The size of the sample does not matter at all. Other companies like NetApps also brag about the number of UIPs, but this number is meaningless unless distributed correctly (nature of the sites sampled, geography, etc.)
    2. How does the data account for dynamic IPs, proxies/squid, and the imbalanced use of the Web browser depending on the user (e.g. # of page requests; this can be correlated to operating systems and browser, e.g. does it support tabs? What is the connection speed?)
    3. How are zombies PCs and other ‘junk’ traffic removed from the dataset?

    There are many other challenges/deficiencies, but it’s commendable when the data and methods (preferably code) are made publicly available for independent audits, provided of course they don’t violate privacy rule (which is a hard problem when browsers are specified very precisely with locational information too). That’s why such surveys cannot reach privacy-conscious sites, many of which appeal to civil rights-aware users (many would GNU/Linux), and that’s just one example. I wrote an article on the subject 3 years ago:

    http://itmanagement.earthweb.com/osrc/article.php/3687616/Can-Linux-Adoption-Ever-be-Accurately-Gauged.htm

  2. Matsi said,

    November 9, 2010 at 1:12 pm

    Gravatar

    Here are the statcounter results for Finland

    http://gs.statcounter.com/#os-FI-monthly-200910-201010
    (showing about 2,5-2,7% for Linux)
    …and here are the results of my homepage (non-geek, 95% non-computer, mostly politics, social discussion…)

    Windows 2785 83.18%
    Linux 302 9.02%
    Macintosh 185 5.53%
    Unknown 73 2.18%
    FreeBSD 2 0.06%
    CPM 1 0.03%

    (visiters: Finland 94,4%, Sweden 0,9, USA 1,1% others 3,4%)

    No big differences between clicks, unique visitors, unique sessions…

    There are several finns telling the same story: Linux has about 7-10% marketshare on their homepages, blogsites…

    So one thing is sure – Linux has much bigger marketshare in Finland than statcounter is telling. My guess is 8% +/- 1%. There is no doubt that situation is quite the same in other regions. Linux is some 3 times bigger than statscounter is claiming.

    Dr. Roy Schestowitz Reply:

    Statscounter needs to tell which actual sites in Finland it is sampling from. These are not randomly selected Finns.

    StatCounterGlobalStats Reply:

    There is still considerable confusion here!

    Our methodology is here:
    http://gs.statcounter.com/faq#methodology

    We are NOT sampling websites from Finland.

    Geo is determined via IP address based on location of visitor NOT location of website. Nothing is randomly selected either – we publish everything we track.

    In other words our stats for Finland are based on all hits we track (approx 60 million per month) from Finland (i.e. IP addresses in Finland) to all our 3 million plus member sites.

    If anyone has further questions, please do submit them to us directly – we’re more than happy to deal with any and all queries.

    Thanks!

    Dr. Roy Schestowitz Reply:

    That does not change my point of argument. What are those 3m+ sites? There are far more sites than that on the Web (IIRC, over 100m domains registered).

    What is the geographical distribution of these 3m+ sites? What proportion of them is Finnish for example? What proportion is Chinese or Brazilian?

    Danielh Reply:

    If i get this right you are sampling 3 million sites out of the 220 million sites available?

    I really hope those sites are very spread out in target groups etc because else there seems to be a great margin for error. Do they include any of the bigger sites like Google, Facebook, Slashdot, Youtube, QQ, Baidu, Blogger, Twitter etc or is it just smaller sites?

    Dr. Roy Schestowitz Reply:

    To be fair to StatCounter, it is an exceptionally hard problem to solve because of its massive scale. I just hope they make all of their experimental data and methods public. In academia we can hardly even publish a paper without this most fundamental requirement, not to mention rigourous scrutiny (no statistician would accept StatCounter’s charts without a challenge).

What Else is New


  1. Links 7/5/2021: IPFire 2.25 Core Update 156 and Diffoscope 174 Released

    Links for the day



  2. The New Microsoft? No, the New IBM.

    Microsoft GitHub and IBM: a strategic alliance between a monopolistic duo



  3. The Audacity Takeover by Muse Group is No Cause for Celebration

    Audacity is now part of an entity called Muse Group and if it doesn’t take or suck freedom out of Audacity, it will certainly deny users rather basic concepts (or anticipation) of privacy



  4. King of Linux

    If the entire operating system is being called "Linux", then we fall for a publicity or misattribution stunt



  5. The Biggest Troll is the Linux Foundation, Still Looking to Provoke and Defame Free Software Communities in Order to Help a Monopolistic Takeover and to Shoehorn Tyrants Into Leadership Positions

    Contrary to what the so-called ‘Linux’ Foundation is trying to say, the most toxic element is itself; it’s maligning the real community while protecting abusive and racist corporations that profit from war and tribalism-motivated hatred



  6. IRC Proceedings: Thursday, May 06, 2021

    IRC logs for Thursday, May 06, 2021



  7. “The Lolita Express” and Prince Bill

    “The Lolita Express” scandals return to haunt pool old Bill, as it turns out his wife was upset and it's quite likely the reason for their divorce



  8. Links 7/5/2021: GNU/Linux Preinstalled, Plamo 7.3, LibreOffice 7.1.3

    Links for the day



  9. The Latest Reports About Bill Gates Serve to Confirm or at Least Reaffirm Many People's Suspicions

    So, just as many people suspected, Melinda Gates did not appreciate her husband sneaking behind her back to meet someone who had trafficked thousands of underage girls for sexual exploitation and there are high-profile calls right now for greater transparency, seeing the impact on the world’s biggest tax evasion vehicle



  10. Disregard Web Sites That Call Themselves 'News' and Instead Promote Proprietary Software for Companies Like Microsoft

    Publishers like IDG have long been paid-for marketing in ‘article’ clothing, sometimes with the veneer of ‘reporting’ (as if they have some inside knowledge or insight, e.g. speaking with or for the company they secretly coordinate with or market for); but sadly we’ve been seeing some so-called ‘Linux’ sites doing the same thing, in effect acting like de facto Microsoft marketers



  11. [Meme] Who Needs Examination Anyway When There's 'Hey Hi' (AI)?

    The patent production line could do away with 'pesky' and 'opinionated' examiners who actually wish to scrutinise alleged 'inventions'



  12. Europe's Second-Largest Institution Corrupting the Media and Buying Expensive Puff Pieces

    As annual reports reveal, the EPO wastes an extraordinary amount of money on reputation laundering campaigns and it pollutes the signal by paying publishers; we examine this issue using the new 'reports' shown in the video above



  13. Links 6/5/2021: Fedora’s Compiler Policy and Celemony Software GmbH Adopting Free Software

    Links for the day



  14. Free Software Proponents Don't Fall for Bullshit (Same is True for EPO Examiners)

    There are parallels between what happens in the Free Software Movement and the EPO, where well-meaning people — and usually hard-working scientists — are besieged by people who never really contributed anything to society



  15. IRC Proceedings: Wednesday, May 05, 2021

    IRC logs for Wednesday, May 05, 2021



  16. Lessons From Another Failed Coup Against the Free Software Movement

    The coup has very clearly failed and we should prepare for future attempts (they go in cycles); the monopolies really dislike software they cannot control fully (e.g. copyleft/GPL-licensed software)



  17. Links 5/5/2021: Mesa 21.1 Released and New Releases of Python

    Links for the day



  18. Links 5/5/2021: StarLabs, GNU Zile 2.6.2, Fedora i3 Spin

    Links for the day



  19. Phony 'Scandals' From Phony 'News' Site ZDNet

    Steven J. Vaughan-Nichols continues the coup against the FSF (trying to separate it from its founder, Richard Stallman), funded by IBM and Microsoft to engage in libel at a marketing company-owned ‘news’ site called ZDNet



  20. Links 5/5/2021: Windows Security Breaches and GNU Pokology Launched

    Links for the day



  21. IRC Proceedings: Tuesday, May 04, 2021

    IRC logs for Tuesday, May 04, 2021



  22. Links 4/5/2021: Taiwins 0.3, KDE Plasma 5.21.5 Released

    Links for the day



  23. EPO Already Wasting Money on Media Manipulation Campaigns for European Inventor Award

    An online-only European Inventor Award 'event' is being used as a pretext/excuse to flood European publishers with money they can rightly perceive as 'hush money'; everyone out there with no spine would likely buckle at the sight of EPO euros and just produce mindless puff pieces that serve to distract from EPO corruption



  24. The Timing of This Melinda Gates Tweet Was Always Curious...

    Remarking on her trip to Africa, where the Gates family lobbies for monopolies on seeds (for profit or course, notably through Monsanto/Bayer, which the Gates family heavily invests in), she posted pure fluff and old photos. And it’s hard to believe she had nothing better to do at the time (better than such nostalgia). As we noted last year: “The above tweet of a beach was posted [by Melinda Gates] on the date of the arrest/search of their employee, who was at their residence at the time.” He was arrested around the very same time this tweet was posted. As we wrote last year (based on detailed documents obtained from the police department): “This tweet was posted 2 hours and 40 minutes after the door was breached and incriminating evidence collected.” He was arrested later that morning at the mansion of Bill and Melinda Gates (the police records contain detailed timelines to confirm the chronology). Melinda’s first name was also in the CP 'stash'.



  25. Media Frenzy Around Gates Divorce Helps Distract From Bill's Crimes

    The distraction from many Gates scandals is cushioned by yet another personal fluff; we would rather see investigative journalism pursuing real answers about real scandals



  26. IRC Proceedings: Monday, May 03, 2021

    IRC logs for Monday, May 03, 2021



  27. EPO Disregards Animal Welfare

    An often overlooked issue surrounding the second-largest institution in Europe is its impact on millions if not billions of animals; there's ongoing research into that



  28. Links 3/5/2021: Sparky 5.15, Bill Gates Divorce, Netflix Fraud

    Links for the day



  29. Links 3/5/2021: New in OpenBSD 6.9 and Audacity Acquired By Muse Group

    Links for the day



  30. Adding, Seaming Together, Merging, or Concatenating Videos From the Command Line With FFMPEG (Scripting for Streamlining of Workflows)

    In order to enrich the looks of videos with almost no extra time/effort (all scripted, no GUIs should be needed) use ffmpeg with the concat operator; but there are several big gotchas, namely lack of sound and need for consistency across formats/codecs and even sampling rates


RSS 64x64RSS Feed: subscribe to the RSS feed for regular updates

Home iconSite Wiki: You can improve this site by helping the extension of the site's content

Home iconSite Home: Background about the site and some key features in the front page

Chat iconIRC Channel: Come and chat with us in real time

Recent Posts