Seeing lots of Wikipedia in your Google searches?

In August and September 2006 various bloggers (Nicholas G. Carr, Steve Rubel, Tim Bray, and others) started to notice that Wikipedia often shows up on Google for their searches.

To research this recent phenomena more throughly I decided to try to do a simple random sampling on whole Wikipedia (together with redirects makes it to ~2.7 million titles) and then try to Google, Yahoo and MSN those articles.

So, how likely is it? It turns out that it is very likely actually. You have about 81 % chance to get Wikipedia link in top 10 results.

(pictures follow, so if you don’t see them in your RSS feeds go to my blog page)

Here is a nice pie for Google Wikipedia results count for top 10 results:

and we can do this of course also for other search engines like Yahoo! or MSN. This gives us nice combined trend lines:

But then comes the question, how high do those Wikipedia articles rank? Well it turns out that if you are Yahoo it’s probably #1 result in ~47% of cases and in top 3 in ~76% cases. It’s in top 3 for Google only in ~66% of cases.

If you want to read more about it you can download Full report (PDF, 7 pages) and also check out Appendix that has more pictures and also statistical outputs if you feel like doubting my interpretations or would just like to see more detailed numbers behind it.

Special thanks go to Matej for giving me helpful hints about sampling methodology and professor Hercules Dalianis who approved this subject as my assignment and thus forced me to actually finish it.

Please post any suggestions or comments about this research in comments. If there is enough interest for further analysis I will be extend it with time.

14 thoughts on “Seeing lots of Wikipedia in your Google searches?

  1. Interesting stuff! It’s nice to see our beloved encyclopedia in the results from a lot of searches, as it generally has nice concises data about the subject. Also, I won’t see Encyclopedia Britannica equal this feat any time soon 🙂

  2. Pingback: Swiss Metablog
  3. Hi!
    Great study: I’ve just overlooked it, but there are three “obvious” remarks to be made.

    1. What about Have you considered smaller, alternative search-engines for comparison purpose. I’m thinking of an open-source one (sic) that might be a good base line.

    2. Please, please, do it again, to have some dynamic data… I’d love to help, if it too much work. (You’ve got my e-mail, though it’s not public, right?)

    3. Could you use Zeitgest info, instead of a Wikipedia biased query file?
    This only has 10 items or so,
    but I believe you might obtain an list of the top 100, unweighted, sorted alphabetically, from one of the four big SE; you can even sign an agreement not to publish it.

    I might post another comment when I’m over with the full detail reading.

  4. Bertil, thanks for the comments. I’ll email you about the details, but until then here are quick answers to the questions:

    1. About I would live to include more search engines, but only “big tree” offer public API’s that I could use to query the data without having to write my own search engine scrapper.

    If you know any other search engines that offer some sort of API or other way to automaticly query for data I would be happy to include it.

    2. What kind of dynamic data? I have another version that I also tested but didn’t publish results yet where I take queries from WP:RecentChanges in a certain time window to query only for pages that are active. Those number would probably give me even more pro-wikipedia results.

    If there is a good source of data it would certainly be interesting to do it on them.

    3. Sure, zeitgeist sounds like a good idea, but it’s probably easier if you just do it manualy then for me to feed it into my system.

    If you can email me with details how to get more detailed zeitgeist information I would be *very* happy to repeat it again on that dataset.

  5. I think to determine “how much Wikipedia people see on top of Google” you’d have to change your methodology — e.g. use actual AOL query data (and even then you’d have the big constraint that AOL searchers may not be typical, but it would be a start, and as bonus you’d also know where they clicked on).

    The fact that searching for Wikipedia titles often brings up Wikipedia doesn’t, IMO, yield relevant results, unless you want to show that Wikipedia has lots of pages indexed in search engines (around over 53 million in Google, according to Google’s “site” operator). But lots of pages indexed does not mean lots of pages will show up in search results. For Wikipedia, we all *know* that’s the case from our searching experience, but to come up with statistically relevant data you’d have to use actual real sample queries for probing.

  6. Pingback: Undercurrent
  7. Pingback: Micro Persuasion
  8. Philipp Lenssen: I agree with you and I’m working on some better methodology. Still, it’s a full disclosure of methodology so at least I’m not pulling results/queries of the air and claiming something.

  9. I’ve posted this comment over on Nick Carr’s blog and Micropersuasion, but thought I’d add it here as well. I agree with Philipp that the methodology is flawed.

    The starting point needs to be what users search for not what Wikipedia covers. By sampling from existing Wikipedia entries you are sampling on the dependent variable. By definition the study is controlling for the fact that a relevant Wikipedia entry exists using that query since you derived the search terms from existing Wikipedia titles. Queries on those exact terms are going to favor pages that have the term in the title. But who is to say that people search for those topics using those terms?

    You could try using the AOL data for some possibilities (like Philipp suggests), but we don’t really know how representative AOL users are of all Internet users. You could get some ideas from Google’s Zeitgeist (as per Bertil’s suggestion), although that will only give you extremely common topics that may have tons of results and so may well be atypical results not reflecting the likelihood of a Wikipedia result for less common terms and topics.

    I do research on how users look for various types of information online. If interested, we could discuss the possibility of you using some of the terms people in my study – average Internet users – entered on search forms for various types of content. I may not have quite the sample size you’re looking for, but I’d have some queries from real folks. (I also happen to know what they clicked on when using a particular search engine so that could also be interesting additional data.)

  10. Pingback: WikiAngela

Comments are closed.