Summary: I used Google and AlltheWeb to estimate the frequency of certain words on the web. When there is a large number of hits (>10,000), AlltheWeb reports a significantly larger number than Google. Otherwise, the two search engines report roughly the same number. The transition between these two behaviors is abrupt and can be seen most clearly by plotting the frequency of variable-length onomatopoeia.
Because onomatopoeia are words that imitate sounds, the spellings of these words are subject to great variation. I am interested in the usage frequency of these variations. For example, "whee", a sound associated with joy, can be spelled with an arbitrarily large number of e's. The easiest way to estimate the usage frequencies of these variants is by searching the web. For example, one can do a Google search on "whee", "wheee", and "wheeee" and record the number of hits. When the number of hits is plotted against the number of e's, the data follows a power law. Power law behavior can be observed in other onomatopoeia, such as "ahh..." and "hahaha..." When the same plots are made using data from AlltheWeb.com search results, the curves are different. AlltheWeb reports a significantly larger number of hits than Google when the number of hits is over 10,000, but when there are fewer than 10,000 hits, the two search engines report similar numbers. See Figure 1. Figure 2 directly compares the number of hits returned by Google and AlltheWeb for these searches. Each point represents the number of hits returned by the two search engines for a single word. The points fall along the diagonal when the number of returned hits is less than 10,000, but they are above the line when the number is greater. This indicates that the two search engines return similar numbers of hits for infrequent words but AlltheWeb returns more for frequent words.
Figure 1. The dropoff in frequency of onomatopoeia as their length increases. The Google data falls along straight lines while the AlltheWeb data has abrupt jumps at around 10,000 hits. | Figure 2. Each point represents the number of hits returned by the two search engines for a single word. The points that lie above the diagonal represent the words for which AlltheWeb returns more hits than Google. |
Why would AlltheWeb report more hits than Google for high-frequency words? Google reports that it searches a slightly larger number of pages than AlltheWeb, so one would expect Google to return similar, of not greater, numbers of hits. Unfortunately, it is difficult to verify if a search engine really finds 10,000 or more hits for a given term. Google will only show links for the top 1,000 results while AlltheWeb will show 4,000 links. Is this a general result that applies to all searches, or is this a quirky result applying only to onomatopoea and nonsense words? If it is a general result, then for most searches that return 10,000 or more hits one would expect AlltheWeb to report more than Google, and for searches that return fewer one would expect both search engines to be about the same. I searched for a few words to see if this happens, and I summarize results in the table below. I got tired of doing this pretty fast (the onomatopoeia are easier to use), but the results seem to agree with the onomatopoeia results. I don't understand how there can be over 100,000 web sites that have the word "hermeneutics" in them.
word | Google hits | AlltheWeb hits | ||
---|---|---|---|---|
dictionary | 12,600,000 | < | 18,501,861 | |
u2 | 2,400,000 | > | 2,152,403 | |
facsimile | 1,620,000 | < | 3,193,565 | |
"George Bush" | 987,000 | < | 1,642,007 | |
flake | 381,000 | < | 769,901 | |
myrrh | 189,000 | < | 364,940 | |
hermeneutics | 129,000 | < | 199,115 | |
frob | 40,400 | > | 24,676 | |
zipf | 35,000 | ~ | 35,085 | |
flummox | 7,370 | < | 17,946 | |
"Stephanie Forrest" | 3,100 | > | 1,944 | |
ho chih minh | 2,000 | > | 1,315 | |
bioflavenoids | 643 | < | 1,036 | |
"epitaph for a peach" | 627 | > | 299 | |
psdoom | 522 | > | 230 | |
"Dennis Chao" | 270 | > | 151 |
Is AlltheWeb overreporting the number of hits when the number of hits is too large to verify? If they are, does it matter? I assume that there is no deliberate deception, but the algorithm AlltheWeb uses to estimate the number of hits when that number is over 10,000 is probably a little optimistic and does not match the number of Google hits nor does it line up with AlltheWeb's results for more infrequent search terms.