Google’s results plotted for repeated character strings

by jaymz on 14/02/2010

Don’t ask why but out of interest I googled for the string “AAAAAAAA” earlier and after looking at the millions of pages that came back and thinking “wtf”, I searched again only making the string much longer. I was expecting it to just keep going down but at around 20 characters there was a significant jump in returned results. I scratched my beard and proclaimed this interesting (as you can probably tell I have no distractions on valentines day). To skip over further bullshit, this is the graph of 3,328 searches – that is, the number of results for every character (A-Z) repeated one to 128 times. Some of the peaks are interesting.

Why 128 and not something higher? Google wont let you, at least via the query string. The raw data for this graph was generated by a simple python script. If you aren’t coding in python already, please do look into it, its jolly nice.

#!/usr/bin/python
 
from urllib import FancyURLopener
from BeautifulSoup import BeautifulSoup
import csv
import string 
 
BASE = "http://www.google.com/search?q="
 
class MozOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1;' +
' it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
ff = MozOpener()
 
out = csv.writer(open('out.csv', 'w'))
 
headers = ['count',]
for l in string.ascii_uppercase:
    headers.append(l)
out.writerow(headers)
 
for x in range(1, 128): # 128 is the max length of chars allowed
    results = [x,]
    for l in string.ascii_uppercase:
        qry = l*x
        raw_data = ff.open(BASE+qry).read()
        soup = BeautifulSoup(raw_data)
        result = soup.find(id='resultStats').findAll('b')[2]
        result = int(result.contents[0].replace(',', ''))
        results.append(result)
        print [result, qry]
    out.writerow(results)

When that had finished I simply loaded it up into OpenOffice calc (3.2 is out by the way) and plotted it with the result count set to a logarithmic scale so result #1 doesn’t just skew the thing into one boring L shape. The first thing I noticed was a very visible spiking at length 100.

This isn’t so hard to imagine, 100 is a “nice” number. It’s not hard to imagine someone using 100x a character as a test input or just a “long” string. Every character string of length 100 exhibits this spike. The much bigger blue curve is for the letter x. This is used by children and adults alike to mark kisses, and everyone knows the more x’s the more someone loves you. If you look at many of the results for 100 or 101 character X searches it seems to be when people are using it in this context. Could it be that the much bigger spike for the 101 character string X is simply because its 100 + 1 kisses?? Towards the start of the graph there are a number of interesting spikes, I’ve marked some of them along with the length.

Some of these spikes are easy to explain, the biggest number of results returned for a single repeated character phrase is a by product of DNS, yep, its “WWW”. This accounts for the slightly higher result count than the simple “A” with 17,090,000,000 pages returned versus 12,260,000,000. Another easy one is 6 F’s – the hex code for white. I am totally stumped by F’s latter behaviour though, there is a spike at F-31 and F-33 but not F-32. There is a big jump for X-12 and X-34. International mobiles have 12 digits, as do UPC codes but I feel like I’m clutching at straws :) trying to explain that. Down the other end of the graph, between 115 and 128 the characters P, H, A, M and O all have significant spikes for specific counts. For M & O when you browse many of the first few pages of google’s results many of the pages are using them as part of exaggerated speech. It’s almost like the collective conciousness of the world has decided that 120 characters is just right to describe a particularly tasty dinner.

A spreadsheet with the data I gathered is available via google docs if you’d like to investigate yourself. I should note that the python script above gets its data from .com, if you are using the site to look up some searches it will more than likely switch to your local domain. Depending on your cookies google may also perform extra filtering (e.g. safe searches) so you’re numbers may not be the exact same as mine. I’d be interested in any theories as to some of the more prominent spikes. The OpenOffice spreadsheet with the charts done is also available for download here.

If you are going to play around then save your fingers & sanity and use python to create you’re test strings, just drop into a python shell and use “X”*120 etc or perl -e ‘print “X”x100′ from the command line etc.

jaymzcd@googlemail.com

There are 4 comments in this article:

  1. 15/02/2010Bumholecheese says:

    purdy graphs

  2. 15/02/2010Bumholecheese says:

    man that f thing is weird. Checked myself. ~400k results for f31 +f33, but only ~45k for f32.
    I’M going with the theory that people have a common sense of time, and when some tards enter fill in a form by holding down a single key they release at very similar times. I realise this is flawed in many ways, but it’s MY idea.
    Do you only get this result with fs?

  3. 15/02/2010jaymz says:

    haha! :) i actually thought that yesterday with the M’s – there’s a big spike for that around 120 char’s. I tried holding down the key for 2 seconds or so to see how often it would be the same length. My 10 attempts where fairly close but that suggested it should be clustered around there, when something like F jumps hugely in between the others its most weird :)

  4. 17/03/2010lee says:

    I would have expected a spike for “XXX” ??? :-) I guess you had safe search on?