Archive of articles classified as' "code"

Back home

Using stochastic processes to generate paint type effects in processing.org

15/06/2010
I very much like to play with processing.org when I’ve got some downtime. I’ve wrote a small particle engine a while back in the framework and this time I thought I’d try and do something a little more arty from the outset. My inspiration for this “sketch” was the many splatter paintings by Jackson Pollock. I can remember going to see some of his work at the Tate modern in London quite a long time ago. I had always thought it a bit poor as “art” when I was a kid but later on I developed an appreciation for the format. If you have no idea what I’m talking about, this is an example:
Now my sketches don’t look much like this but I did capture some of that splattering chaotic effect that I wanted. The basic process is very simple, a class called a Walker will randomly make its way across the canvas. As it moves the vector distance of each step is calculated. If the step is big enough it causes a burst of circles to be drawn, each with a slightly random position & varied alpha to give it some texture. To avoid it becoming a little too busy I’ve limited it to a set palette at runtime which can be regenerated as walkers are added.

I was quite happy with the random scatter effect as different pools of colour made their way through the canvas space. I thought it would be a nice addition to add in some basic boundary conditions so I added a quick polygon class and added in an array of those. A few functions later and I could click around the screen defining new polygons. Using the well known Jordan Curve theorem we can tell if the walker is currently within an arbitrary polygon (without holes of course). Here’s an example video of it in action. Heart’s seem pretty easy to draw compared to anything crazy complex and give a good idea of the effect:

I’ve added quite a few options to the code now allowing me to switch on & off the boundaries as well as add an arbitrary number of polygons. If the vertices overlap then the way the “point in polygon” algorithm works it will flag each “contained” area as a solid.
Finally here is a nicely captured heart shape. It gives a nice idea of the kind of effect when combined with a boundary. I think playing around with the way the walkers are coded could be fun, for example doing a neighbour check and tweaking the randomised movement accordingly could create a weak flocking style system with Brownian motion driving it.
If you’re interested in checking the code out you can find it over on github. It should run for you ok out of the box within your processing environment.
No Comments

Django Reference in android market now

28/05/2010

Very late last night, as I was enjoying “The Design & Evolution of C++” (an excellent book, even if you’re not a C++ dev), I took a short break and was browsing around on the android market place and noticed there was no django app. I’ve recently picked up an HTC desire and have been really enjoying developing on it. I’m currently working on an application for Crooked Tongues to allow people to post & comment their sneakers (don’t ask…) and I thought this would be a quick app to do and get out there. Anyone starting out developing apps will be aware that one of the hardest things is just thinking of something that isn’t already crowded out by what’s already available.

Basically, all I wanted was a more dedicated browser for the django docs. The content there is excellent so I quickly knocked together an application that lets you jump to major sections, set a page zoom by default and switch version easily. All in all it only took a few hours of development and thanks to the approval procedure it’s already available to install.

It’s a glorified web view but it’s a start point and its at least out there now. I’m getting a lot more used to the dev process now on android. The total code for this app weighs in around 400 lines – around 80 of those being XML to define the layout and strings used. The main activity class then manipulates the XML defined webview component to react to the (again XML) menu. My main constructor method looks like this:

@Override
    public void onCreate(Bundle savedInstanceState) {
        Resources res = getResources();
 
        version = res.getString(R.string.default_version);
        base_url = res.getString(R.string.url_base);
 
        super.onCreate(savedInstanceState);
        setContentView(R.layout.main);
        webview = (WebView) findViewById(R.id.webview);
 
        webview.getSettings().setJavaScriptEnabled(true);
        webview.setWebViewClient(new DjangoWebClient());
 
        final View zc = webview.getZoomControls();
        FrameLayout mContentView = (FrameLayout) getWindow().getDecorView().findViewById(android.R.id.content);
        mContentView.addView(zc, ZOOM_PARAMS);
        zc.setVisibility(View.GONE);
 
        setZoom();
        loadDocs();
    }

The remainder of the methods are basically just reacting to the menu choices and directing the webview url appropriately. All in all quite simple.

No Comments

Google Chrome extension for TFL data

18/05/2010

I have been coding in java for about a week now (when I’m not coding in Python of course) and I fancied a break away from it. I’m not entirely sure why but I decided to take a look at coding chrome extensions. These are in fact very, very easy to make. The tube status thing below took me just over an hour to get how I wanted. If you’re running chrome you can get the installer here. I think I’ll do some more work on it to add customization – that was my initial motivation to do this, the first one I installed from the current extension gallery just showed the little panel of status’ from the site and I wanted to just pick a few lines & tweak the output.

In the end it’s easier to just code my own. It makes use of the excellent REST api provided by Ben Dodson. Creating such an extension is trivial. You can make your life a bit easier if you just include jquery from the google ajax api’s. Then its really just a collection of a few html files (in this case just the one popup.html). All my HTML is doing is making an AJAX request for the JSON data from Ben’s api. That comes back and it simply renders out a few div’s with the colors set and based on the status a unicode entity to add some fun to it.

The method of making extensions on chrome really is a breeze and the documentation is plentiful, so hats off to google yet again for making me happy to be an open source, linux loving  developer. The extension is now in the chrome gallery and available here.

No Comments

Crooked Tongues TIA android app – Week 1

16/05/2010

I recently picked up an HTC desire and have been loving android as a platform. I’m not entirely sure why I didn’t go for android to begin with (apart from doing my friend Spangsberg a favour by buying out his iPhone contract before he left the country). I dabbled a little with objective C but not owning a full on mac for dev’ing put me off spending a significant time at home coding for the device. That and it seemed like a lot of work to get going (probably because I use Linux almost all day so am not used to the Mac toolchain).

Working with android on linux has been a total pleasure. Mostly. I’m still getting re-used to the verbosity of java compared to my day-to-day python but the experiance has been a good one so far. I’ve basically been learning by looking at the sample apps supplied combined with a lot of googling and reading of android dev forums. There is a real wealth of documentation out there and there’s an absolute ton of general java code to solve most problems. This is where I’m at about a week into it (spending a couple of hours every few days on it).

The HTTP posts are working now as is uploading a camera shot, you can store user/passwords and then it’ll do the form posts with that data. For now I have it just “logging” in to a dummy setup on my own server – that side is just some PHP for now. The real backend runs python (django), for now its just easier to run that data through some PHP on my blog server. The data for the views comes via a JSON array, that data is served up via python and is essentially just a REST interface to the TIA area. I’m mainly concentrating on just learning my way around the SDK but already I’m seeing that theres a lot of power in combining the android platform with django (and maybe django-pistons on top) to create rather powerful connected mobile apps.

2 Comments

Building CSS sprites with Bash & Imagemagick

3/05/2010

I’ve been rebuilding the Crooked Tongues forum and one of the things we’ve been mindful of is trying to stick to some good practises with regards optimzing the site for fast loading. One of the easy ways to decrease your page load times is to use CSS sprites. This is just stiching together all your common images into one big one and then using background positions & width/heights in CSS to offset and “crop” the big image. This way you only get the one HTTP request made to request it.

If you’ve got hundreds of small images it can really speed up your page. A prime candidate for sprites are typically your buttons, header images and in the case of a forum, emoticons. Crooked uses lots of custom designed icons and when writing them out as individual images for the post toolbar really slowed down the page. There was no chance I was going to do it manually so I turned to a stable of my web development toolset – Imagemagicks convert.

Convert can do a lot. I’ll show you two such ways it made my life easier and cut down on the time I had to spend doing grunt work. First, here’s how the toolbar looks:

I wrote some bash to basically loop over a folder of images and then using convert to pull the width & height info, I use that to write the CSS file. Finally I loop over the images once more using the append command to output one big, long image. (gist)

#!/bin/bash
 
# uses imagemagick to stich together all images in a folder and
# then writes a css file with the correct offsets along with a
# test html page for verification that its all good
 
if [ $# -gt 0 ]
then
 
    if [ $3 ]
    then
        ext="."$3; # the extension to iterate over for input files
    else
        ext=".gif"; # the extension to iterate over for input files
    fi
 
    name=$1; # output will be placed in a folder named this
 
    if [ $2 ]
    then
        classname=$2"-sprite";
    else
        classname=$1"-sprite";
    fi
    css="$name/$classname.css";
    html="$name/test.html";
 
    rm -fr $name;
    mkdir $name;
    touch $css $html;
 
    echo "Generating sprite file...";
    convert *$ext -append $name/$classname$ext;
    echo "Sprite complete! - Creating css & test output...";
 
    echo -e "<html>\n<head>\n\t<link rel=\"stylesheet\" href=\"`basename $css`\" />\n</head>\n<body>\n\t<h1>Sprite test page</h1>\n" >> $html
 
    echo -e ".$classname {\n\tbackground:url('$classname$ext') no-repeat top left; display:inline-block;\n}" >> $css;
    counter=0;
    offset=0;
    for file in *$ext
    do
        width=`identify -format "%[fx:w]" "$file"`;
        height=`identify -format "%[fx:h]" "$file"`;
        idname=`basename "$file" $ext`;
        clean=${idname// /-}
        echo ".$classname#$clean {" >> $css;
        echo -e "\tbackground-position:0 -${offset}px;" >> $css;
        echo -e "\twidth: ${width}px;" >> $css;
        echo -e "\theight: ${height}px;\n}" >> $css;
 
        echo -e "<a href=\"#\" class=\"$classname\" id=\"$clean\"></a>\n" >> $html;
 
        let offset+=$height;
        let counter+=1;
        echo -e "\t#$counter done";
    done
 
    echo -e "<h2>Full sprite:</h2>\n<img src=\"$classname$ext\" />" >> $html;
    echo -e "</body>\n</html>" >> $html;
 
    echo -e "\nComplete! - $counter sprites created, css written & test page output. ~jaymz";
 
else
 
    echo -e "There should be at least 1 argument!\n\tbuildSprite.sh output_folder classname input_extension"
 
fi

The script outputs the names incrementing by one, I’ve gone over myself and turned them into “nice” names. All in all its taken maybe 10 minutes to convert it all into a sprite and nice CSS file. If you were using a primary key field like 1, 2, 3… etc then you could probably fudge it to not even have to name the id’s “nicely”.

Imagemagicks convert can also do some modification of your files. It can for example convert your image to a black & white copy. It’s not a great deal of work to then loop over a set of images, create b&w versions in a temporary folder and then stitch the two versions together to get a color→b/w sprite for hovers. I’ve actually seen people do this manually and spend an age at it, with this method you output them once then sit back for the 30 seconds or so whilst the computer does the work.

for file in *.jpg; do convert -colorspace Gray $file bw/$file; done
for file in *.jpg; do convert $file bw/$file -append sprites/$file; done

It’s times like this that I am reminded of just how much time you can save with a little shell script and the right command line tools. Not everthing needs point & click.

1 Comment

Image color analysis for your ecom site

22/04/2010

Recently we’ve been working rather hard on a new look and complete re-launch of the underlying code for the whole of crookedtongues.com We’re a python shop these days so its running on django with a host of apps and the best part of over 100,000 lines of code.

One of the things that I was tasked with was the data migration and import and I’ll write about that post launch, what I fancy getting out now is how I’ve worked on adding color codes automagically to the store’s entire product database. More or less, its a little rough round the edges but gives fairly good results on a lot of our products. First off you can find the source code in a raw (read that as ‘I’m still working on it personally so take it as it comes’) way over at github. As a taste of what it does, here’s a couple of existing product shots:

It works quite well for colors that fall into the center areas of the color wheel. For my analysis I’ve segmented it by 30° so there are 12 bands of color. I’ve found that yellow & light greens are the most tricky to pull out accurately from an instinctive perception view of the result. By that I mean what looks very much yellow can get labelled as a bright orange instead as values close to the segment boundary (45°) cause an imbalance towards the orange side.

The basic process is working well for products with one dominating color, for multi-color shots it can pull out the relative levels of each tones hue band but blindly chooses the largest, however slight and uses that for its naming output. A better way would be to do some further analysis on the collected results for each image – one way to enforce some confidence that the main hue really is the main color would be to compare its standard deviation from the average across the sample set. A small σ would indicate a more uniform mix of colors – as we have already sampled out low saturation points this would indicate an image of strong multicolor. Taking the set of hue/tones within the 90th percentile and then comparing their relative deviations of count would allow identifying (and quantifying) the number of colors – so it would be possible to say mainly dark blue with a little light red.

If two bands had similar counts it is also more likley that the color in question is not two separate closely aligned colors but in fact a single color at the midpoint (or thereabouts) of them. A scheme to compare these before the final decision would no doubt improve the detection of yellow which seems to constantly tip into the orange side of the wheel. All in all working on this has been a decidedly pleasant break from data imports, javascript and enough MVC style code to do me a lifetime.

Since I wrote the above I’ve tweaked it somewhat to pull out gray (and black/white) counts also. It works out the rough percentage that corresponds to and if its over a certain limit for black or white will output a value for that also (with a lot of the product shots they’re on a white background so I need to be more granular with that than just regular colors).

#!/usr/bin/python
# -*- coding: utf-8 -*-
 
# A color analysis script to help you label your store's products with color data
# automagically. It will take either a single file or scour an entire folder for
# folders of images and do each one individually printing a summary of what it
# thinks is the correct color value. A work in progress...
#
# ~jaymz | @jaymzcampbell | jaymz.eu
#
#
# MIT Licesnsed for what its worth, copy: http://www.opensource.org/licenses/mit-license.php
 
import Image
import ImageFilter
import os
import glob
import sys
import colorsys
import re
from copy import copy
from operator import itemgetter
from decimal import Decimal
 
output = open('colors.csv', 'w')
 
# Pixels will be first compared to these values before being
# added to the data list of color information on the first pass
LBOUND = 0
UBOUND = 255
 
MIN_SATURATION = 30 # avoid washed out pixels influencing counts
 
# Base folder for the processFolder function, it'll iterate over here on subfolders
FOLDER = '/home/jaymz/documents/crooked-docs/data-export/store-migration/product-images/'
 
# Meh, i need to flip between these two, you can probably tweak this 
SUMMARY_FORMAT, SQL_FORMAT = True, True
 
# Names based off: http://bluelobsterart.com/wordpress/wp-content/uploads/2009/03/rgb-color-wheel-lg.jpg
COLOR = ['RED', 'ORANGE', 'YELLOW',
    'LIME', 'GREEN', 'TURQUOISE',
    'CYAN', 'OCEAN', 'BLUE',
    'VIOLET', 'MAGENTA', 'RASPBERRY',
    ]
TONE = ['DARK', '', 'BRIGHT']
 
# via the createColorSQL.py file , addition added in GRAY/BLACK/WHITE to after this
SQL_IDS = {'DARK YELLOW': 7, 'DARK ORANGE': 4, 'BRIGHT GREEN': 15, 'BRIGHT ORANGE': 6, 'DARK RED': 1, 'BRIGHT OCEAN': 24, 'BRIGHT RED': 3, 'DARK OCEAN': 22, 'YELLOW': 8, 'OCEAN': 23, 'BRIGHT YELLOW': 9, 'RASPBERRY': 35, 'GREEN': 14, 'BRIGHT TURQUOISE': 18, 'CYAN': 20, 'MAGENTA': 32, 'RED': 2, 'ORANGE': 5, 'BLUE': 26, 'TURQUOISE': 17, 'LIME': 11, 'BRIGHT LIME': 12, 'DARK MAGENTA': 31, 'DARK LIME': 10, 'BRIGHT MAGENTA': 33, 'BRIGHT VIOLET': 30, 'DARK VIOLET': 28, 'DARK BLUE': 25, 'BRIGHT BLUE': 27, 'VIOLET': 29, 'BRIGHT RASPBERRY': 36, 'DARK TURQUOISE': 16, 'DARK CYAN': 19, 'BRIGHT CYAN': 21, 'DARK GREEN': 13, 'DARK RASPBERRY': 34}
 
pcnt = 0
 
def trimFloat(val, places=2):
    return float(repr(val)[0:places+2])
 
def withinBounds(allowance, _rgb):
    rgb = copy(_rgb)
    diff = 0
    allowance = Decimal(repr(allowance))
    for c in rgb:
        for d in rgb:
            dec_d = Decimal(repr(d)).quantize(allowance)
            dec_c = Decimal(repr(c)).quantize(allowance)
 
            diff = abs(dec_d-dec_c)
 
            if (d != c) and diff>allowance:
                return False
    return True
 
def processImage(i, name=None):
  """ Scales down the image, blurs it to ease the blending of the color values
and reduce spikes from anomolies. It then samples pixels creating a list of
colors. This list is then looped over to build counts which are placed into
bins of 30° hue's seperated into three based on their value. Pixels less than
a certain saturation are discarded. """
 
  global pcnt
 
  i = i.resize((200,200))
  i = i.convert("RGB")
  i = i.filter(ImageFilter.BLUR)
  d = i.getdata()
  cnt = 0
 
  h = [] #holds the hsv info
  grays = [] #holds just gray content
  black_count = 0
  white_count = 0
  total_samples = 0
 
  for p in d:
      cnt = cnt + 1
      if cnt == 8: #take every 4th pixel
        if p[0]>LBOUND and p[1]>LBOUND and p[2]>LBOUND and p[0]<UBOUND and p[1]<UBOUND and p[2]<UBOUND:
            r = trimFloat(float(p[0])/255)
            g = trimFloat(float(p[1])/255)
            b = trimFloat(float(p[2])/255)
 
            if not withinBounds(0.02, (r,g,b)):
                h.append(colorsys.rgb_to_hsv(r,g,b))
            else:
                if (r+g+b)/3>0.94:
                    white_count += 1
                elif (r+g+b)/3<0.3:
                    black_count += 1
                else:
                    grays.append(colorsys.rgb_to_hsv(r,g,b))
            total_samples += 1
        cnt = 0 #reset sample counter
 
  h.sort()
  grays.sort()
  bin_width = 30 # size of hue slices (degress)
  max_bin = 360
 
  darks = [0] * int(max_bin/bin_width)
  mids = [0] * int(max_bin/bin_width)
  lites = [0] * int(max_bin/bin_width)
 
  for p in h[::]:
      hue = p[0]*360
      sat = p[1]*100
      val = p[2]*100
      if sat >= MIN_SATURATION:
        bin_number = ((int(hue)+15)/bin_width)%(max_bin/bin_width)
        if val<33:
            darks[bin_number] += 1
        elif val>33 and val < 66:
            mids[bin_number] += 1
        else:
            lites[bin_number] += 1
        #print "HUE BIN: %s VALUE : %d" % (int(hue)/bin_width, int(hue))
 
  c = 0
  data = zip(darks, mids, lites)
 
  if SUMMARY_FORMAT:
    for x in data:
        print '%d %s : %s %d°' % (c, COLOR[c], x, c*bin_width)
        c += 1
 
  # the following area needs a rework. the index technique works alright as long
  # as counts and values dont all match up, then it starts picking the first one
  # so this needs re-writing to better order the list data
 
  darks_sort, mids_sort, lites_sort = darks[::], mids[::], lites[::]
  darks_sort.sort()
  mids_sort.sort()
  lites_sort.sort()
 
  sorted_counts = (darks_sort, mids_sort, lites_sort)
 
  primary_idx = (darks.index(sorted_counts[0][-1]), mids.index(sorted_counts[1][-1]), lites.index(sorted_counts[2][-1]))
  primary_cnts = (darks[primary_idx[0]], mids[primary_idx[1]], lites[primary_idx[2]])
  tone = primary_cnts.index(max(primary_cnts))
  max_hbin = primary_idx[tone]
 
  pcnt += 1
 
  if SUMMARY_FORMAT:
    print "\nDominant Hue: %s %s" % (TONE[tone], COLOR[max_hbin])
 
  if SQL_FORMAT and name and max(primary_cnts) > 30:
    output.write('%d, %s, %s\n' % (pcnt, name, SQL_IDS[' '.join([TONE[tone], COLOR[max_hbin]]).strip()]))
 
  sorted_counts[0][-1], sorted_counts[1][-1], sorted_counts[2][-1] = (0, 0, 0) # kind of reset the primary to null
  for l in sorted_counts:
      l.sort()
 
  primary_idx = (darks.index(sorted_counts[0][-1]), mids.index(sorted_counts[1][-1]), lites.index(sorted_counts[2][-1]))
  primary_cnts = (darks[primary_idx[0]], mids[primary_idx[1]], lites[primary_idx[2]])
  tone = primary_cnts.index(max(primary_cnts))
  max_hbin = primary_idx[tone]
 
  if SUMMARY_FORMAT:
    print "Secondary Hue: %s %s" % (TONE[tone], COLOR[max_hbin])
 
  if SQL_FORMAT and name and max(primary_cnts) > 30:
    pcnt += 1
    output.write('%d, %s, %s\n' % (pcnt, name, SQL_IDS[' '.join([TONE[tone], COLOR[max_hbin]]).strip()]))
 
  # area to rewrite ends...
 
  gray_total = [(g[0]+g[1]+g[2])/3 for g in grays]
  gray_average = reduce(lambda x,y : x+y, gray_total)/len(gray_total)
 
  black_percent = black_count/float(total_samples)*100
  gray_percent = len(gray_total)/float(total_samples)*100
  white_percent = white_count/float(total_samples)*100
 
  if SUMMARY_FORMAT:
    print "\nAverage Gray: %s (samples: %0.1f%%), White count: %s (%0.1f%%), Black count: %s (%0.1f%%)" % (gray_average, gray_percent, white_count, white_percent, black_count, black_percent)
    print "Total samples taken: %s\n\n" % total_samples
 
  if SQL_FORMAT:
    if black_percent > 10:
        pcnt += 1
        output.write('%d, %s, %d\n' % (pcnt, name, 38))
    if gray_percent > 10:
        pcnt += 1
        output.write('%d, %s, %d\n' % (pcnt, name, 37))
    if white_percent > 30:
        pcnt += 1
        output.write('%d, %s, %d\n' % (pcnt, name, 39))
 
# Helper functions follow along with __main__ def
 
def processFolder(folder):
    for image_folder in glob.glob(folder+'*'):
        try:
            folder_images = []
            for image in os.listdir(image_folder):
                if "jpg" in image and "._" not in image:
                    folder_images.append(image)
            folder_images.sort()
            j = os.path.join(image_folder, folder_images[1])
            if SUMMARY_FORMAT:
                print "working: "+j
            i = Image.open(j)
            processImage(i, image_folder.split('/')[-1])
        except:
            pass
 
def processFile(_file):
    i = Image.open(_file)
    processImage(i)
 
if __name__ == "__main__":
    try:
        if 'product-images' not in sys.argv[1]:
            processFile('product-images/'+sys.argv[1])
        else:
            processFile(sys.argv[1])
    except IndexError:
        processFolder(FOLDER)
    output.close()
1 Comment

Google’s results plotted for repeated character strings

14/02/2010

Don’t ask why but out of interest I googled for the string “AAAAAAAA” earlier and after looking at the millions of pages that came back and thinking “wtf”, I searched again only making the string much longer. I was expecting it to just keep going down but at around 20 characters there was a significant jump in returned results. I scratched my beard and proclaimed this interesting (as you can probably tell I have no distractions on valentines day). To skip over further bullshit, this is the graph of 3,328 searches – that is, the number of results for every character (A-Z) repeated one to 128 times. Some of the peaks are interesting.

Why 128 and not something higher? Google wont let you, at least via the query string. The raw data for this graph was generated by a simple python script. If you aren’t coding in python already, please do look into it, its jolly nice.

#!/usr/bin/python
 
from urllib import FancyURLopener
from BeautifulSoup import BeautifulSoup
import csv
import string 
 
BASE = "http://www.google.com/search?q="
 
class MozOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1;' +
' it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
ff = MozOpener()
 
out = csv.writer(open('out.csv', 'w'))
 
headers = ['count',]
for l in string.ascii_uppercase:
    headers.append(l)
out.writerow(headers)
 
for x in range(1, 128): # 128 is the max length of chars allowed
    results = [x,]
    for l in string.ascii_uppercase:
        qry = l*x
        raw_data = ff.open(BASE+qry).read()
        soup = BeautifulSoup(raw_data)
        result = soup.find(id='resultStats').findAll('b')[2]
        result = int(result.contents[0].replace(',', ''))
        results.append(result)
        print [result, qry]
    out.writerow(results)

When that had finished I simply loaded it up into OpenOffice calc (3.2 is out by the way) and plotted it with the result count set to a logarithmic scale so result #1 doesn’t just skew the thing into one boring L shape. The first thing I noticed was a very visible spiking at length 100.

This isn’t so hard to imagine, 100 is a “nice” number. It’s not hard to imagine someone using 100x a character as a test input or just a “long” string. Every character string of length 100 exhibits this spike. The much bigger blue curve is for the letter x. This is used by children and adults alike to mark kisses, and everyone knows the more x’s the more someone loves you. If you look at many of the results for 100 or 101 character X searches it seems to be when people are using it in this context. Could it be that the much bigger spike for the 101 character string X is simply because its 100 + 1 kisses?? Towards the start of the graph there are a number of interesting spikes, I’ve marked some of them along with the length.

Some of these spikes are easy to explain, the biggest number of results returned for a single repeated character phrase is a by product of DNS, yep, its “WWW”. This accounts for the slightly higher result count than the simple “A” with 17,090,000,000 pages returned versus 12,260,000,000. Another easy one is 6 F’s – the hex code for white. I am totally stumped by F’s latter behaviour though, there is a spike at F-31 and F-33 but not F-32. There is a big jump for X-12 and X-34. International mobiles have 12 digits, as do UPC codes but I feel like I’m clutching at straws :) trying to explain that. Down the other end of the graph, between 115 and 128 the characters P, H, A, M and O all have significant spikes for specific counts. For M & O when you browse many of the first few pages of google’s results many of the pages are using them as part of exaggerated speech. It’s almost like the collective conciousness of the world has decided that 120 characters is just right to describe a particularly tasty dinner.

A spreadsheet with the data I gathered is available via google docs if you’d like to investigate yourself. I should note that the python script above gets its data from .com, if you are using the site to look up some searches it will more than likely switch to your local domain. Depending on your cookies google may also perform extra filtering (e.g. safe searches) so you’re numbers may not be the exact same as mine. I’d be interested in any theories as to some of the more prominent spikes. The OpenOffice spreadsheet with the charts done is also available for download here.

If you are going to play around then save your fingers & sanity and use python to create you’re test strings, just drop into a python shell and use “X”*120 etc or perl -e ‘print “X”x100′ from the command line etc.

4 Comments

Importing existing visitor stats from Google Analytics to Piwik

13/02/2010

Recently at work we had to aggregate a lot of google analytics accounts and do some tracking and custom reporting. We found that it wasn’t quite as straightforward as we thought, one of the problems we had was getting multiple tracking codes to work on the same page. It was no real surprise that this wouldn’t work easily because google themselves have this to say:

Installing multiple instances of the Google Analytics Tracking code on a single web page is not a supported implementation. We suggest you remove all but one instance, and make sure you have the code from the correct profile installed on every page you would like to track.

With a lot of searching and reading I did find a number of scattered blog posts that suggested that it would work and was possible. But no matter what I tried I couldn’t get it tracking data into multiple accounts from the same page. I got fed up and was tasked with looking at alternative tracking solutions. That’s when I came across the very handy list at wikipedia and from there Piwik. Piwik is a GPL license PHP application which aims to mimic the functionality of google analytics. You install it on your server and then add sites to it much like GA. It then gives you a tracking code to install on your site(s).

As it was so similar to GA but locally ran, I played about with it and decided to go with that, working on the assumption that I could at least get at the db tables and source in the future. I setup a number of sites to track and then installed the codes on each page. One acted as a “master” code which was on all pages whilst others only appeared when the domain matched a certain string. I left it for an hour and came back to find the data all populated in each account as I expected. I was jolly pleased.

When it was shown to the intended user they really loved it. The only thing they wanted to sweeten the whole experience even more was to have the visitor count data for the prior month loaded in from Google Analytic’s.  I wasn’t over the moon with this as exporting you’re data from GA isn’t very straightfoward and certainly not in a simple “just dump and drop it in” way either. I had a quick look at what I could export from GA and said I could import in the unique hits per day fairly easily but tying that to specific browsers or page titles etc would be quite a lot of work.

I sort of half expected that that would result in a “ah, ok, lets forget about it” conversation with the end agreement being use GA for data prior to the switchover and Piwik for reports since then but nope, they still wanted to have just the hit counts loaded in regardless of if they were not attached to page titles or a users tech setup. Looking at the piwik developers zone the post about an API to push data in was initially promising but it was focusing on apache logs and most recently a user said the timestamps weren’t coming through. So I looked at another way to get it done. What follows is how I personally went about loading data in. You may find it useful if you end up migrating yourself.

To begin with I installed piwik afresh and dumped out the database. Then I set up just one site and let it record a single hit. Then I went through and compared each table with the previous state. This told me there were 2 places I needed to put data in if I wanted it logged and being used in piwik.

_log_visit & _log_link_visit_action are the two key tables that receive data on each click. The link_visit_action table ties a particular visit to a particular page. I wasn’t going to b doing this so in log_action I added 2 new rows – 1 for the url of my “Google Analytics Dummy page” and another for its title. Then I noted the id’s for each of those rows.

Confident that this would be all I required I went over to GA to export out my data. I clicked through to visitors and set the date range for that required. Then clicked export and chose CSV. You will note that the actual data is not by time but instead aggregated for each day. This means that at most this is just going to allow someone to see the total hits per day but no further drilldown by hour etc. I made that clear to the client and they where again (worse luck!) happy with that level of reporting.

The first thing I needed to do was clean up this data. GA exports it out with day & month names along with some other cruft that wasn’t required. There would be a myriad of ways to sort this out and I chose to use the unix stalwart, sed. The code that follows I saved and chmod’d and then ran on my GA csv file.

#!/bin/sed -nrf
s/.*day, //
s/January/01,/
s/February/02,/
s/March/03,/
s/April/04,/
s/May/05,/
s/June/06,/
s/July/07,/
s/August/08,/
s/September/09,/
s/October/10,/
s/November/11,/
s/December/12,/
1,10d
s/\"//
s/([0-9]{1,2}), ([0-9]{1,2}), ([0-9]{4}),(.+)/\3-\1-\2 00:00:00,"\4"/
/^[0-9]/p

You can download that here.

That should take

  1. "Tuesday, February 2, 2010",21
  2. "Wednesday, February 3, 2010",28
  3. "Thursday, February 4, 2010",26
  4. "Friday, February 5, 2010",29
  5. "Saturday, February 6, 2010",23
  6. "Sunday, February 7, 2010",27
  7. "Monday, February 8, 2010",16
  8. "Tuesday, February 9, 2010",28
  9. "Wednesday, February 10, 2010",25
  10. "Thursday, February 11, 2010",11

and turn it into:

  1. 2010-02-2 00:00:00,"21"
  2. 2010-02-3 00:00:00,"28"
  3. 2010-02-4 00:00:00,"26"
  4. 2010-02-5 00:00:00,"29"
  5. 2010-02-6 00:00:00,"23"
  6. 2010-02-7 00:00:00,"27"
  7. 2010-02-8 00:00:00,"16"
  8. 2010-02-9 00:00:00,"28"
  9. 2010-02-10 00:00:00,"25"
  10. 2010-02-11 00:00:00,"11"

Now its still not the most ISO-formatted csv in the world but it will do for what we need. I obviously can’t just load that into piwik so I then used this as input to a python script that simply magics up the rest of the needed info for the two piwik table’s. As the visitor stats from GA are the uniques (again I made that clear before starting all this to the client) I just create an md5 hash via the current time as I make my way through the counts. That way piwik records them as unique visitors. If the cookies hashes are the same in the db piwik will consider it a returning visitor even if the return flag is 0.

import csv
import os
from datetime import datetime
from md5 import md5
import sys
 
def main(argv):
    if argv[1]:
        COL_ID = int(argv[1])
    else:
        COL_ID = 1
    if argv[2]:
        LVA_ID = int(argv[2])
    else:
        LVA_ID = COL_ID
    SITE_ID = 1
    LOCAL_TIME = "00:00:00"
    VISITOR_RETURNING = 0
    GA_ACTION_URL = 1
    GA_ACTION_NAME = 1
    TOTAL_ACTIONS = 1
    VISIT_TOTAL_TIME = 10
    GOAL_CONVERT = 0
    REFERER_TYPE = 1
    CONFIG_OS = "GA"
    CONFIG_BROWSER = "GA"
    CONFIG_B_VER = "0.1"
    CONFIG_RES = "1600x1200"
    CONFIG_MD5 = md5(CONFIG_BROWSER+CONFIG_OS+CONFIG_RES).hexdigest()
    LOCATION_IP = "168450866"
    BROWSER_LANG = "en-gb"
    LOCATION_COUNTRY = "gb"
    LOCATION_CONTINENT = "eur"
    LOCATION_PROVIDER = "GA Import"
    data = csv.reader(open(argv[0]))
    piwik_output = [csv.writer(open('log_visit-'+argv[0], 'w')),
        csv.writer(open('log_vaction-'+argv[0], 'w'))]
    for row in data:
        hits = int(row[1].replace(',', ''))
        for i in range(0, hits):
            cookie = md5(datetime.now().__str__()).hexdigest()
            action_time = row[0] + LOCAL_TIME
            output = [COL_ID, SITE_ID, action_time, cookie,
                VISITOR_RETURNING, action_time, action_time, row[0],
                GA_ACTION_URL, GA_ACTION_NAME, TOTAL_ACTIONS,
                VISIT_TOTAL_TIME, GOAL_CONVERT, REFERER_TYPE, None,
                None, None, CONFIG_MD5, CONFIG_OS, CONFIG_BROWSER,
                CONFIG_B_VER, CONFIG_RES, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
                LOCATION_IP, BROWSER_LANG, LOCATION_COUNTRY,
                LOCATION_CONTINENT, LOCATION_PROVIDER
            ]
            (piwik_output[0]).writerow(output)
            output = [LVA_ID, COL_ID, GA_ACTION_URL, 0, GA_ACTION_NAME, 0]
            (piwik_output[1]).writerow(output)
            COL_ID += 1
            LVA_ID += 1
 
if __name__ == "__main__":
    main(sys.argv[1:])

You can download the script here.

The above script outputs 2 csv files which you can then load straight into the piwik tables. You can do that via the console:

LOAD DATA LOCAL INFILE 'output-log_visit.csv' INTO
TABLE piwik_log_visit FIELDS terminated BY ','
ENCLOSED BY '"' LINES terminated BY '\n';
 
LOAD DATA LOCAL INFILE 'output-log_vaction.csv' INTO
TABLE piwik_log_link_visit_action FIELDS terminated BY ','
ENCLOSED BY '"' LINES terminated BY '\n';

Now I did think that I was all done but there are a couple of caveats before you’ll finally see that visitor graph take shape. When you add a site to piwik it gives it a creation date in the backend database. This is not editable from the front end and piwik will only examine row data which is greater than that date. So change the  ts_created field for your site to the earliest date of the data you have imported. Finally, drop the archive_blob_* tables, these are caches of calculations piwik has done and when missing will be recreated when the dashboard is loaded.

With all that done when you refresh your dashboard you should see your visitor graph with actual data! Huzzah! In the below image I’ve loaded in a csv containing hit data for January into a fresh install of piwik locally.

Google do provide an export API for GA data but I’ve not had the time to become familiar with it. In any case it will only export the same level of data that you get on the website, so even connecting via the API you’ll not get hourly hits. However that could be a starting point for dumping out a list of page data which could be converted into a table of log_actions which is where piwik stores page names and urls for binding to a visit. I’m open to work with someone on that if anyone’s interested. For now this should save me a giant ballache of time on Monday morning.

4 Comments

Trials2 Stats compare script working again

10/02/2010

I still get quite a few hits coming in for this. I’ve got it all up online and working again. I’ve added a redirect from the old link to the new home so wherever links to it are scattered about should magically rework again. The new home for the compare script.

No Comments

Old school demo effects: #2 – the raster bar

9/02/2010

I have been using processing for a while now, mainly as a sort of relaxing digression from coding web apps & scripts all day. As a child of the 8-bit era I grew up looking at demos which tended to have a few recurring themes. One of these is the classic “plasma” which I’ve already coded and another was animated “scrollers” which typically consisted of a fuax-3d bar which would bouce up & down the screen.

I created this rough version last night before bed. A scroller class gives it a random start/end color and then a number of intermediate stages are interpolated between them. The brightness of the interpolated colors varies with time to give it a bit more life. Its not really finished yet. Code below the video.

Read the rest of this article »

No Comments