Importing existing visitor stats from Google Analytics to Piwik
by jaymz on 13/02/2010Recently at work we had to aggregate a lot of google analytics accounts and do some tracking and custom reporting. We found that it wasn’t quite as straightforward as we thought, one of the problems we had was getting multiple tracking codes to work on the same page. It was no real surprise that this wouldn’t work easily because google themselves have this to say:
Installing multiple instances of the Google Analytics Tracking code on a single web page is not a supported implementation. We suggest you remove all but one instance, and make sure you have the code from the correct profile installed on every page you would like to track.
With a lot of searching and reading I did find a number of scattered blog posts that suggested that it would work and was possible. But no matter what I tried I couldn’t get it tracking data into multiple accounts from the same page. I got fed up and was tasked with looking at alternative tracking solutions. That’s when I came across the very handy list at wikipedia and from there Piwik. Piwik is a GPL license PHP application which aims to mimic the functionality of google analytics. You install it on your server and then add sites to it much like GA. It then gives you a tracking code to install on your site(s).
As it was so similar to GA but locally ran, I played about with it and decided to go with that, working on the assumption that I could at least get at the db tables and source in the future. I setup a number of sites to track and then installed the codes on each page. One acted as a “master” code which was on all pages whilst others only appeared when the domain matched a certain string. I left it for an hour and came back to find the data all populated in each account as I expected. I was jolly pleased.
When it was shown to the intended user they really loved it. The only thing they wanted to sweeten the whole experience even more was to have the visitor count data for the prior month loaded in from Google Analytic’s. I wasn’t over the moon with this as exporting you’re data from GA isn’t very straightfoward and certainly not in a simple “just dump and drop it in” way either. I had a quick look at what I could export from GA and said I could import in the unique hits per day fairly easily but tying that to specific browsers or page titles etc would be quite a lot of work.
I sort of half expected that that would result in a “ah, ok, lets forget about it” conversation with the end agreement being use GA for data prior to the switchover and Piwik for reports since then but nope, they still wanted to have just the hit counts loaded in regardless of if they were not attached to page titles or a users tech setup. Looking at the piwik developers zone the post about an API to push data in was initially promising but it was focusing on apache logs and most recently a user said the timestamps weren’t coming through. So I looked at another way to get it done. What follows is how I personally went about loading data in. You may find it useful if you end up migrating yourself.
To begin with I installed piwik afresh and dumped out the database. Then I set up just one site and let it record a single hit. Then I went through and compared each table with the previous state. This told me there were 2 places I needed to put data in if I wanted it logged and being used in piwik.
_log_visit & _log_link_visit_action are the two key tables that receive data on each click. The link_visit_action table ties a particular visit to a particular page. I wasn’t going to b doing this so in log_action I added 2 new rows – 1 for the url of my “Google Analytics Dummy page” and another for its title. Then I noted the id’s for each of those rows.
Confident that this would be all I required I went over to GA to export out my data. I clicked through to visitors and set the date range for that required. Then clicked export and chose CSV. You will note that the actual data is not by time but instead aggregated for each day. This means that at most this is just going to allow someone to see the total hits per day but no further drilldown by hour etc. I made that clear to the client and they where again (worse luck!) happy with that level of reporting.
The first thing I needed to do was clean up this data. GA exports it out with day & month names along with some other cruft that wasn’t required. There would be a myriad of ways to sort this out and I chose to use the unix stalwart, sed. The code that follows I saved and chmod’d and then ran on my GA csv file.
#!/bin/sed -nrf
s/.*day, //
s/January/01,/
s/February/02,/
s/March/03,/
s/April/04,/
s/May/05,/
s/June/06,/
s/July/07,/
s/August/08,/
s/September/09,/
s/October/10,/
s/November/11,/
s/December/12,/
1,10d
s/\"//
s/([0-9]{1,2}), ([0-9]{1,2}), ([0-9]{4}),(.+)/\3-\1-\2 00:00:00,"\4"/
/^[0-9]/pYou can download that here.
That should take
-
"Tuesday, February 2, 2010",21
-
"Wednesday, February 3, 2010",28
-
"Thursday, February 4, 2010",26
-
"Friday, February 5, 2010",29
-
"Saturday, February 6, 2010",23
-
"Sunday, February 7, 2010",27
-
"Monday, February 8, 2010",16
-
"Tuesday, February 9, 2010",28
-
"Wednesday, February 10, 2010",25
-
"Thursday, February 11, 2010",11
and turn it into:
-
2010-02-2 00:00:00,"21"
-
2010-02-3 00:00:00,"28"
-
2010-02-4 00:00:00,"26"
-
2010-02-5 00:00:00,"29"
-
2010-02-6 00:00:00,"23"
-
2010-02-7 00:00:00,"27"
-
2010-02-8 00:00:00,"16"
-
2010-02-9 00:00:00,"28"
-
2010-02-10 00:00:00,"25"
-
2010-02-11 00:00:00,"11"
Now its still not the most ISO-formatted csv in the world but it will do for what we need. I obviously can’t just load that into piwik so I then used this as input to a python script that simply magics up the rest of the needed info for the two piwik table’s. As the visitor stats from GA are the uniques (again I made that clear before starting all this to the client) I just create an md5 hash via the current time as I make my way through the counts. That way piwik records them as unique visitors. If the cookies hashes are the same in the db piwik will consider it a returning visitor even if the return flag is 0.
import csv
import os
from datetime import datetime
from md5 import md5
import sys
def main(argv):
if argv[1]:
COL_ID = int(argv[1])
else:
COL_ID = 1
if argv[2]:
LVA_ID = int(argv[2])
else:
LVA_ID = COL_ID
SITE_ID = 1
LOCAL_TIME = "00:00:00"
VISITOR_RETURNING = 0
GA_ACTION_URL = 1
GA_ACTION_NAME = 1
TOTAL_ACTIONS = 1
VISIT_TOTAL_TIME = 10
GOAL_CONVERT = 0
REFERER_TYPE = 1
CONFIG_OS = "GA"
CONFIG_BROWSER = "GA"
CONFIG_B_VER = "0.1"
CONFIG_RES = "1600x1200"
CONFIG_MD5 = md5(CONFIG_BROWSER+CONFIG_OS+CONFIG_RES).hexdigest()
LOCATION_IP = "168450866"
BROWSER_LANG = "en-gb"
LOCATION_COUNTRY = "gb"
LOCATION_CONTINENT = "eur"
LOCATION_PROVIDER = "GA Import"
data = csv.reader(open(argv[0]))
piwik_output = [csv.writer(open('log_visit-'+argv[0], 'w')),
csv.writer(open('log_vaction-'+argv[0], 'w'))]
for row in data:
hits = int(row[1].replace(',', ''))
for i in range(0, hits):
cookie = md5(datetime.now().__str__()).hexdigest()
action_time = row[0] + LOCAL_TIME
output = [COL_ID, SITE_ID, action_time, cookie,
VISITOR_RETURNING, action_time, action_time, row[0],
GA_ACTION_URL, GA_ACTION_NAME, TOTAL_ACTIONS,
VISIT_TOTAL_TIME, GOAL_CONVERT, REFERER_TYPE, None,
None, None, CONFIG_MD5, CONFIG_OS, CONFIG_BROWSER,
CONFIG_B_VER, CONFIG_RES, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
LOCATION_IP, BROWSER_LANG, LOCATION_COUNTRY,
LOCATION_CONTINENT, LOCATION_PROVIDER
]
(piwik_output[0]).writerow(output)
output = [LVA_ID, COL_ID, GA_ACTION_URL, 0, GA_ACTION_NAME, 0]
(piwik_output[1]).writerow(output)
COL_ID += 1
LVA_ID += 1
if __name__ == "__main__":
main(sys.argv[1:])You can download the script here.
The above script outputs 2 csv files which you can then load straight into the piwik tables. You can do that via the console:
LOAD DATA LOCAL INFILE 'output-log_visit.csv' INTO
TABLE piwik_log_visit FIELDS terminated BY ','
ENCLOSED BY '"' LINES terminated BY '\n';
LOAD DATA LOCAL INFILE 'output-log_vaction.csv' INTO
TABLE piwik_log_link_visit_action FIELDS terminated BY ','
ENCLOSED BY '"' LINES terminated BY '\n';Now I did think that I was all done but there are a couple of caveats before you’ll finally see that visitor graph take shape. When you add a site to piwik it gives it a creation date in the backend database. This is not editable from the front end and piwik will only examine row data which is greater than that date. So change the ts_created field for your site to the earliest date of the data you have imported. Finally, drop the archive_blob_* tables, these are caches of calculations piwik has done and when missing will be recreated when the dashboard is loaded.
With all that done when you refresh your dashboard you should see your visitor graph with actual data! Huzzah! In the below image I’ve loaded in a csv containing hit data for January into a fresh install of piwik locally.
Google do provide an export API for GA data but I’ve not had the time to become familiar with it. In any case it will only export the same level of data that you get on the website, so even connecting via the API you’ll not get hourly hits. However that could be a starting point for dumping out a list of page data which could be converted into a table of log_actions which is where piwik stores page names and urls for binding to a visit. I’m open to work with someone on that if anyone’s interested. For now this should save me a giant ballache of time on Monday morning.




There are 4 comments in this article: