UPDATE March 2012: we have developed a new Server Log Analytics script.
The new import_logs.py is now included in Piwik core distribution (as of Piwik 1.7.2) in piwik/misc/log-analytics/import_logs.py

Click HERE for the newer version!

Note: Apache2Piwik described below is deprecated: we recommend using import_logs.py (click for more info!)

Apache2Piwik, a script written in Python under GPL license, enables exporting statistics from Apache logs to Piwik!

Log files are known to contain a wealth of information about activity on a website, and are usually analyzed with tools such as AWStats or Webalizer. Being able to transfer it to Piwik, a powerful web analysis tool, can greatly enhance data mining and presentation. This, in turn, means more control over your web property, better informed decisions and greater potential for optimalization.

Importing visits, pages, Goal conversions from logs is very fast, processing thousands of log lines per second and can also read & process your log files in real time. Piwik reports after import have a few missing data points compared to the standard Javascript code and standard Piwik reports. However, compared to older log analyser softwares such as Webalizer or AWStats, Piwik reports are sharp, easy to understand, and lets you focus on your analysis goals!

This page contains the following sections: Apache2Piwik requirements, How to use guide, List of missing reports when using log files, Performance of the script, and Credits.

 

Apache2Piwik Requirements

  • access to Piwik installation
  • access to Apache logs with read privileges (you can specify log format in settings.py)
  • Python 2.6 with MySQLdb, GeoIP for Python and httpagentparser

How to import Apache logs in Piwik?

Follow these steps for a test export with Apache2Piwik:

  • Important: create backup of your Piwik MySQL Database.
  • create `settings.py` as a copy of settings.py.sample and edit MySQL Piwik Database configuration
  • execute apache2piwik.py – see examples below

Example 1 – importing log file, all settings set in settings.py file:

$ python2.6 ./apache2piwik.py
Started processing /path/to/file/logfile1 file...
Finished in 2m16s.
Started processing /path/to/file/logfile2 file...
Finished in 2m59s.

Example 2 – live processing of apache log files.
You can enable, in the config file settings.py, the feature of Apache log Live processing. When enabled, Piwik will check your log files for modifications and automatically import the new requests/visits/pageviews in Piwik.
To start a daemon on log file run:

$ python2.6 ./apache2piwik.py start

To stop the daemon from reading and importing your logs in Piwik in real time:

$ python2.6 ./apache2piwik.py stop

Example 3 – goals processing.
If you added new goals to Piwik after data was imported into the database, you can simply run:

$ python apache2piwik.py -g

in order to reprocess logs and update the goals data in your database.
Important notes about Apache2Piwik:
  • Images files are automatically ignored. You can customize ignored extensions in settings.py file. You can also ignore specific logs with regular expressions there
  • Search bots are not excluded at this stage. We might add a feature to exclude bots in a future version.
  • When you import data in the past, or when you want to reprocess your reports from the logs, you can delete piwik_archive_* tables. See more information in this FAQ.
  • Apache2Piwik imports data into the idsite specified in settings.py. You can override this by “-i [idsite]” command line parameter

Reporting differences between Server Logs and Javascript code

The server web access log files contain a lot of interesting information: URL, Date & time, Referrer URL, Visitor IP, etc. but some reports in Piwik require the Javascript Tracking code to be processing.
The following information will not be tracked when importing logs in Piwik: Screen resolution & type, Custom variables, Page titles, Outlinks, Campaigns, Providers, Plugins support.

Apache log import Performance

Our tests, on a Intel Dual Core 2.5 Ghz, indicated that the script could parse 3000 lines per second on average (processing a single 300MB file with about 500,000 lines was performed under 3 minutes and generated about 21.000 visits and 71.000 actions). During this test, only 14% of log lines were actual Pageviews types requests tracked by Piwik.
Important notes about performance:
  • If your URLs contain session id, add a regular expression in URL_REGEXPR directive in settings.py to cut it out
  • Do you have any monitoring or cron scripts that call some URLs every X minutes?
    If so, add them to IGNORED_LOGS directive in settings.py
  • The script is designed more for a “single website” use case, or for a few websites. We haven’t tested in a “web hosting” environment type load at this stage, but we hope to in the future.

Credits

The project has been developed initially for CLANMO GmbH, an award-winning mobile interactive agency from Köln, Germany.

If you have any suggestion, bug report, or feedback about Apache2Piwik, please leave in a comment in above page directly.


Maciej Zawadziński

Maciej is the CEO of Piwik PRO, the leading professional web analytics services provider. At Piwik PRO, Maciej helps clients deploy and maintain Piwik analytics on their own infrastructure as well as on Piwik Cloud. He also advises on creating customised analytics solutions and provides consulting services on Piwik analytics.