Export statistics from Apache log files to Matomo with Apache2Matomo [DEPRECATED]

Contents
UPDATE March 2012: we have developed a new Server Log Analytics script.
The new import_logs.py is now included in Matomo core distribution (as of Matomo 1.7.2) in piwik/misc/log-analytics/import_logs.py

Click HERE for the newer version!

Note: Apache2Matomo described below is deprecated: we recommend using import_logs.py (click for more info!)

Log files are known to contain a wealth of information about activity on a website, and are usually analyzed with tools such as AWStats or Webalizer. Being able to transfer it to Matomo, a powerful web analysis tool, can greatly enhance data mining and presentation. This, in turn, means more control over your web property, better informed decisions and greater potential for optimalization.

Importing visits, pages, Goal conversions from logs is very fast, processing thousands of log lines per second and can also read & process your log files in real time. Matomo (Piwik) reports after import have a few missing data points compared to the standard Javascript code and standard Matomo reports. However, compared to older log analyser softwares such as Webalizer or AWStats, Matomo reports are sharp, easy to understand, and lets you focus on your analysis goals!

This page contains the following sections: Apache2Matomo requirements, How to use guide, List of missing reports when using log files, Performance of the script, and Credits.

Apache2Matomo Requirements

  • access to Matomo installation
  • access to Apache logs with read privileges (you can specify log format in settings.py)
  • Python 2.6 with MySQLdb, GeoIP for Python and httpagentparser

How to import Apache logs in Matomo?

Follow these steps for a test export with Apache2Piwik:

  • Important: create backup of your Matomo MySQL Database.
  • create `settings.py` as a copy of settings.py.sample and edit MySQL Matomo Database configuration
  • execute apache2piwik.py – see examples below

Example 1 – importing log file, all settings set in settings.py file:

$ python2.6 ./apache2piwik.py
Started processing /path/to/file/logfile1 file...
Finished in 2m16s.
Started processing /path/to/file/logfile2 file...
Finished in 2m59s.

Example 2 – live processing of apache log files.
You can enable, in the config file settings.py, the feature of Apache log Live processing. When enabled, Matomo will check your log files for modifications and automatically import the new requests/visits/pageviews in Matomo.
To start a daemon on log file run:

$ python2.6 ./apache2piwik.py start

To stop the daemon from reading and importing your logs in Matomo in real time:

$ python2.6 ./apache2piwik.py stop

Example 3 – goals processing.
If you added new goals to Matomo after data was imported into the database, you can simply run:

$ python apache2piwik.py -g

in order to reprocess logs and update the goals data in your database.
Important notes about Apache2Piwik:
  • Images files are automatically ignored. You can customize ignored extensions in settings.py file. You can also ignore specific logs with regular expressions there
  • Search bots are not excluded at this stage. We might add a feature to exclude bots in a future version.
  • When you import data in the past, or when you want to reprocess your reports from the logs, you can delete piwik_archive_* tables. See more information in this FAQ.
  • Apache2Matomo imports data into the idsite specified in settings.py. You can override this by “-i [idsite]” command line parameter

Reporting differences between Server Logs and Javascript code

The server web access log files contain a lot of interesting information: URL, Date & time, Referrer URL, Visitor IP, etc. but some reports in Matomo require the Javascript Tracking code to be processing.
The following information will not be tracked when importing logs in Matomo: Screen resolution & type, Custom variables, Page titles, Outlinks, Campaigns, Providers, Plugins support.

Apache log import Performance

Our tests, on a Intel Dual Core 2.5 Ghz, indicated that the script could parse 3000 lines per second on average (processing a single 300MB file with about 500,000 lines was performed under 3 minutes and generated about 21.000 visits and 71.000 actions). During this test, only 14% of log lines were actual Pageviews types requests tracked by Matomo.
Important notes about performance:
  • If your URLs contain session id, add a regular expression in URL_REGEXPR directive in settings.py to cut it out
  • Do you have any monitoring or cron scripts that call some URLs every X minutes?
    If so, add them to IGNORED_LOGS directive in settings.py
  • The script is designed more for a “single website” use case, or for a few websites. We haven’t tested in a “web hosting” environment type load at this stage, but we hope to in the future.

Credits

The project has been developed initially for CLANMO GmbH, an award-winning mobile interactive agency from Köln, Germany.

If you have any suggestion, bug report, or feedback about Apache2Piwik, please leave in a comment in above page directly.

Enjoyed this post?
Join the 160,000+ subscribers who receive the Matomo Newsletter straight to their inbox every month
Get started with Matomo

A powerful web analytics platform that gives you and your business 100% data ownership and user privacy protection.

No credit card required.

Free forever.

Get started with Matomo

A powerful web analytics platform that gives you and your business 100% data ownership and user privacy protection.

No credit card required.

Free forever.