June '11

22

Posted by

in Community, Development

Comments: 9 Comments

Export statistics from Apache log files to Piwik with Apache2Piwik [DEPRECATED]

UPDATE March 2012: we have developed a new Server Log Analytics script.
The new import_logs.py is now included in Piwik core distribution (as of Piwik 1.7.2) in piwik/misc/log-analytics/import_logs.py

Click HERE for the newer version!

Note: Apache2Piwik described below is deprecated: we recommend using import_logs.py (click for more info!)

Apache2Piwik, a script written in Python under GPL license, enables exporting statistics from Apache logs to Piwik!

Log files are known to contain a wealth of information about activity on a website, and are usually analyzed with tools such as AWStats or Webalizer. Being able to transfer it to Piwik, a powerful web analysis tool, can greatly enhance data mining and presentation. This, in turn, means more control over your web property, better informed decisions and greater potential for optimalization.

Importing visits, pages, Goal conversions from logs is very fast, processing thousands of log lines per second and can also read & process your log files in real time. Piwik reports after import have a few missing data points compared to the standard Javascript code and standard Piwik reports. However, compared to older log analyser softwares such as Webalizer or AWStats, Piwik reports are sharp, easy to understand, and lets you focus on your analysis goals!

This page contains the following sections: Apache2Piwik requirements, How to use guide, List of missing reports when using log files, Performance of the script, and Credits.

 

Apache2Piwik Requirements

  • access to Piwik installation
  • access to Apache logs with read privileges (you can specify log format in settings.py)
  • Python 2.6 with MySQLdb, GeoIP for Python and httpagentparser

How to import Apache logs in Piwik?

Follow these steps for a test export with Apache2Piwik:

  • Important: create backup of your Piwik MySQL Database.
  • create `settings.py` as a copy of settings.py.sample and edit MySQL Piwik Database configuration
  • execute apache2piwik.py – see examples below

Example 1 – importing log file, all settings set in settings.py file:

$ python2.6 ./apache2piwik.py
Started processing /path/to/file/logfile1 file...
Finished in 2m16s.
Started processing /path/to/file/logfile2 file...
Finished in 2m59s.

Example 2 – live processing of apache log files.
You can enable, in the config file settings.py, the feature of Apache log Live processing. When enabled, Piwik will check your log files for modifications and automatically import the new requests/visits/pageviews in Piwik.
To start a daemon on log file run:

$ python2.6 ./apache2piwik.py start

To stop the daemon from reading and importing your logs in Piwik in real time:

$ python2.6 ./apache2piwik.py stop

Example 3 – goals processing.
If you added new goals to Piwik after data was imported into the database, you can simply run:

$ python apache2piwik.py -g

in order to reprocess logs and update the goals data in your database.
Important notes about Apache2Piwik:
  • Images files are automatically ignored. You can customize ignored extensions in settings.py file. You can also ignore specific logs with regular expressions there
  • Search bots are not excluded at this stage. We might add a feature to exclude bots in a future version.
  • When you import data in the past, or when you want to reprocess your reports from the logs, you can delete piwik_archive_* tables. See more information in this FAQ.
  • Apache2Piwik imports data into the idsite specified in settings.py. You can override this by “-i [idsite]” command line parameter

Reporting differences between Server Logs and Javascript code

The server web access log files contain a lot of interesting information: URL, Date & time, Referrer URL, Visitor IP, etc. but some reports in Piwik require the Javascript Tracking code to be processing.
The following information will not be tracked when importing logs in Piwik: Screen resolution & type, Custom variables, Page titles, Outlinks, Campaigns, Providers, Plugins support.

Apache log import Performance

Our tests, on a Intel Dual Core 2.5 Ghz, indicated that the script could parse 3000 lines per second on average (processing a single 300MB file with about 500,000 lines was performed under 3 minutes and generated about 21.000 visits and 71.000 actions). During this test, only 14% of log lines were actual Pageviews types requests tracked by Piwik.
Important notes about performance:
  • If your URLs contain session id, add a regular expression in URL_REGEXPR directive in settings.py to cut it out
  • Do you have any monitoring or cron scripts that call some URLs every X minutes?
    If so, add them to IGNORED_LOGS directive in settings.py
  • The script is designed more for a “single website” use case, or for a few websites. We haven’t tested in a “web hosting” environment type load at this stage, but we hope to in the future.

Credits

The project has been developed initially for CLANMO GmbH, an award-winning mobile interactive agency from Köln, Germany.

Apache2Piwik was specified and implemented by Clearcode, official Piwik consultant. Clearcode supports further development of Apache2Piwik and Google2Piwik. Our next project is  to develop a GUI for MacOS X/Windows/Linux for these tools.

Download, Feedback & Questions

The script is available at Apache2Piwik official page.

If you have any suggestion, bug report, or feedback about Apache2Piwik, please leave in a comment in above page directly.

About author
piwik team member

Maciej Zawadziński

Maciej is the CEO at Clearcode, a software development company focused on providing custom solutions for the advertising industry. He is also a Core Piwik team member and helps businesses and companies worldwide with Enterprise Piwik Support, New Plugins, Customization and Performance tuning.

Like what you read?

Subscribe to our rss feed: Posts or you can Suggest a topic to write about in the blog or See list of Features

  1. truth about six pack abs reviews Says:

    January 14, 2013 6:18 pm

    Hi! Someone in my Facebook group shared this site with us so I came to check
    it out. I’m definitely loving the information. I’m book-marking and will be tweeting
    this to my followers! Wonderful blog and terrific design.

  2. Maquina Musculacion Segunda Mano Says:

    November 26, 2012 1:11 pm

    Mi jefe me asigno un proyecto en relacion a maquinas de gimnasio baratas ,
    y tu pagina web me ayudo mucho. Gracias

  3. Elida Says:

    November 24, 2012 6:57 am

    It’s perfect time to make a few plans for the future and it’s
    time to be happy. I’ve read this put up and if I could I want to counsel you some fascinating issues or advice. Perhaps you can write next articles regarding this article. I desire to read more issues approximately it!

  4. September 18, 2011 6:20 pm

    Is Apache 2.6 a requirement for this?

  5. Remote admin Says:

    September 14, 2011 7:05 pm

    I am waiting for this for a long while

  6. Thomas Says:

    July 7, 2011 4:58 pm

    Please make this script usable for those who not have console access to their server!

  7. Tory Burch Says:

    July 6, 2011 4:06 am

    Piwik reports are sharp, easy to understand, and lets you focus on your analysis goals!

  8. Alpay Says:

    June 29, 2011 8:05 am

    The program has some bugs!!!

  9. Piwik team Says:

    June 28, 2011 10:23 pm

    Marc, please post your bug report in the official tool page: http://clearcode.cc/offer/open-source-projects/apache2piwik/

Leave a Reply

Post Comment