Parsing Traefik Access Logs with Python & Pandas

Today we’re exploring how to parse Traefik Access Logs using Python and Pandas

Reading the Traefik Access Logs Docs we see that Traefik (by default) writes it’s logs in Common Log Format

There are a number of other applications that write logs in a similar way, so starting from some existing examples gives us a good starting point which can be slightly modified to suit our uses - Read Apache HTTP server access log with Pandas - Read Nginx access log (multiple quotechars)

import pandas as pd

from datetime import datetime
import pytz

def parse_str(string):
    """
    Returns the string delimited by two `"` characters.

    Example:
        `>>> parse_str('"my string"')`
        `'my string'`
    """
    return string.strip('"')

def parse_datetime(x):
    """
    Parses datetime with timezone formatted as:
        `[day/month/year:hour:minute:second zone]`

    Example:
        `>>> parse_datetime('13/Nov/2015:11:45:42 +0000')`
        `datetime.datetime(2015, 11, 3, 11, 45, 4, tzinfo=<UTC>)`

    Due to problems parsing the timezone (`%z`) with `datetime.strptime`, the
    timezone will be obtained using the `pytz` library.
    """
    dt = datetime.strptime(x[1:-7], '%d/%b/%Y:%H:%M:%S')
    dt_tz = int(x[-6:-3])*60+int(x[-3:-1])
    return dt.replace(tzinfo=pytz.FixedOffset(dt_tz))

We need an extra converter to cover to format Traefik uses for the duration e.g. 321ms

def parse_duration(duration):
    return int(duration.strip('ms'))

Now we have all the helper functions we need, we can read the data into a Pandas Dataframe

# <remote_IP_address> - <client_user_name_if_available> [<timestamp>] "<request_method> <request_path> <request_protocol>" <origin_server_HTTP_status> <origin_server_content_size> "<request_referrer>" "<request_user_agent>" <number_of_requests_received_since_Traefik_started> "<Traefik_router_name>" "<Traefik_server_URL>" <request_duration_in_ms>ms
log_file = '/path/to/my/access.log'
df = pd.read_csv(
    log_file,
    sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
    engine='python',
    usecols=[0, 3, 4, 5, 6, 7, 8, 10, 11, 12],
    names=['ip', 'timestamp', 'request', 'status', 'size', 'referer', 'user_agent', 'Traefik_router_name', 'Traefik_server_URL', 'request_duration_in_ms'],
    na_values='-',
    header=None,
    dtype={'status': pd.Categorical},
    converters={
        'request_duration_in_ms': parse_duration,
        'timestamp': parse_datetime,
        'request': parse_str,
        'referer': parse_str,
        'user_agent': parse_str,
        'Traefik_router_name': parse_str,
        'Traefik_server_URL': parse_str,
    },
)

From here we’d like to unpack the request section "<request_method> <request_path> <request_protocol>", this is simply achievable by unpacking a List into Pandas Columns

df[['request_method', 'request_path', 'request_protocol']] = pd.DataFrame(df['request'].str.split().to_list(), index=df.index) 

Now we have the data in a Dataframe, we’re on familiar ground and can use lots of common tools to do some EDA

import pandas_profiling
profile = df.profile_report(title='Pandas Profiling Report')
profile.to_file(output_file="access_log_pandas_profiling.html")

We can simply plot charts to show how quick we’re responding to requests

df.groupby(df['timestamp'].dt.floor('h'))['request_duration_in_ms'] \
.quantile([0.25, 0.5, 0.75, 0.95, 1])\
.unstack(level=-1)\
.plot()