[PYTHON] Analyze Apache access logs with Pandas and Matplotlib

Overview

Even if you try to analyze access logs such as Apache and Tomcat, there is no surprisingly good tool. I would like to have a tool that supports various output formats, filters only the necessary information, and makes it easy to understand and visualize, but I can't find it easily.

So, in the world of data analysis, I tried using Pandas and Matplotlib, which are standard in the world of data analysis, to see if it is easy to analyze and visualize Apache access logs on Jupyter Notebook.

There are already many easy-to-understand articles about installing Pandas, Matplotlib, and Jupyter Notebook, so I won't cover them here.

Read access log

Import required libraries

First, import the minimum required Pandas and Matplotlib.

import pandas as pd
from pandas import DataFrame, Series
import matplotlib.pyplot as plt

Environmental setting

Set the environment according to your preference.

#Graphs etc. are drawn in Notebook
%matplotlib inline

#Changed the maximum string length of DataFrame columns from the default of 50 to 150
pd.set_option("display.max_colwidth", 150)

Loading access log

For how to load the access log, I referred to this blog entry. Before loading the access log, define the required type analysis functions.

from datetime import datetime
import pytz

def parse_str(x):
    return x[1:-1]

def parse_datetime(x):
    dt = datetime.strptime(x[1:-7], '%d/%b/%Y:%H:%M:%S')
    dt_tz = int(x[-6:-3])*60+int(x[-3:-1])
    return dt.replace(tzinfo=pytz.FixedOffset(dt_tz))

Then load the access log with pd.read_csv (). The output format of the access log I had was slightly different, so I modified it as follows. This time, for the sake of simplicity, only some columns are extracted.

df = pd.read_csv(
    '/var/log/httpd/access_log',
    sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
    engine='python',
    na_values='-',
    header=None,
    usecols=[0, 4, 6, 7, 8, 9, 10],
    names=['ip', 'time', 'response_time', 'request', 'status', 'size', 'user_agent'],
    converters={'time': parse_datetime,
                'response_time': int,
                'request': parse_str,
                'status': int,
                'size': int,
                'user_agent': parse_str})

Analyze access log

After loading the access log, let's analyze it.

Check without processing the data

First, check the first and last 5 lines of the file.

df

Screenshot from 2020-03-24 16-47-29.png

You can see that there were 114,004 requests in about 1 hour and 20 minutes from 18:20 on March 24th.

Confirmation of response time aggregation results

Let's display the average and maximum response time.

print('Response time aggregation result (microseconds)\n')
print('minimum value: {}'.format(str(df['response_time'].min()).rjust(10)))
print('Average value: {}'.format(str(round(df['response_time'].mean())).rjust(10)))
print('Median: {}'.format(str(round(df['response_time'].median())).rjust(10)))
print('Maximum value: {}'.format(str(df['response_time'].max()).rjust(10)))

print('\n Worst response time 15')
df.sort_values('response_time', ascending=False).head(15)

Screenshot from 2020-03-24 14-03-41.png

The 15 worst response times were all requests to the OpenAM authentication service, and it took about 15 seconds to return a response with HTTP status 406. You can see that there was a problem with the application that works with OpenAM.

Confirmation of missing values

Looking at the worst 15 response times, size (response size) and ʻuser_agent (user agent) are NaN`. Let's check how many missing values there are.

df.isnull().sum()

Screenshot from 2020-03-24 17-00-58.png

If size is NaN, it is a redirect. Also, about 30% of ʻuser_agent` is unknown. Since it's not just the end-user's browser that accesses OpenAM, it's likely that many of them don't have a "User-Agent" header. Let's find out what other user agents are available.

User agent confirmation

ua_df = DataFrame(df.groupby(['user_agent']).size().index)
ua_df['count'] = df.groupby(['user_agent']).size().values
ua_df

Screenshot from 2020-03-24 17-15-03.png

There were 490 types. It seems that this is the result because not only the end user but also many applications are working together.

Visualize access logs

I haven't visualized it at all so far, so I think it wasn't interesting. Let's display it as a pie chart. First, let's draw the percentage of the response status code in a pie chart.

plt.figure(figsize = (5, 5))
labels = ['2xx', '3xx', '4xx', '5xx']
plt.pie(df.groupby([df['status'] // 100]).size(), autopct = '%1.1f%%', labels = labels, shadow = True, startangle = 90)
plt.axis('equal')
df.groupby([df['status'] // 100]).size()
plt.show()

Screenshot from 2020-03-24 17-29-01.png

Hmmm, the low percentage status code labels overlap and don't look good ...

Define utility functions for visualization

So, let's define a function that groups a small number of elements (1% or less) into "others". If you use Matplotlib's features, you may not need these functions, but I couldn't find them, so I made a simple one.

#For DataFrame
def replace_df_minors_with_others(df_before, column_name):
    elm_num = 1
    for index, row in df_before.sort_values([column_name], ascending=False).iterrows():
        if (row[column_name] / df_before[column_name].sum()) > 0.01:
            elm_num = elm_num + 1
    
    df_after = df_before.sort_values([column_name], ascending=False).nlargest(elm_num, columns=column_name)
    df_after.loc[len(df_after)] = ['others', df_before.drop(df_after.index)[column_name].sum()]
    return df_after

#For dictionaries
def replace_dict_minors_with_others(dict_before):
    dict_after = {}
    others = 0
    total = sum(dict_before.values())
    for key in dict_before.keys():
        if (dict_before.get(key) / total) > 0.01:
            dict_after[key] = dict_before.get(key)
        else:
            others = others + dict_before.get(key)
    dict_after = {k: v for k, v in sorted(dict_after.items(), reverse=True, key=lambda item: item[1])}
    dict_after['others'] = others
    return dict_after

User agent visualization

Now let's use this function to display the user agent types in a pie chart.

plt.figure(figsize = (15, 10))
ua_df_with_others = replace_df_minors_with_others(ua_df, 'count')
plt.pie(ua_df_with_others['count'], labels = ua_df_with_others['user_agent'], autopct = '%1.1f%%', shadow = True, startangle = 90)
plt.title('User Agent')
plt.show()

Even if you display the "User-Agent" header directly, it's hard to understand. What is "Mozilla / 5.0 (Windows NT 6.1; WOW64; Trident / 7.0; rv: 11.0) like Gecko" ...

So, let's convert it to an easy-to-understand display using a library called Woothee. Install with the following command.

$ pip install woothee

Let's use this to display the client OS and user agent in a pie chart.

import woothee

ua_counter = {}
os_counter = {}

for index, row in ua_df.sort_values(['count'], ascending=False).iterrows():
    ua = woothee.parse(row['user_agent'])
    uaKey = ua.get('name') + ' (' + ua.get('version') + ')'
    if not uaKey in ua_counter:
        ua_counter[uaKey] = 0
    ua_counter[uaKey] = ua_counter[uaKey] + 1
    osKey = ua.get('os') + ' (' + ua.get('os_version') + ')'
    if not osKey in os_counter:
        os_counter[osKey] = 0
    os_counter[osKey] = os_counter[osKey] + 1

plt.figure(figsize = (15, 10))
plt.subplot(1,2,1)
plt.title('Client OS')
os_counter_with_others = replace_dict_minors_with_others(os_counter)
plt.pie(os_counter_with_others.values(), labels = os_counter_with_others.keys(), autopct = '%1.1f%%', shadow = True, startangle = 90)

plt.subplot(1,2,2)
plt.title('User Agent')
ua_counter_with_others = replace_dict_minors_with_others(ua_counter)
plt.pie(ua_counter_with_others.values(), labels = ua_counter_with_others.keys(), autopct = '%1.1f%%', shadow = True, startangle = 90)
plt.show()

Screenshot from 2020-03-24 17-46-52.png

"UNKNOWN" has increased, but has it become easier to understand? Even so, many people still use IE on Windows ...

Visualization of error response status code

Next, let's look at the percentage of the status code (400 or more) of the response that caused an error.

error_df = df[df['status'] >= 400]
plt.figure(figsize = (10, 10))
labels = ['4xx', '5xx']
plt.pie(error_df.groupby([error_df['status'] // 100]).count().time, labels=labels, counterclock=False, startangle=90)

labels2 = ['400', '401', '403', '404', '406', '500', '501', '502']
plt.pie(error_df.groupby(['status']).count().time, labels=labels2, counterclock=False, startangle=90, radius=0.7)

centre_circle = plt.Circle((0,0),0.4, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.title('Error Status Code')
plt.show()

Screenshot from 2020-03-24 17-55-55.png

The characteristic of this access log is that there are many error responses with status code 406, which is generally unfamiliar.

Visualization of load status

Now, let's output a different graph. First, check the load status.

plt.figure(figsize = (15, 5))
access = df['request']
access.index = df['time']
access = access.resample('S').count()
access.index.name = 'Time'
access.plot()
plt.title('Total Access')
plt.ylabel('Access')
plt.show()

Screenshot from 2020-03-24 18-04-18.png

There is constant access, sometimes over 100 per second.

Visualization of the relationship between response size and time

Let's see if there is any relationship between the size of the response and the time.

plt.figure(figsize = (15, 5))
plt.title('size v.s. response_time')
plt.scatter(df['size']/1000, df['response_time']/1000000, marker='.')
plt.xlabel('Size(KB)')
plt.ylabel('Response Time')
plt.grid()

Screenshot from 2020-03-24 18-09-32.png

Although there is no clear relationship, if the size of the response is not 0, it can not be said that the larger the size, the less time it tends to take.

Finally

I think Pandas and Matplotlib can be used to analyze access logs. If you can master these libraries, you will be able to fully utilize them for troubleshooting.

I'm still in the prototype stage, but I'm also committed to GitHub.