Overview

When using Static Website Hosting, it plays a role as an access log, and it is a Logging function that discharges the log of requests to S3 to the specified bucket, puts an update notification (Event nortification) in the target bucket, and starts Lambda Dive into BigQuery to get, parse, and analyze.

Advance preparation

S3 Logging
https://docs.aws.amazon.com/ja_jp/AmazonS3/latest/dev/ServerLogs.html
S3 Event nortification
http://docs.aws.amazon.com/ja_jp/AmazonS3/latest/dev/NotificationHowTo.html
BigQuery
http://www.apps-gcp.com/bigquery-introduction/

Lambda runtime environment

--Runtime is Python 2.7 --Required modules (versions actually used in parentheses) --boto3 (Lambda default) - pytz (2015.7) - gcloud (0.8.0)

S3 Logging log format and BigQuery schema

Log format

It is written here. https://docs.aws.amazon.com/AmazonS3/latest/dev/LogFormat.html

Regular expression for parsing logs

Feeling that each label directly corresponds to the column name in the next schema.

^(?P<owner>[^ ]+) (?P<bucket>[^ ]+) \[(?P<datetime>.+)\](?P<remote_ip>[^ ]+) (?P<requester>[^ ]+) (?P<request_id>[^ ]+) (?P<operation>[^ ]+) (?P<key>[^ ]+) "(?P<method>[^ ]+) (?P<uri>[^ ]+) (?P<proto>.+)" (?P<status>[^ ]+) (?P<error>[^ ]+) (?P<bytes>[^ ]+) (?P<size>[^ ]+) (?P<total_time>[^ ]+) (?P<ta_time>[^ ]+) "(?P<referrer>.+)" "(?P<user_agent>.+)" (?P<version>.+)$

Schema

It looks like this according to the format. The datetime column seems to be better with the TIMESTAMP type, but I chose STRING this time due to circumstances, as I will explain in detail later.

Column name	Mold
owner	STRING
bucket	STRING
datetime	STRING
remote_ip	STRING
requester	STRING
request_id	STRING
operation	STRING
key	STRING
method	STRING
uri	STRING
proto	STRING
status	STRING
error	STRING
bytes	INTEGER
size	INTEGER
total_time	INTEGER
ta_time	INTEGER
referrer	STRING
user_agent	STRING
version	STRING

Source

Replace the <your-*> part as appropriate.

import os
import json
import urllib
import boto3
import re
import datetime
import pytz
from gcloud import bigquery

BQ_PROJECT = '<your-project-id>'
BQ_DATASET = '<your-dataset-name>'
BQ_TABLE = '<your-table-name>'

s3 = boto3.client('s3')
bq = bigquery.Client.from_service_account_json(
    os.path.join(os.path.dirname(__file__), 'bq.json'),
    project=BQ_PROJECT)
dataset = bq.dataset(BQ_DATASET)
table = dataset.table(name=BQ_TABLE)
table.reload()

pattern = ' '.join([
    '^(?P<owner>[^ ]+)',
    '(?P<bucket>[^ ]+)',
    '\[(?P<datetime>.+)\]',
    '(?P<remote_ip>[^ ]+)',
    '(?P<requester>[^ ]+)',
    '(?P<request_id>[^ ]+)',
    '(?P<operation>[^ ]+)',
    '(?P<key>[^ ]+)',
    '"(?P<method>[^ ]+) (?P<uri>[^ ]+) (?P<proto>.+)"',
    '(?P<status>[^ ]+)',
    '(?P<error>[^ ]+)',
    '(?P<bytes>[^ ]+)',
    '(?P<size>[^ ]+)',
    '(?P<total_time>[^ ]+)',
    '(?P<ta_time>[^ ]+)',
    '"(?P<referrer>.+)"',
    '"(?P<user_agent>.+)"',
    '(?P<version>.+)$'])
log_pattern = re.compile(pattern)

def to_int(val):
    try:
        ret = int(val)
    except ValueError:
        ret = None
    return ret

def lambda_handler(event, context):
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf8')
    res = s3.get_object(Bucket=bucket, Key=key)
    body = res['Body'].read()
    rows = []

    for line in body.splitlines():
        matches = log_pattern.match(line)
        dt_str = matches.group('datetime').split(' ')[0]
        timestamp = datetime.datetime.strptime(
            dt_str, '%d/%b/%Y:%H:%M:%S').replace(tzinfo=pytz.utc)

        rows.append((
            matches.group('owner'),
            matches.group('bucket'),
            timestamp.strftime('%Y-%m-%d %H:%M:%S'),
            matches.group('remote_ip'),
            matches.group('requester'),
            matches.group('request_id'),
            matches.group('operation'),
            matches.group('key'),
            matches.group('method'),
            matches.group('uri'),
            matches.group('proto'),
            matches.group('status'),
            matches.group('error'),
            to_int(matches.group('bytes')),
            to_int(matches.group('size')),
            to_int(matches.group('total_time')),
            to_int(matches.group('ta_time')),
            matches.group('referrer'),
            matches.group('user_agent'),
            matches.group('version'),))
    print(table.insert_data(rows))

Precautions etc.

--Get JSON-formatted credentials from GCP's API Manager and include them in your Lambda function deployment package with the name bq.json [^ 1] --In the main body, the datetime part should be able to be parsed with% d /% b /% Y:% H:% M:% S% z, but due to a bug in Python <3.2, % z can be used with an error. Since there is no such thing, it is an irregular method involving pytz [^ 2] --Originally, it is correct to specify the datetime object in the TIMESTAMP type column, but is it a bug of gcloud (0.8.0)? Note that BigQuery is in seconds, but trying to save in milliseconds will result in an impossible date. --Each item in the log will be - if there is no data to output. Note that items that you want to convert to int will be an exception if you convert them to ʻint ('-') `as they are. --The Official Document of Google Cloud Client for Python (gcloud) is old in some places, so Source .com / GoogleCloudPlatform / gcloud-python) It was necessary to implement while watching (as of November 26, 2015)

By the way, the deployment of Lambda for Python is my work, but I am using the following. It's still under development, but it's becoming more convenient to use normally, so please try it if you like. https://github.com/marcy-terui/lamvery

[^ 1]: Looking for a smart way to pass confidential information on Lambda [^ 2]: Where you want Python3 support as soon as possible

[PYTHON] Get and parse S3 Logging logs with Lambda event nortification and plunge into BigQuery