When using Static Website Hosting, it plays a role as an access log, and it is a Logging function that discharges the log of requests to S3 to the specified bucket, puts an update notification (Event nortification) in the target bucket, and starts Lambda Dive into BigQuery to get, parse, and analyze.
--Runtime is Python 2.7 --Required modules (versions actually used in parentheses) --boto3 (Lambda default) - pytz (2015.7) - gcloud (0.8.0)
It is written here. https://docs.aws.amazon.com/AmazonS3/latest/dev/LogFormat.html
Feeling that each label directly corresponds to the column name in the next schema.
^(?P<owner>[^ ]+) (?P<bucket>[^ ]+) \[(?P<datetime>.+)\](?P<remote_ip>[^ ]+) (?P<requester>[^ ]+) (?P<request_id>[^ ]+) (?P<operation>[^ ]+) (?P<key>[^ ]+) "(?P<method>[^ ]+) (?P<uri>[^ ]+) (?P<proto>.+)" (?P<status>[^ ]+) (?P<error>[^ ]+) (?P<bytes>[^ ]+) (?P<size>[^ ]+) (?P<total_time>[^ ]+) (?P<ta_time>[^ ]+) "(?P<referrer>.+)" "(?P<user_agent>.+)" (?P<version>.+)$
It looks like this according to the format.
The datetime
column seems to be better with the TIMESTAMP type, but I chose STRING this time due to circumstances, as I will explain in detail later.
Column name | Mold |
---|---|
owner | STRING |
bucket | STRING |
datetime | STRING |
remote_ip | STRING |
requester | STRING |
request_id | STRING |
operation | STRING |
key | STRING |
method | STRING |
uri | STRING |
proto | STRING |
status | STRING |
error | STRING |
bytes | INTEGER |
size | INTEGER |
total_time | INTEGER |
ta_time | INTEGER |
referrer | STRING |
user_agent | STRING |
version | STRING |
Replace the <your-*>
part as appropriate.
import os
import json
import urllib
import boto3
import re
import datetime
import pytz
from gcloud import bigquery
BQ_PROJECT = '<your-project-id>'
BQ_DATASET = '<your-dataset-name>'
BQ_TABLE = '<your-table-name>'
s3 = boto3.client('s3')
bq = bigquery.Client.from_service_account_json(
os.path.join(os.path.dirname(__file__), 'bq.json'),
project=BQ_PROJECT)
dataset = bq.dataset(BQ_DATASET)
table = dataset.table(name=BQ_TABLE)
table.reload()
pattern = ' '.join([
'^(?P<owner>[^ ]+)',
'(?P<bucket>[^ ]+)',
'\[(?P<datetime>.+)\]',
'(?P<remote_ip>[^ ]+)',
'(?P<requester>[^ ]+)',
'(?P<request_id>[^ ]+)',
'(?P<operation>[^ ]+)',
'(?P<key>[^ ]+)',
'"(?P<method>[^ ]+) (?P<uri>[^ ]+) (?P<proto>.+)"',
'(?P<status>[^ ]+)',
'(?P<error>[^ ]+)',
'(?P<bytes>[^ ]+)',
'(?P<size>[^ ]+)',
'(?P<total_time>[^ ]+)',
'(?P<ta_time>[^ ]+)',
'"(?P<referrer>.+)"',
'"(?P<user_agent>.+)"',
'(?P<version>.+)$'])
log_pattern = re.compile(pattern)
def to_int(val):
try:
ret = int(val)
except ValueError:
ret = None
return ret
def lambda_handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf8')
res = s3.get_object(Bucket=bucket, Key=key)
body = res['Body'].read()
rows = []
for line in body.splitlines():
matches = log_pattern.match(line)
dt_str = matches.group('datetime').split(' ')[0]
timestamp = datetime.datetime.strptime(
dt_str, '%d/%b/%Y:%H:%M:%S').replace(tzinfo=pytz.utc)
rows.append((
matches.group('owner'),
matches.group('bucket'),
timestamp.strftime('%Y-%m-%d %H:%M:%S'),
matches.group('remote_ip'),
matches.group('requester'),
matches.group('request_id'),
matches.group('operation'),
matches.group('key'),
matches.group('method'),
matches.group('uri'),
matches.group('proto'),
matches.group('status'),
matches.group('error'),
to_int(matches.group('bytes')),
to_int(matches.group('size')),
to_int(matches.group('total_time')),
to_int(matches.group('ta_time')),
matches.group('referrer'),
matches.group('user_agent'),
matches.group('version'),))
print(table.insert_data(rows))
--Get JSON-formatted credentials from GCP's API Manager and include them in your Lambda function deployment package with the name bq.json
[^ 1]
--In the main body, the datetime
part should be able to be parsed with% d /% b /% Y:% H:% M:% S% z
, but due to a bug in Python <3.2, % z
can be used with an error. Since there is no such thing, it is an irregular method involving pytz
[^ 2]
--Originally, it is correct to specify the datetime object in the TIMESTAMP type column, but is it a bug of gcloud (0.8.0)
? Note that BigQuery is in seconds, but trying to save in milliseconds will result in an impossible date.
--Each item in the log will be -
if there is no data to output. Note that items that you want to convert to int will be an exception if you convert them to ʻint ('-') `as they are.
--The Official Document of Google Cloud Client for Python (gcloud) is old in some places, so Source .com / GoogleCloudPlatform / gcloud-python) It was necessary to implement while watching (as of November 26, 2015)
By the way, the deployment of Lambda for Python is my work, but I am using the following. It's still under development, but it's becoming more convenient to use normally, so please try it if you like. https://github.com/marcy-terui/lamvery
[^ 1]: Looking for a smart way to pass confidential information on Lambda [^ 2]: Where you want Python3 support as soon as possible