[PYTHON] Aggregate AWS S3 data

Simply put, S3 is AWS's cloud storage service, which can store large amounts of data.

・ What is Amazon S3? Http://docs.aws.amazon.com/ja_jp/AmazonS3/latest/dev/Welcome.html

Since it was necessary to aggregate the data on S3 and display it as a graph this time, I will explain the script for that.

Work environment: macOS Sierra version 10.12.5 Python 2.7.10

boto First you need to use boto to access AWS from Python. The latest version is boto3, so install boto3.

$ pip install boto3

After installing boto3 and importing it into Python, access S3 as follows. In addition, specify the name of the bucket you want to fetch and fetch the data.

import boto3

s3_resource = boto3.resource('s3')
bucket = s3_resource.Bucket('bucket_name')

Since there is a lot of data in the bucket, if there is an object you want to fetch, specify it with filter. If you need all of them, get them with all.

obj = bucket.objects.filter(Prefix='filter_word')
obj = bucket.objects.all()

Download the object data. Filename is the name when saving locally. The extension should be the same as the file located in S3.

bucket.download_file(Key=obj.key, Filename='file_name')

Now you can download the S3 data. After that, open the file and aggregate using a list or dictionary according to the data format.

Graphing

Next, graph the aggregated data. The aggregated data is saved as a text file in the following format. The numerical values of names 1 to 6 at that time are recorded for each hour.

Time 1
Name 1:Number 1-1
Name 2:Number 1-2
Name 3:Number 1-3
Name 4:Number 1-4
Name 5:Number 1-5
Name 6:Number 1-6
Time 2
Name 1:Number 2-1
Name 2:Number 2-2
Name 3:Number 2-3
Name 4:Number 2-4
Name 5:Number 2-5
Name 6:Number 2-6
Time 3
・ ・ ・ ・ ・ ・ ・ ・ ・
・ ・ ・ ・ ・ ・ ・ ・
・ ・ ・ ・ ・ ・ ・
・ ・ ・ ・ ・ ・ ・

Use Pandas and matplotlib to graph this. Pandas is a library that makes it easier to work with data in Pyhton. You can easily create a graph by using the data frame format. matplotlib is used to draw graphs. Both will be installed with pip.

A data frame can be created by directly substituting it, but it can also be created from a dictionary, so create a dictionary and then convert it to the data frame format. Open a text file of aggregated data, create a dictionary, and enter a list of names for keys and numbers for values. After inputting, create a graph in data frame format.

import pandas as pd

with open('data.txt')as f:
    line = f.readline()
    while line:
        results = line.rstrip()
        if ':' in results:
            data = results.split(':')
            results_dict[data[0]].append(int(data[1]))
        line=f.readline()

#Make a dictionary a data frame
my_df = pd.DataFrame.from_dict(results_dict)
#Creating a graph
my_df.plot(title='graph_title')

The graph is now created. Finally, use matplotlib to display it.

import matplotlib.pyplot as plt

plt.show()

Other

You can use a library called seaborn to create more fashionable graphs. There are various graphs and usages on the official page. https://seaborn.pydata.org/examples/index.html

Recommended Posts

Aggregate AWS S3 data
Extract data from S3
[AWS] Search and acquire necessary data from S3 files with S3 Select
Easy AWS S3 testing with MinIO
Manage your data with AWS RDS
I made a new AWS S3 bucket
[AWS] Link Lambda and S3 with boto3
Connect to s3 with AWS Lambda Python
Overwrite data in RDS with AWS Glue
[AWS] Do SSI-like things with S3 / Lambda