[PYTHON] [AWS] Search and acquire necessary data from S3 files with S3 Select

Thing you want to do

I want to use S3 Select to search S3 files with SQL queries to get the records I need. Qiita logo

A happy event

Convenient when you want to quickly check the target data from the S3 file.
I had to put it in the database to search the query, but I don't need it anymore.
By combining with API Gateway, Lambda, etc., you can build serverless applications for S3 data.

Points of S3 Select

Targets one file on S3.
The file format supports CSV, JSON, and Apache Parquet formats.
Supports GZIP and BZIP2 compression formats.
It also supports SQL queries from the AWS SDK and AWS CLI.
Since the charge is based on "the amount of scanned data + the amount of returned data", it is better to compress the target file.

Try to operate from the console screen of S3

This time, we will place a sample file in the following format on S3 and verify it.

`sample.tsv`


male	14	student
female	26	employee
male	32	selfemployed
male	45	unemployed
female	11	student
male	24	employee
male	33	selfemployed
male	49	unemployed
female	57	unemployed

Select the target file from the S3 console screen and select the S3 Select tab.

File format: CSV Delimiter: tab Compression: None "next"

Enter the SQL you want to execute. I want to get a record whose first column is "female".

`SQL`


select * from s3object s where s._1 = 'female'

I was able to get it properly. You can download it as a CSV file by clicking "Download".

There are some sample SQLs, so give them a try.

Run SQL queries in Python from an EC2 instance

I'll also try the pattern of executing SQL queries in Python from an EC2 instance. First, install the required Python.

#Install Python3
$ sudo yum update -y
$ sudo yum install python3 -y

#Enable virtual environment
$ python3 -m venv ~/s3select_example/env
$ source ~/s3select_example/env/bin/activate

#install boto3
$ pip install pip --upgrade
$ pip install boto3

Next, create a Python file.

`S3Select.py`


import boto3

s3 = boto3.client('s3')

resp = s3.select_object_content(
    Bucket='my-bucket.01',
    Key='sample/sample.tsv',
    ExpressionType='SQL',
    Expression="SELECT * FROM s3object s where s._1 = 'female'",
    InputSerialization = {'CSV': {"FileHeaderInfo": "NONE", 'FieldDelimiter': '\t'}, 'CompressionType': 'NONE'},
    OutputSerialization = {'CSV': {}},
)

for event in resp['Payload']:
    if 'Records' in event:
        records = event['Records']['Payload'].decode('utf-8')
        print(records)

Match Bucket and Key to the file to be read by S3. Also, since the target file this time is tab-delimited, "'\ t'" is specified for the Field Delimiter of Input Serialization.

Let's run it.

$ python S3Select.py

female,26,employee
female,11,student
female,57,unemployed

The same record as the result of executing from the console screen was output!

This time, I searched the S3 file directly with SQL query and fetched the target record. I sometimes investigate S3 files, so I would like to use it if I have the opportunity.

Reference: https://aws.amazon.com/jp/blogs/news/querying-data-without-servers-or-databases-using-amazon-s3-select/