I want to use S3 Select to search S3 files with SQL queries to get the records I need.
This time, we will place a sample file in the following format on S3 and verify it.
sample.tsv
male 14 student
female 26 employee
male 32 selfemployed
male 45 unemployed
female 11 student
male 24 employee
male 33 selfemployed
male 49 unemployed
female 57 unemployed
Select the target file from the S3 console screen and select the S3 Select tab.
File format: CSV Delimiter: tab Compression: None "next"
Enter the SQL you want to execute. I want to get a record whose first column is "female".
SQL
select * from s3object s where s._1 = 'female'
I was able to get it properly. You can download it as a CSV file by clicking "Download".
There are some sample SQLs, so give them a try.
I'll also try the pattern of executing SQL queries in Python from an EC2 instance. First, install the required Python.
#Install Python3
$ sudo yum update -y
$ sudo yum install python3 -y
#Enable virtual environment
$ python3 -m venv ~/s3select_example/env
$ source ~/s3select_example/env/bin/activate
#install boto3
$ pip install pip --upgrade
$ pip install boto3
Next, create a Python file.
S3Select.py
import boto3
s3 = boto3.client('s3')
resp = s3.select_object_content(
Bucket='my-bucket.01',
Key='sample/sample.tsv',
ExpressionType='SQL',
Expression="SELECT * FROM s3object s where s._1 = 'female'",
InputSerialization = {'CSV': {"FileHeaderInfo": "NONE", 'FieldDelimiter': '\t'}, 'CompressionType': 'NONE'},
OutputSerialization = {'CSV': {}},
)
for event in resp['Payload']:
if 'Records' in event:
records = event['Records']['Payload'].decode('utf-8')
print(records)
Match Bucket and Key to the file to be read by S3. Also, since the target file this time is tab-delimited, "'\ t'" is specified for the Field Delimiter of Input Serialization.
Let's run it.
$ python S3Select.py
female,26,employee
female,11,student
female,57,unemployed
The same record as the result of executing from the console screen was output!
This time, I searched the S3 file directly with SQL query and fetched the target record. I sometimes investigate S3 files, so I would like to use it if I have the opportunity.
Reference: https://aws.amazon.com/jp/blogs/news/querying-data-without-servers-or-databases-using-amazon-s3-select/
Recommended Posts