Process the gzip file UNLOADed with Redshift with Python of Lambda, gzip it again and upload it to S3

Thing you want to do

As the title suggests, Redshift has a data warehouse, which is usually processed by ELT, but there are cases where data processing by programming is required.

By using Redshift's UNLOAD, you can create a gzip file from Redshift to S3 with the SQL result, so it is said that it will be processed by Lambda with the put event to S3 as a trigger and uploaded to S3 in the state of gzipping again. I tried to do that.

UNLOAD Lambda currently has a maximum of 3008MB. Processing like this will inevitably increase the amount of memory used as the file size increases. Therefore, adjust the file size to be passed to Lambda by setting the MAXFILESIZE parameter. It's a complete case by case, but this time I set it at 50MB.

Code on Lambda

Trigger settings are omitted.

import json
import boto3
import urllib.parse
import os
import sys
import csv
import re
import traceback
import gzip
import subprocess

s3client = boto3.client('s3')
s3resource = boto3.resource('s3')
SEP = '\t'
L_SEP = '\n'
S3OUTBACKET='XXXXXXXX'
S3OUTBASE='athena/preprocessing/XXXXXXtmp/'

def lambda_handler(event, context):
    
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
    taragetfile=os.path.split(key)[1]
    outputprefixA=os.path.split(key)[0].split("/")[-1]
    outputprefixB=os.path.split(key)[0].split("/")[-2]

    outputdata = "";
    try:
        dlfilename ='/tmp/'+key.replace("/","")
        s3client.download_file(bucket, key, dlfilename)
        gzipfile = gzip.open(dlfilename, 'rt') 
        csvreader = csv.reader(gzipfile, delimiter=SEP, lineterminator=L_SEP, quoting=csv.QUOTE_NONE)
        for line in csvreader:
            #Various processing is performed line by line and stored in output data.
            #Among the omitted processes, there is an import that we are using.
            #Please note
    except Exception as e:
        print(e)
        raise e
        
    print("memory size at outputdata:"+str(sys.getsizeof(outputdata)))
    os.remove(dlfilename)
    uploadbinary = gzip.compress(bytes(outputdata , 'utf-8'))
    print("memory size at uploadbinary:"+str(sys.getsizeof(uploadbinary)))
    uploadfilename='processed_'+taragetfile

    try:
        bucket = S3OUTBACKET
        key = S3OUTBASE+outputprefixA+"/"+outputprefixB+"/"+uploadfilename
        obj = s3resource.Object(bucket,key)
        obj.put( Body=uploadbinary ) 
    except Exception as e:
        print(e)
        raise e
 
    return 0

Tune

When I tested it with an actual file, I got a memory error. The str (sys.getsizeof (outputdata)) in the middle of the code is for confirmation, and I grasped the situation by looking at the memory size. Although it is not written in the code, it is good to see the compression ratio of gzip itself to the target data. The data I handled this time was 50MB after gzip compression, but the processed data + compressed data required 1000MB of memory. After all it is something that you can not understand unless you actually try it. You may want to investigate Python's memory situation a little more.

If you increase the memory size of Lambda, CPU resources etc. will also increase, so it depends on the processing content and file size, but it is good to check how fast the processing will be once the maximum is 3008MB. Again, there were cases where doubling the memory would halve the processing time.

If the process is performed on a regular basis, tuning here is directly linked to running costs, so it is very important.

Impressions without vocabulary

Lambda very convenient

Recommended Posts

Process the gzip file UNLOADed with Redshift with Python of Lambda, gzip it again and upload it to S3
[AWS lambda] Deploy including various libraries with lambda (generate a zip with a password and upload it to s3) @ Python
Upload data to s3 of aws with a command and update it, and delete the used data (on the way)
GAE --With Python, rotate the image based on the rotation information of EXIF and upload it to Cloud Storage.
The process of making Python code object-oriented and improving it
Return the image data with Flask of Python and draw it to the canvas element of HTML
[Python] Convert CSV file uploaded to S3 to JSON file with AWS Lambda
Convert the result of python optparse to dict and utilize it
Find the white Christmas rate by prefecture with Python and map it to a map of Japan
Check the existence of the file with python
Connect to s3 with AWS Lambda Python
Convert the spreadsheet to CSV and upload it to Cloud Storage with Cloud Functions
Read CSV file with Python and convert it to DataFrame as it is
Upload what you got in request to S3 with AWS Lambda Python
Read the data of the NFC reader connected to Raspberry Pi 3 with Python and send it to openFrameworks with OSC
[Python] Summary of S3 file operations with boto3
Export RDS snapshot to S3 with Lambda (Python)
Upload files to Google Drive with Lambda (Python)
Convert the character code of the file with Python3
Fourier transform the wav file read by Python, reverse transform it, and write it again.
Template of python script to read the contents of the file
Automatically determine and process the encoding of the text file
The process of installing Atom and getting Python running
I tried to touch the CSV file with Python
Extract images and tables from pdf with python to reduce the burden of reporting
I tried to automate the article update of Livedoor blog with Python and selenium.
Visualize the range of interpolation and extrapolation with python
I want to replace the variables in the python template file and mass-produce it in another file.
Read json file with Python, format it, and output json
Read the csv file with jupyter notebook and write the graph on top of it
I tried to compare the processing speed with dplyr of R and pandas of Python
It is easy to execute SQL with Python and output the result in Excel
[Python environment maintenance] De-NeoBundle. Prepare the environment of the super convenient complementary plug-in jedi-vim with dein and set it to be comfortable
POST the image selected on the website with multipart / form-data and save it to Amazon S3! !!
I tried to get the number of days of the month holidays (Saturdays, Sundays, and holidays) with python
Recursively get the Excel list in a specific folder with python and write it to Excel.
It was great to edit the Python file in the Raspberry Pi with Atom's remote function
How to insert a specific process at the start and end of spider with scrapy
How to get the information of organizations, Cost Explorer of another AWS account with Lambda (python)
I tried to find the entropy of the image with python
[Python] Read the csv file and display the figure with matplotlib
[pyqtgraph] Add region to the graph and link it with the graph region
Try to automate the operation of network devices with Python
Debug by attaching to the Python process of the SSH destination
I want to know the features of Python and pip
[Python] Regularly export from CloudWatch Logs to S3 with Lambda
I tried to divide the file into folders with Python
Try to decipher the garbled attachment file name with Python
Play with the password mechanism of GitHub Webhook and Python
Overview of Python virtual environment and how to create it
Get the source of the page to load infinitely with python.
I wrote AWS Lambda, and I was a little addicted to the default value of Python arguments
[Python] What is pip? Explain the command list and how to use it with actual examples
Read the old Gakushin DC application Word file (.doc) from Python and try to operate it.
[Python] The role of the asterisk in front of the variable. Divide the input value and assign it to a variable
[Python scraping] Output the URL and title of the site containing a specific keyword to a text file
I compared the speed of Hash with Topaz, Ruby and Python
Process the contents of the file in order with a shell script
Made it possible to convert PNG to JPG with Pillow of Python
[Introduction to Python] I compared the naming conventions of C # and Python.
[Python] How to get the first and last days of the month