AWS Lambda supports Python, so I tried using it. This time I used it for copying files between buckets of S3, but I would like to share it because there were various points of interest.

Thing you want to do

I want to copy a file that exists in the s3 bucket to another bucket
Copying with a single process is slow, so I want to copy buckets with multiple processes at the same time
I want to use AWS Lambda Python

I tried it mainly for the third reason.

What i did

I created a Lambda function to get an s3 bucket and implemented a script to copy in parallel.

Creating a Lambda Function

Create a Lambda Function.

--Click Create a Lambda Function

Select blue print Select the template you want to use.

--Select hello-world-python

Configure function

Make basic settings for the Lambda function.

--Name: Lambda Function name --Example: s3LogBucketCopy --Description: Description of Lambda Function --Example: copy logs between buckets --Runtime: Execution environment - Python2.7

Lambda function code

Provides the program code to be executed.

You can choose from the following three types.

Edit the code on the screen
Upload code from your own machine
Upload code from s3

If you need to import a standard python library or a library other than boto3, you need to choose method 2 or 3.

Details are summarized in here, so please refer to it if you are interested.

By the way, this time, since only the standard library and boto3 are used, it is implemented by method 1.

We will implement it later, so we will not change it at first.

Lambda function handler and role --Handler: Name of handler to execute (module name.function name) --Example: lambda_function.s3_log_copy_handler --Role: Lambda execution permission (access permission to resources such as s3, etc.) --Example: S3 execution Role

Advanced settings Set the available memory and timeout time.

--Memory (MB): Available memory --Example: 128MB --Timeout: Timeout time --Example: 5 min

Review

Check the settings. If there is no problem, select Create Function

Script implementation

Implement the script to copy with multi_process.

Below is a simple sample.

#! /user/local/bin/python
# -*- coding:utf-8 -*-

import boto3
from multiprocessing import Process

  

def parallel_copy_bucket(s3client, source_bucket, dest_bucket, prefix):
    '''
Copy s3 bucket in parallel
    '''    
    #Copy the bucket
    def copy_bucket(s3client, dest_bucket, copy_source, key):
        s3client.copy_object(Bucket=dest_bucket, CopySource=copy_source, Key=key)
        
    # list_Note that you can only get up to 1000 data for object.
    result = s3client.list_objects(
        Bucket=source_bucket,
        Prefix=prefix
    )
    #Get the copy source key list and copy
    if 'Contents' in result:
        keys = [content['Key'] for content in result['Contents']]
        p = None
        for key in keys:
            copy_source = '{}/{}'.format(source_bucket, key)
            p = Process(target=copy_bucket, args=(s3client, dest_bucket, copy_source, key))
            p.start()
        if p:
            p.join()


#Handler called at runtime
def s3_log_copy_handler(event, context):
    source_bucket = event["source_bucket"] #Copy source bucket
    dest_bucket = event["dest_bucket"]     #Copy destination bucket
    prefixes = event["prefixes"]           #Copy source file name conditions
    s3client = boto3.client('s3')
    for prefix in prefixes:
        print("Start loading {}".format(prefix))
        parallel_copy_bucket(s3client, source_bucket, dest_bucket, prefix)
    print("Complete loading")

Test run

Set Configure Sample Event from the ʻActions` button

Set parameters to pass to handler

For example, if the configuration of s3 is as follows

- samplelogs.source  #Copy source bucket
    - /key1
        - hogehoge.dat
    - /key2
        - fugafuga.dat
- samplelogs.dest    #Copy destination bucket

Set the JSON as follows.

`.json`


{
  "source_bucket": "samplelogs.source",
  "dest_bucket": "samplelogs.dest",
  "prefixes" : [
    "key1",
    "key2"
  ]
}

Where I was addicted

Allow Role to process s3 bucket

The default S3 Execution Rule only defines s3: GetObject and s3: PutObject. At this time, if you call s3client.list_objects (), you will get the error ʻA client error (AccessDenied) occurred: Access Denied. This method cannot be executed with S3: GetObject and requires another execute permission called S3: ListObejct. Therefore, you need to add s3: ListObject` to your Policy.

multiprocessing.Pool

When running in multiple processes, if you specify a pool, you will get the error ʻOSErrors-[Errno 38] Function not implemented`. This is a problem because you don't have the OS privileges needed to hold the pool when running on Lambda. You need to unconfigure the Pool and run it.

TimeOut settings

Lambda must be configured to time out when the execution time exceeds the specified value. Since the maximum timeout value is 300 sec (5 min), execution cannot be completed for items that take longer to execute. So if you have a bucket with a reasonably large file, you'll need to run the Lambda Function several times.

Impressions

I think it's a good place to use it, but I think it's suitable for light processing such as alerts, push notifications, and small file transfers. On the contrary, it doesn't seem to be suitable for writing heavy processing. Also, now that you have an API endpoint, it may be suitable for ultra-lightweight APIs. I will try it next time.

Here is a summary of the happy and unfortunate points of using the Lambda Function.

――I'm happy --No need to set up an instance to write a simple batch process --Easy batch implementation with minimal settings --Easy to access resources in aws as it can be managed by IAM Role --boto3 can be used as standard --Unfortunately --Not compatible with Python 3 series --Importing other than standard packages and boto3 is troublesome --It is difficult to manage the created code --Maximum timeout time is short

reference

https://boto3.readthedocs.org/en/latest/ http://qiita.com/m-sakano/items/c53ba194a8574f44e78a http://www.perrygeo.com/running-python-with-compiled-code-on-aws-lambda.html

Connect to s3 with AWS Lambda Python