AWS Lambda supports Python, so I tried using it. This time I used it for copying files between buckets of S3, but I would like to share it because there were various points of interest.
I tried it mainly for the third reason.
I created a Lambda function to get an s3 bucket and implemented a script to copy in parallel.
Create a Lambda Function.
--Click Create a Lambda Function
Select blue print Select the template you want to use.
--Select hello-world-python
Configure function
Make basic settings for the Lambda function.
--Name: Lambda Function name --Example: s3LogBucketCopy --Description: Description of Lambda Function --Example: copy logs between buckets --Runtime: Execution environment - Python2.7
Lambda function code
Provides the program code to be executed.
You can choose from the following three types.
If you need to import a standard python library or a library other than boto3, you need to choose method 2 or 3.
Details are summarized in here, so please refer to it if you are interested.
By the way, this time, since only the standard library and boto3 are used, it is implemented by method 1.
We will implement it later, so we will not change it at first.
Lambda function handler and role --Handler: Name of handler to execute (module name.function name) --Example: lambda_function.s3_log_copy_handler --Role: Lambda execution permission (access permission to resources such as s3, etc.) --Example: S3 execution Role
Advanced settings Set the available memory and timeout time.
--Memory (MB): Available memory --Example: 128MB --Timeout: Timeout time --Example: 5 min
Review
Check the settings.
If there is no problem, select Create Function
Implement the script to copy with multi_process.
Below is a simple sample.
#! /user/local/bin/python
# -*- coding:utf-8 -*-
import boto3
from multiprocessing import Process
def parallel_copy_bucket(s3client, source_bucket, dest_bucket, prefix):
'''
Copy s3 bucket in parallel
'''
#Copy the bucket
def copy_bucket(s3client, dest_bucket, copy_source, key):
s3client.copy_object(Bucket=dest_bucket, CopySource=copy_source, Key=key)
# list_Note that you can only get up to 1000 data for object.
result = s3client.list_objects(
Bucket=source_bucket,
Prefix=prefix
)
#Get the copy source key list and copy
if 'Contents' in result:
keys = [content['Key'] for content in result['Contents']]
p = None
for key in keys:
copy_source = '{}/{}'.format(source_bucket, key)
p = Process(target=copy_bucket, args=(s3client, dest_bucket, copy_source, key))
p.start()
if p:
p.join()
#Handler called at runtime
def s3_log_copy_handler(event, context):
source_bucket = event["source_bucket"] #Copy source bucket
dest_bucket = event["dest_bucket"] #Copy destination bucket
prefixes = event["prefixes"] #Copy source file name conditions
s3client = boto3.client('s3')
for prefix in prefixes:
print("Start loading {}".format(prefix))
parallel_copy_bucket(s3client, source_bucket, dest_bucket, prefix)
print("Complete loading")
Set Configure Sample Event
from the ʻActions` button
Set parameters to pass to handler
For example, if the configuration of s3 is as follows
- samplelogs.source #Copy source bucket
- /key1
- hogehoge.dat
- /key2
- fugafuga.dat
- samplelogs.dest #Copy destination bucket
Set the JSON as follows.
.json
{
"source_bucket": "samplelogs.source",
"dest_bucket": "samplelogs.dest",
"prefixes" : [
"key1",
"key2"
]
}
The default S3 Execution Rule only defines s3: GetObject
and s3: PutObject
.
At this time, if you call s3client.list_objects ()
, you will get the error ʻA client error (AccessDenied) occurred: Access Denied. This method cannot be executed with
S3: GetObject and requires another execute permission called
S3: ListObejct. Therefore, you need to add
s3: ListObject` to your Policy.
multiprocessing.Pool
When running in multiple processes, if you specify a pool, you will get the error ʻOSErrors-[Errno 38] Function not implemented`. This is a problem because you don't have the OS privileges needed to hold the pool when running on Lambda. You need to unconfigure the Pool and run it.
Lambda must be configured to time out when the execution time exceeds the specified value. Since the maximum timeout value is 300 sec (5 min), execution cannot be completed for items that take longer to execute. So if you have a bucket with a reasonably large file, you'll need to run the Lambda Function several times.
I think it's a good place to use it, but I think it's suitable for light processing such as alerts, push notifications, and small file transfers. On the contrary, it doesn't seem to be suitable for writing heavy processing. Also, now that you have an API endpoint, it may be suitable for ultra-lightweight APIs. I will try it next time.
Here is a summary of the happy and unfortunate points of using the Lambda Function.
――I'm happy --No need to set up an instance to write a simple batch process --Easy batch implementation with minimal settings --Easy to access resources in aws as it can be managed by IAM Role --boto3 can be used as standard --Unfortunately --Not compatible with Python 3 series --Importing other than standard packages and boto3 is troublesome --It is difficult to manage the created code --Maximum timeout time is short
https://boto3.readthedocs.org/en/latest/ http://qiita.com/m-sakano/items/c53ba194a8574f44e78a http://www.perrygeo.com/running-python-with-compiled-code-on-aws-lambda.html
Recommended Posts