[Solved] Is Go's S3 List Objects slower than Python? problem

Introduction

When downloading a large number of Objects from S3, the speed does not come out moderately regardless of the Object size. When I was writing in Python, I was doing my best with ** concurrent.futures ** etc., but maybe I can do it with Goroutine? I thought, I made my debut in Golang.

What I wanted to do

--Use ListObjectV2 to get all Keys under the specific Prefix of S3 --Download the acquired Key with Goroutine

What actually happened

--I tried writing the ListObject part, but to be clear, it's slow. ――What do you think it is faster to write in Python?

Hmm? Same speed because you just hit the API. If so, I'm still convinced, but I'm a little worried that ** Go is slower than scripting languages **. I changed my schedule in a hurry and tried to verify this matter a little.

Source code

Go version

So here and [here](https: // I will write it with reference to docs.aws.amazon.com/sdk-for-go/api/service/s3/#S3.ListObjectsV2).

main.go


package main

import (
	"fmt"
	"github.com/aws/aws-sdk-go/aws"
	"github.com/aws/aws-sdk-go/aws/session"
	"github.com/aws/aws-sdk-go/service/s3"
	"os"
)

func main() {
	bucket := os.Getenv("BUCKET")
	prefix := os.Getenv("PREFIX")
	region := os.Getenv("REGION")

	sess := session.Must(session.NewSession())
	svc := s3.New(sess, &aws.Config{
		Region: &region,
	})
	params := &s3.ListObjectsV2Input{
		Bucket: &bucket,
		Prefix: &prefix,
	}
	fmt.Println("Start:")
	err := svc.ListObjectsV2Pages(params,
		func(p *s3.ListObjectsV2Output, last bool) (shouldContinue bool) {
			for _, obj := range p.Contents {
				fmt.Println(*obj.Key)
			}
			return true
		})
	fmt.Println("End:")
	if err != nil {
		fmt.Println(err.Error())
		return
	}
}

Python version

I will write this as well. With a low level client to match the conditions with Go.

main.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import boto3
bucket = os.environ["BUCKET"]
prefix = os.environ["PREFIX"]
region = os.environ["REGION"]

# r = boto3.resource('s3').Bucket(bucket).objects.filter(Prefix=prefix)
# [print(r.key) for r in r]
#I usually get it as above, but I measure it with the following code to send it to Golang

s3_client = boto3.client('s3', region)

contents = []
next_token = ''
while True:
    if next_token == '':
        response = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix)
    else:
        response = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix, ContinuationToken=next_token)

    contents.extend(response['Contents'])
    if 'NextContinuationToken' in response:
        next_token = response['NextContinuationToken']
    else:
        break

[print(r["Key"]) for r in contents]

environment

Server, etc.

--Basically run on Cloud9 on EC2 (t2.micro).

Build / deploy, etc.

――I don't want to pollute the environment and it's troublesome, so I built everything with Docker.

$ docker-compose up -d --build

--By the way, see below for construction materials.

Dockerfile


FROM golang:1.13.5-stretch as build
RUN go get \
  github.com/aws/aws-sdk-go/aws \
  github.com/aws/aws-sdk-go/aws/session \
  github.com/aws/aws-sdk-go/service/s3 
COPY . /work
WORKDIR /work
RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o main main.go

FROM python:3.7.6-stretch as release
RUN pip install boto3
COPY --from=build /work/main /usr/local/bin/main
COPY --from=build /work/main.py /usr/local/bin/main.py
WORKDIR /usr/local/bin/

docker-compose.yml


version: '3'
services:
  app:
    build:
      context: .
    container_name: "app"
    tty: True
    environment:
      BUCKET: <Bucket>
      PREFIX: test/
      REGION: ap-northeast-1

S3 bucket

Create one bucket in the Tokyo region and create about 1000 with the following tools.

#/bin/bash

Bucket=<Bucket>
Prefix="test"

#Create test file
dd if=/dev/zero of=testobj bs=1 count=30
#Copy master file
aws s3 cp testobj s3://${Bucket}/${Prefix}/testobj
#Duplicate master file
for i in $(seq 0 9); do
    for k in $(seq 0 99); do
        aws s3 cp s3://${Bucket}/${Prefix}/testobj s3://${Bucket}/${Prefix}/${i}/${k}/${i}_${k}.obj
    done
done

Measurement

Measurement result (1000 Object)

$ time docker-compose exec app ./main

~ Abbreviation ~

real    0m21.888s
user    0m0.580s
sys     0m0.107s
$ time docker-compose exec app ./main.py

~ Abbreviation ~

real    0m2.671s
user    0m0.577s
sys     0m0.104s

Go is 10 times slower than Python. why!

Try increasing the number of Objects

--Let's increase the number of objects a little more. For the time being, around 10,000 cases.

#Difference only
for i in $(seq 0 99); do
    for k in $(seq 0 99); do

――By the way, it took 3-4 hours to complete the upload. I should have made the tool properly ...

Remeasurement result (10000 Object)

$ time docker-compose exec app ./main

~ Abbreviation ~

real    0m23.276s
user    0m0.617s
sys     0m0.128s
$ time docker-compose exec app ./main.py

~ Abbreviation ~
real    0m5.973s
user    0m0.576s
sys     0m0.114s

This time the difference is about 4 times. Rather, it seems that there is a difference of about 18 seconds regardless of the number of objects. Hmmm.

At the end

――I feel that I don't understand the language specifications because of the library settings, so I'd like to get some more information. ――If the efficiency of ** parallel download processing ** in Goroutine, which is the original purpose, is good, it seems that there is an error of about 20 seconds, so I will implement the rest.

Where to worry

--If you look closely, user and sys are about the same, so I / O is suspicious in S3. --When I roughly print debug the go code ("Start:", "End:"), it seems that the list object occupies most of the processing time. Is the default value of S3 setting different from boto3? ――Since it runs in the same container, I don't think the CPU credit problem of T-type instances or the difference in NW bandwidth are related ... ――The former was not solved even if you replaced it with m5.large, so it doesn't matter.

Follow-up (2020/01/18)

As advised in the comment section, I tried debugging the SDK I found that it took a long time to find the IAM Credential. Was the default value of the OS of "-stretch" bad? I tried it several times after that, but it didn't reappear in this environment, so I'll try to solve it. It's moody, but ...

@nabeken Thank you!

Recommended Posts

[Solved] Is Go's S3 List Objects slower than Python? problem
Python list is not a list
Golang vs. Python – Is Golang Better Than Python?