[PYTHON] How to get the key on Amazon S3 with Boto 3, implementation example, notes

In Boto 3, use list_objects () to get the key on the S3 Buckets. You can also specify a prefix to narrow down the criteria. I think it's a commonly used method to get the key in S3.

Basic usage

For example, if you want to get all the keys with the prefix: xx / yy in the bucket name: hoge-bucket, do as follows.

sample1.py


from boto3 import Session
s3client = Session().client('s3')

response = s3client.list_objects(
    Bucket='hoge-bucket',
    Prefix='xx/yy/'
)

if 'Contents' in response:  #If there is no corresponding key, the response will'Contents'Is not included
    keys = [content['Key'] for content in response['Contents']]

Now, if there is a key of the condition specified by Prefix, keys will be

>>> keys
['xx/yy/a1', 'xx/yy/a2', 'xx/yy/a3', 'xx/yy/b1']

An array of key strings should be assigned as in.

By the way, Prefix can also be specified as 'xx / yy / a'. In that case, the following result will be returned.

>>> keys
['xx/yy/a1', 'xx/yy/a2', 'xx/yy/a3']

important point

list_objects () (to be exact, in the Amazon S3 API) has a limit of 1000 keys that can be retrieved at one time. If you simply want to get all 1000 or more keys under the bucket,

sample2.py


from boto3 import Session
s3res = Session().resource('s3')

bucket = s3res.Bucket('hoge-bucket')
keys = [obj.key for obj in bucket.objects.all()]

Will be fine.

To get 1000 or more keys with Prefix condition

But what if you want to specify a Prefix? In sample1.py mentioned above, at most 1000 are assigned to keys. Even if you specify 1000000 for MaxKeys as an argument, 1000 or more will not be returned by specification.

Method 1: Get all records once and then filter

If you have few keys under hoge-bucket, you may want to do the following:

sample3.py


from boto3 import Session
s3res = Session().resource('s3')

bucket = s3res.Bucket('hoge-bucket')
keys = [obj.key for obj in bucket.objects.all() if obj.key.startswith("xx/yy/")]

However, if you have tens of thousands or hundreds of thousands of keys, it will take a considerable amount of time to get results. I want to leave the part that can be done with the model to the model as much as possible.

Method 2: Use Marker withlist_objects ()

The return value of list_objects () is a hash of the form:

{
    'IsTruncated': True|False,  #Was the result shredded? True if done
    'Marker': 'string',
    'NextMarker': 'string',
    'Contents': [
        {
            'Key': 'string',
            'LastModified': datetime(2015, 1, 1),
            'ETag': 'string',
            'Size': 123,
            'StorageClass': 'STANDARD'|'REDUCED_REDUNDANCY'|'GLACIER',
            'Owner': {
                'DisplayName': 'string',
                'ID': 'string'
            }
        },
    ],
    'Name': 'string',
    'Prefix': 'string',
    'Delimiter': 'string',
    'MaxKeys': 123,
    'CommonPrefixes': [
        {
            'Prefix': 'string'
        },
    ],
    'EncodingType': 'url'
}

The important thing here is ʻIsTruncated. If there are more than 1000 results, but only 1000 are returned, this ʻIsTruncated becomes True. By the way, the 'Contents' array is always sorted in ascending order based on the alphabetical order of'Key'.

In addition, list_objects () has an argument called Marker, and the result can be output with the specified key as the first item. The actors are now complete. The following is a function created by wrapping list_objects () to get all the specified keys regardless of the number.

sample4.py


from boto3 import Session
s3client = Session().client('s3')

def get_all_keys(bucket: str='', prefix: str='', keys: []=[], marker: str='') -> [str]:
    """
Returns an array of all keys with the specified prefix
    """
    response = s3client.list_objects(Bucket=bucket, Prefix=prefix, Marker=marker)
    if 'Contents' in response:  #If there is no corresponding key, the response will'Contents'Is not included
        keys.extend([content['Key'] for content in response['Contents']])
        if 'IsTruncated' in response:
            return get_all_keys(bucket=bucket, prefix=prefix, keys=keys, marker=keys[-1])
    return keys

ʻIf'IsTruncated' in response:, and if ʻIsTruncated, then you are calling yourself with the keys (keys [-1] ) as a marker. If there is no Contents in the response, or if it is no longer IsTruncated, the result will be returned at once.

Now you can get the key on S3 without worrying about the number!

Recommended Posts

How to get the key on Amazon S3 with Boto 3, implementation example, notes
How to publish a blog on Amazon S3 with the static Blog engine'Pelican'for Pythonista
Autoencoder with Chainer (Notes on how to use + trainer)
read the tag assigned to you on ec2 with boto3
[Hyperledger Iroha] Notes on how to use the Python SDK
How to get into the python development environment with Vagrant
Notes on how to use marshmallow in the schema library
[Introduction to Python] How to get data with the listdir function
POST the image selected on the website with multipart / form-data and save it to Amazon S3! !!
How to know the number of GPUs from python ~ Notes on using multiprocessing with pytorch ~
Notes on how to use pywinauto
How to get the Python version
Notes on how to use featuretools
How to get started with Scrapy
How to get started with Python
How to get started with Django
Notes on how to use doctest
Notes on how to write requirements.txt
Use boto3 to mess with S3
How is the progress? Let's get on with the boom ?? in Python
How to get the ID of Type2Tag NXP NTAG213 with nfcpy
How to get the directory where the EXE built with Pyinstaller exists
How to get all traffic through VPN with OpenVPN on Linux
I tried to get started with Bitcoin Systre on the weekend
[Python] Explains how to use the format function with an example
[Python] How to get a value with a key other than value with Enum
Step notes to get started with django
How to get parent id with sqlalchemy
Steps to get KeePassX key on OS X with one command line
[Python] Explains how to use the range function with a concrete example
How to get the date and time difference in seconds with python
How to update php on Amazon linux 2
How to deal with SSL error when connecting to S3 with boto of Python
Try server-side encryption on S3 with boto3
How to get colored output to the console
How to install Anisble on Amazon Linux 2
How to catch boto3 S3 NoSuchKey error
Get exchange rates on Heroku regularly and upload logs to Amazon S3
Copy data from Amazon S3 to Google Cloud Storage with Python (boto)
How to get started with laravel (Linux)
How to query BigQuery with Kubeflow Pipelines and save the result and notes
Change the Key of Object on S3 from normal date format to Hive format
[Python] How to save images on the Web at once with Beautiful Soup
Checklist on how to avoid turning the elements of numpy's array with for
Note: How to get the last day of the month with python (added the first day of the month)
How to get a list of files in the same directory with python
[Introduction to Python] How to get the index of data with a for statement
The easiest way to get started with Django
How to specify the NIC to scan with amazon-dash
Strategy on how to monetize with Python Java
How to try the friends-of-friends algorithm with pyfof
How to install OpenGM on OSX with macports
Introduction to Python with Atom (on the way)
How to get the files in the [Python] folder
How to Learn Kaldi with the JUST Corpus
How to set a shared folder with the host OS in CentOS7 on VirtualBOX
How to get started with Visual Studio Online ~ The end of the environment construction era ~
A memo on how to overcome the difficult problem of capturing FX with AI
How to get the variable name itself in python
Think about how to program Python on the iPad
How to delete the specified string with the sed command! !! !!