In Boto 3, use list_objects () to get the key on the S3 Buckets. You can also specify a prefix to narrow down the criteria. I think it's a commonly used method to get the key in S3.
For example, if you want to get all the keys with the prefix: xx / yy in the bucket name: hoge-bucket, do as follows.
sample1.py
from boto3 import Session
s3client = Session().client('s3')
response = s3client.list_objects(
    Bucket='hoge-bucket',
    Prefix='xx/yy/'
)
if 'Contents' in response:  #If there is no corresponding key, the response will'Contents'Is not included
    keys = [content['Key'] for content in response['Contents']]
Now, if there is a key of the condition specified by Prefix, keys will be
>>> keys
['xx/yy/a1', 'xx/yy/a2', 'xx/yy/a3', 'xx/yy/b1']
An array of key strings should be assigned as in.
By the way, Prefix can also be specified as 'xx / yy / a'. In that case, the following result will be returned.
>>> keys
['xx/yy/a1', 'xx/yy/a2', 'xx/yy/a3']
list_objects () (to be exact, in the Amazon S3 API) has a limit of 1000 keys that can be retrieved at one time. If you simply want to get all 1000 or more keys under the bucket,
sample2.py
from boto3 import Session
s3res = Session().resource('s3')
bucket = s3res.Bucket('hoge-bucket')
keys = [obj.key for obj in bucket.objects.all()]
Will be fine.
But what if you want to specify a Prefix? In sample1.py mentioned above, at most 1000 are assigned to keys. Even if you specify 1000000 for MaxKeys as an argument, 1000 or more will not be returned by specification.
If you have few keys under hoge-bucket, you may want to do the following:
sample3.py
from boto3 import Session
s3res = Session().resource('s3')
bucket = s3res.Bucket('hoge-bucket')
keys = [obj.key for obj in bucket.objects.all() if obj.key.startswith("xx/yy/")]
However, if you have tens of thousands or hundreds of thousands of keys, it will take a considerable amount of time to get results. I want to leave the part that can be done with the model to the model as much as possible.
Marker withlist_objects ()The return value of list_objects () is a hash of the form:
{
    'IsTruncated': True|False,  #Was the result shredded? True if done
    'Marker': 'string',
    'NextMarker': 'string',
    'Contents': [
        {
            'Key': 'string',
            'LastModified': datetime(2015, 1, 1),
            'ETag': 'string',
            'Size': 123,
            'StorageClass': 'STANDARD'|'REDUCED_REDUNDANCY'|'GLACIER',
            'Owner': {
                'DisplayName': 'string',
                'ID': 'string'
            }
        },
    ],
    'Name': 'string',
    'Prefix': 'string',
    'Delimiter': 'string',
    'MaxKeys': 123,
    'CommonPrefixes': [
        {
            'Prefix': 'string'
        },
    ],
    'EncodingType': 'url'
}
The important thing here is ʻIsTruncated. If there are more than 1000 results, but only 1000 are returned, this ʻIsTruncated becomes True. By the way, the 'Contents' array is always sorted in ascending order based on the alphabetical order of'Key'.
In addition, list_objects () has an argument called Marker, and the result can be output with the specified key as the first item. The actors are now complete. The following is a function created by wrapping list_objects () to get all the specified keys regardless of the number.
sample4.py
from boto3 import Session
s3client = Session().client('s3')
def get_all_keys(bucket: str='', prefix: str='', keys: []=[], marker: str='') -> [str]:
    """
Returns an array of all keys with the specified prefix
    """
    response = s3client.list_objects(Bucket=bucket, Prefix=prefix, Marker=marker)
    if 'Contents' in response:  #If there is no corresponding key, the response will'Contents'Is not included
        keys.extend([content['Key'] for content in response['Contents']])
        if 'IsTruncated' in response:
            return get_all_keys(bucket=bucket, prefix=prefix, keys=keys, marker=keys[-1])
    return keys
ʻIf'IsTruncated' in response:, and if ʻIsTruncated, then you are calling yourself with the keys (keys [-1] ) as a marker. If there is no Contents in the response, or if it is no longer IsTruncated, the result will be returned at once.
Now you can get the key on S3 without worrying about the number!
Recommended Posts