[PYTHON] I tried various things about Unicode normalization of Amazon S3 (August 2016)

Origin

I uploaded a file from the test environment of my service to Amazon S3 and specified the key to see the meta information, but I could not find a match, and when I searched variously, I found that the file name contained a dakuten. I noticed that I was caught in the trap of conversion (I was made aware of it).

I was a little worried, so I decided to briefly investigate how S3 is treated in other situations.

… That's the story around August 2016. I don't know what's going on now, but I'll publish the article once. If the current one can be reflected + If it becomes clear what will happen on Windows, add / edit.

review

For the handling of file names on Mac (HFS +) and Windows (NTFS), refer to the pages below.

-HFS + Encoding and Unicode Normalization 3rd Edition-Monoka -UTF-8 on Mac OS X and NFC / NFD on Windows: numa's diary

To summarize

--On Mac, a unique normalization method based on NFD --Nothing in particular on Windows (each created and placed) ――At the file system level, NFKC / NFKD cannot be applied.

It seems that.

Summary and conclusion

Summary

--S3 does not specify the Unicode normalization method. --Multiple methods can be mixed --Language-specific SDK seems to allow you to choose your favorite method (I've only tried Python) --In other upload methods, the behavior changes depending on the client. --Most of them inherit the method on the OS side, but be careful because there are exceptions such as ʻaws s3 cp`.

Also, I will post the results of the experiment conducted below.

No. How to upload/File system Normalization method Remarks
1. HFS+ NFD(?) Create with touch command
2. NTFS Not investigated Created in Explorer
3. Python AWS SDK(boto3) NFC / NFD / NFKC / NFKD
The one you choose will be used
Old font replaces new font
4. bash(Mac) + aws s3 sync NFD 1.File
5. bash(Mac) + aws s3 cp NFC 1.File
6. Chrome(Mac, AWS console) NFD 1.File
7. Chrome(Win, AWS console) NFC 2.File

I think that the old font becomes the new font in boto3 because of the library on the Python side.

Conclusion

――I think it's better to stop using Japanese file names when uploading files to S3 ――It should be noted that even if you have to use it, you may not be able to match the keys due to the difference. ――In particular, aws-cli behaves differently depending on the situation, so you should be careful when handling it.

Experiment

All buckets are given appropriate names for convenience.

File upload with AWS SDK for Python (Boto3)

I prepared the following program and investigated the handling in S3 when the character string is normalized by NFC, NFD, NFKC, NFKD.

#!/usr/bin/env python
# -*- encoding: utf-8 -*-

import unicodedata

import boto3


FORMS = ['NFC', 'NFKC', 'NFD', 'NFKD']
STRINGS = [u"1_Zakoba", u"2_crayfish", u"3_Pippi", u"4_㌠", u"5_God", u"6_{0}".format(unichr(0xfa19))]
BUCKET_NAME = 'Appropriate name'

def make_files():
    session = boto3.session.Session()
    s3 = session.resource("s3")

    for form in FORMS:
        for string in STRINGS:
            key = u"{0:>7}/{1}".format(form, string)
            key = unicodedata.normalize(form, key)
            obj = s3.Object(BUCKET_NAME, u"{0}".format(key))
            obj.put(Body='test')


def list_files():
    session = boto3.session.Session()
    s3 = session.resource("s3")

    bucket = s3.Bucket(BUCKET_NAME)
    for i in bucket.objects.all():
        print i
        print i.key.encode("utf-8")


if __name__ == "__main__":
    make_files()
    list_files()

The results are as follows.

s3.ObjectSummary(bucket_name='Appropriate name', key=u'    NFC/1_\u3056\u3053\u3070')
    NFC/1_Zakoba
s3.ObjectSummary(bucket_name='Appropriate name', key=u'    NFC/2_\u30b6\u30ea\u30ac\u30cb')
    NFC/2_crayfish
s3.ObjectSummary(bucket_name='Appropriate name', key=u'    NFC/3_\uff8b\uff9f\uff6f\uff8b\uff9f')
    NFC/3_Pippi
s3.ObjectSummary(bucket_name='Appropriate name', key=u'    NFC/4_\u3320')
    NFC/4_㌠
s3.ObjectSummary(bucket_name='Appropriate name', key=u'    NFC/5_\u795e')
    NFC/5_God
s3.ObjectSummary(bucket_name='Appropriate name', key=u'    NFC/6_\u795e')
    NFC/6_God
s3.ObjectSummary(bucket_name='Appropriate name', key=u'    NFD/1_\u3055\u3099\u3053\u306f\u3099')
    NFD/1_Zakoba
s3.ObjectSummary(bucket_name='Appropriate name', key=u'    NFD/2_\u30b5\u3099\u30ea\u30ab\u3099\u30cb')
    NFD/2_crayfish
s3.ObjectSummary(bucket_name='Appropriate name', key=u'    NFD/3_\uff8b\uff9f\uff6f\uff8b\uff9f')
    NFD/3_Pippi
s3.ObjectSummary(bucket_name='Appropriate name', key=u'    NFD/4_\u3320')
    NFD/4_㌠
s3.ObjectSummary(bucket_name='Appropriate name', key=u'    NFD/5_\u795e')
    NFD/5_God
s3.ObjectSummary(bucket_name='Appropriate name', key=u'    NFD/6_\u795e')
    NFD/6_God
s3.ObjectSummary(bucket_name='Appropriate name', key=u'   NFKC/1_\u3056\u3053\u3070')
   NFKC/1_Zakoba
s3.ObjectSummary(bucket_name='Appropriate name', key=u'   NFKC/2_\u30b6\u30ea\u30ac\u30cb')
   NFKC/2_crayfish
s3.ObjectSummary(bucket_name='Appropriate name', key=u'   NFKC/3_\u30d4\u30c3\u30d4')
   NFKC/3_Pippi
s3.ObjectSummary(bucket_name='Appropriate name', key=u'   NFKC/4_\u30b5\u30f3\u30c1\u30fc\u30e0')
   NFKC/4_Sun team
s3.ObjectSummary(bucket_name='Appropriate name', key=u'   NFKC/5_\u795e')
   NFKC/5_God
s3.ObjectSummary(bucket_name='Appropriate name', key=u'   NFKC/6_\u795e')
   NFKC/6_God
s3.ObjectSummary(bucket_name='Appropriate name', key=u'   NFKD/1_\u3055\u3099\u3053\u306f\u3099')
   NFKD/1_Zakoba
s3.ObjectSummary(bucket_name='Appropriate name', key=u'   NFKD/2_\u30b5\u3099\u30ea\u30ab\u3099\u30cb')
   NFKD/2_crayfish
s3.ObjectSummary(bucket_name='Appropriate name', key=u'   NFKD/3_\u30d2\u309a\u30c3\u30d2\u309a')
   NFKD/3_Pippi
s3.ObjectSummary(bucket_name='Appropriate name', key=u'   NFKD/4_\u30b5\u30f3\u30c1\u30fc\u30e0')
   NFKD/4_Sun team
s3.ObjectSummary(bucket_name='Appropriate name', key=u'   NFKD/5_\u795e')
   NFKD/5_God
s3.ObjectSummary(bucket_name='Appropriate name', key=u'   NFKD/6_\u795e')
   NFKD/6_God

If you look at it this way, you can see that the S3 file system does nothing.

Originally, even though it is a file system, in the case of S3, the name (key) is also just a kind of metadata, and this is probably the stance of leaving it to the upload side.

When uploading from Bash on Mac using aws-cli Part 1

First is the status of the file.

$ ls -l
-rw-r--r--  1 npoi  staff     0  8  2 21:43 1_Zakoba
-rw-r--r--  1 npoi  staff     0  8  2 21:43 2_crayfish
-rw-r--r--  1 npoi  staff     0  8  2 23:42 3_Pippi
-rw-r--r--  1 npoi  staff     0  8  2 21:43 4_㌠
-rw-r--r--  1 npoi  staff     0  8  2 23:42 5_God
-rw-r--r--  1 npoi  staff     0  8  2 23:08 6_God

Also, the confirmation from Python's interactive mode looks like this.

$ python
Python 2.7.11 (default, Dec  5 2015, 14:44:53)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.1.76)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> for i in os.listdir("."):
...     print [i.decode('utf-8')],i
...
[u'.DS_Store'] .DS_Store
[u'1_\u3055\u3099\u3053\u306f\u3099'] 1_Zakoba
[u'2_\u30b5\u3099\u30ea\u30ab\u3099\u30cb'] 2_crayfish
[u'3_\uff8b\uff9f\uff6f\uff8b\uff9f'] 3_Pippi
[u'4_\u3320'] 4_㌠
[u'5_\u795e'] 5_God
[u'6_\ufa19'] 6_God
>>>

It looks like NFD, but 5 and 6 are surprising when you actually see them.

Let's sync this with ʻaws s3 sync`.

$ aws s3 sync ./sample/ s3://Appropriate name/MAC_CLI
upload: sample/4_㌠ to s3://Appropriate name/MAC_CLI/4_㌠
upload: sample/2_Crayfish to s3://Appropriate name/MAC_CLI/2_crayfish
upload: sample/6_God to s3://Appropriate name/MAC_CLI/6_God
upload: sample/5_God to s3://Appropriate name/MAC_CLI/5_God
upload: sample/1_Zakoba to s3://Appropriate name/MAC_CLI/1_Zakoba
upload: sample/.DS_Store to s3://Appropriate name/MAC_CLI/.DS_Store
upload: sample/3_Pippi to s3://Appropriate name/MAC_CLI/3_Pippi

In this state, when I checked with the function used in the check after uploading from the Python SDK, it looked like this.

s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_CLI/1_\u3055\u3099\u3053\u306f\u3099')
MAC_CLI/1_Zakoba
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_CLI/2_\u30b5\u3099\u30ea\u30ab\u3099\u30cb')
MAC_CLI/2_crayfish
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_CLI/3_\uff8b\uff9f\uff6f\uff8b\uff9f')
MAC_CLI/3_Pippi
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_CLI/4_\u3320')
MAC_CLI/4_㌠
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_CLI/5_\u795e')
MAC_CLI/5_God
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_CLI/6_\ufa19')
MAC_CLI/6_God

It seems that the old font has been uploaded properly, and it seems that it is uploaded as HFS + NFD.

When uploading from Bash on Mac using aws-cli Part 2

I used ʻaws s3 sync earlier, but what about using ʻaws s3 cp?

$ aws s3 cp 1_Zakoba s3://Appropriate name/MAC_CLI2/
upload: 1_Zakoba to s3://Appropriate name/MAC_CLI2/1_Zakoba
$ aws s3 cp 2_Crayfish s3://Appropriate name/MAC_CLI2/
upload: ./2_Crayfish to s3://Appropriate name/MAC_CLI2/2_crayfish
$ aws s3 cp 3_Pippi s3://Appropriate name/MAC_CLI2/
upload: ./3_Pippi to s3://Appropriate name/MAC_CLI2/3_Pippi
$ aws s3 cp 4_㌠ s3://Appropriate name/MAC_CLI2
upload: ./4_㌠ to s3://Appropriate name/MAC_CLI2
$ aws s3 cp 4_㌠ s3://Appropriate name/MAC_CLI2/
upload: ./4_㌠ to s3://Appropriate name/MAC_CLI2/4_㌠
$ aws s3 cp 5_God s3://Appropriate name/MAC_CLI2/
upload: ./5_God to s3://Appropriate name/MAC_CLI2/5_God
$ aws s3 cp 6_God s3://Appropriate name/MAC_CLI2/
upload: ./6_God to s3://Appropriate name/MAC_CLI2/6_God

Check with a Python program in the same way.

s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_CLI2/1_\u3056\u3053\u3070')
MAC_CLI2/1_Zakoba
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_CLI2/2_\u30b6\u30ea\u30ac\u30cb')
MAC_CLI2/2_crayfish
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_CLI2/3_\uff8b\uff9f\uff6f\uff8b\uff9f')
MAC_CLI2/3_Pippi
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_CLI2/4_\u3320')
MAC_CLI2/4_㌠
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_CLI2/5_\u795e')
MAC_CLI2/5_God
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_CLI2/6_\ufa19')
MAC_CLI2/6_God

What an NFC.

When uploading from the console using Chrome on Mac

s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_GUI/1_\u3055\u3099\u3053\u306f\u3099')
MAC_GUI/1_Zakoba
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_GUI/2_\u30b5\u3099\u30ea\u30ab\u3099\u30cb')
MAC_GUI/2_crayfish
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_GUI/3_\uff8b\uff9f\uff6f\uff8b\uff9f')
MAC_GUI/3_Pippi
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_GUI/4_\u3320')
MAC_GUI/4_㌠
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_GUI/5_\u795e')
MAC_GUI/5_God
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_GUI/6_\ufa19')
MAC_GUI/6_God

It looks like NFD. It looks the same as when using ʻaws s3 sync`.

When uploading from the console using Chrome on Windows

s3.ObjectSummary(bucket_name='Appropriate name', key=u'WIN_GUI/1_\u3056\u3053\u3070')
WIN_GUI/1_Zakoba
s3.ObjectSummary(bucket_name='Appropriate name', key=u'WIN_GUI/2_\u30b6\u30ea\u30ac\u30cb')
WIN_GUI/2_crayfish
s3.ObjectSummary(bucket_name='Appropriate name', key=u'WIN_GUI/3_\uff8b\uff9f\uff6f\uff8b\uff9f')
WIN_GUI/3_Pippi
s3.ObjectSummary(bucket_name='Appropriate name', key=u'WIN_GUI/4_\u3320')
WIN_GUI/4_㌠
s3.ObjectSummary(bucket_name='Appropriate name', key=u'WIN_GUI/5_\u795e')
WIN_GUI/5_God
s3.ObjectSummary(bucket_name='Appropriate name', key=u'WIN_GUI/6_\ufa19')
WIN_GUI/6_God

NFC ...

Recommended Posts

I tried various things about Unicode normalization of Amazon S3 (August 2016)
I tried using Amazon Glacier
About various encodings of Python 3
I tried to create an environment of MkDocs on Amazon Linux
[Sentence classification] I tried various pooling methods of Convolutional Neural Networks
I tried using GrabCut of OpenCV
I tried to organize about MCMC.
I tried transcribing the news of the example business integration to Amazon Transcribe
I tried to summarize the logical way of thinking about object orientation.
I tried various patterns of date strings to be entered in pandas.to_datetime
I tried putting various versions of Python + OpenCV + FFmpeg environment on Mac
[Lambda] I tried to incorporate an external module of python via S3