I uploaded a file from the test environment of my service to Amazon S3 and specified the key to see the meta information, but I could not find a match, and when I searched variously, I found that the file name contained a dakuten. I noticed that I was caught in the trap of conversion (I was made aware of it).
I was a little worried, so I decided to briefly investigate how S3 is treated in other situations.
… That's the story around August 2016. I don't know what's going on now, but I'll publish the article once. If the current one can be reflected + If it becomes clear what will happen on Windows, add / edit.
For the handling of file names on Mac (HFS +) and Windows (NTFS), refer to the pages below.
-HFS + Encoding and Unicode Normalization 3rd Edition-Monoka -UTF-8 on Mac OS X and NFC / NFD on Windows: numa's diary
To summarize
--On Mac, a unique normalization method based on NFD --Nothing in particular on Windows (each created and placed) ――At the file system level, NFKC / NFKD cannot be applied.
It seems that.
--S3 does not specify the Unicode normalization method. --Multiple methods can be mixed --Language-specific SDK seems to allow you to choose your favorite method (I've only tried Python) --In other upload methods, the behavior changes depending on the client. --Most of them inherit the method on the OS side, but be careful because there are exceptions such as ʻaws s3 cp`.
Also, I will post the results of the experiment conducted below.
No. | How to upload/File system | Normalization method | Remarks |
---|---|---|---|
1. | HFS+ | NFD(?) | Create with touch command |
2. | NTFS | Not investigated | Created in Explorer |
3. | Python AWS SDK(boto3) | NFC / NFD / NFKC / NFKD The one you choose will be used |
Old font replaces new font |
4. | bash(Mac) + aws s3 sync | NFD | 1.File |
5. | bash(Mac) + aws s3 cp | NFC | 1.File |
6. | Chrome(Mac, AWS console) | NFD | 1.File |
7. | Chrome(Win, AWS console) | NFC | 2.File |
I think that the old font becomes the new font in boto3 because of the library on the Python side.
――I think it's better to stop using Japanese file names when uploading files to S3 ――It should be noted that even if you have to use it, you may not be able to match the keys due to the difference. ――In particular, aws-cli behaves differently depending on the situation, so you should be careful when handling it.
All buckets are given appropriate names
for convenience.
I prepared the following program and investigated the handling in S3 when the character string is normalized by NFC, NFD, NFKC, NFKD.
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
import unicodedata
import boto3
FORMS = ['NFC', 'NFKC', 'NFD', 'NFKD']
STRINGS = [u"1_Zakoba", u"2_crayfish", u"3_Pippi", u"4_㌠", u"5_God", u"6_{0}".format(unichr(0xfa19))]
BUCKET_NAME = 'Appropriate name'
def make_files():
session = boto3.session.Session()
s3 = session.resource("s3")
for form in FORMS:
for string in STRINGS:
key = u"{0:>7}/{1}".format(form, string)
key = unicodedata.normalize(form, key)
obj = s3.Object(BUCKET_NAME, u"{0}".format(key))
obj.put(Body='test')
def list_files():
session = boto3.session.Session()
s3 = session.resource("s3")
bucket = s3.Bucket(BUCKET_NAME)
for i in bucket.objects.all():
print i
print i.key.encode("utf-8")
if __name__ == "__main__":
make_files()
list_files()
The results are as follows.
s3.ObjectSummary(bucket_name='Appropriate name', key=u' NFC/1_\u3056\u3053\u3070')
NFC/1_Zakoba
s3.ObjectSummary(bucket_name='Appropriate name', key=u' NFC/2_\u30b6\u30ea\u30ac\u30cb')
NFC/2_crayfish
s3.ObjectSummary(bucket_name='Appropriate name', key=u' NFC/3_\uff8b\uff9f\uff6f\uff8b\uff9f')
NFC/3_Pippi
s3.ObjectSummary(bucket_name='Appropriate name', key=u' NFC/4_\u3320')
NFC/4_㌠
s3.ObjectSummary(bucket_name='Appropriate name', key=u' NFC/5_\u795e')
NFC/5_God
s3.ObjectSummary(bucket_name='Appropriate name', key=u' NFC/6_\u795e')
NFC/6_God
s3.ObjectSummary(bucket_name='Appropriate name', key=u' NFD/1_\u3055\u3099\u3053\u306f\u3099')
NFD/1_Zakoba
s3.ObjectSummary(bucket_name='Appropriate name', key=u' NFD/2_\u30b5\u3099\u30ea\u30ab\u3099\u30cb')
NFD/2_crayfish
s3.ObjectSummary(bucket_name='Appropriate name', key=u' NFD/3_\uff8b\uff9f\uff6f\uff8b\uff9f')
NFD/3_Pippi
s3.ObjectSummary(bucket_name='Appropriate name', key=u' NFD/4_\u3320')
NFD/4_㌠
s3.ObjectSummary(bucket_name='Appropriate name', key=u' NFD/5_\u795e')
NFD/5_God
s3.ObjectSummary(bucket_name='Appropriate name', key=u' NFD/6_\u795e')
NFD/6_God
s3.ObjectSummary(bucket_name='Appropriate name', key=u' NFKC/1_\u3056\u3053\u3070')
NFKC/1_Zakoba
s3.ObjectSummary(bucket_name='Appropriate name', key=u' NFKC/2_\u30b6\u30ea\u30ac\u30cb')
NFKC/2_crayfish
s3.ObjectSummary(bucket_name='Appropriate name', key=u' NFKC/3_\u30d4\u30c3\u30d4')
NFKC/3_Pippi
s3.ObjectSummary(bucket_name='Appropriate name', key=u' NFKC/4_\u30b5\u30f3\u30c1\u30fc\u30e0')
NFKC/4_Sun team
s3.ObjectSummary(bucket_name='Appropriate name', key=u' NFKC/5_\u795e')
NFKC/5_God
s3.ObjectSummary(bucket_name='Appropriate name', key=u' NFKC/6_\u795e')
NFKC/6_God
s3.ObjectSummary(bucket_name='Appropriate name', key=u' NFKD/1_\u3055\u3099\u3053\u306f\u3099')
NFKD/1_Zakoba
s3.ObjectSummary(bucket_name='Appropriate name', key=u' NFKD/2_\u30b5\u3099\u30ea\u30ab\u3099\u30cb')
NFKD/2_crayfish
s3.ObjectSummary(bucket_name='Appropriate name', key=u' NFKD/3_\u30d2\u309a\u30c3\u30d2\u309a')
NFKD/3_Pippi
s3.ObjectSummary(bucket_name='Appropriate name', key=u' NFKD/4_\u30b5\u30f3\u30c1\u30fc\u30e0')
NFKD/4_Sun team
s3.ObjectSummary(bucket_name='Appropriate name', key=u' NFKD/5_\u795e')
NFKD/5_God
s3.ObjectSummary(bucket_name='Appropriate name', key=u' NFKD/6_\u795e')
NFKD/6_God
If you look at it this way, you can see that the S3 file system does nothing.
Originally, even though it is a file system, in the case of S3, the name (key) is also just a kind of metadata, and this is probably the stance of leaving it to the upload side.
First is the status of the file.
$ ls -l
-rw-r--r-- 1 npoi staff 0 8 2 21:43 1_Zakoba
-rw-r--r-- 1 npoi staff 0 8 2 21:43 2_crayfish
-rw-r--r-- 1 npoi staff 0 8 2 23:42 3_Pippi
-rw-r--r-- 1 npoi staff 0 8 2 21:43 4_㌠
-rw-r--r-- 1 npoi staff 0 8 2 23:42 5_God
-rw-r--r-- 1 npoi staff 0 8 2 23:08 6_God
Also, the confirmation from Python's interactive mode looks like this.
$ python
Python 2.7.11 (default, Dec 5 2015, 14:44:53)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.1.76)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> for i in os.listdir("."):
... print [i.decode('utf-8')],i
...
[u'.DS_Store'] .DS_Store
[u'1_\u3055\u3099\u3053\u306f\u3099'] 1_Zakoba
[u'2_\u30b5\u3099\u30ea\u30ab\u3099\u30cb'] 2_crayfish
[u'3_\uff8b\uff9f\uff6f\uff8b\uff9f'] 3_Pippi
[u'4_\u3320'] 4_㌠
[u'5_\u795e'] 5_God
[u'6_\ufa19'] 6_God
>>>
It looks like NFD, but 5 and 6 are surprising when you actually see them.
Let's sync this with ʻaws s3 sync`.
$ aws s3 sync ./sample/ s3://Appropriate name/MAC_CLI
upload: sample/4_㌠ to s3://Appropriate name/MAC_CLI/4_㌠
upload: sample/2_Crayfish to s3://Appropriate name/MAC_CLI/2_crayfish
upload: sample/6_God to s3://Appropriate name/MAC_CLI/6_God
upload: sample/5_God to s3://Appropriate name/MAC_CLI/5_God
upload: sample/1_Zakoba to s3://Appropriate name/MAC_CLI/1_Zakoba
upload: sample/.DS_Store to s3://Appropriate name/MAC_CLI/.DS_Store
upload: sample/3_Pippi to s3://Appropriate name/MAC_CLI/3_Pippi
In this state, when I checked with the function used in the check after uploading from the Python SDK, it looked like this.
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_CLI/1_\u3055\u3099\u3053\u306f\u3099')
MAC_CLI/1_Zakoba
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_CLI/2_\u30b5\u3099\u30ea\u30ab\u3099\u30cb')
MAC_CLI/2_crayfish
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_CLI/3_\uff8b\uff9f\uff6f\uff8b\uff9f')
MAC_CLI/3_Pippi
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_CLI/4_\u3320')
MAC_CLI/4_㌠
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_CLI/5_\u795e')
MAC_CLI/5_God
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_CLI/6_\ufa19')
MAC_CLI/6_God
It seems that the old font has been uploaded properly, and it seems that it is uploaded as HFS + NFD.
I used ʻaws s3 sync earlier, but what about using ʻaws s3 cp
?
$ aws s3 cp 1_Zakoba s3://Appropriate name/MAC_CLI2/
upload: 1_Zakoba to s3://Appropriate name/MAC_CLI2/1_Zakoba
$ aws s3 cp 2_Crayfish s3://Appropriate name/MAC_CLI2/
upload: ./2_Crayfish to s3://Appropriate name/MAC_CLI2/2_crayfish
$ aws s3 cp 3_Pippi s3://Appropriate name/MAC_CLI2/
upload: ./3_Pippi to s3://Appropriate name/MAC_CLI2/3_Pippi
$ aws s3 cp 4_㌠ s3://Appropriate name/MAC_CLI2
upload: ./4_㌠ to s3://Appropriate name/MAC_CLI2
$ aws s3 cp 4_㌠ s3://Appropriate name/MAC_CLI2/
upload: ./4_㌠ to s3://Appropriate name/MAC_CLI2/4_㌠
$ aws s3 cp 5_God s3://Appropriate name/MAC_CLI2/
upload: ./5_God to s3://Appropriate name/MAC_CLI2/5_God
$ aws s3 cp 6_God s3://Appropriate name/MAC_CLI2/
upload: ./6_God to s3://Appropriate name/MAC_CLI2/6_God
Check with a Python program in the same way.
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_CLI2/1_\u3056\u3053\u3070')
MAC_CLI2/1_Zakoba
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_CLI2/2_\u30b6\u30ea\u30ac\u30cb')
MAC_CLI2/2_crayfish
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_CLI2/3_\uff8b\uff9f\uff6f\uff8b\uff9f')
MAC_CLI2/3_Pippi
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_CLI2/4_\u3320')
MAC_CLI2/4_㌠
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_CLI2/5_\u795e')
MAC_CLI2/5_God
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_CLI2/6_\ufa19')
MAC_CLI2/6_God
What an NFC.
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_GUI/1_\u3055\u3099\u3053\u306f\u3099')
MAC_GUI/1_Zakoba
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_GUI/2_\u30b5\u3099\u30ea\u30ab\u3099\u30cb')
MAC_GUI/2_crayfish
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_GUI/3_\uff8b\uff9f\uff6f\uff8b\uff9f')
MAC_GUI/3_Pippi
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_GUI/4_\u3320')
MAC_GUI/4_㌠
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_GUI/5_\u795e')
MAC_GUI/5_God
s3.ObjectSummary(bucket_name='Appropriate name', key=u'MAC_GUI/6_\ufa19')
MAC_GUI/6_God
It looks like NFD. It looks the same as when using ʻaws s3 sync`.
s3.ObjectSummary(bucket_name='Appropriate name', key=u'WIN_GUI/1_\u3056\u3053\u3070')
WIN_GUI/1_Zakoba
s3.ObjectSummary(bucket_name='Appropriate name', key=u'WIN_GUI/2_\u30b6\u30ea\u30ac\u30cb')
WIN_GUI/2_crayfish
s3.ObjectSummary(bucket_name='Appropriate name', key=u'WIN_GUI/3_\uff8b\uff9f\uff6f\uff8b\uff9f')
WIN_GUI/3_Pippi
s3.ObjectSummary(bucket_name='Appropriate name', key=u'WIN_GUI/4_\u3320')
WIN_GUI/4_㌠
s3.ObjectSummary(bucket_name='Appropriate name', key=u'WIN_GUI/5_\u795e')
WIN_GUI/5_God
s3.ObjectSummary(bucket_name='Appropriate name', key=u'WIN_GUI/6_\ufa19')
WIN_GUI/6_God
NFC ...
Recommended Posts