Character code processing compared to "Are to put on a wound" ~ Garbled file name operation with python3 ~

Introduction

I want to deal with a case where a file name bug occurs.

図3.png 図1.png

Even if the file name is the same in Japanese, it is completely different depending on whether the character code is utf-8 or shift-jis. Thanks to the smart interpretation of the UI that displays the file name, you can usually work without using your head, but if you need to do that with python, the character code is mixed (the character code differs depending on the file to be handled) Situation) becomes very troublesome. So, this time, while understanding the display and conversion of each character code, I aim to finally convert the Japanese file name of shift-jis to utf-8 which is the default of python3.

The assumption is 3 series. ** 3 series ** (important because it is related to character code). Most of the knowledge required for the character code of this article is ** official site ** (including citation), and some It was obtained from ** this site **.

Character code that is likened to the name "Are to be attached to a wound"

I downloaded the image of "Are to put on the wound" from Irasutoya. This name seems to be different depending on the region (?). In this article, we will treat the relationship between these names as follows.

図8.png

The name "Band-Aid" is [Daijirin](https://www.weblio.jp/content/%E3%81%B0%E3%82%93%E3%81%9D%E3%81%86% It is an appellative published in E3% 81% 93% E3% 81% 86), while "Cut Bang (Yutoku Pharmaceutical Ind. 1.html)) ”and“ Band-Aid (Johnson & Johnson Co., Ltd.) ”are product names. I know that there are differences in performance and materials depending on the sales company (I have not compared them), but for the sake of simplicity, in this article, all of these names are the same, "Are to put on the wound". Assuming that it is a pointing word __. In other words, "the one to be attached to the wound" is represented by a unique <name (product name) or notation> according to the four types of verbalization rules (** character code **) in the figure. Applying this verbalization rule to translate is called ** encode **. In python3, the <object (are)> to be expressed is defined as ** str type **, and the <name (product name) and notation> that differs depending on the verbalization rules is defined as ** bytes type **.

A Unicode string is a string of code points, which is a number from 0 to 0x10FFFF (1,114,111 in decimal notation). This code point sequence is represented in memory as a code unit column, which maps to an 8-bit byte string. The rules for translating a Unicode string as a byte string are called character encoding or simply encoding. (Omitted) The default encoding of Python source code is UTF-8, so you can include Unicode characters as-is in string literals.

In the python code, the str type "Are to be attached to the wound" is expressed using ** unicode character string ** (and corresponding ** code point **). Then, when encoding is required in the character string processing, it is encoded as "bansoko" using the default character code (hiragana in the example, UTF-8 in python).

The default character code can be confirmed with the following code.

Check the default character code


import sys
print(sys.getdefaultencoding()) # → utf-8

On the contrary, displaying using is called ** decode **.

You can also create a string using the decode () method of the bytes class. This method takes a UTF-8-like value as the encoding argument and optionally an errors argument.

Since the correspondence between and may not work well, both <str> .encode () and <bytes> .decode () can take ʻerrors` arguments. Exception handling is performed according to the rules specified in this variable.

Encoding and decoding of "Akasatana"

Immediately, the character string "Akasatana" is encoded using four types of character codes.

utf-8 and shift-jis comparison


str_src = 'Akasa Tana'
enc_utf = str_src.encode() # default: encoding="utf-8"
enc_u16 = str_src.encode('utf-16')
enc_euc = str_src.encode('euc_jp')
enc_jis = str_src.encode('shift-jis')

print('--- ENCODE ---')
print('encoded by utf-8:', enc_utf, '(length={})'.format(len(enc_utf)))
print('encoded by utf-16:', enc_u16, '(length={})'.format(len(enc_u16)))
print('encoded by euc_jp:', enc_euc, '(length={})'.format(len(enc_euc)))
print('encoded by shift-jis:', enc_jis, '(length={})'.format(len(enc_jis)))
print('--- DECODE ---')
print(enc_utf.decode(), enc_u16.decode('utf-16'), enc_euc.decode('euc_jp'), enc_jis.decode('shift-jis'))
'''
--- ENCODE ---
encoded by utf-8: b'\xe3\x81\x82\xe3\x81\x8b\xe3\x81\x95\xe3\x81\x9f\xe3\x81\xaa' (length=15)
encoded by utf-16: b'\xff\xfeB0K0U0_0j0' (length=12)
encoded by euc_jp: b'\xa4\xa2\xa4\xab\xa4\xb5\xa4\xbf\xa4\xca' (length=10)
encoded by shift-jis: b'\x82\xa0\x82\xa9\x82\xb3\x82\xbd\x82\xc8' (length=10)
--- DECODE ---
Akasata Akasata Akasata Akasata
'''

The output is plotted as follows (utf-16 is also displayed in hexadecimal).

図9.png

The 2 bytes that appear at the beginning of the encoding using utf-16 is a code called BOM.

The Unicode character U + FEFF is used as a byte-order mark (BOM) and is written as the first character in a file to help automatically determine the byte order of the file. Some encodings, like UTF-16, require a BOM at the beginning of the file; when such an encoding is used, the BOM is automatically written as the first character and when reading the file Is implicitly removed.

Creating a file that can be compared to "Attach to a wound"

Now consider creating a file by specifying the file name in the python code. As usual, we consider the principle by "sticking on the wound". Again, ** Here, we will focus on the "file name" and proceed with the discussion without worrying too much about the contents of the created file **.

図10.png

When manufacturing "Are to be attached to a wound", the name "name" is recorded on the disc. Then, when "that" is needed, it will be translated (decoded) every time from the "calling" to "the one to be attached to the wound" in the brain of the expected verbalization rule. Here, consider the case where the expected verbalization rules are only those of Yutoku Pharmaceutical. If the "bansoukou" is passed to this rule, the translation will be incomprehensible, saying "" bansoko "? Only" cut bang "" (** = characters). It will be garbled **). Therefore, it is necessary to select an appropriate when manufacturing the "Are to be attached to the wound". What should I do to manufacture with "Cut Bang", which is defined by Yutoku Pharmaceutical's verbalization rules? First, translate "the one that sticks to the wound" into Yutoku Pharmaceutical's "calling". Next, by making a manufacturing request using this , the of the finished product can also be translated according to the rules of Yutoku Pharmaceutical. There is nothing wrong with this, but ** if you are accustomed to the str type that is translated (decoded) in hiragana **, you will want to request manufacturing using the str type. In this case, the character string used for the request is ** str (= "'\ u or \ u \ u and \ uba \ un'") **, which is a translation of Yutoku Pharmaceutical's in hiragana. There must be.

If you work without thinking about anything, the encoding character code specified by default in python3 will be applied to the file name when the file is generated. Furthermore, the default of the ʻerrors` argument is also determined. These can also be confirmed by the following code.

python


import sys
print(sys.getfilesystemencoding())
print(sys.getfilesystemencodeerrors())
'''
utf-8
surrogateescape
'''

"Surrogate" itself is newly introduced as a code to further expand the expression area of unicode. In Python3 bytes type decoding, ʻerrors = surrogate escape replaces the corresponding part of bytes where decoding does not work with the unicode character string used in surrogate. As a result, although the decoding itself is incomplete, errors can be avoided without losing the original bytes type information. When you want to encode the same code again, the unicode character string inserted by specifying ʻerrors = surrogate escape is deleted so that the original form can be restored.

The surrogateescape error handler decodes any non-ASCII bytes as code points for Unicode private use areas from U + DC80 to U + DCFF. These private code points are returned to the same bytes when using surrogateescape, an error handler when encoding and writing back the data.

Creating a Japanese directory file

Now, let's consider saving a file with a Japanese name in utf-8 and shift-jis, paying attention to the above.

Preparation

Import the module to be used and create a working directory. This article does not use character code processing modules such as codecs (I'm not sure).

Preparation: Import modules used, create working directory


import os
import glob

testdir = './testdir'
if not os.path.exists(testdir):
    os.mkdir(testdir)

#Search and display the path from the current directory only for directories
print([p for p in glob.glob('./*') if os.path.isdir(p)])
'''
[..., './testdir', ...]
''' 

Create file with utf-8

As mentioned above, when the character code is utf-8, it is specified as the default of python3, so there is nothing to be careful about.

utf-8 directory creation

Create the utf-8 directory "aiueo" in the working directory using ʻos.mkdir () `.

Directory creation (utf-8)


utf_dirname = os.path.join(testdir, 'AIUEO')
os.mkdir(utf_dirname)

Save txt file with utf-8

In addition, create "Kakikukeko.txt" in "Aiueo". Although not used in this article, the character string in utf-8 is output in the file.

File creation (utf-8)


utf_filename = os.path.join(utf_dirname, 'Kakikukeko.txt')
with open(utf_filename, 'w') as f:
    f.write('SA Shi Su Se So\n')

Create file with shift-jis

Finally, create a new file whose file name is defined by shift-jis.

図11.png

Create shift-jis directory (by bytes type)

Create the shift-jis directory "Akasatana" in the working directory. Until now, str type was given to ʻos.mkdir () `, but it works normally even if ** bytes type is inserted **.

python


jis_dirname = os.path.join(testdir, 'Akasa Tana').encode('shift-jis')
print(jis_dirname)
os.mkdir(jis_dirname)
'''
b'./testdir/\x82\xa0\x82\xa9\x82\xb3\x82\xbd\x82\xc8'
'''

Create shift-jis directory (by str type)

Earlier we showed that a new directory can be created using a bytes type path name, but this time we will create a shift-jis directory "Akasatana" using a str type. As mentioned above, it is known that ʻencoding ='utf-8' and ʻerrors ='surrogate escape' are used when encoding the file name by default, so use these options to change the bytes type to str type. Convert (decode) to. By specifying ʻerrors,'dc' is added to the beginning of each output unicode character string. When the generated str type unicode character string is put in print () by this special decoding, an error (ʻUnicodeDecodeError) is thrown. Therefore, by enclosing it in repr (), the unicode character string is output as it is.

Directory creation (shift-jis)


jis_dirname_bytes = os.path.join(testdir, 'Akasa Tana'.encode('shift-jis')
jis_dirname_str = jis_dirname_bytes.decode(errors='surrogateescape'))
print(repr(jis_dirname_str))
os.mkdir(jis_dirname_str)
'''
'./testdir/\udc82\udca0\udc82\udca9\udc82\udcb3\udc82\udcbd\udc82\udcc8'
'''

Create txt file name with shift-jis

Create "Hamayarawa.txt" in the directory "Akasatana" with shift-jis. The character code of the text in the file is also set to shift-jis by ʻencoding ='shift-jis'`.

File creation (shift-jis)


jis_filename = os.path.join(jis_dirname, 'Hamayarawa.txt'.encode('shift-jis'))
# encode()After.decode(errors='surrogateescape')May be attached
with open(jis_filename, 'w', encoding='shift-jis') as f:
    f.write('Suddenly\n Himiri ゐ')

Check the saved file once

So far, the file name has been defined with two types of character codes, but let's display it with the UI and glob.glob (). In the figure below, the blue square in the middle is the saved file name (bytes). The lower side shows the result of the UI translating the file name by specifying the character code, and the upper side shows the result of getting the file name list in str type or bytes type in python.

図12.png

UI display by utf-8

I used the top page of Jupyter Notebook. Although the "aiueo" directory created with utf-8 is displayed normally. The "Akasatana" directory created by shift-jis is garbled and cannot be selected.

図1.png

"Kakikukeko.txt" in the "Aiueo" directory is also displayed without any problem.

図2.png

UI display by shift-jis

I used WinSCP. Although the "Akasatana" directory created by shift-jis is displayed normally. The "aiueo" directory created with utf-8 is garbled, and if you select it, an error will occur.

図3.png

"Hamayarawa.txt" in the "Akasatana" directory is also displayed without any problem.

図4.png

List display by glob.glob ()

If you try to get the list of files in the working directory normally, the file name will be decoded by default utf-8, so ** "Akasatana" directory will be a str type with surrogate escape **.

utf-8 to display the path name


print(glob.glob(os.path.join(testdir, '*')))
print(glob.glob(os.path.join(testdir, 'AIUEO', '*')))
'''
['./testdir/AIUEO',
 './testdir/\udc82\udca0\udc82\udca9\udc82\udcb3\udc82\udcbd\udc82\udcc8']
['./testdir/AIUEO/Kakikukeko.txt']
'''

Here, the bytes type is assigned to ** glob.glob (), and the file name is acquired by the ** method of decoding the obtained output with an arbitrary code. If you decode with shift-jis, the name of the "Akasatana" directory will be displayed normally, and the "Aiueo" directory will be garbled.

List of path names using bytes


#Get file name with bytes type
bytes_paths = glob.glob(os.path.join(testdir, '*').encode())
print(bytes_paths)
# utf-Decode at 8
print([f.decode(errors='surrogateescape') for f in bytes_paths])
# shift-Decode with jis
print([f.decode('shift-jis', errors='surrogateescape') for f in bytes_paths])
'''
[b'./testdir/\xe3\x81\x82\xe3\x81\x84\xe3\x81\x86\xe3\x81\x88\xe3\x81\x8a',
 b'./testdir/\x82\xa0\x82\xa9\x82\xb3\x82\xbd\x82\xc8']
['./testdir/AIUEO',
 './testdir/\udc82\udca0\udc82\udca9\udc82\udcb3\udc82\udcbd\udc82\udcc8']
['./testdir/縺 > 縺\udc86 Doctor ♀',
 './testdir/Akasa Tana']
'''

Character code conversion of file name (shift-jis → utf-8)

Finally, change the file name saved by shift-jis to the file name defined in utf-8.

Directory name (shift-jis) conversion

Get the path of the directory defined by shift-jis directly and replace it with the string of utf-8 by ʻos.rename () `.

Single directory character code conversion


target = glob.glob(os.path.join(testdir, '*').encode())[1]
print(repr(target))
print(target.decode('shift-jis'))
os.rename(target, target.decode('shift-jis'))
'''
b'./testdir/\x82\xa0\x82\xa9\x82\xb3\x82\xbd\x82\xc8'
./testdir/Akasa Tana
'''

The result is shown below. The character code was successfully converted and "Akasatana" can be seen.

図5.png

Batch conversion of file names (shift-jis)

Enter the "Akasatana" directory.

図6.png

In this article, there is only one file, but assuming that character code conversion of multiple files is required, the code with loop processing is used. Focusing on the Japanese part, the "Akasatana" part is the utf-8 character string, and "Hamayarawa.txt" is the shift-jis character string. ** Bytes type decoding that mixes these Returns an error ** (Since the "./testdir" part is included in the ascii string, both utf-8 and shift-jis work with common bytes, so you don't have to think about it). It is necessary to separate these two before processing.

Character code conversion of all files in the directory


for f in glob.glob(os.path.join(testdir, 'Akasa Tana', '*').encode()):
    #Directory name (utf-Decode with 8)
    d_str = os.path.dirname(f).decode()
    #File name (shift-Decode with jis)
    b_str = os.path.basename(f).decode('shift-jis')
    #Character code conversion
    os.rename(f, os.path.join(d_str, b_str))

With this, "Hamayarawa.txt" is also safely defined with the character code of utf-8.

図7.png

At the end

So far, I think I finally got the knack for string manipulation. In particular, I found that the work becomes much easier if I make good use of the bytes type, which I was used to and didn't care about at all. In addition to this, if you can choose the handling of ʻerrors` at the time of encoding / decoding, most of the work will be possible.

After all, I feel that the official document is the easiest to understand.

Recommended Posts