Python string
Python uses a mechanism called codecs to convert multibyte characters into various encodings. It seems that it supports not only Japanese but also Korean and Chinese.
Expressed as data of 2 bytes or more. Characters that cannot be expressed in 1 byte
[a] Can be expressed in 1 byte [A] Cannot be expressed in 1 byte
I examined the representative ones and summarized them below
ASCII A character code that summarizes alphabets, numbers, symbols, etc. It is widely used worldwide as the most basic character code, and many other character codes are implemented to be extensions of ASCII. Characters are represented by 7-bit values (0 to 127), and 128 characters are recorded. "A" is 0x41 in ASCII (0x represents hexadecimal).
Since it was difficult to imagine, I have extracted the lowercase letters and alphabetic parts below.
Hexadecimal | letter |
---|---|
0x61 | a |
0x62 | b |
0x63 | c |
0x64 | d |
0x65 | e |
0x66 | f |
0x67 | g |
0x68 | h |
0x69 | i |
0x6a | j |
0x6b | k |
0x6c | l |
0x6d | m |
0x6e | n |
0x6f | o |
0x70 | p |
0x71 | q |
0x72 | r |
0x73 | s |
0x74 | t |
0x75 | u |
0x76 | v |
0x77 | w |
0x78 | x |
0x79 | y |
0x7a | z |
Shift_JIS It is a character code that is often used to represent Japanese, which summarizes various characters including Japanese standardized by the Japanese Industrial Standards Committee. All characters are represented by 2 bytes. "A" is 0x82E0 in Shift_JIS. UTF-8 This is the most widely used standard character code today. All characters are represented by 1 to 4 bytes. Since it can handle characters from all over the world, it has come to be used as standard. The same part as ASCII is represented by 1 byte, and the other parts are represented by 2 to 6 bytes, which is a variable length encoding method. UTF-8 is highly compatible with ASCII code and is used by many software around the world. "A" is 0xe38182 in UTF-8. In Python version 2.x, the standard character code was ASCII. In Python version 3.x, the standard character code is UTF-8, so you can handle Japanese without declaring the character code.
Unicode A character code standardized by the International Organization for Standardization (ISO) as part of ISO / IEC 10646. The purpose is that it was created with the aim of encoding that can be used in common in all countries.
To convert a string to byte type
encode() Description method 'Character string'.encode ('Character code name') * Character code = "utf-8" etc.
decode() Description method b'byte string'.decode ('character code name')
Recommended Posts