Python2 strings are confusing.
In the first place, not only Python, but the story related to character code is complicated. This is probably because different people use different terms. Here, [Yukihiro Matsumoto Code World](http://www.amazon.co.jp/%E3%81%BE%E3%81%A4%E3%82%82%E3%81%A8%E3% 82% 86% E3% 81% 8D% E3% 81% B2% E3% 82% 8D-% E3% 82% B3% E3% 83% BC% E3% 83% 89% E3% 81% AE% E4% B8 % 96% E7% 95% 8C% E2% 80% BE% E3% 82% B9% E3% 83% BC% E3% 83% 91% E3% 83% BC% E3% 83% BB% E3% 83% 97 % E3% 83% AD% E3% 82% B0% E3% 83% A9% E3% 83% 9E% E3% 81% AB% E3% 81% AA% E3% 82% 8B14% E3% 81% AE% E6 % 80% 9D% E8% 80% 83% E6% B3% 95 / dp / 4822234312 According to the following definition in "Yukihiro Matsumoto Code World").
the term | meaning |
---|---|
letter | Symbols used in systems that visually represent language |
Glyph | Glyphs of individual letters |
Character set | A collection of characters that are subject to character code assignment |
Character code | Numbers assigned to individual letters |
Character encoding method | How to express the character code on a computer |
There are two types of strings in Python2. Here, the two are called ** str string ** and ** unicode string **, and these are collectively called ** string **. The terms are not so unified in the official document, so I will call it this way for the time being.
First of all, you should basically use unicode strings.
'...'
Objects generated by literals'Ai'
in an interactive environment returns a string of bytes separated by \ x
.>>> 'Ah'
'\xe3\x81\x82\xe3\x81\x84\xe3\x81\x86'
[]
returns bytes instead of characters>>> 'Ah'[0]
'\xe3'
>>> 'Ah'[1]
'\x81'
>>> 'Ah'[2]
'\x82'
>>> 'Ah'[3]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: string index out of range
len ()
returns the number of bytes>>> len('Ah')
9
in an interactive environment returns an integer sequence separated by a character by
\ u`.>>> u'Ah'
u'\u3042\u3044\u3046'
[]
returns a character>>> u'Ah'[0]
u'\u3042'
>>> u'Ah'[1]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: string index out of range
len ()
returns the number of characters>>> len(u'Ah')
3
As mentioned above, the str string has the following two drawbacks.
On the other hand, with unicode strings, you don't have to worry about this. So you should use unicode strings instead of str strings.
Use ʻisinstance (object, str) and ʻisinstance (object, unicode)
to find out if an object is a str string and a unicode string.
It is not recommended to look up the type by type (object)
.
>>> isinstance(u'Ah', unicode)
True
>>> isinstance('Ah', str)
True
>>> isinstance('Ah', unicode)
False
>>> isinstance(u'Ah', str)
False
There are two ways to convert between str string ↔ unicode strings, one is to use the built-in function called str / unicode, and the other is to use the encode / decode method. First of all, you should basically use the encode / decode method.
str/unicode
Python2 has two built-in functions, str ()
and ʻunicode () `.
The Reference of str ()
says:
str([object]) Returns a string containing a nice printable representation of the object. ...
ʻUnicode ()` Reference is as follows.
unicode([object[, encoding[, errors]]]) ... If no optional parameters are given, unicode () mimics the behavior of str (). However, it returns a Unicode string instead of an 8-bit string. ...
In short, str ()
and ʻunicode ()` are methods for returning str strings and unicode strings that represent objects, not for mutual conversion between str strings and unicode strings. That there is no.
Also, since these functions are designed to call the __str__ ()
and __unicode__ ()
special methods for objects defined, their behavior differs depending on the object.
encode/decode
String objects have methods ʻencode ()and
decode ()`.
This is often explained as follows.
decode ()
method on a str string will give you a unicode stringThis is not wrong, but the str string actually has a ʻencode ()method, and the unicode string also has a
decode ()method. So, if you do
'Ai'.encode (), strange things like ʻUnicodeDecodeError
will occur.
In order to investigate the behavior of ʻencode ()and
decode (), I called ʻencode ()
and decode ()
for various combinations of character strings and encoding methods.
The experiment was conducted in a dialogue environment. The input / output encoding method of the terminal is UTF-8.
For example, the intersecting mass of 'abc'
and.encode ('ascii')
indicates the output when'abc'.encode ('ascii')
is input to the interpreter.
Method \ string | 'abc' |
u'abc' |
'Ah' |
u'Ah' |
---|---|---|---|---|
.encode('ascii') |
'abc' |
'abc' |
error(1) | error(3) |
.encode('utf-8') |
'abc' |
'abc' |
error(1) | '\xe3\x81\x82\xe3\x81\x84\xe3\x81\x86' |
.encode('shift-jis') |
'abc' |
'abc' |
error(1) | '\x82\xa0\x82\xa2\x82\xa4' |
.decode('ascii') |
u'abc' |
u'abc' |
error(1) | error(3) |
.decode('utf-8') |
u'abc' |
u'abc' |
u'\u3042\u3044\u3046' |
error(3) |
.decode('shift-jis') |
u'abc' |
u'abc' |
error(2) | error(3) |
As you can see from this result, ʻEncodeDecodeError appears even though ʻencode ()
is called, and ʻUnicodeEncodeError appears even though
decode () `is called.
Even if you look at Reference, the specification was not written, so it is guessed, but str If you call ʻencode ()on a string, you'll probably get something like decoding with the ASCII encoding and then encoding again with the specified encoding. When the unicode character string is
decode ()`, the opposite (encoding with ASCII encoding → decoding with the specified encoding) is considered to occur.
(Please tell me if you make a mistake)
For type conversion between str and unicode strings, you should use the encode / decode method instead of the str / unicode built-in functions.
The understanding of the encode / decode method is that ʻencode ()returns the str string and
decode ()` returns the unicode string if the encoding method is correct, whether it is a str string or a unicode string. You should think that it will return.
Let's use Python3.
Recommended Posts