Introduction

Python2 strings are confusing.

the term

In the first place, not only Python, but the story related to character code is complicated. This is probably because different people use different terms. Here, [Yukihiro Matsumoto Code World](http://www.amazon.co.jp/%E3%81%BE%E3%81%A4%E3%82%82%E3%81%A8%E3% 82% 86% E3% 81% 8D% E3% 81% B2% E3% 82% 8D-% E3% 82% B3% E3% 83% BC% E3% 83% 89% E3% 81% AE% E4% B8 % 96% E7% 95% 8C% E2% 80% BE% E3% 82% B9% E3% 83% BC% E3% 83% 91% E3% 83% BC% E3% 83% BB% E3% 83% 97 % E3% 83% AD% E3% 82% B0% E3% 83% A9% E3% 83% 9E% E3% 81% AB% E3% 81% AA% E3% 82% 8B14% E3% 81% AE% E6 % 80% 9D% E8% 80% 83% E6% B3% 95 / dp / 4822234312 According to the following definition in "Yukihiro Matsumoto Code World").

the term	meaning
letter	Symbols used in systems that visually represent language
Glyph	Glyphs of individual letters
Character set	A collection of characters that are subject to character code assignment
Character code	Numbers assigned to individual letters
Character encoding method	How to express the character code on a computer

Two types of character strings

There are two types of strings in Python2. Here, the two are called ** str string ** and ** unicode string **, and these are collectively called ** string **. The terms are not so unified in the official document, so I will call it this way for the time being.

First of all, you should basically use unicode strings.

str string

'...' Objects generated by literals
Bytes obtained by encoding each character by encoding methods such as UTF-8 and Shift-JIS.
One character may be represented by multiple bytes
The str string itself has no information about the encoding used for encoding.
It seems that to know the encoding method, basically try from one end
Entering 'Ai' in an interactive environment returns a string of bytes separated by \ x.

>>> 'Ah'
'\xe3\x81\x82\xe3\x81\x84\xe3\x81\x86'

The interpreter recognizes it as a sequence of bytes, not a sequence of characters
The number of characters and the number of bytes do not match in a str string containing multibyte characters
Index by [] returns bytes instead of characters

>>> 'Ah'[0]
'\xe3'
>>> 'Ah'[1]
'\x81'
>>> 'Ah'[2]
'\x82'
>>> 'Ah'[3]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: string index out of range

The built-in function len () returns the number of bytes

>>> len('Ah')
9

unicode string

ʻU'...'` Objects generated by literals
Character code An array of integers corresponding to each character on UCS-2
Since the integer corresponding to the character is determined by UCS-2, there is no need to be aware of the difference in encoding method.
Entering ʻu'Ai'in an interactive environment returns an integer sequence separated by a character by\ u`.

>>> u'Ah'
u'\u3042\u3044\u3046'

Index by [] returns a character

>>> u'Ah'[0]
u'\u3042'
>>> u'Ah'[1]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: string index out of range

The built-in function len () returns the number of characters

>>> len(u'Ah')
3

Summary

As mentioned above, the str string has the following two drawbacks.

It is difficult to understand which coding method is used for coding.
Difficult to find the number of characters

On the other hand, with unicode strings, you don't have to worry about this. So you should use unicode strings instead of str strings.

Examine the type of a string object

Use ʻisinstance (object, str) and ʻisinstance (object, unicode) to find out if an object is a str string and a unicode string. It is not recommended to look up the type by type (object).

>>> isinstance(u'Ah', unicode)
True
>>> isinstance('Ah', str)
True
>>> isinstance('Ah', unicode)
False
>>> isinstance(u'Ah', str)
False

Convert the type of a string object

There are two ways to convert between str string ↔ unicode strings, one is to use the built-in function called str / unicode, and the other is to use the encode / decode method. First of all, you should basically use the encode / decode method.

str/unicode

Python2 has two built-in functions, str () and ʻunicode () `.

The Reference of str () says:

str([object]) Returns a string containing a nice printable representation of the object. ...

ʻUnicode ()` Reference is as follows.

unicode([object[, encoding[, errors]]]) ... If no optional parameters are given, unicode () mimics the behavior of str (). However, it returns a Unicode string instead of an 8-bit string. ...

In short, str () and ʻunicode ()` are methods for returning str strings and unicode strings that represent objects, not for mutual conversion between str strings and unicode strings. That there is no.

Also, since these functions are designed to call the __str__ () and __unicode__ () special methods for objects defined, their behavior differs depending on the object.

encode/decode

String objects have methods ʻencode ()anddecode ()`. This is often explained as follows.

Call the ʻencode ()` method on a unicode string to get the str string
Calling the decode () method on a str string will give you a unicode string

This is not wrong, but the str string actually has a ʻencode ()method, and the unicode string also has adecode ()method. So, if you do'Ai'.encode (), strange things like ʻUnicodeDecodeError will occur.

In order to investigate the behavior of ʻencode ()anddecode (), I called ʻencode () and decode () for various combinations of character strings and encoding methods. The experiment was conducted in a dialogue environment. The input / output encoding method of the terminal is UTF-8.

For example, the intersecting mass of 'abc' and.encode ('ascii')indicates the output when'abc'.encode ('ascii') is input to the interpreter.

Method \ string	`'abc'`	`u'abc'`	`'Ah'`	`u'Ah'`
`.encode('ascii')`	`'abc'`	`'abc'`	error(1)	error(3)
`.encode('utf-8')`	`'abc'`	`'abc'`	error(1)	`'\xe3\x81\x82\xe3\x81\x84\xe3\x81\x86'`
`.encode('shift-jis')`	`'abc'`	`'abc'`	error(1)	`'\x82\xa0\x82\xa2\x82\xa4'`
`.decode('ascii')`	`u'abc'`	`u'abc'`	error(1)	error(3)
`.decode('utf-8')`	`u'abc'`	`u'abc'`	`u'\u3042\u3044\u3046'`	error(3)
`.decode('shift-jis')`	`u'abc'`	`u'abc'`	error(2)	error(3)

Error (1): ʻUnicodeDecodeError:'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range (128) `
Error (2): ʻUnicodeDecodeError:'shift_jis' codec can't decode byte 0x86 in position 8: incomplete multibyte sequence`
Error (3): ʻUnicodeEncodeError:'ascii' codec can't encode characters in position 0-2: ordinal not in range (128) `

As you can see from this result, ʻEncodeDecodeError appears even though ʻencode () is called, and ʻUnicodeEncodeError appears even though decode () `is called.

Even if you look at Reference, the specification was not written, so it is guessed, but str If you call ʻencode ()on a string, you'll probably get something like decoding with the ASCII encoding and then encoding again with the specified encoding. When the unicode character string isdecode ()`, the opposite (encoding with ASCII encoding → decoding with the specified encoding) is considered to occur. (Please tell me if you make a mistake)

Summary

For type conversion between str and unicode strings, you should use the encode / decode method instead of the str / unicode built-in functions.

The understanding of the encode / decode method is that ʻencode ()returns the str string anddecode ()` returns the unicode string if the encoding method is correct, whether it is a str string or a unicode string. You should think that it will return.

in conclusion

Let's use Python3.

Python2 str / unicode and encode / decode

Introduction

the term

Two types of character strings

str string

unicode string

Summary

Examine the type of a string object

Convert the type of a string object

Summary

in conclusion