Python2 str / unicode and encode / decode

Introduction

Python2 strings are confusing.

the term

In the first place, not only Python, but the story related to character code is complicated. This is probably because different people use different terms. Here, [Yukihiro Matsumoto Code World](http://www.amazon.co.jp/%E3%81%BE%E3%81%A4%E3%82%82%E3%81%A8%E3% 82% 86% E3% 81% 8D% E3% 81% B2% E3% 82% 8D-% E3% 82% B3% E3% 83% BC% E3% 83% 89% E3% 81% AE% E4% B8 % 96% E7% 95% 8C% E2% 80% BE% E3% 82% B9% E3% 83% BC% E3% 83% 91% E3% 83% BC% E3% 83% BB% E3% 83% 97 % E3% 83% AD% E3% 82% B0% E3% 83% A9% E3% 83% 9E% E3% 81% AB% E3% 81% AA% E3% 82% 8B14% E3% 81% AE% E6 % 80% 9D% E8% 80% 83% E6% B3% 95 / dp / 4822234312 According to the following definition in "Yukihiro Matsumoto Code World").

the term meaning
letter Symbols used in systems that visually represent language
Glyph Glyphs of individual letters
Character set A collection of characters that are subject to character code assignment
Character code Numbers assigned to individual letters
Character encoding method How to express the character code on a computer

Two types of character strings

There are two types of strings in Python2. Here, the two are called ** str string ** and ** unicode string **, and these are collectively called ** string **. The terms are not so unified in the official document, so I will call it this way for the time being.

First of all, you should basically use unicode strings.

str string

>>> 'Ah'
'\xe3\x81\x82\xe3\x81\x84\xe3\x81\x86'
>>> 'Ah'[0]
'\xe3'
>>> 'Ah'[1]
'\x81'
>>> 'Ah'[2]
'\x82'
>>> 'Ah'[3]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: string index out of range
>>> len('Ah')
9

unicode string

>>> u'Ah'
u'\u3042\u3044\u3046'
>>> u'Ah'[0]
u'\u3042'
>>> u'Ah'[1]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: string index out of range
>>> len(u'Ah')
3

Summary

As mentioned above, the str string has the following two drawbacks.

On the other hand, with unicode strings, you don't have to worry about this. So you should use unicode strings instead of str strings.

Examine the type of a string object

Use ʻisinstance (object, str) and ʻisinstance (object, unicode) to find out if an object is a str string and a unicode string. It is not recommended to look up the type by type (object).

>>> isinstance(u'Ah', unicode)
True
>>> isinstance('Ah', str)
True
>>> isinstance('Ah', unicode)
False
>>> isinstance(u'Ah', str)
False

Convert the type of a string object

There are two ways to convert between str string ↔ unicode strings, one is to use the built-in function called str / unicode, and the other is to use the encode / decode method. First of all, you should basically use the encode / decode method.

str/unicode

Python2 has two built-in functions, str () and ʻunicode () `.

The Reference of str () says:

str([object]) Returns a string containing a nice printable representation of the object. ...

ʻUnicode ()` Reference is as follows.

unicode([object[, encoding[, errors]]]) ... If no optional parameters are given, unicode () mimics the behavior of str (). However, it returns a Unicode string instead of an 8-bit string. ...

In short, str () and ʻunicode ()` are methods for returning str strings and unicode strings that represent objects, not for mutual conversion between str strings and unicode strings. That there is no.

Also, since these functions are designed to call the __str__ () and __unicode__ () special methods for objects defined, their behavior differs depending on the object.

encode/decode

String objects have methods ʻencode ()anddecode ()`. This is often explained as follows.

This is not wrong, but the str string actually has a ʻencode ()method, and the unicode string also has adecode ()method. So, if you do'Ai'.encode (), strange things like ʻUnicodeDecodeError will occur.

In order to investigate the behavior of ʻencode ()anddecode (), I called ʻencode () and decode () for various combinations of character strings and encoding methods. The experiment was conducted in a dialogue environment. The input / output encoding method of the terminal is UTF-8.

For example, the intersecting mass of 'abc' and.encode ('ascii')indicates the output when'abc'.encode ('ascii') is input to the interpreter.

Method \ string 'abc' u'abc' 'Ah' u'Ah'
.encode('ascii') 'abc' 'abc' error(1) error(3)
.encode('utf-8') 'abc' 'abc' error(1) '\xe3\x81\x82\xe3\x81\x84\xe3\x81\x86'
.encode('shift-jis') 'abc' 'abc' error(1) '\x82\xa0\x82\xa2\x82\xa4'
.decode('ascii') u'abc' u'abc' error(1) error(3)
.decode('utf-8') u'abc' u'abc' u'\u3042\u3044\u3046' error(3)
.decode('shift-jis') u'abc' u'abc' error(2) error(3)

As you can see from this result, ʻEncodeDecodeError appears even though ʻencode () is called, and ʻUnicodeEncodeError appears even though decode () `is called.

Even if you look at Reference, the specification was not written, so it is guessed, but str If you call ʻencode ()on a string, you'll probably get something like decoding with the ASCII encoding and then encoding again with the specified encoding. When the unicode character string isdecode ()`, the opposite (encoding with ASCII encoding → decoding with the specified encoding) is considered to occur. (Please tell me if you make a mistake)

Summary

For type conversion between str and unicode strings, you should use the encode / decode method instead of the str / unicode built-in functions.

The understanding of the encode / decode method is that ʻencode ()returns the str string anddecode ()` returns the unicode string if the encoding method is correct, whether it is a str string or a unicode string. You should think that it will return.

in conclusion

Let's use Python3.

Recommended Posts

Python2 str / unicode and encode / decode
Experimented with unicode, decode and encode
str and unicode
A python regular expression, str and unicode that are sober and addictive
My str (python)
Base64 decode / encode
str and repr
[python] Compress and decompress
Python and numpy tips
[Python] pip and wheel
Python iterators and generators
Decode ShiftJIS to Unicode
Python packages and modules
Vue-Cli and Python integration
Ruby, Python and map
python input and output
Python and Ruby split
If you encounter a "Unicode Decode Error" in Python
Python3, venv and Ansible
Python asyncio and ContextVar
Get a Python web page, character encode it, and display it
Programming with Python and Tkinter
Encryption and decryption with Python
Python: Class and instance variables
3-3, Python strings and character codes
Python 2 series and 3 series (Anaconda edition)
Python and hardware-Using RS232C with Python-
Python on Ruby and angry Ruby on Python
Python indentation and string format
Install Python and Flask (Windows 10)
About python objects and classes
About Python variables and objects
Apache mod_auth_tkt and Python AuthTkt
Å (Ongustromu) and NFC @ Python
Understand Python packages and modules
# 2 [python3] Separation and comment out
Python shallow copy and deep copy
Python and ruby slice memo
Python installation and basic grammar
I compared Java and Python!
Python shallow and deep copy
About Python, len () and randint ()
About Python datetime and timezone
Install Python 3.7 and Django 3.0 (CentOS)
Python environment construction and TensorFlow
Ruby and Python syntax ~ branch ~
[Python] Python and security-① What is Python?
Stack and Queue in Python
python metaclass and sqlalchemy declareative
Fibonacci and prime implementations (python)
Python basics: conditions and iterations
Python bitwise operator and OR
Python debug and test module
Python list and tuples and commas
Python variables and object IDs
Python list comprehensions and generators
About Python and regular expressions
python with pyenv and venv
Unittest and CI in Python
Maxout description and implementation (Python)
[python] Get quotient and remainder