Things to keep in mind when processing strings in Python2

UnicodeEncodeError, the biggest natural enemy (exaggeration) for Python programmers (with Python2) who handle Japanese. The person next to me yesterday became the prey, and while I was helping to solve it, I was able to sort out the direction of string processing in Python 2 a little. (I want to put together a Python3 version soon)

Personal conclusion

--Always be aware of whether you are dealing with byte strings or Unicode strings. -(Basically) Handle Unicode character strings in the program, and convert them to byte character strings when exchanging with standard I / O (ex. Print).

Byte string and unicode string

The byte string is encoded by a specific encoding method (ex. Utf-8), and is expressed as 'that' in literals. On the other hand, a Unicode character string is an arrangement of Unicode code points, and in literals, ʻu is added like ʻu'that'.

python


(py2.7)~ » ipython
   (abridgement)
>>> 'Ah' #Byte string
Out[1]: '\xe3\x81\x82'

>>> u'Ah' #Unicode string
Out[2]: u'\u3042'

>>> 'Ah'.decode('utf-8') (or unicode('Ah', 'utf-8')) #Byte string->Unicode string(=Decode)
Out[3]: u'\u3042'

>>> u'Ah'.encode('utf-8') #Unicode string->Byte string(=Encode)
Out[4]: '\xe3\x81\x82'

If you check with the type function, you can see that the byte string is of type str / the Unicode string is of type ʻunicode`.

python


>>> type('a')
Out[5]: str

>>> type(u'a')
Out[6]: unicode

Furthermore, in Python2, both byte strings and Unicode strings are strings and can be concatenated.

python


>>> u'a' + 'a'
Out[7]: u'aa'

what. There is no problem.

Yes, I have to deal with Japanese (to be exact, all non-ASCII characters)! As you can see from the output of the above example, combining a Unicode string and a byte string produces a Unicode string. In the process, you have to decode the byte string into a Unicode string, but the problem here is that the Python string doesn't have any information about its own encoding.

"If you don't know how to encode, you can decode it in ASCII," Python says, and Hello UnicodeEncodeError. It is rare for literals to make such mistakes, but it is easy to make mistakes if you are not careful about the character strings received from outside your own program (including standard input / output).

python


>>> u'a' + 'Ah' #Unicode string and byte string(Non-ASCII)Combine
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-8-084e015bd795> in <module>()
----> 1 u'a' + 'Ah'

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)

>>> u'a' + 'Ah'.decode('utf-8') #Byte string->Unicode string
Out[9]: u'a\u3042'

>>> print(u'a' + 'Ah'.decode('utf-8'))
a ah

The reason for moving to Unicode strings instead of byte strings is that it is often more convenient to work with strings at the codepoint level than at the byte level. For example, if you want to count the number of characters, you can use the len function for Unicode strings. On the other hand, a byte string returns the number of bytes, so it cannot be used with that intention.

python


>>> len(u'Ah')
Out[11]: 3

>>> len('Ah')
Out[12]: 9

Unicode strings are the best! I didn't want byte strings!

Is not. As an example, consider the following simple program.

test.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-

print(u'Ah' + u'Say')

Try running it in the terminal. Probably the majority of people can do it without problems.

python


(py2.7)~ » python test.py
Ah

Then, how about redirecting the execution result to a file? There are many environments where UnicodeEncodeError occurs as shown below.

python


(py2.7)~ » python test.py > test.txt
Traceback (most recent call last):
  File "test.py", line 4, in <module>
    print(u'Ah')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

Locale and encoding

In the example print (u'Ai'), a Unicode character string is passed to the standard output, but at this time, Unicode character string-> byte character string conversion (encoding) is performed. If standard I / O is connected to a terminal, Python will automatically select the appropriate encoding from the locale value (ex. Environment variable LANG). On the other hand, when standard input / output is connected to other than the terminal by redirect etc., information for selecting an appropriate encoding method cannot be obtained, and encoding is attempted in ASCII, and in most cases (= when non-ASCII characters are included). Fail.

(ref.) http://blog.livedoor.jp/dankogai/archives/51816624.html

Encoding the Unicode string before passing it to standard output can solve this problem.

test.py(Unicode string->Byte string)


#!/usr/bin/env python
# -*- coding: utf-8 -*-

print((u'Ah' + u'Say').encode('utf-8'))

I've always thought

By specifying the environment variable PYTHONIOENCODING, the encoding method to be used can be fixed regardless of the locale. If you specify this, you don't have to encode it one by one.

python


(py2.7)~ » cat test.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-

print(u'Ah' + u'Say')
(py2.7)~ » PYTHONIOENCODING=utf-8 python test.py > test.txt
(py2.7)~ » cat test.txt
Ah

(ref.) http://methane.hatenablog.jp/entry/20120806/1344269400

Recommended Posts

Things to keep in mind when processing strings in Python2
Things to keep in mind when processing strings in Python3
Things to keep in mind when developing crawlers in Python
Things to keep in mind when copying Python lists
Things to keep in mind when using Python with AtCoder
Things to keep in mind when using cgi with python.
Things to keep in mind when using Python for those who use MATLAB
Things to keep in mind when building automation tools for the manufacturing floor in Python
Things to keep in mind when deploying Keras on your Mac
Things to keep in mind when converting row vectors to column vectors with ndarray
Things to note when initializing a list in Python
Things to keep in mind when doing Batch Prediction on GCP ML Engine
Things to watch out for when using default arguments in Python
Summary of points to keep in mind when writing a program that runs on Python 2.5
3 ways to parse time strings in python [Note]
Type Python scripts to run in QGIS Processing
A clever way to time processing in Python
Error when trying to install psycopg2 in Python
File processing in Python
Multithreaded processing in python
Text processing in Python
Queue processing in Python
Compare strings in Python
Reverse strings in Python
How to take multiple arguments when doing parallel processing using multiprocessing in python
How to measure processing time in Python or Java
Django class-based view
Things to keep in mind when processing strings in Python2
Things to keep in mind when processing strings in Python3
Consider common pre-processing when processing DynamoDB Stream with Lambda (Python)
What I was addicted to when migrating Processing users to Python
How to exit when using Python in Terminal (Mac)
I want to do something in Python when I finish
I want to manipulate strings in Kotlin like Python!
UTF8 text processing in python
To flush stdout in Python
Login to website in Python
Search for strings in Python
Attention when os.mkdir in Python
Image Processing Collection in Python
How to develop in Python
Using Python mode in Processing
Post to Slack in Python
Processing of python3 that seems to be usable in paiza
What I was addicted to when migrating Processing users to Python
Convenient writing method when appending to list continuously in Python
What to do when "SSL: CERTIFICATE_VERIFY_FAILED _ssl.c: 1056" appears in Python
Allow Python to select strings in input files from folders
Processing order when chaining when in PySpark
[Subprocess] When you want to execute another Python program in Python code
Leave the troublesome processing to Python
How to not escape Japanese when dealing with json in python
[Python] How to do PCA in Python
processing to use notMNIST data in Python (and tried to classify it)
Signal processing in Python (1): Fourier transform
Precautions when using pit in Python
Things to watch out for when naming dynamic routing in nuxt.js
Convert markdown to PDF in Python
How to collect images in Python
100 Language Processing Knock Chapter 1 in Python
Behavior when listing in Python heapq
[Introduction to Python3 Day 13] Chapter 7 Strings (7.1-7.1.1.1)
How to use SQLite in Python
Things to note when running Python on EC2 from AWS Lambda
In the python command python points to python3.8