Things to keep in mind when processing strings in Python3

Continuing from the article Yesterday, this time I will summarize my own policy when dealing with strings in Python 3.

Personal conclusion

--In most cases, it handles character strings and exchanges standard input / output with character strings. --However, it may be necessary to handle the byte string, such as when a byte string is passed from an external program. To put it the other way around, it doesn't handle bytes except in that case.

This isn't good or bad, it's just that there aren't many cases where you have to deal with bytes in your code.

Bytes and strings

Byte strings are encoded by a specific encoding method, and are expressed as b'a' in literals. On the other hand, a character string is an array of Unicode code points, and is expressed as 'a' in literals.

I wrote it briefly, but at this point you can see the difference in handling with Python 2.

--"Python3 byte string" is treated similar to "Python2 byte string". However, "Python2 byte string" is a "character string", but "Python3 byte string" is not a "character string" but a completely different type. --"Python3 string" and "Python2 Unicode string" can be considered equivalent. There is a difference in the literal notation, and "Python3 string" does not need it as opposed to "Python2 Unicode string" which had to be prefixed with ʻu`.

python


(py3.4)~ » ipython
   (abridgement)
>>> b'a' #Byte sequence
Out[1]: b'a'

#Literal notation cannot be used when containing non-ASCII characters
#You need to encode the string with a specific encoding
>>> b'Ah' 
  File "<ipython-input-2-c12eb8e58bcd>", line 1
    b'Ah'
        ^
SyntaxError: bytes can only contain ASCII literal characters.

>>> 'Ah'.encode('utf-8') #String->Byte sequence(Encode)
Out[3]: b'\xe3\x81\x82'


>>> 'Ah' #String
Out[4]: 'Ah'

>>> b'\xe3\x81\x82'.decode('utf-8') #Byte sequence->String(Decode)
Out[5]: 'Ah'


# Python2(Repost)
(py2.7)~ » ipython
   (abridgement)
>>> 'Ah' #Byte string
Out[1]: '\xe3\x81\x82'

>>> u'Ah' #Unicode string
Out[2]: u'\u3042'

>>> 'Ah'.decode('utf-8') (or unicode('Ah', 'utf-8')) #Byte string->Unicode string(=Decode)
Out[3]: u'\u3042'

>>> u'Ah'.encode('utf-8') #Unicode string->Byte string(=Encode)
Out[4]: '\xe3\x81\x82'

If you check with the type function, you can see that the byte string is of type bytes / the string is of type str.

python


>>> type(b'a')
Out[6]: bytes #≒ Python2 str type

>>> type('a')
Out[7]: str #≒ Python2 unicode type

Also, as mentioned above, Python3 byte strings are not "strings". Therefore, it cannot be concatenated with a character string, and the supported methods are different. This point is relatively important, because it is the same character string as Python2, the processing progresses somehow and finally "UnicodeEncodeError is ga", but with Python3 it becomes "error due to different type" and error output / The location of occurrence is relatively easy to understand.

python


>>> s = 'str' #String

>>> b = b'byte' #Byte sequence

>>> s + b #String+Byte string is an error
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-20-5fe2240a1b50> in <module>()
----> 1 s + b

TypeError: Can't convert 'bytes' object to str implicitly

>>> s.find('t') #The string supports the find method
Out[11]: 1

>>> b.find('y') #Byte strings do not support the find method.
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-24-e1b070a5aaba> in <module>()
----> 1 b.find('y')

TypeError: Type str doesn't support the buffer API

Also, from Python 3.2, it seems to select the appropriate encoding method from the value of locale even when the standard output is connected to other than the terminal. Therefore, in Python2, the following cases with UnicodeEncodeError also work normally.

(ref.) http://methane.hatenablog.jp/entry/20120806/1344269400 Addendum

python


(py3.4)~ » cat test.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-

print('Ah' + 'Say')


#Run in terminal(I / O is connected to the terminal)
(py3.4)~ » python test.py
Ah

#Redirect to file(Standard I / O is connected to other than the terminal)
(py3.4)~ » python test.py > test.txt
(py3.4)~ » cat test.txt
Ah

UnicodeEncodeError is no longer scary

Even if you don't blatantly flag it as dead, you can still run into UnicodeEncodeError. For example, when executing from cron, you cannot select the encoding method from locale and try encoding / decoding with ASCII, and you usually end up with UnicodeEncodeError.

(ref.) http://www.python.jp/pipermail/python-ml-jp/2014-November/011311.html (Posting is extremely timely)

Considering this, it may be better to always specify the encoding method with the environment variable PYTHONIOENCODING without relying on the locale.

(ref.) http://methane.hatenablog.jp/entry/20120806/1344269400 (ref.) http://www.python.jp/pipermail/python-ml-jp/2014-November/011314.html

So what to do when dealing with bytes

You can use sys.stdin.buffer (standard input) / sys.stdout.buffer (standard output) to work with bytes instead of strings.

python


(py3.4)~ » cat test.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys

#print('Ah' + 'Say') #print is sys.write to stdout
sys.stdout.write('Ah' + 'Say' + '\n') # sys.Write a string to stdout
sys.stdout.buffer.write(('Ah' + 'Say' + '\n').encode('utf-8')) # sys.stdout.Write a string of bytes to buffer

#Run in terminal
(py3.4)~ » python test.py
Ah
Ah

#Redirect to file
(py3.4)~ » python test.py > test.txt
(py3.4)~ » cat test.txt
Ah
Ah

Again, in Python 3, bytes and strings are completely different. Therefore, the byte string cannot be written to sys.stdout which writes the character string, and the character string cannot be written to sys.stdout.buffer which writes the byte string.

python


>>> import sys

#Text stream(ref. https://docs.python.org/3/library/io.html#io.TextIOWrapper)
>>> type(sys.stdout) 
Out[2]: _io.TextIOWrapper

#Byte stream(ref. https://docs.python.org/3/library/io.html#io.BufferedWriter)
>>> type(sys.stdout.buffer)
Out[3]: _io.BufferedWriter 

#Cannot write bytes to text stream
>>> sys.stdout.write('a'.encode('utf-8')) 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-581ae8b6af82> in <module>()
----> 1 sys.stdout.write('a'.encode('utf-8'))

TypeError: must be str, not bytes

#Strings cannot be written to byte stream
>>> sys.stdout.buffer.write('a') 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-42da1d141b96> in <module>()
----> 1 sys.stdout.buffer.write('a')

TypeError: 'str' does not support the buffer interface

Recommended Posts

Things to keep in mind when processing strings in Python2
Things to keep in mind when processing strings in Python3
Things to keep in mind when developing crawlers in Python
Things to keep in mind when copying Python lists
Things to keep in mind when using Python with AtCoder
Things to keep in mind when using cgi with python.
Things to keep in mind when using Python for those who use MATLAB
Things to keep in mind when building automation tools for the manufacturing floor in Python
Things to keep in mind when deploying Keras on your Mac
Things to keep in mind when converting row vectors to column vectors with ndarray
Things to note when initializing a list in Python
Things to watch out for when using default arguments in Python
3 ways to parse time strings in python [Note]
Type Python scripts to run in QGIS Processing
A clever way to time processing in Python
Error when trying to install psycopg2 in Python
File processing in Python
Multithreaded processing in python
Queue processing in Python
Compare strings in Python
Reverse strings in Python
How to take multiple arguments when doing parallel processing using multiprocessing in python
How to measure processing time in Python or Java
I want to do something in Python when I finish
I want to manipulate strings in Kotlin like Python!
UTF8 text processing in python
To flush stdout in Python
Asynchronous processing (threading) in python
Attention when os.mkdir in Python
Speech to speech in python [text to speech]
Image Processing Collection in Python
How to develop in Python
Using Python mode in Processing
Post to Slack in Python
Processing of python3 that seems to be usable in paiza
What I was addicted to when migrating Processing users to Python
Convenient writing method when appending to list continuously in Python
What to do when "SSL: CERTIFICATE_VERIFY_FAILED _ssl.c: 1056" appears in Python
Allow Python to select strings in input files from folders
Processing order when chaining when in PySpark
[Subprocess] When you want to execute another Python program in Python code
Leave the troublesome processing to Python
[Python] How to do PCA in Python
processing to use notMNIST data in Python (and tried to classify it)
Precautions when using pit in Python
Convert markdown to PDF in Python
How to collect images in Python
100 Language Processing Knock Chapter 1 in Python
Behavior when listing in Python heapq
[Introduction to Python3 Day 13] Chapter 7 Strings (7.1-7.1.1.1)
Things to note when running Python on EC2 from AWS Lambda
In the python command python points to python3.8
[Introduction to Python3 Day 14] Chapter 7 Strings (7.1.1.1 to 7.1.1.4)
Timezone specification when converting a string to datetime type in python
[Python] When you want to use all variables in another file
Try to calculate Trace in Python
[Introduction to Python3 Day 15] Chapter 7 Strings (7.1.2-7.1.2.2)
How to use Mysql in python
How to wrap C in Python
How to use ChemSpider in Python
How to use PubChem in Python