(Windows) Causes and workarounds for UnicodeEncodeError on Python 3

background

str has become Unicode, so What about CP932 as represented by the conventional Shift-JIS?

When an ascii conversion error occurs when outputting as standard on Windows. I organized it to see what it was.

environment

Windows Python3 (Anaconda3)

Windows and Python and encoding

Python string encoating

In Python3, there are two types of strings. --str type (Unicode only) --byte type (arbitrary encoding)

str is for UTF-8 only. Other encoding strings cannot be stored. On the other hand, byte can store any circular coating character string. Of course UTF-8 is also possible. You can convert from str to byte with encode (), and vice versa with decode (). If you don't know which is which, you can do dir (str). There are no two types of functions as in Python2.

In Python2, there are str type and unicode type.

Python3 internal Windows standard output(input)
==========                  ===================

  UTF-8  ---------------------->  CP932
 (str type)   str.encode('CP932')   (byte type)
         <----------------------
           byte.decode('CP932')

Windows encoding

The standard output of Windows uses an encoding called CP932. Therefore, when the str character string is output as standard or written to a file, the conversion to CP932 works automatically by default.

What is the reason why you cannot print?

In fact, Python does not explicitly convert, but when it outputs standard output, it automatically converts it to the system encoding and then tries to output it.

In the case of Windows, it tries to convert to CP932, so if it cannot be converted to CP932, a UnicodeEncodeError exception will occur.

>>> s = '\xa0'
>>> print(s)

>>> s.encode('utf-8')
b'\xc2\xa0'
>>> s.encode('cp932')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'cp932' codec can't encode character '\xa0' in position 0: illegal multibyte sequence

Erase the bad code

The cause of UnicodeEncodeError is that it contains code that cannot be converted to CP932, so if you delete the code that is doing the wrong thing, it may be solved.

In this case, \ xa0 is bad, so if you replace it with the replace function, the exception error will not appear.

>>> s
'\xa0'
>>> s.encode('cp932')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'cp932' codec can't encode character '\xa0' in position 0: illegal multibyte sequence
>>> s2 =s.replace('\xa0', '')
>>> s2.encode('cp932')
b''

Ignore the bad code

It is troublesome and easy to leak the code that cannot be converted to CP932. In the first place, I thought that there might be an option to ignore if it could not be converted to an encode function, and when I googled it, there was an ignore option.

[Reference] Conversion to byte string https://docs.python.jp/3/howto/unicode.html (In addition to ignore, there are replace, name replace, etc.)

An example of suppressing an exception error by using the ignore option.

>>> s.encode('cp932')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'cp932' codec can't encode character '\xa0' in position 0: illegal multibyte sequence
>>> s.encode('cp932', "ignore")
b''

Summary

Character strings containing \ xa0 etc. are UTF-8 in Python3 and managed internally, so they can be processed without problems in Python, but in cases where they must be converted to CP932 in a Windows environment, for example. , When outputting as standard or when outputting as a file Unicode --> CP932 The conversion process to is executed. At that time, UnicodeEncodeError will occur, so if you encode it once with the ignore option, convert it to byte type, and return it to str with decode, you can avoid UnicodeEncodeError from now on. Also, when writing to a file, the byte type can only be output in binary mode, so specify the binary mode ('wb' or'ab'instead of'w' or'a') when opening the file. In the case of open using codecs, you can specify the encoding and ignore option at the time of open, and you can output as str type.

Example of standard output:

import codecs
s = '\xa0'
b = s.encode('cp932', "ignore")
s_after = b.decode('cp932')
print(s_after)

Example of file output:

f = open('test', 'ab')
s = '\xa0'
b = s.encode('cp932', 'ignore')
f.write(b)
f.close()

Example of outputting a file using codecs:

import codecs
f = codecs.open('test', 'ab', 'cp932', 'ignore')
s = '\xa0'
f.write(s) #If you use codecs, you can write as str
f.close()

reference

Python3 Unicode HOWTO https://docs.python.jp/3/howto/unicode.html

CP932 and UTF-8 https://android.googlesource.com/toolchain/benchmark/+/master/python/src/Modules/cjkcodecs/README

Recommended Posts

(Windows) Causes and workarounds for UnicodeEncodeError on Python 3
Python 3.6 on Windows ... and to Xamarin.
Integrate Modelica and Python on Windows
Python on Windows
Notes for using OpenCV on Windows10 Python 3.8.3.
Install and run Python3.5 + NumPy + SciPy on Windows 10
Notes on installing Python3 and using pip on Windows7
Install OpenCV 4.0 and Python 3.7 on Windows 10 with Anaconda
python basic on windows ②
Install python on windows
Put MeCab binding for Python with pip on Windows, mac and Linux
[Note] Installing Python 3.6 + α on Windows and RHEL
Installing TensorFlow on Windows Easy for Python beginners
Install Python and libraries for Python on MacOS Catalina
Install Python (for Windows)
Install ZIP version Python and pip on Windows 10
Initial settings for using Python3.8 and pip on CentOS8
Build a Python extension for E-Cell 4 on Windows 7 (64bit)
[Heroku] Memo for deploying Python apps using Heroku on Windows [Python]
[Windows] [Python3] Install python3 and Jupyter Notebook (formerly ipython notebook) on Windows
PIL with Python on Windows 8 (for Google App Engine)
Procedure for building a CDK environment on Windows (Python)
Compile and install MySQL-python for python2.7 on amazon linux
Create a decent shell and python environment on Windows
For those who can't install Python on Windows XP
Set-enable Python virtualenv on Windows
Run Openpose on Python (Windows)
Install watchdog on Windows + Python 3.3
Python on Ruby and angry Ruby on Python
Install Python and Flask (Windows 10)
Python 3.6 installation procedure [for Windows]
Python + Kivy development on Windows
F2py on Miniconda for Windows
Sphinx-autobuild (0.5.2) on Windows7, Python 3.5.1, Sphinx 1.3.5
Fastest Python installation on Windows
Build Python environment on Windows
Build python environment on windows
Pitfalls and workarounds for pandas.DataFrame.to_sql
I ran python on windows
[Python] [Chainer] [Windows] Install Chainer on Windows
Use Python on Windows (PyCharm)
Install dlib for Python (Windows)
Until you create Python Virtualenv on Windows and launch Jupyter
Install python and Visual Studio Code on windows10 (April 2020 version)
Build Python3 for Windows 10 on ARM with Visual Studio 2019 (x86) on Windows 10 on ARM
Python environment construction memo on Windows 10
Notes on Python and dictionary types
Installing Kivy on Windows10 64bit Python3.5
Anaconda python environment construction on Windows 10
Install python2.7 on windows 32bit environment
Install xgboost (python version) on Windows
Install Python on Windows + pip + virtualenv
Build and install OpenCV on Windows
Install Pytorch on Blender 2.90 python on Windows
Installing Kivy-Designer on Windows10 64bit Python3.5
Mecab / Cabocha / KNP on Python + Windows
Mastering pip and wheel on windows
Install Python development environment on Windows 10
Install confluent-kafka for Python on Ubuntu
Python CGI file created on Windows
Getting started with Python 3.8 on Windows