Character encoding when dealing with files in Python 3

Overview

--In Python3, the default character encoding when handling files with ʻopenetc. depends on the OS. --On Unix (Linux), it depends onlocale (LC_CTYPE). --If you read or write a file without thinking about it, you may encounter ʻUnicodeDecodeError etc. depending on the environment.

Verification

--Check the operation on your macOS --For example, suppose you have a utf-8 text file with Japanese written in it. Open this file to get the contents ​

with open('utf-8.txt', mode='r') as fp:
    text = fp.read()

--You can open the file without any error and get the contents of the file. --This is because macOS defaults to UTF-8 character encoding --You can check the character encoding actually used with locale.getpreferredencoding. ​

>> import locale
>> locale.getpreferredencoding() 
UTF-8

--Because getpreferredencoding is ʻUTF-8, the text of utf-8 can be read without error. --Actually change LC_CTYPEand check that an error occurs --Usesetlocale to temporarily change LC_CTYPE` ​

import locale
​
locale.setlocale(locale.LC_CTYPE, ('C')) 
print(locale.getpreferredencoding(False)) # => US-Become ASCII
​
with open('hoge.txt') as fp:
    text = fp.read()

Result

US-ASCII
Traceback (most recent call last):
  File "test.py", line 7, in <module>
    text = fp.read()
  File "/path/to/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)

--By setting LC_CTYPE to C, the character encoding becomes US-ASCII. --As a result, I got ʻUnicodeDecodeError` when reading the text of uff-8.

Note
-LC without setloacale_The same behavior can be confirmed by directly changing the environment variable of CTYPE.
- getpreferredencoding(do_setloacal=False)If you do not, you will not be able to get the temporarily changed encoding with setlocale.

Correspondence

--Basically, when dealing with files, it is better to specify the character encoding. --In Python3, ʻopen can now accept ʻencoding arguments, so you can use that (= you can handle files regardless of LC_CTYPE).

with open('utf-8.txt', encoding='utf-8') as fp:
    text = fp.read()

--If you want to write a library that works with both python2 and python3, it is better to open it in binary mode and then set it to utf-8 or use the codecs module.

#! -*- coding:utf-8 -*-
import locale
import codecs
import six
​
locale.setlocale(locale.LC_CTYPE, ('C'))
​
with open('utf-8.txt', 'rb') as fp:
    text1 = fp.read()
    text1 = six.text_type(text1, 'utf-8')
​
with codecs.open('utf-8.txt', 'r', encoding='utf-8') as fp:
    text2 = fp.read()
​
assert text1 == text2

Summary

--Python3 determines the default character encoding when dealing with files depending on the OS and locale (LC_CTYPE) --Basically, it is better to handle the file after specifying the character encoding. Otherwise, you will encounter unintended problems. ――I'm sorry I've done that kind of thing lately.

reference

Recommended Posts

Character encoding when dealing with files in Python 3
Precautions when dealing with control structures in Python 2.6
Japanese output when dealing with python in visual studio
Read files in parallel with Python
Until dealing with python in Atom
Tips for dealing with binaries in Python
Dealing with "years and months" in Python
How to not escape Japanese when dealing with json in python
[Python] Get the files in a folder with Python
gRPC-Methods used when dealing with protocol buffers types in Python CopyFrom, Extend
Note on encoding when LANG = C in Python
Character encoding when using csv module of python 2.7.3
Handle zip files with Japanese filenames in Python 3
Split files when writing vim plugin in python
Encoding judgment in Python
[Python] Dealing with multiple call errors in ray.init
Character strings placed in GCS with python are garbled when viewed with a browser
Mailbox selection when retrieving Gmail with imaplib in python
Base64 encoding images in Python 3
Scraping with selenium in Python
Working with LibreOffice in Python
Debugging with pdb in Python
Working with sounds in Python
Sorting image files with Python (2)
Sort huge files with python
Sorting image files with Python (3)
Scraping with Tor in Python
Tweet with image in Python
Sorting image files with Python
Attention when os.mkdir in Python
Combined with permutations in Python
Integrate PDF files with Python
Reading .txt files with Python
Error when playing with python
Character code learned in Python
Problem not knowing parameters when dealing with Blender from Python
Things to keep in mind when using Python with AtCoder
Things to keep in mind when using cgi with python.
What I was addicted to when dealing with huge files in a Linux 32bit environment
Number recognition in images with Python
Transpose CSV files in Python Part 1
Testing with random numbers in Python
Precautions when using pit in Python
GOTO in Python with Sublime Text 3
Working with LibreOffice in Python: import
Scraping with Selenium in Python (Basic)
Manipulating EAGLE .brd files with Python
Behavior when listing in Python heapq
CSS parsing with cssutils in Python
Manipulate files and folders in Python
[Python] POST wav files with requests [POST]
Numer0n with items made in Python
Handling of JSON files in Python
Download Google Drive files in Python
Decrypt files encrypted with OpenSSL with Python 3
JSON encoding and decoding with python
Use rospy with virtualenv in Python3
Sort large text files in Python
Handle Excel CSV files with Python
Use Python in pyenv with NeoVim
[Beginner] Extract character strings with Python