[PYTHON] Be careful when working with gzip-compressed text files

There were some (personally) pitfalls when reading a gzip-compressed text file, so I've summarized them.

Binary reading

The default file read mode is binary, so the code below will read each line in binary.

import gzip

with gzip.open("test.txt.gz", "r") as fi:
    for line in fi:
	print(line)

To read it as text, read it in'rt'mode when the file is opened.

import gzip

with gzip.open("test.txt.gz", "rt") as fi:
    for line in fi:
	print(line)

Ignore the encoding declaration at the beginning of the sentence

Even if you specify the encoding by default, it will be ignored, so you need to specify the encoding again when opening the file. In other words, in the end, it can be read as a text file with the following code.

import gzip

with gzip.open("test.txt.gz", "rt", "utf_8") as fi:
    for line in fi:
	print(line)

Serpentine

Probably the same with other compressed files, but I haven't tried it.

Recommended Posts

Be careful when working with gzip-compressed text files
Be careful when running CakePHP3 with PHP7.2
Be careful when reading data with pandas (specify dtype)
Be careful with easy method references
When will mmap (2) files be updated? (3)
When will mmap (2) files be updated? (2)
(Note) Be careful with python argparse
When will mmap (2) files be updated? (1)
[Python] Be careful when using print
Be careful with Python's append method
Be careful of LANG for UnicodeEncodeError when printing Japanese with Python 3
⚠️ Be careful with Python's default argument values ⚠️
Be careful when retrieving tweets at regular intervals with the Twitter API
Be careful when adding an array to an array
Be careful of the type when making an image mask with Numpy
When Html cannot be output with Jupyter Notebook
When contour map cannot be drawn with APLpy
Investigation when import cannot be done with python
Character encoding when dealing with files in Python 3
BigQuery-Python was useful when working with BigQuery from Python
[android] When AVD cannot be used with permission denied