About the handling of ZIP files including Japanese files when upgrading from Python2 to Python3

Introduction

The other day I was involved in the task of raising the Python of the Django application from 2.7 to 3, but after releasing the Python3 source, the bug "** I can not download the file with the Japanese name uploaded at Python 2 **" Faced with.

According to the application specifications, the uploaded file is saved in the storage in ZIP format, and the file name is passed to zipfile.extract () at the time of download to extract it, but this extract () I was getting an error like Key Error:" There is no item named'xxx.yyy' in the archive "`.

The following articles were very helpful for troubleshooting, but there were some parts that could not be solved by this alone, so I would like to write an article including the meaning of supplement.

Zip extraction with Python (Japanese file name support) --Qiita

Cause 1: Specification change of zipfile

This is described in the article mentioned above, but in Python2 zipfile.extract (), the file name is returned as a byte string, so you did not have to worry about the character code, but in Python3 Looking at the header information of the ZIP file, there is a specification change problem that ** if the UTF-8 flag is not set, all the file names will be decoded by CP437 **.

Even if it is said, people like me who are not familiar with the ZIP specifications and the handling of Python strings are not so good at this, so I would like to raise the resolution a little more.

zipfile.extract () Overview

First, Python's zipfile.ZipFile object holds meta information (ZipInfo object) for each stored file. zipfile.extract () uses the path included in this ZipInfo to access the target file and extract the data.

    def extract(self, member, path=None, pwd=None):
...
        if not isinstance(member, ZipInfo):
            member = self.getinfo(member)
...
        return self._extract_member(member, path, pwd)

If the file name is passed to zipfile.extract () here, ZipFile.getinfo () will be called, and getinfo () will refer to the NameToInfo attribute of the ZipFile object and be the target file. Gets the ZipInfo object of. The NameToInfo attribute is a dictionary object called{filename: ZipInfo}.

class ZipFile(object):
...
    def __init__(self, file, mode="r", compression=ZIP_STORED, allowZip64=False):
...
        self.NameToInfo = {}    # Find file info given name
...
    def getinfo(self, name):
        """Return the instance of ZipInfo given 'name'."""
        info = self.NameToInfo.get(name)
        if info is None:
            raise KeyError(
                'There is no item named %r in the archive' % name)

        return info

This flow is common to Python2 and 3.

Changes in Python 3

Then, what has changed in Python 3 is that, as explained at the beginning, when setting the file name included in this ZipInfo or NameToInfo, the file name is automatically decoded. More specifically, it is a specification change of zipfile.ZipFile._RealGetContents ().

In Python2, UTF-8 is used for decoding only when the UTF-8 flag is set as shown below, and the byte string is set as it is otherwise.

class ZipFile(object):
...
    def __init__(self, file, mode="r", compression=ZIP_STORED, allowZip64=False):
...
        try:
            if key == 'r':
                self._RealGetContents()
...
    def _RealGetContents(self):
        """Read in the table of contents for the ZIP file."""
...
            filename = fp.read(centdir[_CD_FILENAME_LENGTH])
            # Create ZipInfo instance to store file information
            x = ZipInfo(filename)
...
            x.filename = x._decodeFilename()
            self.filelist.append(x)
            self.NameToInfo[x.filename] = x
...
    def _decodeFilename(self):
        if self.flag_bits & 0x800:
            return self.filename.decode('utf-8')
        else:
            return self.filename

On the other hand, Python3's _RealGetContents () always decodes the file name with either ** UTF-8 or CP437 ** as shown below.

    def _RealGetContents(self):
        """Read in the table of contents for the ZIP file."""
...
            filename = fp.read(centdir[_CD_FILENAME_LENGTH])
            flags = centdir[5]
            if flags & 0x800:
                # UTF-8 file names extension
                filename = filename.decode('utf-8')
            else:
                # Historical ZIP filename encoding
                filename = filename.decode('cp437')

Due to this specification change, CP437 will be forced to encode files with character codes that are not written in the ZIP specifications such as Shift_JIS, as well as files that are encoded in ** UTF-8 but do not have the UTF-8 flag set. Since it is decoded **, the name of the extracted file will be garbled or an error will occur.

Cause 2: Passing the file name to extract ()

In fact, to specify the file to extract from the ZIP file, you can pass either the filename or the ZipInfo object as the first argument tozipfile.extract ().

If you want to pass a ZipInfo object, the workaround for the article introduced in the introduction (re-decoding the ZipInfo.filename and passing the modified ZipInfo object toextract ()) will work. However, when passing the file name, the error is still not resolved. Let's check with a concrete example.

Create a file named test.txt with the file name Shift_Jis encoded in a directory named python_zip and compress it as test.zip in the same directory (ls because LANG is UTF-8). Then the characters will be garbled). Also, create a directory called extracted as the storage destination for the extracted files.

~/python_zip$ ls
''$'\203''e'$'\203''X'$'\203''g'$'\227''p'$'\202''ł'$'\267''.txt'   extracted   test.zip

Try to extract the txt file from this test.zip with the code that rewrites ZipInfo.filename as follows.

import zipfile

zf = zipfile.ZipFile("test.zip", 'r')
for info in zf.infolist():
    bad_filename = info.filename
    info.filename = info.filename.encode('cp437').decode('shift_jis')
 zf.extract ("for testing.txt", "./extracted")
zf.close()

I get a KeyError as shown below.

~/python_zip$ python extract_zip_py3.py
Traceback (most recent call last):
  File "extract_zip_py3.py", line 24, in <module>
 zf.extract ("for testing.txt", "./extracted")
 (Omitted)
 KeyError: "There is no item named'for testing.txt' in the archive"

As mentioned in "zipfile.extract () Overview", when specifying by file name, the NameToInfo attribute was referenced. When I debug NameToInfo of test.zip, it looks like the following after rewriting ZipInfo.filename.

 {'âeâ Xâgù pé┼é╖.txt': <ZipInfo filename ='for testing.txt' filemode ='-rw-r--r--' file_size = 19>}

Sure, I've fixed the filename, but since the key is still a garbled filename, I can't match withself.NameToInfo.get (name)ingetinfo ()and an error occurs. I will.

Remedy 1

This means that in this case, if you also rewrite NameToInfo, it will work. Modify the previous extract_zip_py3.py as follows.

import zipfile

zf = zipfile.ZipFile("test.zip", 'r')
for info in zf.infolist():
    bad_filename = info.filename
    info.filename = info.filename.encode('cp437').decode('shift_jis')
    zf.NameToInfo[info.filename] = info
    del zf.NameToInfo[bad_filename]
 print (zf.NameToInfo) # For debugging
 zf.extract ("for testing.txt", "./extracted")
zf.close()

When you do this, you get:

~/python_zip$ python extract_zip_py3.py
 {'For testing .txt': <ZipInfo filename ='For testing .txt' filemode ='-rw-r--r--' file_size = 19>}
~/python_zip$ ls extracted
 For testing .txt

By rewriting NameToInfo, you can get the ZipInfo of the target file correctly, and you can confirm that the file can be extracted without garbled characters.

Remedy 2 (+ encoding judgment)

I think that Solution 1 is basically sufficient for Python 2 to 3 support of Django applications, but in the unlikely event that the character code to be re-decoded can be determined automatically rather than fixed, it is safer. Can be said to be high.

So, in addition to Shift_Jis's test .txt, I created a UTF-8 test 2 .txt and compressed it to test2.zip.

~/python_zip$ ls
''$'\203''e'$'\203''X'$'\203''g'$'\227''p'$'\202''ł'$'\267''.txt'   extracted    test2.zip
 extract_zip_py3.py test.zip Test 2 .txt

And I modified extract_zip_py3.py as follows.

import sys
import zipfile
import chardet

args = sys.argv
 zname = args [1] # ZIP file name
 fname = args [2] # File name to be extracted

zf = zipfile.ZipFile(zname, 'r')
for info in zf.infolist():
    bad_filename = info.filename
 code = chardet.detect (info.filename.encode ('cp437')) # Character code automatic judgment
    print(code)
    info.filename = info.filename.encode('cp437').decode(code['encoding'])
    zf.NameToInfo[info.filename] = info
    del zf.NameToInfo[bad_filename]
print(zf.NameToInfo)
zf.extract(fname, "./extracted")
zf.close()

I will try this.

 ~/python_zip $ python extract_zip_py3.py test.zip For testing .txt
{'encoding': 'SHIFT_JIS', 'confidence': 0.99, 'language': 'Japanese'}
 {'For testing .txt': <ZipInfo filename ='For testing .txt' filemode ='-rw-r--r--' file_size = 19>}

 ~/python_zip $ python extract_zip_py3.py test2.zip Test 2 .txt
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
 {'Test 2 .txt': <ZipInfo filename ='Test 2 .txt' filemode ='-rw-r--r--' file_size = 28>}

~/python_zip$ ls extracted
 For testing .txt For testing 2.txt

As intended, I was able to re-decode the file name with the character code derived by automatic judgment and extract the file without garbled characters.

Once summarized

This time, the specification change in handling Python strings and the specification change of the zipfile module overlapped, and it took a relatively long time to be able to explain in my own way. I also want to write something related to Python 2 to 3.

Thank you for visiting our website for the time being. From here on, I will write about the details that I was interested in while writing, so if you are interested, please contact me a little more.

Question (1): Where is the path of the file to be extracted?

In this workaround, the meta information of the ZipFile object is modified, but it is often thought that the target file of the ZIP archive can be found even if the filename or NameToInfo is rewritten without permission. I thought it was strange.

So, if you follow the source a little more, there is a separate attribute value called ZipInfo.orig_filename, which is written in the local header of the ZIP archive (metadata for each file stored in ZIP, see the figure below). The specification was to compare the file names (decoded character strings) and if they match, open () the file.

When I debugged extract_zip_py3.py with hard coding, orig_filename remains garbled even after rewriting NameToInfo, and the file name (fname) obtained from the local header is decoded with CP437. It matched the character string that was used.

 ~/python_zip $ python extract_zip_py3.py test.zip For testing .txt
 {'For testing .txt': <ZipInfo filename ='For testing .txt' filemode ='-rw-r--r--' file_size = 19>}
orig_filename: âeâXâgùpé┼é╖.txt
fname: b'\x83e\x83X\x83g\x97p\x82\xc5\x82\xb7.txt'

~/python_zip$ python
Python 3.7.5 (default, Nov 22 2020, 16:16:44)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> s = b'\x83e\x83X\x83g\x97p\x82\xc5\x82\xb7.txt'
>>> s.decode('cp437')
'âeâXâgùpé┼é╖.txt'

The information for finding the file is written in the local header of orig_filename and ZIP, so it is not affected, and filename is used for the name of the extracted file, so there is no problem. I'm doing well.

ZIPformat ja - ZIP (ファイルフォーマット) - Wikipedia

Question (2): Why is the flag not set even though it is encoded in UTF-8?

The bug that triggered this article was originally caused by the fact that the files in the ZIP archive that were encoded in UTF-8 at the time of Python2 were not flagged as UTF-8. Why did that happen?

Looking at the source of the zipfile module in Python2, I found that it only sets the UTF-8 flag if the filename is of type Unicode and cannot be encoded with ACSII.

class ZipInfo (object):
...
    def FileHeader(self, zip64=None):
...
        filename, flag_bits = self._encodeFilenameFlags()
...
    def _encodeFilenameFlags(self):
        if isinstance(self.filename, unicode):
            try:
                return self.filename.encode('ascii'), self.flag_bits
            except UnicodeEncodeError:
                return self.filename.encode('utf-8'), self.flag_bits | 0x800
        else:
            return self.filename, self.flag_bits

So, I created a file with a Unicode character string as shown below, compressed it into a ZIP, and put the debug code in a Python2 program that extracts the file and executed it.

 -*- coding: utf-8 -*-

import zipfile

# Create file, store in ZIP, extract to another directory
 with open (u "for testing.txt",'w') as f:
 f.write ("Write to file \ n")

zf = zipfile.ZipFile("test.zip", 'w')
 zf.write (u "for testing.txt")
zf.close()

zf = zipfile.ZipFile("test.zip", 'r')
print zf.NameToInfo
print zf.infolist()[0].flag_bits
 zf.extract (u "for testing .txt", "./extracted")
zf.close()

The following is the execution result. Since 0x800 is 2048 in decimal, UTF-8 certainly stands.

~/python_zip$ python --version
Python 2.7.17

~/python_zip$ python zip_py2.py
{u'\u30c6\u30b9\u30c8\u7528\u3067\u3059.txt': <zipfile.ZipInfo object at 0x7f68ec012940>}
2048

~/python_zip$ ls extracted
 For testing .txt

Congratulations.

References

cpython/zipfile.py at 2.7 · python/cpython · GitHub cpython/zipfile.py at 3.7 · python/cpython · GitHub .ZIP File Format Specification (English) ZIP specifications summarized in Japanese · GitHub: ↑ Japanese translation. Thank you. [ZIP (File Format)-Wikipedia](https://ja.wikipedia.org/wiki/ZIP_(%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB%E3%83%95%E3%82%A9%E3%83%BC%E3%83%9E%E3%83%83%E3%83%88) #% E3% 83% 95% E3% 82% A1% E3% 82% A4% E3% 83% AB% E3% 83% 98% E3% 83% 83% E3% 83% 80) It seems that the handling of file names in zipfile has become decent-Tschinoko, this one. (beta)

Recommended Posts

About the handling of ZIP files including Japanese files when upgrading from Python2 to Python3
From the introduction of JUMAN ++ to morphological analysis of Japanese with Python
The wall of changing the Django service from Python 2.7 to Python 3
How to correctly upgrade the software when building Linux (CentOS) with Vagrant ~ Using the example of upgrading from Python 2.7 to Python 3.6 ~
About the ease of Python
About the features of Python
Existence from the viewpoint of Python
Handling of JSON files in Python
About the basics list of Python basics
[Python] Try to graph from the image of Ring Fit [OCR]
How to avoid duplication of data when inputting from Python to SQLite.
About the error I encountered when trying to use Adafruit_DHT from Python on a Raspberry Pi
About the order of learning programming languages (from beginner to intermediate) Part 2
Learning notes from the beginning of Python 1
[Python] What to do when PEP8 is violated in the process of importing from the directory added to sys.path
Learning notes from the beginning of Python 2
Japanese translation: PEP 20 --The Zen of Python
When using PyQtGraph with Python Pyside, pay attention to the order of import
[Python] Scan the inside of the folder including subfolders → Export the file list to CSV
I want to get / execute variables / functions / classes of external files from Python
How to get a list of files in the same directory with python
Get the contents of git diff from python
Summary of how to import files in Python 3
Change the decimal point of logging from, to.
A note about the python version of python virtualenv
Handle zip files with Japanese filenames in Python 3
How to get the files in the [Python] folder
[Note] About the role of underscore "_" in Python
About the behavior of Model.get_or_create () of peewee in Python
[Introduction to cx_Oracle] (5th) Handling of Japanese data
Batch conversion of Excel files to JSON [Python]
What I did when updating from Python 2.6 to 2.7
About the * (asterisk) argument of python (and itertools.starmap)
From the introduction of pyethapp to the execution of contract
[Python] Get the day of the week (English & Japanese)
The story of moving from Pipenv to Poetry
From the initial state of CentOS8 to running php python perl ruby with nginx
Extract images and tables from pdf with python to reduce the burden of reporting
About the usefulness of the Python Counter class-You don't have to count it yourself anymore-
How to pass arguments when invoking python script from blender on the command line
I used Python to find out about the role choices of the 51 "Yachts" in the world.
A memo of misunderstanding when trying to load the entire self-made module with Python3
A story about trying to introduce Linter in the middle of a Python (Flask) project
Easy way to check the source of Python modules
python beginners tried to predict the number of criminals
[Python] How to remove duplicate values from the list
Python practice Convert the input year to the Japanese calendar
Extract the table of image files with OneDrive & Python
Think about how to program Python on the iPad
How to get the number of digits in Python
Learn Nim with Python (from the beginning of the year).
Find out the location of Python class definition files.
[Python] Get the text of the law from the e-GOV Law API
[python] option to turn off the output of click.progressbar
Study from the beginning of Python Hour1: Hello World
Get the return code of the Python script from bat
Python points from the perspective of a C programmer
I wanted to use the Python library from MATLAB
Python --Notes when converting from str type to int type
[Python3] I want to generate harassment names from Japanese!
How to download files from Selenium in Python in Chrome