[PYTHON] What! The unzipped file name was SJIS! command?

SJIS fuck

When using Linux, when you unzip the zip file downloaded from Japan, the Japanese file name is garbled and it often happens. So, I usually use it as it is without worrying about it, and when I don't need it, I feel like discarding it. Even if there are necessary files, it is not a big number, so I renamed it myself. This time, there was a situation where it was useless unless I fixed a good number of files, so I did a little research.

Cause

Basically, it's because an app running multibyte on Windows embeds the file name raw in zip with cp932. Conversion is easy if this is written to the file system as cp932 when it is expanded on the Linux side. It's just a matter of renaming through iconv -f shift-jis -t utf-8 in the shell. When exporting raw, only 0x2F is out in terms of file name, but this is not included in the second byte of cp932, so it does not seem to be a problem. However, it seems that some conversion has been applied to the non-ASCII part, and it cannot be restored properly.

Give up and deploy in Python

When I searched, there was an exchange like writing in Python with Stack Overflow, so it was easy to write. I'm a person who can't usually write useful tools in glue language.

unzip.py


#!/usr/bin/env python

import sys
import zipfile

def main(filename):
    with zipfile.ZipFile(filename) as zip:
        for info in zip.infolist():
            info.filename = info.filename.decode('shift-jis').encode('utf-8')
            zip.extract(info)

if __name__ == '__main__':
    sys.exit(main(sys.argv[1]))

A script that simply expands the zip file as the first argument, thinking that it has the SJIS file name, without considering any error handling.

It was actually easier

So, if I dig into the cause of the garbled characters thinking that I should make a note on Qiita, it looks dark under the lighthouse. Orz with the option to convert the character code properly

$ unzip -O sjis foo.zip

It seems that this is all you need. Somehow -O and -I are the opposite of intuition, but it seems that -O specifies the encoding in the archive and -I specifies the encoding of the destination file system. Also, it seems that the strange encoding was done because the automatic detection failed.

Read about help before looking at the source, me. Furthermore, Qiita also has Answer.

Summary

It was a complete waste of work if I lifted my back and did something I wouldn't normally do. But why do you like writing big code but hate writing short code? Maybe because the boiler plate ratio is high.

Recommended Posts

What! The unzipped file name was SJIS! command?
[Linux] What is the host name confirmation method other than the hostname command?
Replace the directory name and the file name in the directory together with a Linux command.
The file name was bad in Python and I was addicted to import
Effective Python was, as the name implies, Effective
Try rewriting the file with the less command
Extract only the file name excluding the directory in the directory
Adjust file permissions with the Linux command chmod
It's a Mac. What is the Linux command Linux?
The file name saved by pysheng was a hexadecimal number, so I fixed it.
I want to see the file name from DataLoader
Get the file name in a folder using glob
Specify the file name when sending the csv attached mail
Python program that looks for the same file name