Japanese file enumeration with Python2 system on Windows (5C problem countermeasure)

Since the processing of multi-byte characters is still delicate in Python 2 system on Windows, if you write the file enumeration process normally, the enumeration process works well when a specific character ("table", "so", etc.) appears in the file path. It may not work. So-called 5C problem.

List of files used for this test

C:/test
  filelist.py
Tesuto/
    a1.txt
    a2.txt
table/
    hyo1.txt
    hyo2.txt
Table inside/
      hyo10.txt
      hyo11.txt

A script that writes a file enumeration normally

filelist.py


# -*- coding: utf-8 -*-
import os
SEP = os.sep
def filelist(dir_path):
    for item in os.listdir(dir_path):
        file_path = dir_path + SEP + item
        print(file_path)
        if os.path.isdir(file_path):
            filelist(file_path)
# test
script_dir = os.path.dirname(os.path.abspath(__file__))
filelist(script_dir)

Execution result

C:\test\filelist.py
C:\test\Tesuto
C:\test\Tesuto\a1.txt
C:\test\Tesuto\a2.txt
C:\test\table
C:\test\table\table

The enumeration in the path that contains the "table" is not working.

Measures (measures by decode)

@wonderful_panda taught me. If you do .decode ('cp932') after getting the file path, you can enumerate the files without any problem.

# -*- coding: utf-8 -*-
import os
SEP = os.sep
def filelist(dir_path):
    for item in os.listdir(dir_path):
        file_path = dir_path + SEP + item
        print(file_path)
        if os.path.isdir(file_path):
            filelist(file_path)
# test
script_dir = os.path.dirname(os.path.abspath(__file__.decode('cp932')))
filelist(script_dir)

Execution result

C:\test\filelist.py
C:\test\Tesuto
C:\test\Tesuto\a1.txt
C:\test\Tesuto\a2.txt
C:\test\table
C:\test\table\hyo1.txt
C:\test\table\hyo2.txt
C:\test\table\中のtable
C:\test\table\中のtable\hyo10.txt
C:\test\table\中のtable\hyo11.txt

The enumeration in the path containing the "table" was also successful.

Countermeasure (old. Countermeasure by current directory)

The countermeasure by decode is overwhelmingly smarter, but I will leave the countermeasure that I wrote before telling you about it.

filelist.py


# -*- coding: utf-8 -*-
import os
SEP = os.sep
def filelist2(dir_path):
    old_dir = os.getcwd()
    os.chdir(dir_path) #Change current directory.
    for item in os.listdir("."):
        file_path = dir_path + SEP + item
        print(file_path)
        if os.path.isdir(item):
            filelist2(file_path)
    os.chdir(old_dir) #Restore the current directory.
# test
script_dir = os.path.dirname(os.path.abspath(__file__))
filelist(script_dir)

If you pass a path containing 5C characters to os.listdir, a problem will occur, so instead of passing the path directly, set the current directory to the target path in advance, and set os.listdir to the current path. Pass `". `` `Indicating the directory. This allows the enumeration to be performed normally.

Execution result

C:\test\filelist.py
C:\test\Tesuto
C:\test\Tesuto\a1.txt
C:\test\Tesuto\a2.txt
C:\test\table
C:\test\table\hyo1.txt
C:\test\table\hyo2.txt
C:\test\table\中のtable
C:\test\table\中のtable\hyo10.txt
C:\test\table\中のtable\hyo11.txt

The enumeration in the path containing the "table" was also successful.

About Python3 series

Even on Windows, with Python 3.5 etc., file enumeration could be performed normally without taking measures like this. If it is a new program that is known to perform multi-byte character string processing, it is safe to start with Python 3 series instead of Python 2 series.

Bonus: 5C problem solving in PHP

I tried to enumerate files by the same method (change the current directory) in PHP on Windows, but it didn't work. I tried to catch a lot of information, but in conclusion, there seems to be no solution in PHP?

This is not possible. It's a limitation of PHP. PHP uses the multibyte versions of Windows APIs; you're limited to the characters your codepage can represent.

If you set a breakpoint at readdir_r() in win32\readdir.c, you'll see that FindNextFile already returns a filename with question marks in place of the characters you want, so there's nothing you can do about it, apart from patching PHP itself.

Recommended Posts

Japanese file enumeration with Python2 system on Windows (5C problem countermeasure)
ABC163 C problem with python3
ABC188 C problem with python3
ABC187 C problem with python
OpenJTalk on Windows10 (Speak Japanese with Python from environment construction)
Python CGI file created on Windows
Getting started with Python 3.8 on Windows
[C] [python] Read with AquesTalk on Linux
Notes on doing Japanese OCR with Python
Python on Windows
Install OpenCV 4.0 and Python 3.7 on Windows 10 with Anaconda
Develop Windows apps with Python 3 + Tkinter (exe file)
Extract zip with Python (Japanese file name support)
Challenge AtCoder (ABC) 164 with Python! A ~ C problem
Make a breakpoint on the c layer with python
Python starting with Windows 7
python basic on windows ②
Execute python3 system with PHP exec () on AWS EC2
PIL with Python on Windows 8 (for Google App Engine)
Install python on windows
Let's get started with Python ~ Building an environment on Windows 10 ~
Build Python3 for Windows 10 on ARM with Visual Studio 2019 (x86) on Windows 10 on ARM
Set-enable Python virtualenv on Windows
Python with VS Code (Windows 10)
Send Japanese email with Python3
Run python with PyCharm (Windows)
Install watchdog on Windows + Python 3.3
Workaround for the problem that sys.argv is not passed when executing a Python script with only the file name in Python2.7 on Windows
Draw netCDF file with python
Python + Kivy development on Windows
Sphinx-autobuild (0.5.2) on Windows7, Python 3.5.1, Sphinx 1.3.5
Fastest Python installation on Windows
Build Python environment on Windows
Japanese morphological analysis with Python
Build python environment on windows
I ran python on windows
Presentation Support System with Python3
[Python] [Chainer] [Windows] Install Chainer on Windows
Use Python on Windows (PyCharm)
Time synchronization (Windows) with Python
Download csv file with python
Blogging with Pelican on Windows
Build a 64-bit Python 2.7 environment with TDM-GCC and MinGW-w64 on Windows 7
[AtCoder commentary] Win the ABC165 C problem "Many Requirements" with Python!
Solve ABC163 A ~ C with Python
Python environment construction memo on Windows 10
Extract the xz file with python
Face detection with YOLO Face (Windows10, Python3.6)
Call C from Python with DragonFFI
Python 3.6 on Windows ... and to Xamarin.
Installing Kivy on Windows10 64bit Python3.5
Tips on Python file input / output
Anaconda python environment construction on Windows 10
[Python] Write to csv file with Python
[Automation with python! ] Part 1: Setting file
Implemented file download with Python + Bottle
Install python2.7 on windows 32bit environment
Draw Japanese with matplotlib on Ubuntu
Install Python on Windows + pip + virtualenv
Output to csv file with Python
ABC166 in Python A ~ C problem