[PYTHON] "Stop committing Japanese files to git on Mac> <" For the time being, I wrote a script to search for incompatible Japanese files on Mac and Linux.

In my recent PHP project, I write programs in Japanese. Not only class names and variable names, but also file names are in Japanese. (I would like to summarize in a separate article why I decided to write in Japanese and my motivation.)

In this project, the development environment is Mac and the production environment is Linux, but problems such as PHP with Japanese file names not being autoloaded occurred. When I looked it up, it was because the Unicode standard was different between the Mac file system and the Linux file system. For more information, see "Introduction Mania Dorafuto Edition: Notes on File Names on Mac OS X (NFC, NFD, etc.) ”Article will be helpful.

To briefly explain the difference between file systems,

Mac: A standard called NFD. The voiced sound mark and the semi-voiced sound mark are separated (normalized). "Da" becomes 6 bytes of "ta" and "" Linux: A standard called NFC. Do not disperse the semi-voiced sound mark (denormalized). "Da" becomes 3 bytes

There is a difference.

If you commit the NFD file created on Mac to git, it will be staged in the normalized state as it is. It would be nice if you could convert it from NFD to NFC when you git pull it on Linux, but the file will be created as NFD. Since the PHP source code is NFC, if the file name is referenced in a fixed manner, the phenomenon that "it worked on Mac, but it stopped working on Linux" occurs.

It is unavoidable that the Japanese file has been committed, so in order to identify the problematic file for the time being, I made a script in Python to find out the NFD file.

How to use

$ find-nfd -h
usage: find-nfd [-h] [path]

Find NFD files

positional arguments:
  path        path to find(Default: current working directory)

optional arguments:
  -h, --help  show this help message and exit

Source code

`find-nfd.py`


#!/usr/bin/env python
import os
import argparse
from unicodedata import normalize

def fild_all_files(directory):
    for root, dirs, files in os.walk(directory):
        yield root
        for file in files:
            yield os.path.join(root, file)

def to_nfc(string):
    string = string.decode("utf8")
    string = normalize("NFC", string)
    string = string.encode("utf8")
    return string

def is_nfd(string):
    if to_nfc(string) == string:
        return False
    else:
        return True

def find_nfd_files(directory):
    for file in fild_all_files(directory):
        if is_nfd(file):
            yield file

def main():
    parser = argparse.ArgumentParser(description="Find NFD files")
    parser.add_argument("path", type=str, help="path to find(Default: current working directory)", nargs='?', default=os.getcwd())
    args = parser.parse_args()

    count = 0

    for file in find_nfd_files(args.path):
        print file
        count += 1

    print ""
    print "%u files found" % (count)

if __name__ == "__main__":
    main()

Try using

It is a file made on Mac ↓

$ php -r 'var_dump(glob("/tmp/test/1/*"));'
array(7) {
  [0] =>
  string(13) "/tmp/test/1/a"
  [1] =>
  string(13) "/tmp/test/1/b"
  [2] =>
  string(17) "/tmp/test/1/schon"
  [3] =>
  string(19) "/tmp/test/1/schön"
  [4] =>
  string(30) "/tmp/test/1/한글"
  [5] =>
  string(27) "/tmp/test/1/Hahifuheho"
  [6] =>
  string(42) "/tmp/test/1/Papipupepo"
}

It is completely indistinguishable whether it is NFD or NFC, but you can see that the number of bytes in the string is different between "Hahifuheho" and "Papipupepo". You can see that the German umlaut and the Korean Hangul are also NFD.

Look for the NFD file in this:

$ find-nfd.py /tmp/test/1
/tmp/test/1/schön
/tmp/test/1/한글
/tmp/test/1/Papipupepo

3 files found

I found three.

If you find such a file, you will have to rename it in a Linux or Windows environment and put it back in git.

You might be asked, "Do you do this annoying thing every time?", But vagrant is changing the development mechanism itself so that a Debian environment can be completed in just 5 minutes :)

Although it is a Mac, it is not a production environment, so it is important to create the exact same development environment as the production environment in order to avoid unnecessary harmony.