-Since it is troublesome to check and set the character code every time the file is read, I created a module to acquire it automatically. -It is especially useful when importing csv files containing Japanese created in Excel. -It also supports importing files on the net. -By setting the return value to encoding at the time of opening, it works without problems so far.
def check_encoding(file_path):
'''Get the character code of the file''' from chardet.universaldetector import UniversalDetector import requests
    detector = UniversalDetector()
    if file_path[:4] == 'http':
        r = requests.get(file_path)
        for binary in r:
            detector.feed(binary)
            if detector.done:
                break
        detector.close()
    else:
        with open(file_path, mode='rb') as f:
            for binary in f:
                detector.feed(binary)
                if detector.done:
                    break
        detector.close()
    print("  ", detector.result, end=' => ')
    print(detector.result['encoding'], end='\n')
    return detector.result['encoding']
-It seems that csv including Japanese has many Shift_JIS, so it seems better to convert it to more general-purpose cp932 in the next model. -By entering the return value obtained in the first model as an argument, the optimum character code name can be obtained as the return value.
def change_encoding(encoding):
'''Convert encoding sjis relation to cp932''' if encoding in ['Shift_JIS', 'SHIFT_JIS', 'shift_jis', 'sjis', 's_jis']: encoding = 'cp932'
    return encoding
Supervised, thank you.
Recommended Posts