[PYTHON] Precautions when opening other than CP932 (Shift-JIS) encoding on Windows

Introduction

I am currently a beginner learning machine learning. By copying the code of the ancestor with kaggle, I will summarize the error that occurred while learning the method and the solution as a reminder.

As a summary, Windows will try to convert to CP932 by default. However, if it cannot be converted to CP932, a UnicodeEncodeError exception will occur, so we have summarized the countermeasures.

Reference url https://qiita.com/Yuu94/items/9ffdfcb2c26d6b33792e

https://qiita.com/butada/items/33db39ced989c2ebf644

environment

Windows 10 Home Python 3.7.4.

Problems and solutions


embedding_dict={}
with open('xxxxx.txt','r') as f: 
    for line in f:
        values=line.split()
        word = values[0]
        vectors=np.asarray(values[1:],'float32')
        embedding_dict[word]=vectors
f.close()

UnicodeDecodeError: 'cp932' codec can't decode byte 0x93 in position 5456: illegal multibyte sequence

Came out. This means that if you are using a Windows environment, the default is coded with cp932. In that case, you will get an error if it contains code that cannot be converted. The specific code is unknown this time, but it seems that it is because the code of the overseas person is copied by kaggle.

Therefore, add a description to the program so that it is coded in UTF-8.


embedding_dict={}
with open('xxxxx.txt','r',encoding="utf-8") as f:
    for line in f:
        values=line.split()
        word = values[0]
        vectors=np.asarray(values[1:],'float32')
        embedding_dict[word]=vectors
f.close()

Just add encoding = "utf-8", but that's the solution.

I learned that the error itself is a Windows-specific issue and how it was coded.

Recommended Posts

Precautions when opening other than CP932 (Shift-JIS) encoding on Windows
Python looks up Error: unknown encoding: cp65001 on Windows
[Windows] The problem that an error occurs when opening a file other than CP932 (Shift-JIS) encoded in Python has been solved for the time being.