[PYTHON] UnicodeDecodeError: What to do when'shift_jis' codec can't decode byte

--Environment --Windows10 Pro version 1909 - Python 3.8.5 - Pandas 1.0.5

Event: I got angry when I read a CSV file with Pandas

Traceback (most recent call last):
  File "C:/path/to/my_code.py", line 258, in <module>
    csv = read_files(target_dir)
  File "C:/path/to/my_code.py", line 74, in read_files
    data = pd.read_csv(file, encoding="shift_jis")
  File "C:\path\to\venv\lib\site-packages\pandas\io\parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\path\to\venv\lib\site-packages\pandas\io\parsers.py", line 448, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "C:\path\to\venv\lib\site-packages\pandas\io\parsers.py", line 880, in __init__
    self._make_engine(self.engine)
  File "C:\path\to\venv\lib\site-packages\pandas\io\parsers.py", line 1114, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "C:\path\to\venv\lib\site-packages\pandas\io\parsers.py", line 1891, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas\_libs\parsers.pyx", line 529, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas\_libs\parsers.pyx", line 720, in pandas._libs.parsers.TextReader._get_header
  File "pandas\_libs\parsers.pyx", line 916, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas\_libs\parsers.pyx", line 2063, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'shift_jis' codec can't decode byte 0xee in position 4225: illegal multibyte sequence

my_code.py


data = pd.read_csv(file, encoding="shift_jis")

Cause: The file has some extended characters in CP932 that are not in SJIS.

This is in test2.csv, ・ Hashigodaka "Taka" ・ Tachisaki "Saki" It is caused by the mixture of windows extension strings such as>. Points to note when letting pandas read csv of excel output --Qiita

I'm crazy so I tried the combination. *. I tried using "Takasaki" as the extended character.

read_csv
Character code
file of
Character code
file of
With extended characters
file of
No extended characters
shift_jis shift_jis error OK
shift_jis cp392 error OK
cp932 shift_jis error
editor(Sublime Text)so
If you put extended characters and save
保存はsoきるけど警告が出る。
You should notice that it is strange ...
OK
cp932 cp932 OK OK

shift with cp932_Error when reading a file with extended characters with jis


# ...abridgement...
UnicodeDecodeError: 'cp932' codec can't decode byte 0x86 in position 5: illegal multibyte sequence

If you use a file with Japanese on Windows, it is better to set it to cp932. I also studied CP932 and SJIS.

Action: Read the file with CP932

my_code.py


data = pd.read_csv(file, encoding="cp932")

Recommended Posts

UnicodeDecodeError: What to do when'shift_jis' codec can't decode byte
Mecab --UnicodeDecodeError:'utf8' codec can't decode byte
When codec can't decode byte appears in python
What to do if pipreqs results in UnicodeDecodeError
What to do if you can't pipenv shell
What to do if you can't pip install mysqlclient
What to do if a UnicodeDecodeError occurs in pip
What to do when you can't bind CaboCha to Python
What to do if you can't sort files with subscripts
What to do if you can't log in as root
What to do if you can't use WiFi on Linux
What to do when Ubuntu crashes
What to do if yum breaks
What to do with Magics install
What to do if you can't install pyaudio with pip #Python
What to do if you get a UnicodeDecodeError with pip install
What to do with PYTHON release?
UnicodeDecodeError:'ascii' codec can't decode byte 0xa4 in position 0: ordinal not in range (128)
What to do if you can't build your project with Maven
What to do if you can't use the trash in Lubuntu 18.04.
What to do when you get "I can't see the site !!!!"
What to do when UnicodeDecodeError occurs during read_csv in pandas (pd.read_table ())
What to do to get tensorflow-gpu to work
What to do if you can't find well with grep's -f option
What to do if you can't find PDO in Laravel or CakePHP
What to do if you can't use scikit grid search in Python
What to do if you can't install with pip in babun environment
What to do after installing Linux (Ubuntu)
Let's summarize what you want to do.
What to do if Django can't load an image from a static folder
What ICCV2019 Best paper SinGAN can't do [Practice]
[Python] What I did to do Unit Test
Avoiding "'ascii' codec can't decode" with pip install
Note: What to do if pip install fails
What to do if mod_fcgid cannot resolve UnicodeEncodeError
[For beginners] What to do after installing Anaconda
What to do if rails s doesn't work
What to do if pip cannot be installed
What to do when PermissionError of tempfile.mkstemp occurs
What to do if atom autocomplete-python doesn't work
What to do to get google spreadsheet in python
What to do if Docker-sync suddenly stops working
What to do if "amazon-linux-extras" → "No module named amazon_linux_extras"
yum throws "UnicodeDecodeError:'ascii' codec can't decode byte 0xc3 in position 15: ordinal not in range (128)" error
How to install and use pyenv, what to do if you can't switch python versions
What to do if you can't hit the arrow keys in the Python interactive console