[PYTHON] Memorandum (Countermeasures against Unicode Decode Error when reading CSV files)

** You can automate it with python, right? Thank you ~, so I'll do my best Record part 1 **

** ~~ Write only the conclusion first ~~ ** When specifying with open (CSV file, encoding = character code) You can use the "character code used in the CSV file" as the character code. This time it was a file containing Japanese, so it was necessary to specify it with Shift-jis.
** ~~ I don't care diary ~~ ** Output data to CSV with in-house software → The person in charge molds it on the Excel file with a good feeling → Send the Excel file to the other party by e-mail I was told to use python to automate the above work well, so I decided to do my best with zero start. I'm completely amateur about python, but it seems that if I have some knowledge of VBA, I can manage it. Is it true? ** ~~ Diary up to here ~~ **

** ~~ Subject ~~ ** ** ・ When I read the CSV file, I got a UnicodeDecodeError, so I want to fix it **

** ~~ Explanation of the situation from here ~~ ** I started by checking the basic functions of python using Google Colaboratory. Place a suitable CSV file directly under Google Drive. Create a new Notebook from google Colaboratory. Try a sample program that reads and outputs csv. Ref:https://note.com/092i034i/n/n76f2c2de197

test


import csv #If you write this, it seems that you can handle CSV files

csvfile = open('/content/drive/My Drive/test.csv') #Imported csv file into python variable csvfile
reader = csv.DictReader(csvfile) #I somehow threw the csvfile information into a variable called reader

for row in reader: #I'm not sure yet. It seems to be a repetitive process, but ...

 print(row) #The one that outputs the contents of the variable row

However, it does not compile. I get an error.

error-message


UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-30-e6400dcd8fdb> in <module>()
      4 reader = csv.DictReader(csvfile)
      5 
----> 6 for row in reader:
      7  print(row)

As a person who thought that his English grades were overwhelming and deadly, and even the announcement of Smash Bros.'s "Break the target!" Was a language of a different world, at this point he was already feeling sick, but one. There is no choice but to work on it one by one.

Apparently, decoding fails when assigning the contents of the variable reader to row. No error occurred in the process of csvfile = open (test.csv).

`Program processing flow considered from this result

  1. Encode (encrypt) the CSV file on the drive and import it into the python variable csvfile
  2. Substitute the contents of the variable csvfile into the variable Reader according to the rule csv.dictreader as it is encoded.
  3. Decrypt the contents of the variable reader and transfer it to the variable row (decryption failed here and an error occurred)
  4. Output the contents of the variable Row from the first row to the end `

Therefore, I thought that there was something wrong with the decryption method when moving the contents of the reader to the row. Looking up the error message on the net seems to be different.

`Something like the correct processing flow

  1. The CSV file on the drive contains Japanese, so it was encoded in Shift-jis.
  2. However, unless otherwise specified, python encodes the contents of the CSV file in UTF-8 and imports it into the variable csvfile.
  3. At this time, the content of the variable csvfile is "garbled characters in which the Shift-jis file is forcibly encoded in UTF-8".
  4. However, for some reason, even in this state, it is possible to assign from the variable csvfile to the variable Reader by the rule of csv.dictreader.
  5. Decode the contents of the variable reader and transfer it to the variable row (it seems that garbled characters can be seen only if it cannot be decoded here)
  6. Output the contents of the variable Row from the first row to the end `

I don't know why csv.dict reader is OK, After fixing it, the compilation passed, so for the time being I will divide it as such.

** Even if there is a problem at the time of encoding, it seems that it is discovered at the time of decoding by python processing. ** ** This is probably something I don't understand because of my lack of knowledge about encoding and decoding ... but I'm not going to talk about main automation, so I'll give up on it this time.

However, after the first day, I can only read the CSV file, is it really okay? If I'm Olimar, I can't get out of Hoko Tate and return to the soil. Is it really true that I have knowledge of VBA? anxiety.

Supplement: Code when the compilation passes after fixing


import csv

csvfile = open('/content/drive/My Drive/test.csv',encoding="shift-jis")  #Shift Japanese csv file-Import with jis
reader = csv.DictReader(csvfile)

for row in reader:
 print(row) 

Recommended Posts