Handle Excel CSV files with Python

Python's csv package is very useful. It does the troublesome escape processing properly. Especially, it is useful for processing files sent as Excel files. After all, you can read Excel CSV properly with dialect ='excel'.

However, when Japanese is involved, the problem quickly becomes troublesome. What is the essence of the problem? First, the following is known regarding the handling of TSV in Excel.

--Excel tsv export is utf-16 (with BOM) --Excel can read directly tsv of utf-16 with BOM (comma separated by wizard required) --csv (comma separated values) output and reading are possible, but not recommended (see below)

What you can see from this is that it's best to work with utf-16 files with BOM. The best way to handle this is to use the ʻio` package.

with io.open(path, encoding='utf-16') as f:
  ...

I wrote it briefly, but it's really important to use the ʻiopackage. utf-16 is a fairly unwieldy format. First, there is the BOM. Furthermore, ifcr + lf is performed in this state, \ 0 will be inserted between bytes. So, if you normally use ʻopen (), it behaves like a mystery.

Now the problem arises here. When using python2 series, the csv package does not support ʻunicode. Therefore, it cannot be read if opened by the above method. Here, if you use ʻopen in good weather, it will be difficult to read for the above reasons. I used to export with csv of shift-jis, but Excel seems to be Unicode internally, and there are times when the dakuten is in a different character and it is not normalized, and at this time shift-jis When I output it, all the dakuten disappeared. So I don't recommend this method either.

Actually, how to read Unicode with csv is described in End of manual. The correct answer is to encode the ʻunicodestream to utf-8 once, read it in thecsv package in this state, split it, and decode utf-8 again. The same is true when writing, just change ʻunicode to utf-8, write to StringIO, and convert it to ʻunicodeagain just before spitting it out to a file. It was overwhelmingly troublesome twice, but I struggled with Excel for many years and finally got the most stable output. By the way, in Python3, thecsv package supports str (ʻunicode in 2), so you can open it with ʻio` and process it normally.

It would be dark if I wrote these processes every time, so I made it a library.

https://github.com/unnonouno/excelcsv

It avoids all of the above issues, maybe.

What you shouldn't do

That's all, but let's review what you shouldn't do and why.

Encode with utf-16 for each element

.encode ('utf-16') produces a utf-16 string with a BOM. This is the decisive difference from other encodings.

>>> u'a'.encode('utf-16')
'\xff\xfea\x00'

If you write a code like [x.encode ('utf-16') for x in row], a mysterious character string with a BOM will be generated here. This approach is useless. You have to make sure that the BOM is only at the beginning of the file. The best way to avoid this is to handle it with ʻunicode in principle and control it in the file output part. It's the ʻio package that does this, and you shouldn't encode it yourself using the regular ʻopen. Another workaround is to output to StringIO with utf-8, convert the whole to ʻunicode before writing to the file, and then convert it back to utf-16. However, it is cleaner to use ʻio`.

Open with open, parse and then decode

utf-16 means that the newline character and tab character you want to parse are also written in 2 bytes. If you let the Python2 csv package, which processes in 1-byte units, process it, unnecessary \ 0 characters will remain. This approach can only be used with utf-8 and shift-jis, where Ascii characters remain intact.

Export and then load with shift-jis csv

Excel seems to have the data in Unicode internally. Worst of all, the dakuten may be saved as a separate character, and if you export with shift-jis in this state, all the dakuten will disappear. I wish I could normalize it. Therefore, exporting from excel is utf-16 text alternative. It's awkward to handle, but it can't be helped.

Recommended Posts

Handle Excel CSV files with Python
Handle Excel with python
Excel with Python
[Easy Python] Reading Excel files with openpyxl
[Easy Python] Reading Excel files with pandas
Handle rabbimq with python
Csv tinkering with python
[Python] Reading CSV files
Operate Excel with Python (1)
Operate Excel with Python (2)
Operate Excel with Python openpyxl
Read csv with python pandas
Let's run Excel with Python
Sorting image files with Python (2)
Sort huge files with python
Sorting image files with Python (3)
[Tips] Handle Athena with Python
Sorting image files with Python
Integrate PDF files with Python
Reading .txt files with Python
Write to csv with Python
Handle JSON files with Matlab
Download csv file with python
Manipulate excel files from python with xlrd (personal notes)
Remove headings from multiple format CSV files with python
Transpose CSV files in Python Part 1
Handle Base91 keys with python + redis.
solver> Link> Solve Excel Solver with python
Make apache log csv with python
Recursively unzip zip files with python
Improve your productivity by processing huge Excel files with Python
Manipulating EAGLE .brd files with Python
Easily handle lists with python + sqlite3
Output to csv file with Python
[Python] POST wav files with requests [POST]
Create an Excel file with Python3
Easily handle databases with Python (SQLite3)
Stylish technique for pasting CSV data into Excel with Python
Create Heroku, Flask, Python, Nyanko bulletin boards with "csv files"
Read files in parallel with Python
Reading and writing CSV with Python
Excel, csv import, export with Django
Excel aggregation with Python pandas Part 1
How to import CSV and TSV files into SQLite with Python
[For beginners] Read Excel / CSV files into DataFrame with Google Colaboratory
[AWS] Using ini files with Lambda [Python]
Play audio files from Python with interrupts
Read CSV file with python (Download & parse CSV file)
Draw Nozomi Sasaki in Excel with python
Quickly create an excel file with Python #python
Convert Excel data to JSON with python
Read and write csv files with numpy
Excel aggregation with Python pandas Part 2 Variadic
Reading and writing JSON files with Python
Download files on the web with Python
Trying to handle SQLite3 with Python [Note]
Create Excel file with Python + similarity matrix
[Part1] Scraping with Python → Organize to csv!
Convert HEIC files to PNG files with Python
Excel table creation with Python [Progress management table]
FizzBuzz with Python3