Python's csv
package is very useful. It does the troublesome escape processing properly. Especially, it is useful for processing files sent as Excel files. After all, you can read Excel CSV properly with dialect ='excel'
.
However, when Japanese is involved, the problem quickly becomes troublesome. What is the essence of the problem? First, the following is known regarding the handling of TSV in Excel.
--Excel tsv export is utf-16 (with BOM) --Excel can read directly tsv of utf-16 with BOM (comma separated by wizard required) --csv (comma separated values) output and reading are possible, but not recommended (see below)
What you can see from this is that it's best to work with utf-16 files with BOM. The best way to handle this is to use the ʻio` package.
with io.open(path, encoding='utf-16') as f:
...
I wrote it briefly, but it's really important to use the ʻiopackage. utf-16 is a fairly unwieldy format. First, there is the BOM. Furthermore, if
cr + lf is performed in this state,
\ 0 will be inserted between bytes. So, if you normally use ʻopen ()
, it behaves like a mystery.
Now the problem arises here. When using python2 series, the csv
package does not support ʻunicode. Therefore, it cannot be read if opened by the above method. Here, if you use ʻopen
in good weather, it will be difficult to read for the above reasons. I used to export with csv of shift-jis, but Excel seems to be Unicode internally, and there are times when the dakuten is in a different character and it is not normalized, and at this time shift-jis When I output it, all the dakuten disappeared. So I don't recommend this method either.
Actually, how to read Unicode with csv
is described in End of manual. The correct answer is to encode the ʻunicodestream to utf-8 once, read it in the
csv package in this state, split it, and decode utf-8 again. The same is true when writing, just change ʻunicode
to utf-8, write to StringIO
, and convert it to ʻunicodeagain just before spitting it out to a file. It was overwhelmingly troublesome twice, but I struggled with Excel for many years and finally got the most stable output. By the way, in Python3, the
csv package supports
str (ʻunicode
in 2), so you can open it with ʻio` and process it normally.
It would be dark if I wrote these processes every time, so I made it a library.
https://github.com/unnonouno/excelcsv
It avoids all of the above issues, maybe.
That's all, but let's review what you shouldn't do and why.
.encode ('utf-16')
produces a utf-16 string with a BOM. This is the decisive difference from other encodings.
>>> u'a'.encode('utf-16')
'\xff\xfea\x00'
If you write a code like [x.encode ('utf-16') for x in row]
, a mysterious character string with a BOM will be generated here. This approach is useless. You have to make sure that the BOM is only at the beginning of the file.
The best way to avoid this is to handle it with ʻunicode in principle and control it in the file output part. It's the ʻio
package that does this, and you shouldn't encode it yourself using the regular ʻopen. Another workaround is to output to
StringIO with utf-8, convert the whole to ʻunicode
before writing to the file, and then convert it back to utf-16. However, it is cleaner to use ʻio`.
utf-16 means that the newline character and tab character you want to parse are also written in 2 bytes. If you let the Python2 csv
package, which processes in 1-byte units, process it, unnecessary \ 0
characters will remain. This approach can only be used with utf-8 and shift-jis, where Ascii characters remain intact.
Excel seems to have the data in Unicode internally. Worst of all, the dakuten may be saved as a separate character, and if you export with shift-jis in this state, all the dakuten will disappear. I wish I could normalize it. Therefore, exporting from excel is utf-16 text alternative. It's awkward to handle, but it can't be helped.
Recommended Posts