[PYTHON] Precautions when using codecs and pandas

environment

version

$ python --version
Python 2.7.12 :: Continuum Analytics, Inc.
$ pip freeze | grep pandas
pandas==0.19.1

Sample file

$ file --mime sample.tsv
sample.tsv: text/plain; charset=utf-8
$ cat sample.tsv
ID language
1 Japanese
2 english

codecs

First of all, codecs

>>> open("sample.tsv", "r").read()
'ID\t\xe8\xa8\x80\xe8\xaa\x9e\n1\t\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e\n2\t\xe8\x8b\xb1\xe8\xaa\x9e\n'
>>> import codecs
>>> codecs.open("sample.tsv", "r", "utf-8").read()
u'ID\t\u8a00\u8a9e\n1\t\u65e5\u672c\u8a9e\n2\t\u82f1\u8a9e\n'

If you read it with codecs, it will be ʻunicode`.

pandas

The read_table function, which is useful when reading tsv.

>>> import pandas as pd
>>> df = pd.read_table(open("sample.tsv", "r"))
>>> df
ID language
0 1 Japanese
1 2 english
>>> df.columns
Index([u'ID', u'language'], dtype='object')
>>> df[u"language"]
Traceback (most recent call last):
  ...
KeyError: u'\u8a00\u8a9e'
>>> list(df.columns)
['ID', '\xe8\xa8\x80\xe8\xaa\x9e']
>>> type(list(df.columns)[1])
<type 'str'>
>>> df["language"]
0 Japanese
1 english
Name:language, dtype: object

I'm not sure that ʻuis in the display ofdf.columns, It's understandable that the type of the string is str`.

codecs & pandas

with read_table

Then, if you use codecs and read with read_table

>>> df = pd.read_table(codecs.open("sample.tsv", "r", "utf-8"))
>>> df
ID language
0 1 Japanese
1 2 english
>>> df[u"language"]
Traceback (most recent call last):
  ...
KeyError: u'\u8a00\u8a9e'
>>> df["language"]
0 Japanese
1 english
Name:language, dtype: object

It seems to be str for some reason.

without read_table

>>> from collections import defaultdict
>>> data = defaultdict(list)
>>> f = codecs.open("sample.tsv", "r", "utf-8")
>>> labels = f.readline()[:-1].split("\t") #Divide other than line breaks by tabs
>>> values = f.readline()[:-1].split("\t") #Divide other than line breaks by tabs
>>> for label, value in zip(labels, values):
...     data[label].append(value)
... 
>>> df = pd.DataFrame(data)
>>> df
ID language
0 1 Japanese
>>> df["language"]
Traceback (most recent call last):
  ...
KeyError: '\xe8\xa8\x80\xe8\xaa\x9e'
>>> df[u"language"]
0 Japanese
Name:language, dtype: object
>>> list(df.columns)
[u'ID', u'\u8a00\u8a9e']
>>> type(list(df.columns)[1])
<type 'unicode'>

Without using read_table When read with codecs, It was as expected.

Recommended Posts

Precautions when using codecs and pandas
Precautions when using for statements in pandas
Precautions when using Chainer
When using if and when using while
Precautions when using TextBlob trait analysis
Precautions when using the urllib.parse.quote function
Precautions when using phantomjs from python
Precautions when using six with Python 2.5
Precautions and error handling when calling .NET DLL from python using pythonnet
Precautions when using OpenCV from Power Automate Desktop
Precautions when using tf.keras.layers.TimeDistributed for tf.keras custom layer
Precautions when using google-cloud library with GAE / py
Error that occurred in OpenCV3 and its solution Precautions when using OpenCV3 on Mac
Precautions when handling Luigi
Precautions when using sqlite3 on macOS Sierra (10.12) with multiprocessing
Cross tabulation using Pandas
jupyter and pandas installation
pandas index and reindex
Tips and precautions when porting MATLAB programs to Python
pandas resample and rolling
Precautions when installing fbprophet
Summary of things that were convenient when using pandas
Pandas averaging and listing
Graph time series data in Python using pandas and matplotlib
Precautions when changing unix time to datetime type in pandas
Analyze stock prices using pandas data aggregation and group operations
[Python] Random data extraction / combination from DataFrame using random and pandas
Precautions and solutions when installing Ubuntu on NVIDIA GeForce PCs
(Personal) points when using ctypes
Environment variables when using Tkinter
Precautions when upgrading TensorFlow (to 1.3)
When using optparse with iPython
Correspondence between pandas and SQL
Key additions to pandas 1.1.0 and 1.0.0
DEBUG settings when using Django
File structure when using serverless-python-requirements
This and that using reflect
Try using pytest-Overview and Samples-
Use configparser when using API
Data analysis using python pandas
Small speedup when using pytorch
Precautions when using a list or dictionary as the default argument
[Python] Error and solution memo when using venv with pyenv + anaconda
Precautions when passing def to sorted and groupby functions in Python? ??
How to format a table using Pandas apply, pivot and swaplevel