[PYTHON] Header shifts in read_csv () and read_table () of Pandas

Introduction

When I tried to import the file acquired in csv format with pandas and process it, the header and the data were misaligned, but unexpectedly I could not reach the answer immediately, so I wrote it as an article. [Jump to solution](# solution)

Operating environment

I ran it in the following environment.

module version
python 3.8.3
pandas 1.0.5

problem

Import the following csv format file as DatFrame.

example.csv


Time  x   y   z
   0  1   2  10
   1  2   2  10
   2  3   2  10
..

Import using read_csv ().

read_csv.py


import pandas as pd
path = 'csv file path'
df = pd.read_csv(path)
print(df)

The output result in the terminal is as follows.

#    Time\tx\ty\tz
#  0    1\t1\t2\t10
#  1    2\t2\t2\t10
#  2    3\t3\t2\t10
..

There is an extra \ t in it. It seems that it was tab-separated (tsv format) instead of comma-separated.

Try running it with read_tabel ().

read_tsv.py


import pandas as pd
path = 'csv file path'
df = pd.read_table(path)
print(df)

The output result is as follows. The \ t is gone, but instead the header and data are misaligned and all the z data is now NaN.

#    Time  x   y    z
#  0    1  2  10  NaN
#  1    2  2  10  NaN
#  2    3  2  10  NaN

Solution

Give an argument to read_csv () as follows.

read_csv_2.py


import pandas as pd
path = 'csv file path'
df = pd.read_csv(path, sep='\s+')
print(df)

According to the padas documentation (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), the arguments you give to files separated by one or more characters. It seems like. It seems that the cause was that the original data was separated by tabs and spaces ... Please forgive me ... lol.

Summary

I was able to correctly convert the header and data to a DataFrame by giving the argument sep ='\ s +' tocsv_read ()for the data separated by tabs and spaces.

Recommended Posts

Header shifts in read_csv () and read_table () of Pandas
UnicodeDecodeError in pandas read_csv
Features of pd.NA in pandas 1.0.0 (rc0)
Etosetra related to read_csv of Pandas
Ignore # line and read in pandas
Summary of methods often used in pandas
Import of japandas with pandas 1.0 and above
A little scrutiny of pandas 1.0 and dask
Judgment of NaN in pandas: When str type and float type are mixed
Screenshots of Megalodon in selenium and Chrome.
Separation of design and data in matplotlib
Summary of modules and classes in Python-TensorFlow2-
Project Euler # 1 "Multiples of 3 and 5" in Python
Talking about the features that pandas and I were in charge of in the project
Summary of OSS tools and libraries created in 2016
Summary of what was used in 100 Pandas knocks (# 1 ~ # 32)
Calculation of technical indicators by TA-Lib and pandas
Coexistence of Anaconda 2 and Anaconda 3 in Jupyter + Bonus (Julia)
Add totals to rows and columns in pandas
Explanation of edit distance and implementation in Python
Basic operation of Python Pandas Series and Dataframe (1)
"Linear regression" and "Probabilistic version of linear regression" in Python "Bayesian linear regression"
Analysis of financial data by pandas and its visualization (2)
Full-width and half-width processing of CSV data in Python
About import error of numpy and scipy in anaconda
Calculation of standard deviation and correlation coefficient in Python
Analysis of financial data by pandas and its visualization (1)
Difference between Ruby and Python in terms of variables
[python] Calculation of months and years of difference in datetime
Overview of generalized linear models and implementation in Python
Sample of getting module name and class name in Python
Summary of date processing in Python (datetime and dateutil)
"Type Error: Unrecognized value type: <class'str'>" in to_datetime of pandas