[Python] Master the reading of csv files. List of main options for pandas.read_csv.

[Python] Master the reading of csv files. List of main options for pandas.read_csv.

Advanced version of reading csv files with pandas Note that there are many things you can do more than you think, such as specifying the rows and columns to read.

Basically it's OK if you hold down this much [List of main options for read_csv method](# 1-List of main options for read_csv method)


> ・ Click here for the basics of reading csv files with python (https://qiita.com/yuta-38/items/8f7a332651cd5a02e986)

・ The official page is here


**table of contents**
  1. [List of main options for read_csv method](# 1-List of main options for read_csv method)
  2. [Data read by default](# 2-Data read by default)
  3. [Blank rows / columns / cells of the original file](# 3-Blank matrix cells of the original file)
  4. Header (# 4-Header)
  5. Read a file without a header (#Read a file without a header)
  6. [Specify the header line](#Specify the header line)
  7. [Specify header name](#Specify header name to read)
  8. [Specify a common prefix for the header name](#Specify a common prefix for the header name)
  9. [Specify Heading Column (Index)](# 5-Specify Heading Column Index)
  10. Read Columns (# 6-Read Columns)
  11. [Specify by column number](#Specify by column number)
  12. [Specify by column name](#Specify by column name)
  13. Read Line (# 7-Read Line)
  14. [Specify the number of lines to read from the beginning](#Specify the number of lines to read from the beginning)
  15. [Specify the number of lines to be excluded from the beginning](#Specify the number of lines to be excluded from the beginning)
  16. [Exclude specified line](#Exclude specified line)
  17. [Specify the number of lines to be excluded from the end](#Specify the number of lines to be excluded from the end)
  18. [Read by specifying type](# 8-Read by specifying type)
  19. [Read files on the web](# 9-Read files on the web)
  20. [Read compressed file](# 10-Read compressed file)
  21. [Read by specifying the delimiter](# 11-Read by specifying the delimiter)

## 1. List of main options for read_csv method
option Example of use Contents
sep sep=';' Separate
delimiter delimiter=';' Split(Same as sep)
header header=1 Specify header line (default is guess, if not header=None * "N" is uppercase)
names ①names=['AA','BB','CC',,]  ②names='1234567' Give column title (header if there is header)=In combination with "0")
index_col index_col=0 Line heading(index)Specify the column to be
usecols usecols=[1,2,5] Specify the line to read. Specify only one line in list format "usecols=[0].. Can also be specified in the column title "
prefix prefix="line number", header=None Specify the prefix of the line title. Example "prefix='line number'」ならline number0、line number1、、、となる。 hedar=Valid only when None is specified.
dtype dtype=str Read by specifying the type. If not applicable, an error (such as reading str with float)
skiprows ①skiprows=5  ②skiprows=[1,3,6] Specify the line number not to be read at the beginning. For integers, from 0 to the specified integer.
skipfooter skipfooter=2, engine='python', encoding='utf_8' Specify the number of lines to exclude from the bottom. Need to describe what to use in python. If the characters are garbled, specify the character code.
nrows nrows=5 Specify the number of lines to read.
encoding encoding='shift_jis' Character code specification when reading a file
(compression) compression='zip' Open the compressed file. At present, it opens by analogy without description. (Conversely, compression in the zip file='gzip'Error if you specify
(skipinitialspace) skipinitialspace=True delimiter(Character delimiter)Later, remove the leading whitespace. Currently, it seems to be a specification that is deleted by default

## 2. Data read by default

■ Original file

When the following csv file is read

image.png

** ▼ Column ** ・ Column A is index (heading) ・ Column F is empty -Column G is a character and a blank cell

** ▼ line ** ・ The first row is the title of the column ・ The 9th line is empty ・ There is a formula error (#NUM!) On the 10th line.


#### ■ Reading result

image.png

** ▼ Point ** ・ ** Heading column added to the first column ** (index number from 0) ・ ** Title line added to the first line ** --Blank cells are ** filled with "Unnamed: column number" ** -** Blank cells become NaN **. -Formula error is displayed as #NUM !.


### ■ Attributes of each column

Column attributes


Unnamed: 0     object
Column 1 object
Column 2 float64
Column 3 object
Column 4 float64
Unnamed: 5     float64
Column 5 object

-Date: object type ・ Numerical value: float64 type └ Both integers and decimals └ NaN is ignored -Column with function error: object type -Empty column: float64 type -Text: object type └ If there is one text cell, it will be an object type


### ■ Output reading result

When output as a csv file with utf8.

image.png

-Headings automatically inserted in the 1st row and 1st column remain. ・ NaN becomes a blank line


## 3. Blank rows / columns / cells of the original file

Blanks are treated as "NaN" (empty data). The following are also treated as NaN.


## 4. Header

The default when reading is "analog".

▼ Original file image.png

▼ Reading result image.png

read csv file


import pandas as pd

df = pd.read_csv('~/desktop/test.csv')
df

└ Read and display the test.csv file on the desktop.


### ① Read a file without a header Optionally specify that there is no header. `header=None`

** ▼ Original file ** ("Desktop test2.csv") image.png

** ▼ Read file ** pd.read_csv('~/desktop/test2.csv' ,header=None) image.png

python


import pandas as pd
df = pd.read_csv('~/desktop/test2.csv' ,header=None)
df

** ▼ If not specified ** `df2 = pd.read_csv('~/desktop/test2.csv')` ![image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/563526/fb3d7fa6-e1cc-8afc-cca6-d1a49ab21179.png)
### ② Specify the line that will be the header * Above the specified line is not read.

** ▼ When a line to be the header is specified **

Optional header = integer

import pandas as pd
df = pd.read_csv('~/desktop/test.csv' ,header=6)
df

image.png


** ▼ If not specified ** ![image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/563526/73ce0f60-b776-d8f6-9b71-7f38cd2f7874.png)
### ③ Read by specifying the header name Describe `names =` as an option. There are two ways to write.

(1) Consecutive character strings (2) list format

** ▼ Point ** -If you already have a header, overwrite it with header = 0. -When the number of specified characters is less than the number of columns to be read: The column title of the other party is blank -If the number of specified characters is larger: The title of the last column is NaN -Different columns cannot be given the same name (error)


### ■ Example (try the execution result with 7 columns of data)

(1) Specify as a continuous character string

** ▼ Example 1: When names = '123345' **

import pandas as pd
df = pd.read_csv('~/desktop/test.csv' ,names='12345')
df

image.png

The first two missing columns are blank.


** ▼ Example 2: When `names ='abcdefghi'` **
import pandas as pd
df = pd.read_csv('~/desktop/test.csv' ,names='abcdefghi')
df

image.png

Many column titles are empty (NaN) columns.


** ▼ Example 3: `names ='aaabbbccc'` Error if duplicated **

python


import pandas as pd
df = pd.read_csv('~/desktop/test.csv' ,names='aaabbbccc')
df

#output
# ValueError: Duplicate names are not allowed.

#### (2) Specify in list format

** ▼ Example 1: When names = ['aaa','bbb','ccc','ddd','eee','fff'] **

python


import pandas as pd
df = pd.read_csv('~/desktop/test.csv' ,names=['aaa','bbb','ccc','ddd','eee','fff'])
df

image.png


** ▼ Example 2: `names = ['aaa','bbb','aaa','ddd']` Duplicate is an error **

python


import pandas as pd
df = pd.read_csv('~/desktop/test.csv' ,names=['aaa','bbb','aaa','ddd'])
df

### Specify a common prefix for header names `prefix ='string', header = None` └ Valid only when header = None (ignored if not) └ A column number is added to the specified character string.  

python


import pandas as pd
df = pd.read_csv('~/desktop/test.csv', prefix="XXX", header=None)
df

image.png


## 5. Specifying the heading column (index) Describe ʻindex_col = integer` in the option. By default, columns with index numbers are added automatically.
```python import pandas as pd df = pd.read_csv('~/desktop/test.csv' ,index_col=0) df ```

image.png


For default (not specified) ![image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/563526/c24e0186-6c71-e4b8-9b26-81fbeaa279f7.png)
## 6. Read columns Can be specified by column number or column name.

① Specify by column number ② Specify by column name


▼ Use the following for the original file ![image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/563526/3046793e-146a-441a-8c88-938a8f3d9f14.png)
import pandas as pd
df = pd.read_csv('~/desktop/test.csv')
df

### ① Specify by column number Describe ʻusecols = []` in the option └ List type └ Described in [] even if the specification is one column

Specify multiple columns


import pandas as pd
df = pd.read_csv('~/desktop/test.csv', usecols=[0,3,6])
df

image.png

** ▼ For 1 column (example 0th column only) ** usecols=[0]

Specify only one column (Example: 0th column only)


import pandas as pd
df = pd.read_csv('~/desktop/test.csv', usecols=[0])
df

image.png

** ▼ Error if not list type **

error


import pandas as pd
df = pd.read_csv('~/desktop/test.csv', usecols=0)
df

#output
# ValueError: 'usecols' must either be list-like of all strings, all unicode, all integers or a callable.

### ② Specify by column name

It is also possible to extract only the specified column name.

▼ Example: ʻusecols = ['column 1','column 4'] ` └ Specify column 1 and column 4.

python


import pandas as pd
df = pd.read_csv('~/desktop/test.csv', usecols=['Column 1','Column 4'])
df

image.png


▼ It is also possible to give a column name when reading and extract by that name.

Example: ・ Header = 0Names ='ABCDEFG' ・ ʻUscols = ['A','C']`

import pandas as pd
df = pd.read_csv('~/desktop/test.csv', header=0, names='ABCDEFG' ,usecols=['A','C'])
df

image.png


## 7. Read line

① Specify the number of lines to read from the beginning (2) Specify the number of lines to be excluded from the beginning ③ Exclude the specified line ④ Specify the number of lines to exclude from the end


### ① Specify the number of lines to read from the beginning Describe ʻusecols = integer` in the option. Useful when checking the contents when the number of lines is huge.
▼ Example: `nrows = 3` Read up to the third line from the top.

python


import pandas as pd
df = pd.read_csv('~/desktop/test.csv', nrows=3)
df

image.png


### (2) Specify the number of lines to be excluded from the beginning Describe `skiprows = integer` in the option.

▼ Example: skiprows = 6 Skip to the 6th line from the top. If no header is specified, the 6th line will be the header.

python


import pandas as pd
df = pd.read_csv('~/desktop/test.csv', skiprows=6)
df

image.png


### ③ Exclude the specified line Describe `skiprows = [integer]` in the option.

▼ Example: skiprows = [2,3,6,7,8] Skip the 2nd, 3rd, 5th, 7th and 8th lines from the top.

python


import pandas as pd
df = pd.read_csv('~/desktop/test.csv', skiprows=[2,3,6,7,8])
df

image.png


### ④ Specify the number of lines to exclude from the end Describe `skipfooter = integer, engine ='python'` in the option.

▼ Example: skipfooter = 6, engine ='python', encoding ='utf_8' Skip the 6th line from the bottom.

python


import pandas as pd
df = pd.read_csv('~/desktop/test.csv', skipfooter=6, engine='python', encoding='utf_8')
df

image.png


▼ When no character code is specified `skipfooter=6, engine='python'` 

python


import pandas as pd
df = pd.read_csv('~/desktop/test.csv', skipfooter=6, engine='python')
df

image.png

Japanese characters are garbled.


▼ When python is not specified `skipfooter=6` 

python


import pandas as pd
df = pd.read_csv('~/desktop/test.csv', skipfooter=6)
df

#output
# <ipython-input-81-77b6fdc5c66e>:2: ParserWarning: Falling back to the 'python' engine 
#because the 'c' engine does not support skipfooter; 
#you can avoid this warning by specifying engine='python'.

An error is displayed. Instructions to write "engine ='python'".


## 8. Specify the type and read Describe `dtype = type` as an option. If it cannot be changed, an error will occur.

▼ Convert to a character string with dtype = str and check the type with .dtypes (dtypes method).

Convert to string


import pandas as pd
df = pd.read_csv('~/desktop/test.csv', dtype=str)

df.dtypes

#output
Unnamed: 0    object
Column 1 object
Column 2 object
Column 3 object
Column 4 object
Unnamed: 5    object
Column 5 object
dtype: object

▼ Default

python


import pandas as pd
df = pd.read_csv('~/desktop/test.csv')

df.dtypes

#output
Unnamed: 0     object
Column 1 object
Column 2 float64
Column 3 object
Column 4 float64
Unnamed: 5     float64
Column 5 object
dtype: object

▼ Convert character string to float (error)

python


import pandas as pd
df = pd.read_csv('~/desktop/test.csv' ,dtype=float)
df.dtypes

#output
# ValueError: could not convert string to float

## 9. Read files on the WEB Files on the WEB can also be read.

pd.read_csv ('URL', encoding ='character code')

** ▼ Read the statistical data of the government's population by prefecture and gender ** ・ Reference page: e-Start

python


import pandas as pd

dfurl = pd.read_csv('https://www.e-stat.go.jp/stat-search/file-download?statInfId=000031524010&fileKind=1', encoding='shift_jis')
dfurl

image.png


### ▼ If no character code is specified (an error will occur)

error


import pandas as pd

dfurl = pd.read_csv('https://www.e-stat.go.jp/stat-search/file-download?statInfId=000031524010&fileKind=1')
dfurl

#output
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 0: invalid start byte

## 10. Read the compressed file Compressed files such as zip can be read without specifying anything. Readable compression formats: ‘gzip’, ‘bz2’, ‘zip’, ‘xz’

Because it reads the compressed format by analogy. └ Default: compression = infer


** ▼ Read the zip file test.zip on the desktop **

python


import pandas as pd
df = pd.read_csv('~/desktop/test.zip')
df

image.png

■ The above is the same as compression ='zip'.

import pandas as pd
df = pd.read_csv('~/desktop/test.zip', compression='zip')
df

▼ An error will occur if the compression format is specified incorrectly.

error


import pandas as pd
df = pd.read_csv('~/desktop/test.zip', compression='gzip')
df

#output
# BadGzipFile: Not a gzipped file (b'PK')

▼ Error even if two or more files are compressed

error


import pandas as pd
df = pd.read_csv('~/desktop/2files.zip')
df

#output
# ValueError: Multiple files found in compressed zip file ['test.csv', 'space.csv']

▼ Error even if two or more files are compressed

error


import pandas as pd
df = pd.read_csv('~/desktop/2files.zip')
df

#output
# ValueError: Multiple files found in compressed zip file ['test.csv', 'space.csv']

▼ Error even if PW is set

error


import pandas as pd
df = pd.read_csv('~/desktop/test.zip')
df

#output
# RuntimeError: File 'test.csv' is encrypted, password required for extraction

## 11. Read by specifying the delimiter `sep ='delimiter'` └ The same applies to "delimiter ='delimiter'".

** ▼ Example: File to read ** There are multiple data in one cell. └ Data separated by "@" └ Data separated by ";" image.png


** ▼ Default loading **

python


import pandas as pd

df = pd.read_csv('~/desktop/test2.csv')
df

image.png


** ▼ `sep ='@'` ** Separated by "@"

「@Separated by "(sep)


import pandas as pd

df = pd.read_csv('~/desktop/test2.csv', sep='@')
df

image.png

** ▼ delimita ='@' ** Separated by "@"

「@Delimiter


import pandas as pd

df = pd.read_csv('~/desktop/test2.csv', delimiter='@')
df

image.png


** ▼ `sep =';'` ** Separated by ";"

「;Separated by "(sep)


import pandas as pd

df = pd.read_csv('~/desktop/test2.csv', sep=';')
df

image.png


** ▼ Options cannot be repeated. ** **

error


import pandas as pd

df = pd.read_csv('~/desktop/test2.csv', sep=';', sep='@')
df

#output
# SyntaxError: keyword argument repeated

** ▼ 2 characters cannot be specified (list cannot be used) **

error


import pandas as pd

df = pd.read_csv('~/desktop/test2.csv', sep=[';','@'])
df

#output
# TypeError: unhashable type: 'list'

** ▼ 2 delimiter and sep cannot be used together. ** ** └ Priority is given to the delimiter.
image.png

[Return to top](List of main options of pandas read_csv to master reading #pythoncsv file)

Recommended Posts

[Python] Master the reading of csv files. List of main options for pandas.read_csv.
[Python] Reading CSV files
[Introduction for beginners] Reading and writing Python CSV files
[Python: UnicodeDecodeError] One of the error solutions when reading CSV
About the basics list of Python basics
python note: map -do the same for each element of the list
Python Master RTA for the time being
Pandas of the beginner, by the beginner, for the beginner [Python]
Python> sys.path> List of strings indicating the path to search for modules
Search for synonyms from the word list (csv) using Python Japanese WordNet
I measured the speed of list comprehension, for and while with python2.7.
The fastest way for beginners to master Python
[python] Check the elements of the list all, any
The story of low learning costs for Python
[Python] Sort the list of pathlib.Path in natural sort
[Python] Get the main topics of Yahoo News
Make a copy of the list in Python
Reading and writing CSV and JSON files in Python
Character code for reading and writing csv files with python ~ windows environment ver ~
Image processing? The story of starting Python for
[Python] Scan the inside of the folder including subfolders → Export the file list to CSV
The story of reading HSPICE data in Python
Python script to get a list of input examples for the AtCoder contest
Code for checking the operation of Python Matplotlib
Example of reading and writing CSV with Python
How to get a list of files in the same directory with python
List of Python libraries for data scientists and data engineers
Find out the location of Python class definition files.
List of sample program distribution sites for python books
[python] Get the list of classes defined in the module
[Python] Get the list of ExifTags names of Pillow library
[Python] Outputs all combinations of elements in the list
Python: Get a list of methods for an object
[Maya Python] Crush the contents of the script 2 ~ list Notes
List of python modules
the zen of Python
Try to get the function list of Python> os package
Check the operation of Python for .NET in each environment
[Maya Python] Crush the contents of the script 3 ~ List unknown Plugins
[Note] List of basic commands for building python / conda environment
Consideration for Python decorators of the type that passes variables
Get the number of occurrences for each element in the list
[Python] The biggest weakness / disadvantage of Google Colaboratory [For beginners]
Google search for the last line of the file in Python
2015-11-26 python> Display the function list of the module> import math> dir (math)
[Python] Easy reading of serial number image files with OpenCV
python memo: enumerate () -get index and element of list at the same time and turn for statement
Python list, for statement, dictionary
Summary of Python3 list operations
About the ease of Python
Reading and writing csv files
Reading .txt files with Python
[Python] Copy of multidimensional list
[Python / PyQ] 4. list, for statement
Python #list for super beginners
About the features of Python
The Power of Pandas: Python
A memorandum of understanding for the Python package management tool ez_setup
How to change the log level of Azure SDK for Python
Wrap (part of) the AtCoder Library in Cython for use in Python
[Python] Organize the basic structure of Flask apps (Aim for de-copying)