Introduction

I've had several opportunities to use pandas for in-house development tasks, I wonder how to do it because it's not so frequent There are many things like how to do it in the first place. Also, since there are many opportunities to handle several GB of data due to its nature, I will update the method to handle such cases from time to time. Therefore, the composition is described in the form of what you want to do ⇒ how to do it.

By the way, it's not like doing gorigori machine learning, so Please note that we will not be talking about that area.

loading csv

In the first place, nothing can be done unless csv can be read. So from how to read. The basics are as follows.


df = pd.read_csv("file name")

Reading large files

However, when the file size becomes GB, There is a high possibility that you will not be able to survive the memory. In such a case, add the chunksize option and load it separately.

If you specify chunksize, It is loaded as a TextFileReader instance instead of a Dataframe. If you turn the TextFileReader in a loop, you can retrieve the DataFrame. In the following example, 50 lines are taken out and printed.


data = pd.read_csv("test.csv", chunksize=50)
for i in data:
  print(i)

option

Option name	meaning	Example
encoding	Character code specification	encoding='UTF-8'
skiprows	Specify the line to skip	skiprows=2
chunksize	Read every specified number of lines	chunksize=50
usecols	Read only specified columns	usecols=[1, 3]

[Other options](https://own-search-and-study.xyz/2015/09/03/pandas%E3%81%AEread_csv%E3%81%AE%E5%85%A8%E5%BC%95 % E6% 95% B0% E3% 82% 92% E4% BD% BF% E3% 81% 84% E3% 81% 93% E3% 81% AA% E3% 81% 99 /)

2 Data concatenation

How to read and concatenate separate files


#Read file 1
Data1 = pd.read_csv(file1, dtype = np.float32)
#File 2 read
Data2 = pd.read_csv(file2, dtype = np.float32)
#2 Concatenate data
rawData = pd.concat([Data1, Data2], ignore_index=True)

Get a specific column


df['A']

Extract rows for a specific section


df[1:3]

Note that the line starts from 0, so in this case the first line cannot be obtained. Also,: 3 means up to 2.

Delete unnecessary columns

The original dataframe is changed by specifying inplace = True. In the following example, the City and Price columns disappear from Data.


df.drop(columns=['City', 'Price'], axis = 1, inplace=True)

Can be specified by column number


df.drop(columns=[[1, 2]], axis = 1, inplace=True)

Delete row


df.drop(df.index[[1, 3, 5]])

By default, the row index is a serial number from 0, Note that the index will not be a serial number after sorting etc.

Substitution of a specific column


df.replace({Column name: {Original value:Value after replacement}})

Splitting data by the value of a column

Used when changing the process depending on the column value. Since it returns a GroupBy object, you can mess with the for statement as it is.


for column value,Data frame in df.groupby('Column name'):
Each process

In addition, since the index is scattered, Use reset_index () to re-paste the index in each data frame.

Note that if drop = True is not specified, the old index will be moved to the data column.

for city,sdf in  df.groupby('city'):
    sdf.reset_index(drop=True)
    if city = 'Tokyo':
             flags = 1

Count the number of records

Returns the number of elements in the array in the first column. Without [0], each matrix is counted.


pd.shape[0]

average


means = df.mean(axis = 0)

standard deviation


std = df.std(axis = 0)

Save csv

To save as csv after processing the dataframe:


df.to_csv("file name")

Option name	meaning	Example
columns	Export only specific columns	columns=['age']
header	With or without header	header=False
index	Presence or absence of index	index=False

Export / add csv file with pandas (to_csv)

reference

Differences between Pandas and NumPy and how to use them properly Group DataFrame by column value-pandas Drop by specifying the row / column of pandas.DataFrame Read csv / tsv file with pandas (read_csv, read_table) Reassign the index of DataFrame with reset_index-python

[PYTHON] Data manipulation with Pandas!