[PYTHON] Grammar summary often used in pandas

We have summarized the grammars that are often used in Python's data analysis library "pandas".

2019-02-18 Updated the display enlargement method
2018-05-06 Reflect comments (pd.set_option('display.width', 100))
2018-02-14 Link correction
2017-11-01 df.fillna(method='ffill')Corrected the description of
2017-06-09 Correction of broken links, etc.
2016-10-10 Editing examples
2016-06-21 df.rolling, pd.date_range, pd.datetime, df.Add pivot, add other examples

** Handling CSV files **

** pd.read_csv ** (read)

There are many options available, so you can load anything other than csv.

import pandas as pd

df = pd.read_csv('some.csv')

Example: When reading multiple columns (date and hour) together as a datetime type index (date_hour)

df = pd.read_csv('some.csv', parse_dates={'date_hour':['date', 'hour']}, index_col='date_hour')

Example: When a file containing Japanese and -- are treated as missing data

df = pd.read_csv('some.csv', encoding='Shift_JIS', na_values='--')

--List of options

option meaning
index_col Column name to be index
parse_dates Column name (list or dictionary) to be read as datetime type
date_parser parse_Self-made function to read the column specified by dates
na_values Character string (list) to be a missing value
encoding 'Shift_JIS'Such
sep Delimiter(' ': In the case of space)

--Reference: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

** df.to_csv ** (write)

It's terribly easy to write to a csv file. There are many options.

df.to_csv('some2.csv')

Example: If you don't need index

df.to_csv('some2.csv', index=None)

Example: When naming the index

df.to_csv('some2.csv', index_label='date')

--Reference: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html

** Handling of missing values **

** df.dropna ** (deleted)

Delete row (index) with missing value (nan)

df = df.dropna()

Example: If a particular column (temp or depth) has a missing value, delete that row (Ignore missing values other than temp and depth)

df = df.dropna(subset=['temp','depth'])

** df.fillna ** (fill in)

Example: Fill in the blanks with a constant (0)

df = df.fillna(0)

Example: Fill in the front (rear) holes (2017/11/01 Corrected to reflect @ hadacchi's comment)

df = df.fillna(method='ffill')  #forward fill in the blank nan 1.0 nan -> nan 1.0 1.0 (forward  =Index increase direction = DataFrame downward direction)
df = df.fillna(method='bfill')  #backward fill-in-the-blank nan 1.0 nan -> 1.0 1.0 nan (backward =Index decrease direction = DataFrame upward direction)

--Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html

** df.interpolate ** (interpolation)

Many interpolation methods are available.

df = df.interpolate(method='index')

--Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.interpolate.html

** df.resample ** (resample)

Change the resolution (frequency) of time series data. When I use it, I often reduce the number of data, so I recognize it as a function for compression.

Example: Convert hourly data to daily average (ignore missing values and average)

#daily = hourly.resample('D', how='mean') <-Old way of writing
daily = hourly.resample('D').mean()

Example: If the value is placed at 00:00, but you want it to be 12:00

daily = hourly.resample('D', loffset='12H').mean()

** Time format **

Charactor meaning Remarks
M Month 0.5M で半Monthとはならない
D Day I'm doing something with 15D
H time 12H は12time
T or min Minutes 30min で30Minutes

--Reference: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html

** df.rolling ** (moving average)

Take the n-term moving average, maximum value, etc.

Example: 3-term moving average (center = True and place the value in the center (2nd term in this case))

ma3 = hourly.rolling(3, center=True).mean()

** Handling of time **

pd.date_range

Easily create consecutive datetime variables

date = pd.date_range('2012-1-1', '2012-1-2', freq='D')

Example: Since it is a pandas Timestamp as it is above, when returning to the python datetime variable

date = pd.date_range('2012-1-1', '2012-1-2', freq='D').to_pydatetime()

pd.datetime

ʻImport datetime` not needed

date = pd.datetime(2012, 1, 1, 0, 0, 0)

** Other frequently used functions **

** df.describe ** (summary statistics)

Take a quick look at the number of data, mean, variance, and other statistics for each column.

print df.describe()

** df.groupby ** (aggregate)

Data can be handled easily by grouping by column (converting the value of column so that it can be handled as index). It is easier to understand than stack and unstack.

Example: Group by multiple columns (type and time) and take the average of each group.

grouped_mean = df.groupby(['type','time']).mean()

--Reference: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html

df.pivot(1D→2D)

Image of turning a one-dimensional array into a two-dimensional array

temp2d = df.pivot(index='y', columns='x', values='temp')

Other tips

DataFrame creation (dictionary type)

If you want to create a DataFrame from something other than a file, create a dictionary and load it.

data = {'a':[0, 1], 'b':[2, 3]}

df = pd.DataFrame(data)

Example: When indexing the above example

date = pd.date_range('2012-1-1', '2012-1-2', freq='D')

df = pd.DataFrame(data, index=date)

Expansion of wrapping width when printing

If you don't want the display to wrap when printing a data frame with a large number of columns, you can change the display width.

pd.set_option('display.width', 100)
or
pd.set_option('display.max_columns', 100)  #Can be controlled by the number of columns

#pd.set_option('line_width', 100)  # line_width is deprecated or abolished (2018/05/06 Thanks to @dhwty)

print df  #Does not wrap up to 100 characters (or column)

Reference link

Recommended Posts

Grammar summary often used in pandas
Summary of methods often used in pandas
Processing memos often used in pandas (beginners)
Summary of what was used in 100 Pandas knocks (# 1 ~ # 32)
Settings often used in Jupyter
Processing summary 2 often done in Pandas (data reference, editing operation)
I tried to summarize the code often used in Pandas
Installation summary often used for AI projects
Summary of frequently used commands in matplotlib
Disk-related commands often used in Ubuntu (memories)
blockdiag grammar summary
Summary of processes often performed in Pandas 1 (CSV, Excel file related operations)
Summary of tools used in Command Line vol.8
Summary of tools used in Command Line vol.5
Summary of evaluation functions used in machine learning
Code often used in Python / Django apps [prefectures]
A collection of Numpy, Pandas Tips that are often used in the field
Pandas Personal Notes Summary
Learn Pandas in 10 minutes
A collection of code often used in personal Python
UnicodeDecodeError in pandas read_csv
Summary of Pandas methods used when extracting data [Python]
A collection of Excel operations often used in Python
I tried to summarize the commands often used in business
Summary of how to write .proto files used in gRPC
pandas Matplotlib Summary by usage
What's new in Python 3.10 (Summary)
Timer used in while loop
python pandas study recent summary
Used in machine learning EDA
Face detection summary in Python
Frequently used commands in virtualenv
Basic usage of Pandas Summary
What's new in Python 3.9 (Summary)
Swap columns in pandas dataframes
Python standard input summary that can be used in competition pro