We have summarized the grammars that are often used in Python's data analysis library "pandas".
2019-02-18 Updated the display enlargement method
2018-05-06 Reflect comments (pd.set_option('display.width', 100))
2018-02-14 Link correction
2017-11-01 df.fillna(method='ffill')Corrected the description of
2017-06-09 Correction of broken links, etc.
2016-10-10 Editing examples
2016-06-21 df.rolling, pd.date_range, pd.datetime, df.Add pivot, add other examples
There are many options available, so you can load anything other than csv.
import pandas as pd
df = pd.read_csv('some.csv')
Example: When reading multiple columns (date
and hour
) together as a datetime type index (date_hour
)
df = pd.read_csv('some.csv', parse_dates={'date_hour':['date', 'hour']}, index_col='date_hour')
Example: When a file containing Japanese and --
are treated as missing data
df = pd.read_csv('some.csv', encoding='Shift_JIS', na_values='--')
--List of options
option | meaning |
---|---|
index_col | Column name to be index |
parse_dates | Column name (list or dictionary) to be read as datetime type |
date_parser | parse_Self-made function to read the column specified by dates |
na_values | Character string (list) to be a missing value |
encoding | 'Shift_JIS'Such |
sep | Delimiter(' ': In the case of space) |
--Reference: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html
It's terribly easy to write to a csv file. There are many options.
df.to_csv('some2.csv')
Example: If you don't need index
df.to_csv('some2.csv', index=None)
Example: When naming the index
df.to_csv('some2.csv', index_label='date')
--Reference: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html
Delete row (index) with missing value (nan)
df = df.dropna()
Example: If a particular column (temp
or depth
) has a missing value, delete that row
(Ignore missing values other than temp
and depth
)
df = df.dropna(subset=['temp','depth'])
Example: Fill in the blanks with a constant (0)
df = df.fillna(0)
Example: Fill in the front (rear) holes (2017/11/01 Corrected to reflect @ hadacchi's comment)
df = df.fillna(method='ffill') #forward fill in the blank nan 1.0 nan -> nan 1.0 1.0 (forward =Index increase direction = DataFrame downward direction)
df = df.fillna(method='bfill') #backward fill-in-the-blank nan 1.0 nan -> 1.0 1.0 nan (backward =Index decrease direction = DataFrame upward direction)
--Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html
Many interpolation methods are available.
df = df.interpolate(method='index')
linear
, time
, index
, values
, nearest
, zero
, slinear
, quadratic
, cubic
, barycentric
, krogh
, polynomial
, spline
, piecewise_polynomial
, pchip
instead of
linear`.--Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.interpolate.html
Change the resolution (frequency) of time series data. When I use it, I often reduce the number of data, so I recognize it as a function for compression.
Example: Convert hourly data to daily average (ignore missing values and average)
#daily = hourly.resample('D', how='mean') <-Old way of writing
daily = hourly.resample('D').mean()
Example: If the value is placed at 00:00, but you want it to be 12:00
daily = hourly.resample('D', loffset='12H').mean()
Charactor | meaning | Remarks |
---|---|---|
M | Month | 0.5M で半Monthとはならない |
D | Day | I'm doing something with 15D |
H | time | 12H は12time |
T or min | Minutes | 30min で30Minutes |
--Reference: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html
Take the n-term moving average, maximum value, etc.
Example: 3-term moving average (center = True and place the value in the center (2nd term in this case))
ma3 = hourly.rolling(3, center=True).mean()
pd.date_range
Easily create consecutive datetime variables
date = pd.date_range('2012-1-1', '2012-1-2', freq='D')
Example: Since it is a pandas Timestamp as it is above, when returning to the python datetime variable
date = pd.date_range('2012-1-1', '2012-1-2', freq='D').to_pydatetime()
pd.datetime
ʻImport datetime` not needed
date = pd.datetime(2012, 1, 1, 0, 0, 0)
Take a quick look at the number of data, mean, variance, and other statistics for each column.
print df.describe()
Data can be handled easily by grouping by column (converting the value of column so that it can be handled as index). It is easier to understand than stack and unstack.
Example: Group by multiple columns (type
and time
) and take the average of each group.
grouped_mean = df.groupby(['type','time']).mean()
--Reference: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html
df.pivot(1D→2D)
Image of turning a one-dimensional array into a two-dimensional array
temp2d = df.pivot(index='y', columns='x', values='temp')
If you want to create a DataFrame from something other than a file, create a dictionary and load it.
data = {'a':[0, 1], 'b':[2, 3]}
df = pd.DataFrame(data)
Example: When indexing the above example
date = pd.date_range('2012-1-1', '2012-1-2', freq='D')
df = pd.DataFrame(data, index=date)
If you don't want the display to wrap when printing a data frame with a large number of columns, you can change the display width.
pd.set_option('display.width', 100)
or
pd.set_option('display.max_columns', 100) #Can be controlled by the number of columns
#pd.set_option('line_width', 100) # line_width is deprecated or abolished (2018/05/06 Thanks to @dhwty)
print df #Does not wrap up to 100 characters (or column)
Recommended Posts