[PYTHON] [Pandas_flavor] Add a method of Pandas DataFrame

TL;DR

BEFORE


dataframe_ = dataframe.loc[(dataframe.time == 'pre') & \
                           (dataframe.group == 'exp') & \
                           (dataframe.cond == 'a'), :]
sns.regplot(x='mood', y='score', data=dataframe_)

↓↓↓

AFTER


dataframe.by(time='pre', cond='exp', group='a').regplot(x='trait', y='score')

You can add your favorite methods to pandas DataFrame (and Series) by using pandas_flavor.

motivation

** It is troublesome to extract the parts that meet the conditions from the Long format data! ** **

For example, suppose you have this data.

スクリーンショット 2020-11-20 10.46.29.png

The setting was that 50 subjects were divided into two groups (group: exp, ctrl), and some intervention was performed in each group. The task was performed before and after the intervention (time: pre, post), and the score was measured under the two conditions (cond: a, b) during the task. At the same time, the mood when doing the task was also measured for each condition (cond: a, b). [^ 1]

If the measurement data is summarized in long format as shown in the image above, subsequent analysis will be easier.

Well, before doing various analyzes, for the time being ** Let's plot the correlation between score and mood when the task condition a of the exp group in pre is **.

The lines that meet the above conditions will be extracted, so the code will look like this.

dataframe_ = dataframe.loc[(dataframe.time == 'pre') & \
                           (dataframe.group == 'exp') & \
                           (dataframe.cond == 'a'), :]
sns.regplot(x='mood', y='score', data=dataframe_)

I make a bool type Series that expresses the conditions and put it in .loc. Well, it's kind of dirty.

If you use the .query () method, you can write like this.

dataframe_ = dataframe.query('time == "pre" & group == "exp" & cond == "a"')
sns.regplot(x='mood', y='score', data=dataframe_)

This one is a lot cleaner, but I wonder if it feels a little better. It seems that the method of using .query () is slower than the method of using bool's Series. After all, it is troublesome to extract the parts that meet the conditions from the ** Long format data! ** **

Add a Pandas DataFrame method

** Then you should create a method **

Therefore, let's create a ** new method ** that extracts rows that meet the conditions from the DataFrame. ↓ Add a new .by () method that can be used like this to DataFrame.

dataframe.by(time='pre', cond='exp', group='a')

You can easily achieve this with a package called pandas_flavor.

Installation method

pip or

pip install pandas_flavor

It is one shot with conda.

conda install -c conda-forge pandas_flavor

Example of use

import pandas_flavor as pf


@pf.register_dataframe_method
def by(self, **args):
    for key in args.keys():
        self = self.loc[self.loc[:, key] == args[key], :]
    return self

Just write a function and add @ pf.register_dataframe_method as a decorator. In this example, the argument is received as a dictionary by doing ** args. This extracts the line specified by each argument.

Furthermore, it would be nice to make various seaborn functions into methods.

@pf.register_dataframe_method
def regplot(self, **args):
    return sns.regplot(data=self, **args)

スクリーンショット 2020-11-20 13.01.55.png

And it looks like this. If you want to add a method to pandas.Series, you can do the same with @ pf.register_series_method.

In this example ... I think it's okay to use .query (), but it seems that it can be applied in various ways.

[^ 1]: Needless to say, it's all a fake psychological experiment. The numbers are generated by the random module.

Recommended Posts

[Pandas_flavor] Add a method of Pandas DataFrame
Formatted display of pandas DataFrame
[Python] Summary of table creation method using DataFrame (pandas)
Behavior of pandas rolling () method
A handy function to add a column anywhere in a Pandas DataFrame
Create a pandas Dataframe from a string.
How to find the memory address of a Pandas dataframe value
DataFrame of pandas From creating a DataFrame from two lists to writing a file
A little scrutiny of pandas 1.0 and dask
[Python] Add total rows to Pandas DataFrame
A brief description of pandas (Cheat Sheet)
Create a dataframe from excel using pandas
Pandas: A very simple example of DataFrame.rolling ()
Download Pandas DataFrame as a CSV file
[Python] How to read a csv file (read_csv method of pandas module)
Clustering of clustering method
A memorandum of method often used when analyzing data with pandas (for beginners)
Basic operation of Python Pandas Series and Dataframe (1)
This is a sample of function application in dataframe.
Add a list of numpy library functions little by little --a
Make a note of the list of basic Pandas usage
Summary of pre-processing practices for Python beginners (Pandas dataframe)
3D plot Pandas DataFrame
parallelization of class method
About MultiIndex of pandas
Basic operation of Pandas
I made a method to automatically select and visualize an appropriate graph for pandas DataFrame
Python application: Pandas # 3: Dataframe
Summary of test method
Add a list of numpy library functions little by little --- b
A simple Python implementation of the k-nearest neighbor method (k-NN)
Reuse the behavior of the @property method by using a descriptor [16/100]
Add a list of numpy library functions little by little --c
Put the lists together in pandas to make a DataFrame
A collection of methods used when aggregating data with pandas