TL;DR

`BEFORE`


dataframe_ = dataframe.loc[(dataframe.time == 'pre') & \
                           (dataframe.group == 'exp') & \
                           (dataframe.cond == 'a'), :]
sns.regplot(x='mood', y='score', data=dataframe_)

↓↓↓

`AFTER`


dataframe.by(time='pre', cond='exp', group='a').regplot(x='trait', y='score')

You can add your favorite methods to pandas DataFrame (and Series) by using pandas_flavor.

motivation

** It is troublesome to extract the parts that meet the conditions from the Long format data! ** **

For example, suppose you have this data.

スクリーンショット 2020-11-20 10.46.29.png

The setting was that 50 subjects were divided into two groups (group: exp, ctrl), and some intervention was performed in each group. The task was performed before and after the intervention (time: pre, post), and the score was measured under the two conditions (cond: a, b) during the task. At the same time, the mood when doing the task was also measured for each condition (cond: a, b). [^ 1]

If the measurement data is summarized in long format as shown in the image above, subsequent analysis will be easier.

Well, before doing various analyzes, for the time being ** Let's plot the correlation between score and mood when the task condition a of the exp group in pre is **.

The lines that meet the above conditions will be extracted, so the code will look like this.

dataframe_ = dataframe.loc[(dataframe.time == 'pre') & \
                           (dataframe.group == 'exp') & \
                           (dataframe.cond == 'a'), :]
sns.regplot(x='mood', y='score', data=dataframe_)

I make a bool type Series that expresses the conditions and put it in .loc. Well, it's kind of dirty.

If you use the .query () method, you can write like this.

dataframe_ = dataframe.query('time == "pre" & group == "exp" & cond == "a"')
sns.regplot(x='mood', y='score', data=dataframe_)

This one is a lot cleaner, but I wonder if it feels a little better. It seems that the method of using .query () is slower than the method of using bool's Series. After all, it is troublesome to extract the parts that meet the conditions from the ** Long format data! ** **

Add a Pandas DataFrame method

** Then you should create a method **

Therefore, let's create a ** new method ** that extracts rows that meet the conditions from the DataFrame. ↓ Add a new .by () method that can be used like this to DataFrame.

dataframe.by(time='pre', cond='exp', group='a')

You can easily achieve this with a package called pandas_flavor.

Installation method

pip or

pip install pandas_flavor

It is one shot with conda.

conda install -c conda-forge pandas_flavor

Example of use

import pandas_flavor as pf


@pf.register_dataframe_method
def by(self, **args):
    for key in args.keys():
        self = self.loc[self.loc[:, key] == args[key], :]
    return self

Just write a function and add @ pf.register_dataframe_method as a decorator. In this example, the argument is received as a dictionary by doing ** args. This extracts the line specified by each argument.

Furthermore, it would be nice to make various seaborn functions into methods.

@pf.register_dataframe_method
def regplot(self, **args):
    return sns.regplot(data=self, **args)

スクリーンショット 2020-11-20 13.01.55.png

And it looks like this. If you want to add a method to pandas.Series, you can do the same with @ pf.register_series_method.

In this example ... I think it's okay to use .query (), but it seems that it can be applied in various ways.

[^ 1]: Needless to say, it's all a fake psychological experiment. The numbers are generated by the random module.

[PYTHON] [Pandas_flavor] Add a method of Pandas DataFrame

BEFORE