[PYTHON] Split dataframe ⇒ Apply function

Consider how to divide the dataframe in the middle of calculation and apply the function to each of the divided dataframes. Note that it seems to be used frequently.

1. After dividing data.frame with purrr, apply the function to each data.frame

An example of here. Group mtcars with cyl ⇒Apply regression analysis to each divided data.frame ⇒ Issue a summary of each result. ⇒ Issue each R2. The flow.

library(purrr)
mtcars %>%
  split(.$cyl) %>% # from base R
  map(~ lm(mpg ~ wt, data = .)) %>%
  map(summary) %>%
  map_dbl("r.squared")

#>         4         6         8 
#> 0.5086326 0.4645102 0.4229655

Concise!

2. After splitting dataframe, apply to each dataframen (python)

If you do the same thing with python. While referring to the answer here.

import pandas as pd
data = pd.read_csv('https://gist.githubusercontent.com/ZeccaLehn/4e06d2575eb9589dbe8c365d61cb056c/raw/64f1660f38ef523b2a1a13be77b002b98665cdfe/mtcars.csv')
data.rename(columns={'Unnamed: 0':'brand'}, inplace=True)   

d = dict(tuple(data.groupby(["cyl"])))
print(d)

        brand   mpg  cyl   disp   hp  ...   qsec  vs  am  gear  carb
2       Datsun 710  22.8    4  108.0   93  ...  18.61   1   1     4     1
7        Merc 240D  24.4    4  146.7   62  ...  20.00   1   0     4     2
8         Merc 230  22.8    4  140.8   95  ...  22.90   1   0     4     2
17        Fiat 128  32.4    4   78.7   66  ...  19.47   1   1     4     1
18     Honda Civic  30.4    4   75.7   52  ...  18.52   1   1     4     2
19  Toyota Corolla  33.9    4   71.1   65  ...  19.90   1   1     4     1
20   Toyota Corona  21.5    4  120.1   97  ...  20.01   1   0     3     1
25       Fiat X1-9  27.3    4   79.0   66  ...  18.90   1   1     4     1
26   Porsche 914-2  26.0    4  120.3   91  ...  16.70   0   1     5     2
27    Lotus Europa  30.4    4   95.1  113  ...  16.90   1   1     5     2
31      Volvo 142E  21.4    4  121.0  109  ...  18.60   1   1     4     2
[11 rows x 12 columns]

         brand   mpg  cyl   disp   hp  ...   qsec  vs  am  gear  carb
0        Mazda RX4  21.0    6  160.0  110  ...  16.46   0   1     4     4
1    Mazda RX4 Wag  21.0    6  160.0  110  ...  17.02   0   1     4     4
3   Hornet 4 Drive  21.4    6  258.0  110  ...  19.44   1   0     3     1
5          Valiant  18.1    6  225.0  105  ...  20.22   1   0     3     1
9         Merc 280  19.2    6  167.6  123  ...  18.30   1   0     4     4
10       Merc 280C  17.8    6  167.6  123  ...  18.90   1   0     4     4
29    Ferrari Dino  19.7    6  145.0  175  ...  15.50   0   1     5     6
[7 rows x 12 columns]

              brand   mpg  cyl   disp   hp  ...   qsec  vs  am  gear  carb
4     Hornet Sportabout  18.7    8  360.0  175  ...  17.02   0   0     3     2
6            Duster 360  14.3    8  360.0  245  ...  15.84   0   0     3     4
11           Merc 450SE  16.4    8  275.8  180  ...  17.40   0   0     3     3
12           Merc 450SL  17.3    8  275.8  180  ...  17.60   0   0     3     3
13          Merc 450SLC  15.2    8  275.8  180  ...  18.00   0   0     3     3
14   Cadillac Fleetwood  10.4    8  472.0  205  ...  17.98   0   0     3     4
15  Lincoln Continental  10.4    8  460.0  215  ...  17.82   0   0     3     4
16    Chrysler Imperial  14.7    8  440.0  230  ...  17.42   0   0     3     4
21     Dodge Challenger  15.5    8  318.0  150  ...  16.87   0   0     3     2
22          AMC Javelin  15.2    8  304.0  150  ...  17.30   0   0     3     2
23           Camaro Z28  13.3    8  350.0  245  ...  15.41   0   0     3     4
24     Pontiac Firebird  19.2    8  400.0  175  ...  17.05   0   0     3     2
28        Ford Pantera L  15.8    8  351.0  264  ...  14.50   0   1     5     4
30        Maserati Bora  15.0    8  301.0  335  ...  14.60   0   1     5     8
[14 rows x 12 columns]

It was confirmed that the key became the unique value of cyl before splitting, it became the dataframe after grouping, and the dataframe was successfully split and put into the dictionary, so while turning with the key, apply the function (lm) to each dataframe. Application. ⇒ Store the result in a dictionary (summary). The flow.

import statsmodels.api as sm
def lm(y_train,X_train):

Regression model creation

    model = sm.OLS(y_train, sm.add_constant(X_train))
    result = model.fit() 
    return(result)

d = dict(tuple(data.groupby(["cyl"])))
summary = {}
for key in d:
    y_train = d[key]["mpg"]
    X_train = d[key]["wt"]
    summary[key] = lm(y_train,X_train)
    print("#cyl{}:{}".format(key,summary[key].rsquared))

#cyl4:0.5086325963231395
#cyl6:0.4645101505505491
#cyl8:0.42296553649611224

Either is relatively easy to do. I thought R was better code this time because purrr's map makes even the application of regression analysis very concise. purrr is deep.

Recommended Posts

Split dataframe ⇒ Apply function
Apply Influence function to logistic regression
How to split and save a DataFrame