[PYTHON] [For beginners] For those who are stopped by their own data of regression model (stats models (2nd time))

[For beginners] For those who are stopped by their own data of regression model (stats models (2nd time))

Almost a memorandum

** Introduction to stats models (OLS) for those who are stuck with sample data </ font> **

This is an introductory article for people who have tried regression samples with statsmodels and are stuck with their own data.

The case assumed by this script is a fictitious restaurant (assumed to be a bar or sky lounge), and sales are assumed to have fictitious sales data that records the major categories of products, unit price per customer, number of visitors, etc. Is there any tendency on days when there are many? It is a setting.

【environment】 Linux: debian10.3 python: 3.7.3 pandas: 1.0.3 statsmodels: 0.11.1 jupyter-lab: 2.1.0

Assuming you have a csv file like the one below

Date,earnings,customer,earnings_customer,fortified_sweet,rum,brown_spirits,mojito_rebjito,cocktail,bar_food,cigar 2020-03-01,30000,5,6000,2,2,2,3,2,5,1

[1] Applying the model and displaying the summary (continuation of the previous time)

`statsmodels`



#Regression model call
model = sm.OLS(y, sm.add_constant(X))

#Creating a model
results = model.fit()

#View result details
print(results.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:               earnings   R-squared:                       0.930
Model:                            OLS   Adj. R-squared:                  0.921
Method:                 Least Squares   F-statistic:                     100.8
Date:                Sat, 09 May 2020   Prob (F-statistic):           2.50e-28
Time:                        01:09:38   Log-Likelihood:                -618.49
No. Observations:                  61   AIC:                             1253.
Df Residuals:                      53   BIC:                             1270.
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const            -435.1552   2434.515     -0.179      0.859   -5318.173    4447.863
customer         5103.4245    617.184      8.269      0.000    3865.511    6341.338
fortified_sweet   844.1247    543.874      1.552      0.127    -246.747    1934.997
rum              -389.6465    440.184     -0.885      0.380   -1272.545     493.252
brown_spirits    1267.2019    581.664      2.179      0.034     100.532    2433.872
cocktail        -1766.9369    568.908     -3.106      0.003   -2908.022    -625.852
bar_food           74.3759    514.091      0.145      0.886    -956.760    1105.512
cigar            4420.0626    599.323      7.375      0.000    3217.972    5622.153
==============================================================================
Omnibus:                       16.459   Durbin-Watson:                   1.864
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               24.107
Skew:                           0.971   Prob(JB):                     5.83e-06
Kurtosis:                       5.390   Cond. No.                         37.4
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

** Looking at a part of the summary **

`statsmodels`



#View result details
print(results.summary())

Part of the summary

R-squared:                       0.930
Adj. R-squared:                  0.921
AIC:                             1253.

                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------

customer         5103.4245    617.184      8.269      0.000    3865.511    6341.338
fortified_sweet   844.1247    543.874      1.552      0.127    -246.747    1934.997
rum              -389.6465    440.184     -0.885      0.380   -1272.545     493.252
brown_spirits    1267.2019    581.664      2.179      0.034     100.532    2433.872
cocktail        -1766.9369    568.908     -3.106      0.003   -2908.022    -625.852
bar_food           74.3759    514.091      0.145      0.886    -956.760    1105.512
cigar            4420.0626    599.323      7.375      0.000    3217.972    5622.153

When the summary is displayed, there is something to look at first.

"R-squared" coefficient of determination (the closer it is to 1, the higher the accuracy) "Adj. R-squared" Coefficient of determination adjusted for degrees of freedom (coefficient of determination when there are many explanatory variables) -In this case, there are many explanatory variables, so I will judge by this value.

"AIC" To what extent does the model fit? (The smaller the value, the higher the accuracy.) ・ It seems that the value is too large (the accuracy is low).

"Coef" regression coefficient (the larger the value, the greater the effect on the whole) ・ Looking at this, it seems that "customer" and "cigar" have a big influence.

「>|t|」(The smaller(As a reference, 0.Less than 05), Variable effects are likely not accidental) ・ Looking at this, it seems that it is not a coincidence that the value of "cocktail" is small in addition to "customer" and "cigar".

** Looking at the summary so far, it seems that there are many explanatory variables. ** **

It can be said that "customer" and "cigar" have a great influence on sales. </ font>

Looking at the output of the regression coefficient, it can be said that the result is almost as expected.

** Looking at the value of "AIC", it seems that there are many explanatory variables and the accuracy is low. ** ** I will reduce the explanatory variables and try the analysis again.

The results of the reanalysis will be updated at another time.

If you post only the graph first, there is no legend, but if you look at the graph, the bluish line is the movement of "customer (number of guests visiting)" and the reddish line is the movement of "cigar (provided by cigar)".

** Above, [for beginners] It was for people who stopped at their own data of regression models (stats models (2nd time)). ** **