Almost a memorandum
** Introduction to stats models (OLS) for those who are stuck with sample data </ font> **
This is an introductory article for people who have tried regression samples with statsmodels and are stuck with their own data.
The case assumed by this script is a fictitious restaurant (assumed to be a bar or sky lounge), and sales are assumed to have fictitious sales data that records the major categories of products, unit price per customer, number of visitors, etc. Is there any tendency on days when there are many? It is a setting.
【environment】 Linux: debian10.3 python: 3.7.3 pandas: 1.0.3 statsmodels: 0.11.1 jupyter-lab: 2.1.0
Assuming you have a csv file like the one below
Date,earnings,customer,earnings_customer,fortified_sweet,rum,brown_spirits,mojito_rebjito,cocktail,bar_food,cigar 2020-03-01,30000,5,6000,2,2,2,3,2,5,1
[1] Applying the model and displaying the summary (continuation of the previous time)
statsmodels
#Regression model call
model = sm.OLS(y, sm.add_constant(X))
#Creating a model
results = model.fit()
#View result details
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: earnings R-squared: 0.930
Model: OLS Adj. R-squared: 0.921
Method: Least Squares F-statistic: 100.8
Date: Sat, 09 May 2020 Prob (F-statistic): 2.50e-28
Time: 01:09:38 Log-Likelihood: -618.49
No. Observations: 61 AIC: 1253.
Df Residuals: 53 BIC: 1270.
Df Model: 7
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
const -435.1552 2434.515 -0.179 0.859 -5318.173 4447.863
customer 5103.4245 617.184 8.269 0.000 3865.511 6341.338
fortified_sweet 844.1247 543.874 1.552 0.127 -246.747 1934.997
rum -389.6465 440.184 -0.885 0.380 -1272.545 493.252
brown_spirits 1267.2019 581.664 2.179 0.034 100.532 2433.872
cocktail -1766.9369 568.908 -3.106 0.003 -2908.022 -625.852
bar_food 74.3759 514.091 0.145 0.886 -956.760 1105.512
cigar 4420.0626 599.323 7.375 0.000 3217.972 5622.153
==============================================================================
Omnibus: 16.459 Durbin-Watson: 1.864
Prob(Omnibus): 0.000 Jarque-Bera (JB): 24.107
Skew: 0.971 Prob(JB): 5.83e-06
Kurtosis: 5.390 Cond. No. 37.4
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
** Looking at a part of the summary **
statsmodels
#View result details
print(results.summary())
Part of the summary
R-squared: 0.930
Adj. R-squared: 0.921
AIC: 1253.
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
customer 5103.4245 617.184 8.269 0.000 3865.511 6341.338
fortified_sweet 844.1247 543.874 1.552 0.127 -246.747 1934.997
rum -389.6465 440.184 -0.885 0.380 -1272.545 493.252
brown_spirits 1267.2019 581.664 2.179 0.034 100.532 2433.872
cocktail -1766.9369 568.908 -3.106 0.003 -2908.022 -625.852
bar_food 74.3759 514.091 0.145 0.886 -956.760 1105.512
cigar 4420.0626 599.323 7.375 0.000 3217.972 5622.153
When the summary is displayed, there is something to look at first.
"R-squared" coefficient of determination (the closer it is to 1, the higher the accuracy) "Adj. R-squared" Coefficient of determination adjusted for degrees of freedom (coefficient of determination when there are many explanatory variables) -In this case, there are many explanatory variables, so I will judge by this value.
"AIC" To what extent does the model fit? (The smaller the value, the higher the accuracy.) ・ It seems that the value is too large (the accuracy is low).
"Coef" regression coefficient (the larger the value, the greater the effect on the whole) ・ Looking at this, it seems that "customer" and "cigar" have a big influence.
「>|t|」(The smaller(As a reference, 0.Less than 05), Variable effects are likely not accidental) ・ Looking at this, it seems that it is not a coincidence that the value of "cocktail" is small in addition to "customer" and "cigar".
** Looking at the summary so far, it seems that there are many explanatory variables. ** **
It can be said that "customer" and "cigar" have a great influence on sales. </ font>
Looking at the output of the regression coefficient, it can be said that the result is almost as expected.
** Looking at the value of "AIC", it seems that there are many explanatory variables and the accuracy is low. ** ** I will reduce the explanatory variables and try the analysis again.
The results of the reanalysis will be updated at another time.
If you post only the graph first, there is no legend, but if you look at the graph, the bluish line is the movement of "customer (number of guests visiting)" and the reddish line is the movement of "cigar (provided by cigar)".
** Above, [for beginners] It was for people who stopped at their own data of regression models (stats models (2nd time)). ** **
Recommended Posts