[PYTHON] [For beginners] For those who are stopped by their own data of regression model (stats models (1st time))

[For beginners] For those who are stopped by their own data of regression model (stats models (1st time))

Almost a memorandum

** Introduction to stats models (OLS) for those who are stuck with sample data </ font> **

This is an introductory article for people who have tried regression samples with statsmodels and are stuck with their own data.

Recently, when I tried prophet on facebook as a time series analysis, including data prediction, there was little data prepared and the accuracy is still unknown, but it worked for the time being. When I started regression with stats models and sklearn, which seem to be basic, when I created the data myself, an error etc. appeared and it stopped.

Looking at the official website of statsmodels and referring to various sites, I proceeded little by little, and it seems that it worked almost as expected.

At the time of model application, it stopped at the data to be passed. I'd like to post it because the data I made was close to what I expected.

The case assumed by this script is a fictitious restaurant (assumed to be a bar or sky lounge), and sales are assumed to have fictitious sales data that records the major categories of products, unit price per customer, number of visitors, etc. Is there any tendency on days when there are many? It is a setting.

It is assumed that the store has Western liquors and cocktails, light meals and cigars.


【environment】 Linux: debian10.3 python: 3.7.3 pandas: 1.0.3 statsmodels: 0.11.1 jupyter-lab: 2.1.0

Assuming you have a csv file like the one below

Date,earnings,customer,earnings_customer,fortified_sweet,rum,brown_spirits,mojito_rebjito,cocktail,bar_food,cigar 2020-03-01,30000,5,6000,2,2,2,3,2,5,1

** 1. Analysis of fictitious sales data **

The script ran in jupyter.

jupyter



##!/usr/bin/env python
# coding: utf-8

#Fictitious sales data
# infile = './sales_item_tf.csv'
import pandas as pd
#import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns

#csv file
infile = './sales_item_tf.csv'

df = pd.read_csv(infile ,usecols={'earnings' ,'customer' 
                                  ,'fortified_sweet' ,'brown_spirits' ,'rum' ,'cocktail' ,'bar_food','cigar'})

df.columns

#Correlation coefficient
df_corr = df.corr()
sns.heatmap(df_corr, vmax=1, vmin=-1, center=0)

#Assign to explanatory variable X
X = df.loc[: ,'customer':'cigar']
X = X.astype(float)

#X.info()
df.columns

#Earnings in objective variable y'Substitute
df
y = df['earnings']
y = y.astype(float)
#y = y.values

y

#Regression model call
model = sm.OLS(y, sm.add_constant(X))

#Creating a model
results = model.fit()

#View result details
print(results.summary())

# plot
#plt.plot(X)

** 2. Try splitting the configuration **   [1] Importing the library and reading the csv file

jupyer



#Fictitious sales data
# infile = './sales_item_tf.csv'
import pandas as pd
#import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns

#csv file
infile = './sales_item_tf.csv'

df = pd.read_csv(infile ,usecols={'earnings' ,'customer' 
                                  ,'fortified_sweet' ,'brown_spirits' ,'rum' ,'cocktail' ,'bar_food','cigar'})

df.columns

Index(['earnings', 'customer', 'fortified_sweet', 'rum', 'brown_spirits',
       'cocktail', 'bar_food', 'cigar'],
      dtype='object')

[2] Substitute data for the explanatory variable (X) and the objective variable (y)

jupyter



#Assign to explanatory variable X
X = df.loc[: ,'customer':'cigar']
X = df[['customer' ,'cigar']]
X = X.astype(float)

X.head()

	customer 	fortified_sweet 	rum 	brown_spirits 	cocktail 	bar_food 	cigar
0 	5.0 	2.0 	2.0 	2.0 	2.0 	5.0 	1.0
1 	10.0 	8.0 	5.0 	1.0 	2.0 	2.0 	1.0
2 	10.0 	5.0 	2.0 	2.0 	2.0 	2.0 	0.0
3 	6.0 	5.0 	5.0 	3.0 	2.0 	0.0 	0.0
4 	10.0 	5.0 	6.0 	2.0 	5.0 	5.0 	

y

0     30000.0
1     60000.0
2     50000.0
3     30000.0
4     40000.0
       ...    
56    50000.0
57    48000.0
58    40000.0
59    20000.0
60    20000.0
Name: earnings, Length: 61, dtype: float64

The explanatory variable (X) and objective variable (y) are changed to float type. </ font> I stopped here. ** It seems that an error will occur if you do not make it a float type when applying the model. ** **

[3] Applying the model and displaying the summary

statsmodels



#Regression model call
model = sm.OLS(y, sm.add_constant(X))

#Creating a model
results = model.fit()

#View result details
print(results.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:               earnings   R-squared:                       0.930
Model:                            OLS   Adj. R-squared:                  0.921
Method:                 Least Squares   F-statistic:                     100.8
Date:                Sat, 09 May 2020   Prob (F-statistic):           2.50e-28
Time:                        01:09:38   Log-Likelihood:                -618.49
No. Observations:                  61   AIC:                             1253.
Df Residuals:                      53   BIC:                             1270.
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const            -435.1552   2434.515     -0.179      0.859   -5318.173    4447.863
customer         5103.4245    617.184      8.269      0.000    3865.511    6341.338
fortified_sweet   844.1247    543.874      1.552      0.127    -246.747    1934.997
rum              -389.6465    440.184     -0.885      0.380   -1272.545     493.252
brown_spirits    1267.2019    581.664      2.179      0.034     100.532    2433.872
cocktail        -1766.9369    568.908     -3.106      0.003   -2908.022    -625.852
bar_food           74.3759    514.091      0.145      0.886    -956.760    1105.512
cigar            4420.0626    599.323      7.375      0.000    3217.972    5622.153
==============================================================================
Omnibus:                       16.459   Durbin-Watson:                   1.864
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               24.107
Skew:                           0.971   Prob(JB):                     5.83e-06
Kurtosis:                       5.390   Cond. No.                         37.4
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

For the time being, it seems that the model has been applied.

It's been a little longer, so I'll divide it into two parts. Let's start by looking at the results of the summary.

** Once finished, [for beginners] It was for people who stopped at their own data of regression models (stats models (1st time)). ** **

Recommended Posts