Almost a memorandum
** Introduction to stats models (OLS) for those who are stuck with sample data </ font> **
This is an introductory article for people who have tried regression samples with statsmodels and are stuck with their own data.
Recently, when I tried prophet on facebook as a time series analysis, including data prediction, there was little data prepared and the accuracy is still unknown, but it worked for the time being. When I started regression with stats models and sklearn, which seem to be basic, when I created the data myself, an error etc. appeared and it stopped.
Looking at the official website of statsmodels and referring to various sites, I proceeded little by little, and it seems that it worked almost as expected.
At the time of model application, it stopped at the data to be passed. I'd like to post it because the data I made was close to what I expected.
The case assumed by this script is a fictitious restaurant (assumed to be a bar or sky lounge), and sales are assumed to have fictitious sales data that records the major categories of products, unit price per customer, number of visitors, etc. Is there any tendency on days when there are many? It is a setting.
It is assumed that the store has Western liquors and cocktails, light meals and cigars.
【environment】 Linux: debian10.3 python: 3.7.3 pandas: 1.0.3 statsmodels: 0.11.1 jupyter-lab: 2.1.0
Assuming you have a csv file like the one below
Date,earnings,customer,earnings_customer,fortified_sweet,rum,brown_spirits,mojito_rebjito,cocktail,bar_food,cigar 2020-03-01,30000,5,6000,2,2,2,3,2,5,1
** 1. Analysis of fictitious sales data **
The script ran in jupyter.
jupyter
##!/usr/bin/env python
# coding: utf-8
#Fictitious sales data
# infile = './sales_item_tf.csv'
import pandas as pd
#import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
#csv file
infile = './sales_item_tf.csv'
df = pd.read_csv(infile ,usecols={'earnings' ,'customer'
,'fortified_sweet' ,'brown_spirits' ,'rum' ,'cocktail' ,'bar_food','cigar'})
df.columns
#Correlation coefficient
df_corr = df.corr()
sns.heatmap(df_corr, vmax=1, vmin=-1, center=0)
#Assign to explanatory variable X
X = df.loc[: ,'customer':'cigar']
X = X.astype(float)
#X.info()
df.columns
#Earnings in objective variable y'Substitute
df
y = df['earnings']
y = y.astype(float)
#y = y.values
y
#Regression model call
model = sm.OLS(y, sm.add_constant(X))
#Creating a model
results = model.fit()
#View result details
print(results.summary())
# plot
#plt.plot(X)
** 2. Try splitting the configuration ** [1] Importing the library and reading the csv file
jupyer
#Fictitious sales data
# infile = './sales_item_tf.csv'
import pandas as pd
#import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
#csv file
infile = './sales_item_tf.csv'
df = pd.read_csv(infile ,usecols={'earnings' ,'customer'
,'fortified_sweet' ,'brown_spirits' ,'rum' ,'cocktail' ,'bar_food','cigar'})
df.columns
Index(['earnings', 'customer', 'fortified_sweet', 'rum', 'brown_spirits',
'cocktail', 'bar_food', 'cigar'],
dtype='object')
[2] Substitute data for the explanatory variable (X) and the objective variable (y)
jupyter
#Assign to explanatory variable X
X = df.loc[: ,'customer':'cigar']
X = df[['customer' ,'cigar']]
X = X.astype(float)
X.head()
customer fortified_sweet rum brown_spirits cocktail bar_food cigar
0 5.0 2.0 2.0 2.0 2.0 5.0 1.0
1 10.0 8.0 5.0 1.0 2.0 2.0 1.0
2 10.0 5.0 2.0 2.0 2.0 2.0 0.0
3 6.0 5.0 5.0 3.0 2.0 0.0 0.0
4 10.0 5.0 6.0 2.0 5.0 5.0
y
0 30000.0
1 60000.0
2 50000.0
3 30000.0
4 40000.0
...
56 50000.0
57 48000.0
58 40000.0
59 20000.0
60 20000.0
Name: earnings, Length: 61, dtype: float64
The explanatory variable (X) and objective variable (y) are changed to float type. </ font> I stopped here. ** It seems that an error will occur if you do not make it a float type when applying the model. ** **
[3] Applying the model and displaying the summary
statsmodels
#Regression model call
model = sm.OLS(y, sm.add_constant(X))
#Creating a model
results = model.fit()
#View result details
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: earnings R-squared: 0.930
Model: OLS Adj. R-squared: 0.921
Method: Least Squares F-statistic: 100.8
Date: Sat, 09 May 2020 Prob (F-statistic): 2.50e-28
Time: 01:09:38 Log-Likelihood: -618.49
No. Observations: 61 AIC: 1253.
Df Residuals: 53 BIC: 1270.
Df Model: 7
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
const -435.1552 2434.515 -0.179 0.859 -5318.173 4447.863
customer 5103.4245 617.184 8.269 0.000 3865.511 6341.338
fortified_sweet 844.1247 543.874 1.552 0.127 -246.747 1934.997
rum -389.6465 440.184 -0.885 0.380 -1272.545 493.252
brown_spirits 1267.2019 581.664 2.179 0.034 100.532 2433.872
cocktail -1766.9369 568.908 -3.106 0.003 -2908.022 -625.852
bar_food 74.3759 514.091 0.145 0.886 -956.760 1105.512
cigar 4420.0626 599.323 7.375 0.000 3217.972 5622.153
==============================================================================
Omnibus: 16.459 Durbin-Watson: 1.864
Prob(Omnibus): 0.000 Jarque-Bera (JB): 24.107
Skew: 0.971 Prob(JB): 5.83e-06
Kurtosis: 5.390 Cond. No. 37.4
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
For the time being, it seems that the model has been applied.
It's been a little longer, so I'll divide it into two parts. Let's start by looking at the results of the summary.
** Once finished, [for beginners] It was for people who stopped at their own data of regression models (stats models (1st time)). ** **
Recommended Posts