I think it is very important to verify the effect after conducting various trials with WEB services. I think that basic statistics can be fully used as a method. This time, we have summarized the method of statistical data analysis using Python + Jupyter Lab (Docker) for the purpose of verifying the effect of trial production that can be used in business from basic statistics and data analysis.
Also, please refer to the notebook used this time as well. https://github.com/hikarut/Data-Science/tree/master/notebooks/statisticsSample
It is assumed that Jupyter Lab can be used with Docker by referring to the following. Use vim keybindings in JupyterLab started with Docker
- | Qualitative explanatory variables (2 classifications) |
Quantitative explanatory variables (Multiple cases including quantity and quality) |
---|---|---|
Quantitative outcome(Numerical) | T-test the difference between the mean values (or Wilcoxon rank sum test) |
Multiple regression analysis |
Qualitative outcome(Classification type) | Z-test the difference in proportion (Same as chi-square test) |
Logistic regression analysis |
import numpy as np
np.mean()
import numpy as np
np.median()
import numpy as np
np.var()
import numpy as np
np.std()
#1-sample t-test
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
coffee = np.array([
210.9, 195.4, 202.1 , 211.3, 195.5,
212.9, 210.9, 198.3, 202.1, 215.6,
204.7, 212.2, 200.7, 206.1, 195.8
])
t, p = sp.stats.ttest_1samp(coffee, popmean=200)
print('Mother mean:', np.mean(coffee))
print('T-value with a population mean of 200:', t)
print('Probability that the population mean is 200(p-value):', p)
Execution result
Population mean: 204.96666666666664
T-value with population mean of 200: 2.751076959309973
Probability that the population mean is 200(p-value): 0.015611934395473872
#Paired t-test
import numpy as np
from scipy import stats
A = np.array([0.7, -1.6, -0.2, -1.2, -0.1, 3.4, 3.7, 0.8, 0.0, 2.0])
B = np.array([1.9, 0.8, 1.1, 0.1, -0.1, 4.4, 5.5, 1.6, 4.6, 3.4])
print('A average:', np.mean(A))
print('B average:', np.mean(B))
stats.ttest_rel(A, B)
Execution result
A average: 0.75
B average: 2.3299999999999996
Ttest_relResult(statistic=-4.062127683382037, pvalue=0.00283289019738427)
#Student's t-test
import numpy as np
from scipy import stats
A = np.array([6.3, 8.1, 9.4, 10.4, 8.6, 10.5, 10.2, 10.5, 10.0, 8.8])
B = np.array([4.8, 2.1, 5.1, 2.0, 4.0, 1.0, 3.4, 2.7, 5.1, 1.4, 1.6])
print('A average:', np.mean(A))
print('B average:', np.mean(B))
stats.ttest_ind(A, B)
Execution result
A average: 9.28
B average: 3.0181818181818185
Ttest_indResult(statistic=9.851086859836649, pvalue=6.698194360479442e-09)
#Welch's t-test
import numpy as np
from scipy import stats
A = np.array([13.8, 10.2, 4.6, 10.0, 4.2, 16.1, 14.4, 4.9, 7.7, 11.4])
B = np.array([3.3, 2.6, 4.0, 4.7, 1.9, 2.9, 4.7, 5.3, 4.3, 3.0, 2.0])
print('A average:', np.mean(A))
print('B average:', np.mean(B))
stats.ttest_ind(A, B, equal_var=False)
Execution result
A average: 9.73
B average: 3.5181818181818176
Ttest_indResult(statistic=4.426442804187721, pvalue=0.0012285738375064346)
import numpy as np
from scipy import stats
A = np.array([1.83, 1.50, 1.62, 2.48, 1.68, 1.88, 1.55, 3.06, 1.30, 2.01, 3.11])
B = np.array([0.88, 0.65, 0.60, 1.05, 1.06, 1.29, 1.06, 2.14, 1.29])
print('A average:', np.mean(A))
print('B average:', np.mean(B))
stats.mannwhitneyu(A, B, alternative='two-sided')
Execution result
A average: 2.0018181818181815
B average: 1.1133333333333333
MannwhitneyuResult(statistic=91.0, pvalue=0.0018253610099931035)
mannwhitneyu
is used because it is the same as the Mann-Whitney U test.import numpy as np
from scipy import stats
A = np.array([1.83, 1.50, 1.62, 2.48, 1.68, 1.88, 1.55, 3.06, 1.30])
B = np.array([0.88, 0.65, 0.60, 1.05, 1.06, 1.29, 1.06, 2.14, 1.29])
print('A average:', np.mean(A))
print('B average:', np.mean(B))
stats.wilcoxon(A, B)
Execution result
A average: 1.8777777777777775
B average: 1.1133333333333333
WilcoxonResult(statistic=0.0, pvalue=0.007685794055213263)
wilcoxon
#Chi-square test
import numpy as np
import pandas as pd
from scipy import stats
#Sample data
sex = np.random.choice(['male', 'female'], size=20)
vote = np.random.choice(['agree', 'against'], size=20)
cross = pd.crosstab(index=sex, columns=vote)
print(cross)
x2, p, dof, expected = stats.chi2_contingency(cross, correction=False)
print("Chi-square value:", x2)
print("p-value:", p)
print("The degree of freedom is", dof)
print(expected)
Execution result
vote against agree
sex
female 5 5
male 6 4
Chi-square value: 0.20202020202020202
p-value: 0.653095114932182
1 degree of freedom
[[5.5 4.5]
[5.5 4.5]]
correction = False
specifies that no correction is performed.#Registration of test data
import pandas as pd
import numpy as np
data = pd.DataFrame({'output': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
'input01': [1, 3, 5, 6, 10, 14, 8, 17, 15, 20],
'input02': [5, 10, 20, 30, 35, 35, 50, 70, 85, 100]},)
data.head()
#Simple regression analysis with input01 and output
from sklearn import linear_model
model = linear_model.LinearRegression()
#Use input01 as explanatory variable
X = data.loc[:, ['input01']].values
#Use output as objective variable
Y = data['output'].values
#Create a predictive model
model.fit(X, Y)
print('Model parameters:', model.get_params())
print('Regression coefficient:', model.coef_)
print('Intercept(error):', model.intercept_)
print('Coefficient of determination(X,Correlation of Y):', model.score(X, Y))
print('Regression equation:[alcohol] = %s × [density] + %s' % (model.coef_[0], model.intercept_))
Execution result
Model parameters: {'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'normalize': False}
Regression coefficient: [4.45327487]
Intercept(error): 10.912578788709233
Coefficient of determination(X,Correlation of Y): 0.8771602016326598
Regression equation:[alcohol] = 4.45327486982735 × [density] + 10.912578788709233
#Regression analysis using stats models
import statsmodels.api as sm
model = sm.OLS(Y, sm.add_constant(X))
result = model.fit(disp=0)
print(result.summary())
Execution result
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.877
Model: OLS Adj. R-squared: 0.862
Method: Least Squares F-statistic: 57.13
Date: Fri, 20 Mar 2020 Prob (F-statistic): 6.56e-05
Time: 23:33:37 Log-Likelihood: -37.282
No. Observations: 10 AIC: 78.56
Df Residuals: 8 BIC: 79.17
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 10.9126 6.833 1.597 0.149 -4.845 26.670
x1 4.4533 0.589 7.558 0.000 3.095 5.812
==============================================================================
Omnibus: 5.725 Durbin-Watson: 2.878
Prob(Omnibus): 0.057 Jarque-Bera (JB): 2.315
Skew: 1.150 Prob(JB): 0.314
Kurtosis: 3.513 Cond. No. 22.4
==============================================================================
#Graphing
import matplotlib.pyplot as plt
#Scatter plot
plt.scatter(X, Y)
#Regression line
plt.plot(X, model.predict(X), color='black')
#Normalized and multiple regression analysis
from sklearn import linear_model
model = linear_model.LinearRegression()
#Normalize each column in the data frame
data2 = data.apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))
data2.head()
#Use other than output as explanatory variable
data2_except_output = data2.drop("output", axis=1)
X = data2_except_output
#Use output as objective variable
Y = data2['output'].values
#Create a predictive model
model.fit(X, Y)
#Partial regression coefficient
print(pd.DataFrame({"name":data2_except_output.columns,
"result":np.abs(model.coef_)}).sort_values(by='result') )
print('Intercept(error):', model.intercept_)
Execution result
name result
0 input01 0.295143
1 input02 0.707205
Intercept(error): 1.6679414843100476e-17
#Multiple regression analysis using stats models
import statsmodels.api as sm
#Create a predictive model
model = sm.OLS(Y, sm.add_constant(X))
result = model.fit(disp=0)
print(result.summary())
Execution result
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.962
Model: OLS Adj. R-squared: 0.951
Method: Least Squares F-statistic: 87.64
Date: Fri, 20 Mar 2020 Prob (F-statistic): 1.11e-05
Time: 23:36:48 Log-Likelihood: 13.530
No. Observations: 10 AIC: -21.06
Df Residuals: 7 BIC: -20.15
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -1.388e-17 0.024 -5.87e-16 1.000 -0.056 0.056
input01 0.2951 0.180 1.636 0.146 -0.132 0.722
input02 0.7072 0.180 3.923 0.006 0.281 1.133
==============================================================================
Omnibus: 6.793 Durbin-Watson: 1.330
Prob(Omnibus): 0.033 Jarque-Bera (JB): 2.620
Skew: 1.171 Prob(JB): 0.270
Kurtosis: 3.896 Cond. No. 10.5
==============================================================================
scikit-learn
is used for machine learning, and stats models
is used for statistics. With scikit-learn
, you need to calculate the p-value yourself, but statsmodels
will calculate it automatically.#Registration of test data
import pandas as pd
import numpy as np
data = pd.DataFrame({'sex': [1, 1, 0, 1, 0, 0, 1, 1, 0, 0],
'student': [0, 0, 0, 0, 1, 0, 0, 1, 0, 1],
'Staying time(Seconds)': [34, 28, 98, 70, 67, 23, 67, 56, 41, 90],
'user registration': [0, 0, 1, 0, 1, 0, 1, 1, 0, 1]},)
data
#Logistic regression analysis using stats models
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
#Use other than user registration for explanatory variables
X = data[['sex', 'student', 'Staying time(Seconds)']]
#Use user registration for objective variable
Y = data['user registration'].values
model = sm.Logit(Y, sm.add_constant(X))
result = model.fit(disp=0)
print('---summary---')
print(result.summary())
print('---Logarithmic odds---')
print(result.params)
print('---p-value---')
print(result.pvalues)
print('---What is the probability of an event occurring when a variable is incremented by 1 unit?%Will increase to(Quantitative evaluation)---')
print('Staying time(Seconds):', 1 / (1 + np.exp(-result.params['Staying time(Seconds)'])))
print('Staying time(Seconds):', np.exp(result.params['Staying time(Seconds)']) / (1 + np.exp(result.params['Staying time(Seconds)'])))
print('---How many times the probability that an event will occur when the variable becomes 1 is the probability that an event will not occur(Qualitative evaluation)---')
print('sex:', np.exp(result.params['sex']))
print('student:', np.exp(result.params['student']))
Execution result
---summary---
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 10
Model: Logit Df Residuals: 6
Method: MLE Df Model: 3
Date: Fri, 20 Mar 2020 Pseudo R-squ.: 0.7610
Time: 10:12:42 Log-Likelihood: -1.6565
converged: False LL-Null: -6.9315
Covariance Type: nonrobust LLR p-value: 0.01443
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -9.5614 10.745 -0.890 0.374 -30.622 11.499
Gender 0.1543 5.170 0.030 0.976 -9.979 10.287
Student 22.7574 3.26e+04 0.001 0.999 -6.38e+04 6.38e+04
Staying time(Seconds) 0.1370 0.139 0.988 0.323 -0.135 0.409
==============================================================================
Possibly complete quasi-separation: A fraction 0.30 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
---Logarithmic odds---
const -9.561361
Gender 0.154283
Student 22.757416
Staying time(Seconds) 0.136965
dtype: float64
---p-value---
const 0.373568
Gender 0.976193
Student 0.999442
Staying time(Seconds) 0.323037
dtype: float64
---What is the probability of an event occurring when a variable is incremented by 1 unit?%Will increase to(Quantitative evaluation)---
Staying time(Seconds): 0.5341877701226888
Staying time(Seconds): 0.5341877701226888
---How many times the probability that an event will occur when the variable becomes 1 is the probability that an event will not occur(Qualitative evaluation)---
sex: 1.1668207000698392
student: 7645749443.830123
Recommended Posts