Overview

I think it is very important to verify the effect after conducting various trials with WEB services. I think that basic statistics can be fully used as a method. This time, we have summarized the method of statistical data analysis using Python + Jupyter Lab (Docker) for the purpose of verifying the effect of trial production that can be used in business from basic statistics and data analysis.

Also, please refer to the notebook used this time as well. https://github.com/hikarut/Data-Science/tree/master/notebooks/statisticsSample

environment

It is assumed that Jupyter Lab can be used with Docker by referring to the following. Use vim keybindings in JupyterLab started with Docker

Analysis method summary

-	Qualitative explanatory variables (2 classifications)	Quantitative explanatory variables (Multiple cases including quantity and quality)
Quantitative outcome(Numerical)	T-test the difference between the mean values (or Wilcoxon rank sum test)	Multiple regression analysis
Qualitative outcome(Classification type)	Z-test the difference in proportion (Same as chi-square test)	Logistic regression analysis

the term

Quantitative

Information represented by ** number ** such as age, income, purchase amount

qualitative

Information represented by ** characters ** rather than numbers such as gender, occupation, and product genre
May be used by replacing with ** binary variable ** of 0,1

Outcome

When you look at data analysis in terms of causal insights, that is, the results you want to control in the end and the potential causes that can affect them, ** the outcomes you want to control in the end Indicator) **
Generally referred to as "result variable", "objective variable", "dependent variable", and "external standard"

Explanatory variable

Factors that may affect or explain the difference in outcomes are called explanatory variables.

Average value

Total data divided by the number of data
Generally, what is called "average" is often "arithmetic mean", but there are also "geometric mean" and "geometric mean".
Calculation of mean value using Numpy

import numpy as np

np.mean()

Median

Value located in the center when the data is arranged in ascending order
If the data is even, the median is the value obtained by adding the two values in the center and dividing by two.
Calculating median using Numpy

import numpy as np

np.median()

deviation

The difference between each value of the data and the average value
An index that shows how large or small each value of the data is from the average value.
Since the total deviation is 0, the average deviation is also 0.

Variance

Average of squares of deviation
One of the statistics that shows the spread of the distribution, and it is possible to judge whether the data has a large variation or a small variation.
Calculating variance using Numpy

import numpy as np

np.var()

Standard Deviation

Positive square root of variance
One of the statistics that shows the spread of the distribution, and it is possible to judge whether the data has a large variation or a small variation.
Unlike the variance, the standard deviation has the same unit dimension as the data, so the standard deviation is often used to express the degree of dispersion of actual data.
It is called "Standard Deviation" and is often written as "SD" for short.
Calculation of standard deviation using Numpy

import numpy as np

np.std()

Null hypothesis

For the hypothesis that you originally want to assert, the "hypothesis that completely overturns your theory" is called the null hypothesis.
The alternative hypothesis is often more important because it is a hypothesis that is tentatively established during statistical hypothesis testing.
For example, if the null hypothesis "no difference" is made, it is rejected and the alternative hypothesis "there is a difference" is concluded. *reference:Null hypothesis|Statistical glossary|Statistics WEB

Alternative hypothesis

A hypothesis adopted when the null hypothesis is rejected in the statistical hypothesis test.
The hypothesis that you originally want to assert is called an alternative hypothesis because it is a hypothesis that is in opposition to the null hypothesis. *reference:Alternative hypothesis|Statistical glossary|Statistics WEB

p-value

In the statistical hypothesis test, the probability that the test statistic will be the value under the null hypothesis.
Generally, when the ** p-value is 5% or less **, the null hypothesis is rejected as false and the alternative hypothesis is adopted.
In many cases, we often want to assert the hypothesis that there is a significant difference. In that case, if the probability p value of the null hypothesis that there is no significant difference is 0.05 or less, there is a significant difference. Can be judged *reference:P value|Statistical glossary|Statistics WEB

Analysis method details

t-test

A method for considering whether the mean value is caused by accidental data variation.
→ ** Method to check if there is a significant difference between the two mean values **
A technique that can be used when the outcome is quantitative (numerical) and the explanatory variables are qualitative.
The z-test is used when the number of cases is large, and the t-test is a method that can be used even when the number of cases is small, but the t-test is generally used because it can cover both large and small cases.
** Generally, if the p value is 0.05 or less, it can be judged that there is a significant difference **.
The t-test can be used ** a parametric test that assumes that the population follows a normal distribution **, and if the population is not normally distributed (if there are outliers), significant differences are unlikely to occur.
There are four types of t-test (Reference: Concept of t-test)
1 group t-test
Paired t-test
Student's t-test
Welch's t-test

1 sample t-test (1 group t-test)

A test to see if the mean is equal to a particular value
One-sample t-test using Python

#1-sample t-test

import numpy as np
import scipy as sp
import matplotlib.pyplot as plt

coffee = np.array([
    210.9, 195.4, 202.1 , 211.3, 195.5, 
    212.9, 210.9, 198.3, 202.1, 215.6, 
    204.7, 212.2, 200.7, 206.1, 195.8
])

t, p = sp.stats.ttest_1samp(coffee, popmean=200)
print('Mother mean:', np.mean(coffee))
print('T-value with a population mean of 200:', t)
print('Probability that the population mean is 200(p-value)：', p)

`Execution result`


Population mean: 204.96666666666664
T-value with population mean of 200: 2.751076959309973
Probability that the population mean is 200(p-value)： 0.015611934395473872

Reference: How to perform one-sample t-test (two-sided / one-sided test) with ttest_1samp in Python

Paired t-test

** A test that compares the difference between two means for the same population **
Reference: Corresponding t-test
Corresponding t-test using Python

#Paired t-test

import numpy as np
from scipy import stats

A = np.array([0.7, -1.6, -0.2, -1.2, -0.1, 3.4, 3.7, 0.8, 0.0, 2.0])
B = np.array([1.9, 0.8, 1.1, 0.1, -0.1, 4.4, 5.5, 1.6, 4.6, 3.4])

print('A average:', np.mean(A))
print('B average:', np.mean(B))
stats.ttest_rel(A, B)

`Execution result`


A average: 0.75
B average: 2.3299999999999996
Ttest_relResult(statistic=-4.062127683382037, pvalue=0.00283289019738427)

Reference: [Python] Summary of statistical testing methods between two groups

Student's t-test

Test used when there is no correspondence between two data ** (when the population is different) ** and the variance of the two data can be assumed to be ** homoscedastic **
Generally, when we call it "t-test", we often refer to "student's t-test".
Reference: Student's t-test
Student's t-test using Python

#Student's t-test

import numpy as np
from scipy import stats

A = np.array([6.3, 8.1, 9.4, 10.4, 8.6, 10.5, 10.2, 10.5, 10.0, 8.8])
B = np.array([4.8, 2.1, 5.1, 2.0, 4.0, 1.0, 3.4, 2.7, 5.1, 1.4, 1.6])

print('A average:', np.mean(A))
print('B average:', np.mean(B))
stats.ttest_ind(A, B)

`Execution result`


A average: 9.28
B average: 3.0181818181818185
Ttest_indResult(statistic=9.851086859836649, pvalue=6.698194360479442e-09)

Reference: [Python] Summary of statistical testing methods between two groups

Welch's t-test

Test used when there is no ** correspondence between two data (when the population is different) ** and the ** population variances of the two data are not necessarily equal **
Reference: Welch's t-test
Welch's t-test using Python

#Welch's t-test

import numpy as np
from scipy import stats

A = np.array([13.8, 10.2, 4.6, 10.0, 4.2, 16.1, 14.4, 4.9, 7.7, 11.4])
B = np.array([3.3, 2.6, 4.0, 4.7, 1.9, 2.9, 4.7, 5.3, 4.3, 3.0, 2.0])

print('A average:', np.mean(A))
print('B average:', np.mean(B))
stats.ttest_ind(A, B, equal_var=False)

`Execution result`


A average: 9.73
B average: 3.5181818181818176
Ttest_indResult(statistic=4.426442804187721, pvalue=0.0012285738375064346)

Reference: [Python] Summary of statistical testing methods between two groups

Wilcoxon rank sum test

Similar to the t-test, a method for considering whether the mean value is caused by accidental data variation.
Unlike the t-test, the Wilcoxon rank sum test is one of the ** nonparametric tests that can be used even if the population does not follow a normal distribution **
A test method used when there is no ** correspondence between two data, and is equivalent to Student's t-test or Welch's t-test.
Reference
What is Wilcoxon rank sum test?
Wilcoxon rank sum test
Parametric and non-parametric
Wilcoxon rank sum test using Python

import numpy as np
from scipy import stats

A = np.array([1.83, 1.50, 1.62, 2.48, 1.68, 1.88, 1.55, 3.06, 1.30, 2.01, 3.11])
B = np.array([0.88, 0.65, 0.60, 1.05, 1.06, 1.29, 1.06, 2.14, 1.29])

print('A average:', np.mean(A))
print('B average:', np.mean(B))
stats.mannwhitneyu(A, B, alternative='two-sided')

`Execution result`


A average: 2.0018181818181815
B average: 1.1133333333333333
MannwhitneyuResult(statistic=91.0, pvalue=0.0018253610099931035)

Note that mannwhitneyu is used because it is the same as the Mann-Whitney U test.

Wilcoxon signed rank sum test

Similar to Wilcoxon rank sum test, a method for considering whether the mean value is caused by accidental data variation.
"Wilcoxon rank sum test" and "Wilcoxon signed rank sum test" are different
"Wilcoxon rank sum test" is a test method used when there is no correspondence between two data, and "Wilcoxon signed rank sum test" is a test method used when there is a ** correspondence between two data **.
Test equivalent to the paired t-test
Reference: Wilcoxon signed rank test
Wilcoxon signed rank test using Python

import numpy as np
from scipy import stats

A = np.array([1.83, 1.50, 1.62, 2.48, 1.68, 1.88, 1.55, 3.06, 1.30])
B = np.array([0.88, 0.65, 0.60, 1.05, 1.06, 1.29, 1.06, 2.14, 1.29])

print('A average:', np.mean(A))
print('B average:', np.mean(B))
stats.wilcoxon(A, B)

`Execution result`


A average: 1.8777777777777775
B average: 1.1133333333333333
WilcoxonResult(statistic=0.0, pvalue=0.007685794055213263)

Note that use wilcoxon

Chi-square test

One of the tests of independence, a method for determining whether two variables are related
A technique that can be used when the outcome is qualitative and the explanatory variables are also qualitative.
How to check the relationship between two variables by comparing the actually measured data with the result that would be obtained if the two variables were independent (not related).
Similar to the t-test, when the null hypothesis is "not related", it is generally rejected when the ** p-value is 0.05 or less **, and the alternative hypothesis "related" is concluded. To do.
Reference
What is the chi-square test? Easy to understand even the calculation formula!
Chi-square test
Chi-square test using Python

#Chi-square test

import numpy as np
import pandas as pd
from scipy import stats

#Sample data
sex = np.random.choice(['male', 'female'], size=20)
vote = np.random.choice(['agree', 'against'], size=20)
cross = pd.crosstab(index=sex, columns=vote)
print(cross)

x2, p, dof, expected = stats.chi2_contingency(cross, correction=False)
print("Chi-square value:", x2)
print("p-value:", p)
print("The degree of freedom is", dof)
print(expected)

`Execution result`


vote    against  agree
sex                   
female        5      5
male          6      4
Chi-square value: 0.20202020202020202
p-value: 0.653095114932182
1 degree of freedom
[[5.5 4.5]
 [5.5 4.5]]

correction = False specifies that no correction is performed.
Reference: Cross tabulation and chi-square test

Multiple regression analysis

An analytical method that shows the tendency of how much the outcome increases (decreases) on average as the quantitative explanatory variables increase.
** Simple regression analysis ** analyzes the relationship between one explanatory variable and one outcome, and ** Multiple regression analysis * can analyze the relationship between multiple explanatory variables and outcomes at once. *
A technique that can be used when the outcome is quantitative and the explanatory variables are also quantitative.
As a premise, if "other explanatory variables are the same", it is to investigate how the outcome changes when the other explanatory variables change, and if there is a correlation between the variables ** "Multicollinearity (multicollinearity) = multicolinearity) ”** Note that there is a risk of falling into
Since the explanatory variable changes the value of the outcome (objective variable), it is also possible to ** predict ** the "value" of the outcome (objective variable) from the explanatory variable.
Reference: What is multiple regression analysis
Simple regression analysis using Python (scikit-learn)

#Registration of test data
import pandas as pd
import numpy as np

data = pd.DataFrame({'output': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
                     'input01': [1, 3, 5, 6, 10, 14, 8, 17, 15, 20],
                     'input02': [5, 10, 20, 30, 35, 35, 50, 70, 85, 100]},)
data.head()

#Simple regression analysis with input01 and output
from sklearn import linear_model
model = linear_model.LinearRegression()

#Use input01 as explanatory variable
X = data.loc[:, ['input01']].values
#Use output as objective variable
Y = data['output'].values

#Create a predictive model
model.fit(X, Y)
print('Model parameters:', model.get_params())
print('Regression coefficient:', model.coef_)
print('Intercept(error):', model.intercept_)
print('Coefficient of determination(X,Correlation of Y):', model.score(X, Y))
print('Regression equation:[alcohol] = %s × [density] + %s' % (model.coef_[0], model.intercept_))

`Execution result`


Model parameters: {'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'normalize': False}
Regression coefficient: [4.45327487]
Intercept(error): 10.912578788709233
Coefficient of determination(X,Correlation of Y): 0.8771602016326598
Regression equation:[alcohol] = 4.45327486982735 × [density] + 10.912578788709233

Simple regression analysis using Python (stats models)

#Regression analysis using stats models
import statsmodels.api as sm

model = sm.OLS(Y, sm.add_constant(X))
result = model.fit(disp=0)
print(result.summary())

`Execution result`


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.877
Model:                            OLS   Adj. R-squared:                  0.862
Method:                 Least Squares   F-statistic:                     57.13
Date:                Fri, 20 Mar 2020   Prob (F-statistic):           6.56e-05
Time:                        23:33:37   Log-Likelihood:                -37.282
No. Observations:                  10   AIC:                             78.56
Df Residuals:                       8   BIC:                             79.17
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         10.9126      6.833      1.597      0.149      -4.845      26.670
x1             4.4533      0.589      7.558      0.000       3.095       5.812
==============================================================================
Omnibus:                        5.725   Durbin-Watson:                   2.878
Prob(Omnibus):                  0.057   Jarque-Bera (JB):                2.315
Skew:                           1.150   Prob(JB):                        0.314
Kurtosis:                       3.513   Cond. No.                         22.4
==============================================================================

#Graphing
import matplotlib.pyplot as plt
 
#Scatter plot
plt.scatter(X, Y)
 
#Regression line
plt.plot(X, model.predict(X), color='black')

Multiple regression analysis using Python (scikit-learn)

#Normalized and multiple regression analysis
from sklearn import linear_model
model = linear_model.LinearRegression()

#Normalize each column in the data frame
data2 = data.apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))
data2.head()

#Use other than output as explanatory variable
data2_except_output = data2.drop("output", axis=1)
X = data2_except_output
#Use output as objective variable
Y = data2['output'].values
 
#Create a predictive model
model.fit(X, Y)
 
#Partial regression coefficient
print(pd.DataFrame({"name":data2_except_output.columns,
                    "result":np.abs(model.coef_)}).sort_values(by='result') )
 
print('Intercept(error):', model.intercept_)

`Execution result`


      name    result
0  input01  0.295143
1  input02  0.707205
Intercept(error): 1.6679414843100476e-17

Multiple regression analysis using Python (stats models)

#Multiple regression analysis using stats models
import statsmodels.api as sm

#Create a predictive model
model = sm.OLS(Y, sm.add_constant(X))
result = model.fit(disp=0)
print(result.summary())

`Execution result`


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.962
Model:                            OLS   Adj. R-squared:                  0.951
Method:                 Least Squares   F-statistic:                     87.64
Date:                Fri, 20 Mar 2020   Prob (F-statistic):           1.11e-05
Time:                        23:36:48   Log-Likelihood:                 13.530
No. Observations:                  10   AIC:                            -21.06
Df Residuals:                       7   BIC:                            -20.15
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -1.388e-17      0.024  -5.87e-16      1.000      -0.056       0.056
input01        0.2951      0.180      1.636      0.146      -0.132       0.722
input02        0.7072      0.180      3.923      0.006       0.281       1.133
==============================================================================
Omnibus:                        6.793   Durbin-Watson:                   1.330
Prob(Omnibus):                  0.033   Jarque-Bera (JB):                2.620
Skew:                           1.171   Prob(JB):                        0.270
Kurtosis:                       3.896   Cond. No.                         10.5
==============================================================================

- In many cases, scikit-learn is used for machine learning, and stats models is used for statistics. With scikit-learn, you need to calculate the p-value yourself, but statsmodels will calculate it automatically.
Reference
[Linear regression with scikit-learn (single regression analysis / multiple regression analysis)](https://pythondatascience.plavox.info/scikit-learn/%E7%B7%9A%E5%BD%A2%E5%9B% 9E% E5% B8% B0)
Multiple regression analysis by Python

Logistic regression analysis

A method of comparing by finding the ** odds ratio **, which is "how many times the rate (probability = odds) that the outcome becomes 1 when the explanatory variable increases by 1"
Techniques that can be used when the outcome is qualitative and the explanatory variables are quantitative
Regression analysis for analyzing outcomes related to two-valued logic (regression analysis for "probability", there is no distinction between simple regression and multiple regression)
Since the probability that the outcome (objective variable) will be 1 is known from the explanatory variable, it is also possible to ** predict ** the probability that the outcome (objective variable) will be 1 from the explanatory variable.
Reference: [Logistic Regression Analysis] Introducing usage examples, odds ratios, and how to use in Excel!
Logistic regression analysis using Python

#Registration of test data
import pandas as pd
import numpy as np

data = pd.DataFrame({'sex': [1, 1, 0, 1, 0, 0, 1, 1, 0, 0],
                     'student': [0, 0, 0, 0, 1, 0, 0, 1, 0, 1],
                     'Staying time(Seconds)': [34, 28, 98, 70, 67, 23, 67, 56, 41, 90],
                     'user registration': [0, 0, 1, 0, 1, 0, 1, 1, 0, 1]},)
data

#Logistic regression analysis using stats models
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm

#Use other than user registration for explanatory variables
X = data[['sex', 'student', 'Staying time(Seconds)']]
#Use user registration for objective variable
Y = data['user registration'].values

model = sm.Logit(Y, sm.add_constant(X))
result = model.fit(disp=0)
print('---summary---')
print(result.summary())

print('---Logarithmic odds---')
print(result.params)

print('---p-value---')
print(result.pvalues)

print('---What is the probability of an event occurring when a variable is incremented by 1 unit?%Will increase to(Quantitative evaluation)---')
print('Staying time(Seconds):', 1 / (1 + np.exp(-result.params['Staying time(Seconds)'])))
print('Staying time(Seconds):', np.exp(result.params['Staying time(Seconds)']) / (1 + np.exp(result.params['Staying time(Seconds)'])))

print('---How many times the probability that an event will occur when the variable becomes 1 is the probability that an event will not occur(Qualitative evaluation)---')
print('sex:', np.exp(result.params['sex']))
print('student:', np.exp(result.params['student']))

`Execution result`


---summary---
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                   10
Model:                          Logit   Df Residuals:                        6
Method:                           MLE   Df Model:                            3
Date:                Fri, 20 Mar 2020   Pseudo R-squ.:                  0.7610
Time:                        10:12:42   Log-Likelihood:                -1.6565
converged:                      False   LL-Null:                       -6.9315
Covariance Type:            nonrobust   LLR p-value:                   0.01443
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -9.5614     10.745     -0.890      0.374     -30.622      11.499
Gender 0.1543      5.170      0.030      0.976      -9.979      10.287
Student 22.7574   3.26e+04      0.001      0.999   -6.38e+04    6.38e+04
Staying time(Seconds)        0.1370      0.139      0.988      0.323      -0.135       0.409
==============================================================================

Possibly complete quasi-separation: A fraction 0.30 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
---Logarithmic odds---
const      -9.561361
Gender 0.154283
Student 22.757416
Staying time(Seconds)     0.136965
dtype: float64
---p-value---
const      0.373568
Gender 0.976193
Student 0.999442
Staying time(Seconds)    0.323037
dtype: float64
---What is the probability of an event occurring when a variable is incremented by 1 unit?%Will increase to(Quantitative evaluation)---
Staying time(Seconds): 0.5341877701226888
Staying time(Seconds): 0.5341877701226888
---How many times the probability that an event will occur when the variable becomes 1 is the probability that an event will not occur(Qualitative evaluation)---
sex: 1.1668207000698392
student: 7645749443.830123

Reference
Logistic Regression Analysis with Python
[Determining good customers by logistic regression](http://www.kazutak.com/index.php/2018/06/02/%E3%83%AD%E3%82%B8%E3%82%B9 % E3% 83% 86% E3% 82% A3% E3% 83% 83% E3% 82% AF% E5% 9B% 9E% E5% B8% B0% E3% 81% A7% E5% 84% AA% E8 % 89% AF% E9% A1% A7% E5% AE% A2% E3% 82% 92% E5% 88% A4% E5% 88% A5% E3% 81% 99% E3% 82% 8B /)

Summary of statistical data analysis methods using Python that can be used in business

Overview

environment

Analysis method summary

the term

Quantitative

qualitative

Outcome

Explanatory variable

Average value

Median

deviation

Variance

Standard Deviation

Null hypothesis

Alternative hypothesis

p-value

Analysis method details

t-test

1 sample t-test (1 group t-test)

Execution result

Paired t-test

Execution result

Student's t-test

Execution result

Welch's t-test

Execution result

Wilcoxon rank sum test

Execution result

Wilcoxon signed rank sum test

Execution result

Chi-square test

Execution result

Multiple regression analysis

Execution result

Execution result

Execution result

Execution result

Logistic regression analysis

Execution result

References

`Execution result`

`Execution result`

`Execution result`

`Execution result`

`Execution result`

`Execution result`

`Execution result`

`Execution result`

`Execution result`

`Execution result`

`Execution result`

`Execution result`