Summary of statistical data analysis methods using Python that can be used in business

Overview

I think it is very important to verify the effect after conducting various trials with WEB services. I think that basic statistics can be fully used as a method. This time, we have summarized the method of statistical data analysis using Python + Jupyter Lab (Docker) for the purpose of verifying the effect of trial production that can be used in business from basic statistics and data analysis.

Also, please refer to the notebook used this time as well. https://github.com/hikarut/Data-Science/tree/master/notebooks/statisticsSample

environment

It is assumed that Jupyter Lab can be used with Docker by referring to the following. Use vim keybindings in JupyterLab started with Docker

Analysis method summary

- Qualitative explanatory variables
(2 classifications)
Quantitative explanatory variables
(Multiple cases including quantity and quality)
Quantitative outcome(Numerical) T-test the difference between the mean values
(or Wilcoxon rank sum test)
Multiple regression analysis
Qualitative outcome(Classification type) Z-test the difference in proportion
(Same as chi-square test)
Logistic regression analysis

the term

Quantitative

qualitative

Outcome

Explanatory variable

Average value

import numpy as np

np.mean()

Median

import numpy as np

np.median()

deviation

Variance

import numpy as np

np.var()

Standard Deviation

import numpy as np

np.std()

Null hypothesis

Alternative hypothesis

p-value

Analysis method details

t-test

1 sample t-test (1 group t-test)

#1-sample t-test

import numpy as np
import scipy as sp
import matplotlib.pyplot as plt

coffee = np.array([
    210.9, 195.4, 202.1 , 211.3, 195.5, 
    212.9, 210.9, 198.3, 202.1, 215.6, 
    204.7, 212.2, 200.7, 206.1, 195.8
])

t, p = sp.stats.ttest_1samp(coffee, popmean=200)
print('Mother mean:', np.mean(coffee))
print('T-value with a population mean of 200:', t)
print('Probability that the population mean is 200(p-value):', p)

Execution result


Population mean: 204.96666666666664
T-value with population mean of 200: 2.751076959309973
Probability that the population mean is 200(p-value): 0.015611934395473872

Paired t-test

#Paired t-test

import numpy as np
from scipy import stats

A = np.array([0.7, -1.6, -0.2, -1.2, -0.1, 3.4, 3.7, 0.8, 0.0, 2.0])
B = np.array([1.9, 0.8, 1.1, 0.1, -0.1, 4.4, 5.5, 1.6, 4.6, 3.4])

print('A average:', np.mean(A))
print('B average:', np.mean(B))
stats.ttest_rel(A, B)

Execution result


A average: 0.75
B average: 2.3299999999999996
Ttest_relResult(statistic=-4.062127683382037, pvalue=0.00283289019738427)

Student's t-test

#Student's t-test

import numpy as np
from scipy import stats

A = np.array([6.3, 8.1, 9.4, 10.4, 8.6, 10.5, 10.2, 10.5, 10.0, 8.8])
B = np.array([4.8, 2.1, 5.1, 2.0, 4.0, 1.0, 3.4, 2.7, 5.1, 1.4, 1.6])

print('A average:', np.mean(A))
print('B average:', np.mean(B))
stats.ttest_ind(A, B)

Execution result


A average: 9.28
B average: 3.0181818181818185
Ttest_indResult(statistic=9.851086859836649, pvalue=6.698194360479442e-09)

Welch's t-test

#Welch's t-test

import numpy as np
from scipy import stats

A = np.array([13.8, 10.2, 4.6, 10.0, 4.2, 16.1, 14.4, 4.9, 7.7, 11.4])
B = np.array([3.3, 2.6, 4.0, 4.7, 1.9, 2.9, 4.7, 5.3, 4.3, 3.0, 2.0])

print('A average:', np.mean(A))
print('B average:', np.mean(B))
stats.ttest_ind(A, B, equal_var=False)

Execution result


A average: 9.73
B average: 3.5181818181818176
Ttest_indResult(statistic=4.426442804187721, pvalue=0.0012285738375064346)

Wilcoxon rank sum test

import numpy as np
from scipy import stats

A = np.array([1.83, 1.50, 1.62, 2.48, 1.68, 1.88, 1.55, 3.06, 1.30, 2.01, 3.11])
B = np.array([0.88, 0.65, 0.60, 1.05, 1.06, 1.29, 1.06, 2.14, 1.29])

print('A average:', np.mean(A))
print('B average:', np.mean(B))
stats.mannwhitneyu(A, B, alternative='two-sided')

Execution result


A average: 2.0018181818181815
B average: 1.1133333333333333
MannwhitneyuResult(statistic=91.0, pvalue=0.0018253610099931035)

Wilcoxon signed rank sum test

import numpy as np
from scipy import stats

A = np.array([1.83, 1.50, 1.62, 2.48, 1.68, 1.88, 1.55, 3.06, 1.30])
B = np.array([0.88, 0.65, 0.60, 1.05, 1.06, 1.29, 1.06, 2.14, 1.29])

print('A average:', np.mean(A))
print('B average:', np.mean(B))
stats.wilcoxon(A, B)

Execution result


A average: 1.8777777777777775
B average: 1.1133333333333333
WilcoxonResult(statistic=0.0, pvalue=0.007685794055213263)

Chi-square test

#Chi-square test

import numpy as np
import pandas as pd
from scipy import stats

#Sample data
sex = np.random.choice(['male', 'female'], size=20)
vote = np.random.choice(['agree', 'against'], size=20)
cross = pd.crosstab(index=sex, columns=vote)
print(cross)

x2, p, dof, expected = stats.chi2_contingency(cross, correction=False)
print("Chi-square value:", x2)
print("p-value:", p)
print("The degree of freedom is", dof)
print(expected)

Execution result


vote    against  agree
sex                   
female        5      5
male          6      4
Chi-square value: 0.20202020202020202
p-value: 0.653095114932182
1 degree of freedom
[[5.5 4.5]
 [5.5 4.5]]

Multiple regression analysis

#Registration of test data
import pandas as pd
import numpy as np

data = pd.DataFrame({'output': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
                     'input01': [1, 3, 5, 6, 10, 14, 8, 17, 15, 20],
                     'input02': [5, 10, 20, 30, 35, 35, 50, 70, 85, 100]},)
data.head()
#Simple regression analysis with input01 and output
from sklearn import linear_model
model = linear_model.LinearRegression()

#Use input01 as explanatory variable
X = data.loc[:, ['input01']].values
#Use output as objective variable
Y = data['output'].values

#Create a predictive model
model.fit(X, Y)
print('Model parameters:', model.get_params())
print('Regression coefficient:', model.coef_)
print('Intercept(error):', model.intercept_)
print('Coefficient of determination(X,Correlation of Y):', model.score(X, Y))
print('Regression equation:[alcohol] = %s × [density] + %s' % (model.coef_[0], model.intercept_))

Execution result


Model parameters: {'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'normalize': False}
Regression coefficient: [4.45327487]
Intercept(error): 10.912578788709233
Coefficient of determination(X,Correlation of Y): 0.8771602016326598
Regression equation:[alcohol] = 4.45327486982735 × [density] + 10.912578788709233
#Regression analysis using stats models
import statsmodels.api as sm

model = sm.OLS(Y, sm.add_constant(X))
result = model.fit(disp=0)
print(result.summary())

Execution result


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.877
Model:                            OLS   Adj. R-squared:                  0.862
Method:                 Least Squares   F-statistic:                     57.13
Date:                Fri, 20 Mar 2020   Prob (F-statistic):           6.56e-05
Time:                        23:33:37   Log-Likelihood:                -37.282
No. Observations:                  10   AIC:                             78.56
Df Residuals:                       8   BIC:                             79.17
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         10.9126      6.833      1.597      0.149      -4.845      26.670
x1             4.4533      0.589      7.558      0.000       3.095       5.812
==============================================================================
Omnibus:                        5.725   Durbin-Watson:                   2.878
Prob(Omnibus):                  0.057   Jarque-Bera (JB):                2.315
Skew:                           1.150   Prob(JB):                        0.314
Kurtosis:                       3.513   Cond. No.                         22.4
==============================================================================
#Graphing
import matplotlib.pyplot as plt
 
#Scatter plot
plt.scatter(X, Y)
 
#Regression line
plt.plot(X, model.predict(X), color='black')
#Normalized and multiple regression analysis
from sklearn import linear_model
model = linear_model.LinearRegression()

#Normalize each column in the data frame
data2 = data.apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))
data2.head()
#Use other than output as explanatory variable
data2_except_output = data2.drop("output", axis=1)
X = data2_except_output
#Use output as objective variable
Y = data2['output'].values
 
#Create a predictive model
model.fit(X, Y)
 
#Partial regression coefficient
print(pd.DataFrame({"name":data2_except_output.columns,
                    "result":np.abs(model.coef_)}).sort_values(by='result') )
 
print('Intercept(error):', model.intercept_)

Execution result


      name    result
0  input01  0.295143
1  input02  0.707205
Intercept(error): 1.6679414843100476e-17
#Multiple regression analysis using stats models
import statsmodels.api as sm

#Create a predictive model
model = sm.OLS(Y, sm.add_constant(X))
result = model.fit(disp=0)
print(result.summary())

Execution result


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.962
Model:                            OLS   Adj. R-squared:                  0.951
Method:                 Least Squares   F-statistic:                     87.64
Date:                Fri, 20 Mar 2020   Prob (F-statistic):           1.11e-05
Time:                        23:36:48   Log-Likelihood:                 13.530
No. Observations:                  10   AIC:                            -21.06
Df Residuals:                       7   BIC:                            -20.15
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -1.388e-17      0.024  -5.87e-16      1.000      -0.056       0.056
input01        0.2951      0.180      1.636      0.146      -0.132       0.722
input02        0.7072      0.180      3.923      0.006       0.281       1.133
==============================================================================
Omnibus:                        6.793   Durbin-Watson:                   1.330
Prob(Omnibus):                  0.033   Jarque-Bera (JB):                2.620
Skew:                           1.171   Prob(JB):                        0.270
Kurtosis:                       3.896   Cond. No.                         10.5
==============================================================================

Logistic regression analysis

#Registration of test data
import pandas as pd
import numpy as np

data = pd.DataFrame({'sex': [1, 1, 0, 1, 0, 0, 1, 1, 0, 0],
                     'student': [0, 0, 0, 0, 1, 0, 0, 1, 0, 1],
                     'Staying time(Seconds)': [34, 28, 98, 70, 67, 23, 67, 56, 41, 90],
                     'user registration': [0, 0, 1, 0, 1, 0, 1, 1, 0, 1]},)
data
#Logistic regression analysis using stats models
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm

#Use other than user registration for explanatory variables
X = data[['sex', 'student', 'Staying time(Seconds)']]
#Use user registration for objective variable
Y = data['user registration'].values

model = sm.Logit(Y, sm.add_constant(X))
result = model.fit(disp=0)
print('---summary---')
print(result.summary())

print('---Logarithmic odds---')
print(result.params)

print('---p-value---')
print(result.pvalues)

print('---What is the probability of an event occurring when a variable is incremented by 1 unit?%Will increase to(Quantitative evaluation)---')
print('Staying time(Seconds):', 1 / (1 + np.exp(-result.params['Staying time(Seconds)'])))
print('Staying time(Seconds):', np.exp(result.params['Staying time(Seconds)']) / (1 + np.exp(result.params['Staying time(Seconds)'])))

print('---How many times the probability that an event will occur when the variable becomes 1 is the probability that an event will not occur(Qualitative evaluation)---')
print('sex:', np.exp(result.params['sex']))
print('student:', np.exp(result.params['student']))

Execution result


---summary---
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                   10
Model:                          Logit   Df Residuals:                        6
Method:                           MLE   Df Model:                            3
Date:                Fri, 20 Mar 2020   Pseudo R-squ.:                  0.7610
Time:                        10:12:42   Log-Likelihood:                -1.6565
converged:                      False   LL-Null:                       -6.9315
Covariance Type:            nonrobust   LLR p-value:                   0.01443
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -9.5614     10.745     -0.890      0.374     -30.622      11.499
Gender 0.1543      5.170      0.030      0.976      -9.979      10.287
Student 22.7574   3.26e+04      0.001      0.999   -6.38e+04    6.38e+04
Staying time(Seconds)        0.1370      0.139      0.988      0.323      -0.135       0.409
==============================================================================

Possibly complete quasi-separation: A fraction 0.30 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
---Logarithmic odds---
const      -9.561361
Gender 0.154283
Student 22.757416
Staying time(Seconds)     0.136965
dtype: float64
---p-value---
const      0.373568
Gender 0.976193
Student 0.999442
Staying time(Seconds)    0.323037
dtype: float64
---What is the probability of an event occurring when a variable is incremented by 1 unit?%Will increase to(Quantitative evaluation)---
Staying time(Seconds): 0.5341877701226888
Staying time(Seconds): 0.5341877701226888
---How many times the probability that an event will occur when the variable becomes 1 is the probability that an event will not occur(Qualitative evaluation)---
sex: 1.1668207000698392
student: 7645749443.830123

References

Recommended Posts

Summary of statistical data analysis methods using Python that can be used in business
Summary of scikit-learn data sources that can be used when writing analysis articles
Scripts that can be used when using bottle in Python
[Python] Introduction to web scraping | Summary of methods that can be used with webdriver
Python standard input summary that can be used in competition pro
Easy padding of data that can be used in natural language processing
Summary of Pandas methods used when extracting data [Python]
List of Python code used in big data analysis
List of tools that can be used to easily try sentiment analysis of Japanese sentences in Python (try with google colab)
Summary of methods often used in pandas
Summary of built-in methods in Python list
A summary of Python e-books that are useful for free-to-read data analysis
Summary of probability distributions that often appear in statistics and data analysis
Functions that can be used in for statements
A well-prepared record of data analysis in Python
Summary of Excel operations using OpenPyXL in Python
Python practice data analysis Summary of learning that I hit about 10 with 100 knocks
Data analysis in Python Summary of sources to look at first for beginners
Data analysis using Python 0
A personal memo of Pandas related operations that can be used in practice
I made a familiar function that can be used in statistics with Python
Summary of tools needed to analyze data in Python
Basic algorithms that can be used in competition pros
[Python] [Word] [python-docx] Simple analysis of diff data using python
Japanese can be used with Python in Docker environment
Python knowledge notes that can be used with AtCoder
ANTs image registration that can be used in 5 minutes
Can be used in competition pros! Python standard library
Geographic information visualization of R and Python that can be expressed in Power BI
[Introduction to Python] Summary of functions and methods that frequently appear in Python [Problem format]
Data analysis using python pandas
Format summary of formats that can be serialized with gensim
Basic summary of data manipulation in Python Pandas-Second half: Data aggregation
Processing of python3 that seems to be usable in paiza
Goroutine (parallel control) that can be used in the field
Text analysis that can be done in 5 minutes [Word Cloud]
Goroutine that can be used in the field (errgroup.Group edition)
Evaluation index that can be specified in GridSearchCV of sklearn
Can be used with AtCoder! A collection of techniques for drawing short code in Python!
[Django] Field names, user registration, and login methods that can be used in the User model
[Python3] Code that can be used when you want to resize images in folder units
A simple data analysis of Bitcoin provided by CoinMetrics in Python
Investigation of DC power supplies that can be controlled by Python
I made it because I want JSON data that can be used freely in demos and prototypes
33 strings that should not be used as variable names in python
Here's a summary of things that might be useful when dealing with complex numbers in Python
A timer (ticker) that can be used in the field (can be used anywhere)
New features in Python 3.9 (1)-Union operators can be used in dictionary types
Recommendation of data analysis using MessagePack
A class that summarizes frequently used methods in twitter api (python)
Python standard module that can be used on the command line
I wrote a tri-tree that can be used for high-speed dictionary implementation in D language and Python.
I created a template for a Python project that can be used universally
AtCoder C problem summary that can be solved in high school mathematics
Basic summary of scraping with Requests that beginners can absolutely understand [Python]
Power of forecasting methods in time series data analysis Semi-optimization (SARIMA) [Memo]
Mathematical optimization that can be used for free work with Python + PuLP
Maximum number of function parameters that can be defined in each language
Code reading of faker, a library that generates test data in Python
(Bad) practice of using this in Python
Summary of frequently used commands in matplotlib