Residual analysis in Python (Supplement: Cochrane rules)

Introduction

There is a cross-tabulation test called χ-square test, and when you do that, there is a relationship between categories (for example, men like tea but women like tea). (Prefer water, etc.) can be statistically tested. If you want to know more about the chi-square test itself, see the link above. However, there are two points to note when performing the chi-square test, even if the p value falls below the significance level.

  1. Is the number of samples secured sufficiently even when viewed by category?

Need to be considered. After all, what is tested by the chi-square test is the presence or absence of bias in the entire crosstabulation table, and even if the test results are significant, it does not mean that all the categories are related.

Cochrane rules

Originally, it is a standard that needs to be confirmed before performing the χ-square test. The application standard of the chi-square test is that ** cells with an expected value of less than 5 in the cross tabulation table must not exceed 20% of the total **, which corresponds to ** Cochrane's rule **. There are various theories in the part of "20% or more", and various notations such as "25% or more" and "greater than 20%" can be seen. Use scipy.stat.chi2_contingency for chi-square test in Python. I think there are many, so use the expected value table returned by this function to check the rules of Cochrane.

#Chi-square test cross is a numpy two-dimensional array
x2, p, dof, expected = stats.chi2_contingency(cross)
expected = np.array(expected)
#Cochrane rules
expected < 5

If the number of True cells is less than 20% of the total, the Cochrane rule is satisfied. If you find that your data does not meet this rule, then Fisher's exact test (https://en.wikipedia.org/wiki/%E3%83%95%E3%82%A3%E3%83] % 83% E3% 82% B7% E3% 83% A3% E3% 83% BC% E3% 81% AE% E6% AD% A3% E7% A2% BA% E7% A2% BA% E7% 8E% 87 It would be better to move to% E6% A4% 9C% E5% AE% 9A).

Residual analysis

If you do a chi-square test with R, it will return the adjusted standardized residuals of each cell at the same time as the test, so there is no problem, but in the case of Python it seems that you need to do it manually. The definition of residual is Residual = observed value-expected value However, in order to calculate the adjusted standardized residuals, it is necessary to newly define ** residual variance **.

Residual variance= (1 - \frac{Horizontal peripheral sum}{Total number})(1 - \frac{Vertical peripheral sum}{Total number})

Please refer to the reference site below for the specifics. Anyway, based on this, the adjusted standardized residuals

Adjusted standardized residuals= \frac{Residual error}{\sqrt{Expected value*Residual error分散}}

Can be calculated as If you write the flow up to this point in Python code, it looks like this.

#Residual error
res = cross - expected
#Find the residual variance
res_var = np.zeros(res.shape)
it = np.nditer(cross, flags=['multi_index'])
while not it.finished:
    var = (1 - (cross[:,it.multi_index[1]].sum() / cross.sum()))*(1-(cross[it.multi_index[0],:].sum() / cross.sum()))
    res_var[it.multi_index[0], it.multi_index[1]] = var
    it.iternext()
#Find the adjusted standardized residuals
stdres = res / np.sqrt(expected * res_var)
#This adjusted standardized residual is absolute value 1.A significant difference can be claimed if it is 96 or higher. Here, the value is converted from the normal distribution table to the p-value and displayed.
np.apply_along_axis(stats.norm.sf, 0, np.abs(stdres[0,:]))

Did you find it useful?

Reference site

https://note.chiebukuro.yahoo.co.jp/detail/n71838

Recommended Posts

Residual analysis in Python (Supplement: Cochrane rules)
Association analysis in Python
Regression analysis in Python
EEG analysis in Python: Python MNE tutorial
First simple regression analysis in Python
Widrow-Hoff learning rules implemented in Python
Implemented Perceptron learning rules in Python
Planar skeleton analysis in Python (2) Hotfix
Quadtree in Python --2
Python in optimization
CURL in python
Metaprogramming in Python
Python 3.3 in Anaconda
Geocoding in python
SendKeys in Python
Meta-analysis in Python
Unittest in python
Data analysis python
Epoch in Python
Discord in Python
Sudoku in Python
DCI in Python
quicksort in python
nCr in python
Survival time analysis learned in Python 2 -Kaplan-Meier estimator
Perform entity analysis using spaCy / GiNZA in Python
Data analysis in Python: A note about line_profiler
Plink in Python
[Environment construction] Dependency analysis using CaboCha in Python 2.7
Constant in python
Experiment with NIST 800-63B password rules in Python
Python learning (supplement)
Lifegame in Python.
FizzBuzz in Python
Sqlite in python
StepAIC in Python
N-gram in python
LINE-Bot [0] in Python
Disassemble in Python
Reflection in Python
Constant in python
format in python
Scons in Python3
Puyo Puyo in python
python in virtualenv
PPAP in Python
A well-prepared record of data analysis in Python
Quad-tree in Python
Reflection in Python
Chemistry in Python
Hashable in python
DirectLiNGAM in Python
LiNGAM in Python
Flatten in python
flatten in python
2. Multivariate analysis spelled out in Python 1-1. Simple regression analysis (scikit-learn)
2. Multivariate analysis spelled out in Python 3-2. Principal component analysis (algorithm)
Merge sort implementation / complexity analysis and experiment in Python
2. Multivariate analysis spelled out in Python 7-3. Decision tree [regression tree]
List of Python code used in big data analysis
2. Multivariate analysis spelled out in Python 2-1. Multiple regression analysis (scikit-learn)