Statistical test (multiple test) in Python: scikit_posthocs

** What you can do with this article **

  1. Selection of test method considering data normality and homoscedasticity
  2. Para-nonparametric multiplex test
  3. Illustrated results by heat map with high listability

Introduction

I major in biology at graduate school. Python is very useful for graph analysis of experimental data, but it was quite a problem because there are few libraries that perform various tests (especially significant difference tests between groups). As a solution

  1. Do it with R.
  2. Implement the R function in Python.

Etc. are possible. (For 2, I was able to do it by introducing Rpy2)

However, I really want to run nonparametric tests in Python! Therefore, I decided to implement a nonparametric test using a library called the third method ** scilit_posthocs **.

If you are only interested in implementing scikit_posthocs, I hope you can jump to the table of contents below at once.

Test Basics: How to select a test method.

First of all, to start the test The following content is based on the following J-stage article For those who do not understand statistical tests I, II //www.jstage.jst.go.jp/article/kagakutoseibutsu/51/6/51_408/_pdf), [Ⅲ](https://www.jstage.jst.go.jp/article/kagakutoseibutsu/51/ 7 / 51_483 / _pdf)

I interpret and describe these articles in my own way.

It's going to be a little long, so I'll update it when I have time, but in short, I'll follow the next flow.

** Unsupported (independent) data in 3 or more groups **     ↓ Normality test (Shapiro-Wilk test or QQ plot ..) → Go to Non-para     ↓ Homoscedasticity test (Bartlett's test) → To Non-para     ↓ One-way ANOVA (ANOVA) → To Non-para     ↓ Tukey_HSD test, Scheffe test, Tukey test (n is the same in each group), Dunnett test (comparison with Control group)

In the article I referred to,

  1. When the null hypothesis of normality and homoscedasticity cannot be rejected
  2. If there is no significant difference in ANOVA Recommends nonparametric tests.

** If you go to nonparametric in the above flowchart **     ↓ Homoscedasticity test (Levene test, Fligner test)     ↓ One-way ANOVA (Kruskal-Wallis test)     ↓ Steel-Dwass (dscf) test, Conover test For the time being, these tests are also posthoc tests, so why not have a significant difference in one-way ANOVA? ?? I think that, but according to the above article, I wrote that it is not necessary to perform analysis of variance.

Implementation of scikit-posthocs

scikit_posthocs is a library that covers a lot of tests, which is not covered by scipy or statsmodels, and is very easy to use. The official website is very well organized, so please check it out. Official HP GitHub repository

Dependent packages are Numpy, Pandas, scipy, stasmodels, matplotlib, seaborn. scikit_posthocs can be installed with pip.

!pip install scikit_posthocs
  1. Tukey_HSD test
  2. Tukey test
  3. Scheffe test
  4. Pairwise t-test (probably because t-test is repeated multiple times, it is not well-behaved)
  5. Steel-Dwass test
  6. Conover test

is. Any test (other than HSD) can be executed as follows.

import scikit_posthoc as sp
import seaborn as sns

#Load Titanic data
df = sns.load_dataset("titanic")

#Steel-Dwass test
#val_col is the value column
#group_col is the column of the group you want to compare
sp.posthoc_dscf(df,val_col="fare",group_col="class")

The result will be returned in the following data frame. The contents of the table are p-values. スクリーンショット 2020-01-31 19.00.15.png

Implemented collectively from selection of test method to illustration

I put the above contents together on github. rola-bio/stats_test Download stats_test.py in it to your working directory and import it. And when you run stats_test (), As shown in the flow above, the normality and homoscedasticity of the data are tested and analysis of variance is performed automatically. The data is then analyzed with a suitable test and a bar graph of the significant difference results and data is illustrated. By default, one of the Tukey-HSD, Steel-Dwass, and Conover tests is selected.

Now, let's use this function to analyze the difference in fares depending on the type of passenger from the Titanic passenger data that is actually installed as standard in seaborn.

titanic.ipynb



import stats_test as st
import seaborn as sns

#Load Titanic data
df = sns.load_dataset("titanic")
df.head()
スクリーンショット 2020-01-31 17.42.55.png Next, use stats_test () to specify the data frame, the value you want to test, and which element to group. This time, I tried to divide the types of passengers by boarding place (embark_town).

titanic.ipynb


st.stats_test(df,val_col="fare",group_col="embark_town")

Oops ~~? ?? I got an error when I ran this.

TypeError: '<' not supported between instances of 'float' and 'str'

Apparently there is an error (nan) in fare or embark_town. You may get this error if group_col is mixed with ints or null values. In case of int error

df ["column name"] = df ["column name"] .astype (str)

You can deal with it with. This time, I removed nan with dropna as shown below.

titanic.ipynb


st.stats_test(df.dropna(subset=["embark_town"]),val_col="fare",group_col="embark_town")

result

スクリーンショット 2020-01-31 17.43.11.png ** How to read the screen ** The notation on the upper side of the image shows that the result of the test was nonparametric and unequal variance. Also, the Kruskal-Wallis test results seemed to be significantly different, so the Conover test was automatically selected. The left side of the figure plots the data, and the right side shows the test results in a heat map.

Apparently there is a significant difference of p-value <0.001 or less between all groups. People who rode in Cherbourg are significantly crazy ...

Further divide the group into men and women

Oops, me! I have made a remark that seems to be a man.

People who ride in Cherbourg are significantly bogged down

This data does not distinguish between men and women, so let's do a significant difference test by gender next.

titanic.ipynb


for sex in df["sex"].unique():
    print("""
This result is from {} 
""".format(sex))
    df_query = df.query("sex =='{}'".format(sex))
    st.stats_test(df_query.dropna(subset=["embark_town"]),
                  val_col="fare",group_col="embark_town")
スクリーンショット 2020-01-31 18.00.07.png スクリーンショット 2020-01-31 18.00.20.png

The result was something like that. It's hard to understand because the color coding of the boarding place has changed from the first result. .. .. You can adjust it by playing with sign_barplot () in the package.

In any case, Cherbourg passengers seem to be significantly richer for both men and women. (Gununu ,,,) However, the wage difference between Southampton and Cherbourg for men has risen to a p-value of about 0.01. Is it because of Maya Yoshida?

That's it.

By the way, if you pass result = True to stats_test (), the result of the test in the middle will also be displayed. You can specify the test yourself by passing test = "test name". (Or you can easily change it by playing with the function of stats_test.py, one_way_ANOVA ())

Over time, we may also write implementations of individual tests. For details, please refer to the contents of the code ...

Recommended Posts

Statistical test (multiple test) in Python: scikit_posthocs
Statistical test grade 2 probability distribution learned in Python ②
Statistical test grade 2 probability distribution learned in Python ①
Multiple regression expressions in Python
Algorithm in Python (primality test)
Avoid multiple loops in Python
Prohibit multiple launches in python
Set python test in jenkins
Extract multiple list duplicates in Python
Write selenium test code in python
Delete multiple elements in python list
[Statistical test 2nd grade / quasi 1st grade] Regression analysis training in Python (2)
[Statistical test 2nd grade / quasi 1st grade] Regression analysis training in Python (1)
Handle multiple python versions in one jupyter
Stress Test with Locust written in Python
Write the test in a python docstring
Send email to multiple recipients in Python (Python 3)
Collectively implement statistical hypothesis testing in Python
Post Test 3 (Working with PosgreSQL in Python)
Quadtree in Python --2
Python in optimization
CURL in python
Metaprogramming in Python
Python 3.3 in Anaconda
SendKeys in Python
Epoch in Python
Discord in Python
Sudoku in Python
DCI in Python
quicksort in python
nCr in python
N-Gram in Python
Programming in python
Constant in python
Python Integrity Test
Lifegame in Python.
FizzBuzz in Python
Sqlite in python
StepAIC in Python
N-gram in python
LINE-Bot [0] in Python
Csv in python
Disassemble in Python
Reflection in Python
Constant in python
nCr in Python.
format in python
Scons in Python3
Puyo Puyo in python
python in virtualenv
PPAP in Python
Quad-tree in Python
Reflection in Python
Chemistry in Python
Hashable in python
DirectLiNGAM in Python
LiNGAM in Python
Flatten in python
flatten in python
Create a Vim + Python test environment in 1 minute
I want to do Dunnett's test in Python