[PYTHON] Practice typical methods of statistics (1)

"Statistics is the strongest study" and "[Statistics is the strongest study](Practice edition) by Hiromu Nishiuchi http://www.amazon.co.jp/dp/4478028230) ”has become an exceptional bestseller with a cumulative total of over 370,000 copies in the series. I think there are many people who have read it.

In the two books before and after this, various methods appearing in statistics textbooks are described in "[Generalized Linear Model](http://ja.wikipedia.org/wiki/%E4%B8%80%E8%88%AC%]. E5% 8C% 96% E7% B7% 9A% E5% BD% A2% E3% 83% A2% E3% 83% 87% E3% 83% AB) ”is summarized in one table.

I will quote the table here.

A table summarizing the p170 generalized linear model, where statistics is the strongest study

1.png

Practical edition where statistics are the strongest study p344 An expanded version of one table that dramatically advances the understanding of statistics

1.png

These two books explain the statistical methods often used in business, what they mean, what kind of ideas they came up with, and how to use them.

In addition, there are three pieces of knowledge that cannot be obtained in this book in the above-mentioned Practical Edition p357.

  1. Practice using tools and real data
  2. Deep understanding of mathematical methods
  3. More advanced methods born in recent years Is listed.

From this time on, I would like to pay particular attention to 1. above and give examples based on simple data when practicing in the analytical language that I have used so far.

That said, there are some that have already been described so far, so let's proceed as a review of those.

Case-control study and chi-square test

This is the story of Dole and Hill's "case-control study," which is said to be the first epidemiological estimate.

A survey of 1465 lung cancer inpatients from hospitals across the UK between 1948 and 1952 as a link between lung cancer and smoking was as follows:

Number of people smoker 非smoker
Male lung cancer patient 1357 1350(99.5%) 7(0.5%)
Male non-lung cancer patient 1357 1296(95.5%) 61(4.5%)
Female lung cancer patient 108 68(63.0%) 40(37.0%)
Female lung cancer patient 108 49(45.4%) 59(54.6%)

An epidemiological case is a case, that is, a case (patient) who has become ill, and a control is a comparison.

When a chi-square test is performed on this data, the result is as follows.

import scipy as sp
import scipy.stats as stats

#Male data(Lung cancer patients and non-lung cancer patients)
man = sp.array([[1350, 7], [1296, 61]])
#Women's data(Lung cancer patients and non-lung cancer patients)
female = sp.array([[68, 40], [49, 59]])

def chi_squared_test(data):
    """Function to perform chi-square test"""
    #Chi-square value, p-value, degrees of freedom
    x2, p, dof, expected = stats.chi2_contingency(data)
    return x2, p, dof, expected

results = chi_squared_test(man)
results = chi_squared_test(female)

As a result, for men, the chi-square value is 42.3704259482, the p value is 7.5523446617e-11, and the degree of freedom is 1, which is a significant difference. Similarly, for women, the chi-square value is 6.04195804196 and the p value is 0.0139697819212 with 1 degree of freedom, which is a significant difference.

This shows that smoking cannot be said to have no effect on lung cancer cases.

It will continue to the next.

Recommended Posts

Practice typical methods of statistics (1)
Deep learning 1 Practice of deep learning
[Statistics] Multiprocessing of MCMC sampling
Various import methods of Mnist
Predictive Statistics (Practice Classification) Python
[Basics of Modern Mathematical Statistics with python] Chapter 3: Typical Probability Distribution
Predictive Statistics (Practice Simple Regression) Python