[PYTHON] Hypothesis test and probability distribution

Yesterday explained statistics and interval estimation as prerequisites for the hypothesis test. Once again, let's clean up the NumPy function that we often use to find statistics.

Frequently used basic statistic calculation function

Let's assume you have the numeric vectors X and Y. Note that import numpy as np and from scipy import stats are a prerequisite.

function Description
np.max(X) Find the maximum value of X
np.min(X) Find the minimum value of X
np.mean(X) Find the mean of X
np.median(X) Find the median of X
np.var(X) Find the variance of X
np.std(X) Find the standard deviation of X
stats.scoreatpercentile(X, 25) Find the first quartile of X
stats.scoreatpercentile(X, 75) Find the third quartile of X
np.dot(X, Y) Find the matrix product of X and Y
np.outer(X, Y) Find the Cartesian product of X and Y
np.corrcoef(X, Y)[0,1] Find the correlation coefficient between X and Y

Hypothesis test and probability distribution

A hypothesis test is a statistical hypothesis significance test. Since it is a hypothesis test, you have to make a hypothesis. If you make a rough hypothesis, for example, the following cases can be considered.

An example of a hypothesis case

  1. The dice roll should be 1/6, but it seems that the frequency of 6 is high (Isn't it crazy)?
  2. After a month of dieting, my weight decreased from 75 kg to 70 kg. However, body weight fluctuates due to measurement errors and daily fluctuations. If we follow a normal distribution with a standard deviation of 1 kg, including these errors and fluctuations, is this thin?
  3. The number of epidemic cases was investigated in two areas A and B, which are medical diagnoses. In 10 studies, 52 patients were found in A and 28 in B. At first glance, area A seems to have a higher prevalence of epidemics, but the population of the area is different. If we now know that the frequency of patient occurrence follows a Poisson distribution, can we really say so?
  4. Women in their 30s buy twice as many products as women in their 20s. A survey of 100 women to confirm this revealed that there were 52 women in their 30s, 30 women in their 20s, and 15 in other age groups. However, when the number of people surveyed was increased to 150, 300, and 500, the ratio was different. How should I make a decision in such a case?

Sampling distribution of statistics

Here is the definition of the probability distribution.

distribution Description
Binary population Binomial distribution Bi if the population distribution is Bernoulli distribution with parameter p(1,p), X1 + ... +The distribution of Xn is the binomial distribution Bi(n,p)Follow.
Poisson population Poisson distribution with population parameter λ Po(λ)Then X1+ ... +Xn is Poisson distribution Po(nλ)Follow.
Regular population Population distribution is population parameter u,Normal distribution of σ N(μ, σ^2)Then X1+ ... +Xn is normally distributed N(nμ, nσ^2)Follow

Main continuous probability distributions

normal distribution

The description of the ** normal distribution ** that often appears is [Wikipedia description](http://en.wikipedia.org/wiki/%E6%AD%A3%E8%A6%8F%E5%88 It may be faster to look around% 86% E5% B8% 83), but the definition is as follows.

f(x) = \frac 1 {\sqrt{2\pi\sigma}} exp \{-(x-\mu)^2/2{\sigma^2}\}, -\infty \lt x \lt \infty

When the probability distribution X follows a normal distribution, the expected value is:

E(X) = \int_{-\infty}^{\infty}x(1/{\sqrt{2\pi\sigma}}) exp \{-(x-\mu)^2/2{\sigma^2}\}{dx} = \mu

Therefore, the variance is given by

V(X) = \int_{-\infty}^{\infty}(x-\mu)^2(1/{\sqrt{2\pi\sigma}})exp \{-(x-\mu)^2/2{\sigma^2}\}{dx} = \sigma^2

From this, the normal distribution of the mean μ variance σ ^ 2 is expressed as follows.

N(\mu, \sigma^2)

Exponential distribution

** exponential distribution ** is a continuous distribution defined by the following probability density function.

f(x) = {\lambda}e^{-{\lambda}x} \\
However\\
(x\ge0), 0 (x\lt0)

This probability distribution has the property of a continuous waiting time distribution. For example, the waiting time, lifespan, useful life, or years to disaster of a system with a constant failure rate.

The expected value and variance of the random variable X that follows this distribution can be calculated by the following equations.

E(X) = 1/{\lambda} \\
V(X) = 1/{\lambda^2}

Rare events, in which the number of years until occurrence is distributed by an exponential distribution, are not unnatural even if they occur in the near future, even if the probability is small. For example, a large earthquake is an easy-to-understand and familiar analogy.

Main discrete probability distributions

Poisson distribution

Consider a binomial distribution like a coin toss. The binomial distribution is uniform, but Poisson's minority law holds if n is large and p is small (probability is rare in large numbers of observations). For example, it is easy to understand if you mention the lottery that only 3 out of 1000 hits and the rest are out, or the success rate of huge products with a very low probability of reaching a contract. The theorem is as follows.

P(X = k) = \frac {{\lambda}^xe^{-\lambda}} {k!}, \lambda \gt 0

If the random variable X follows a Poisson distribution, the expected value and variance are: It can be said that the Poisson distribution is characterized by the fact that the expected value and the variance are equal to λ.

E(X) = \lambda \\
V(X) = \lambda

Various hypothesis tests

Chi-square test

The other day has also appeared ** The chi-square test ** verifies the variance match. If the null hypothesis is not rejected, the test statistic is [chi-square distribution](http://en.wikipedia.org/wiki/%E3%82%AB%E3%82%A4%E4%BA%8C%E4 % B9% 97% E5% 88% 86% E5% B8% 83).

When n random samplings are performed from the normal distribution N (μ, σ ^ 2)

Z = \sum_{i=1}^n \frac {(X_i - \mu)^2} {\sigma^2}

Z follows a chi-square distribution with n degrees of freedom.

For example, suppose you observe a shopping street and 45 women and 55 men are observed. There was a bias in these 100 people, but according to a survey that the male-female ratio may actually be fifty-fifty.

n = \frac {(45-50)^2} {50} + \frac 
{(55-50)^2} {50} = 1

At this time, the degree of freedom n is 1. The chi-square distribution with one degree of freedom is 0.32, assuming that men and women are equal in the first place, so it is not rejected. In other words, it can happen enough.

t-test

** t-test (student's t-test) ** tests the mean for small samples. Using the population mean u, the sample mean X, and the standard sample deviation s for a sample of size n extracted from a normally distributed population, T can be obtained as shown in the following equation.

T = \frac {\sqrt{n-1} (X - \mu)} s

Then T follows a t distribution with n-1 degrees of freedom.

Practice of hypothesis testing

Let's take an example to explain what the difference is between the chi-square test and the t-test, and what the implementation code looks like.

Chi-square test

The chi-square test looks for the following aggregated data to see if it is related to store and merchandise sales.

Store Product A Product B total
Store X 435 165 600
Store Y 265 135 400
total 700 300 1000

The chi-square test was performed previously, so it will be omitted.

t-test

The t-test examines whether there is a significant difference in the scores of Japanese and math for the following data, for example. (* Pseudo data)

Attendance number National language Math
1 68 86
2 75 83
3 80 76
4 71 81
5 73 75
6 79 82
7 69 87
8 65 75

This is a t-test.

import numpy as np
import scipy as sp
from scipy import stats

X = [68 75 80 71 73 79 69 65]
Y = [86 83 76 81 75 82 87 75]

print(X)
print(Y)

t, p = stats.ttest_rel(X, Y)

print( "t value is%(t)s" %locals() )
print( "The probability is%(p)s" %locals() )

if p < 0.05:
    print("There is a significant difference")
else:
    print("There is no significant difference")

# [68 75 80 71 73 79 69 65]
# [86 83 76 81 75 82 87 75]
#t value is-2.9923203754253302
#Probability is 0.0201600161737
#There is a significant difference

We found that there was a significant difference between Japanese and math grades.

So what about the next science and social grades?

Attendance number Science society
1 85 80
2 69 76
3 77 84
4 77 93
5 75 76
6 74 80
7 87 79
8 69 84

Let's try with the same code.

# [85 69 77 77 75 74 87 69]
# [80 76 84 93 76 80 79 84]
#t value is-1.6077470858053244
#Probability is 0.151925908683
#There is no significant difference

This time it turned out that there was no significant difference.

Recommended Posts

Hypothesis test and probability distribution
Distribution and test
[Statistical test 2nd grade] Discrete probability distribution
Post test
Distribution and test
Python debug and test module
Hypothesis test and probability distribution
Statistical test grade 2 probability distribution learned in Python ②
Concept of Bayesian reasoning (2) ... Bayesian estimation and probability distribution
Statistical test grade 2 probability distribution learned in Python ①
Bayesian statistics hypothesis test
OS and Linux distribution
Statistical hypothesis test of A/B test and required number of data
Mixed Gaussian distribution and logsumexp
Hypothesis test for product improvement
Basic statistics and Gaussian distribution
Python debug and test module
Test the goodness of fit of the distribution
PRML Chapter 2 Probability Distribution Nonparametric Method
[Introduction to Data Scientists] Basics of Probability and Statistics ♬ Probability / Random Variables and Probability Distribution
Text mining: Probability density distribution on the hypersphere and text clustering in KMeans