[PYTHON] Distribution and test

distribution

Random numbers and uniform distribution

First, let's generate a uniform random number and illustrate its distribution.

#Import the library for handling random numbers.
import random
sample_size = 10 #Number of random numbers generated

#Store uniform random numbers in dist (distribution):distribution)
dist = [random.random() for i in range(sample_size)]
#Check the contents of dist.
dist
#Import a library to illustrate diagrams and graphs.
import matplotlib.pyplot as plt
%matplotlib inline
#Draw a histogram.
plt.hist(dist)
plt.grid()
plt.show()

Try increasing the number of random numbers generated

As the number of random numbers generated increases, the shape of the "ideal" distribution approaches.

sample_size = 100 #Number of random numbers generated

#Store uniform random numbers in dist
dist = [random.random() for i in range(sample_size)]

#Draw a histogram.
plt.hist(dist)
plt.grid()
plt.show()
sample_size = 1000 #Number of random numbers generated

#Store uniform random numbers in dist
dist = [random.random() for i in range(sample_size)]

#Draw a histogram.
plt.hist(dist)
plt.grid()
plt.show()
sample_size = 10000 #Number of random numbers generated

#Store uniform random numbers in dist
dist = [random.random() for i in range(sample_size)]

#Draw a histogram.
plt.hist(dist)
plt.grid()
plt.show()
sample_size = 100000 #Number of random numbers generated

#Store uniform random numbers in dist
dist = [random.random() for i in range(sample_size)]

#Draw a histogram.
plt.hist(dist)
plt.grid()
plt.show()

Try increasing bin

The box used to separate garbage is called bin. When drawing a histogram, the display will differ depending on how many bins are sorted. If you increase the number of bins, you can see the fine shape of the distribution, but the number of data separated per bin naturally decreases.

sample_size = 100000 #Number of random numbers generated

#Store uniform random numbers in dist
dist = [random.random() for i in range(sample_size)]

#Draw a histogram.
plt.hist(dist, bins=100) #Increase bin
plt.grid()
plt.show()

Binomial distribution

__np.random.binomial (n, p) __ returns the number of odd numbers that appear when you play roulette n times with a probability of p that produces an odd number (probability 1-p produces an even number). Such a distribution is called a binomial distribution.

Binomial distribution with equal probability

Play roulette with equal probability of odd and even numbers 10 times and count the number of odd numbers. Repeat it 10,000 times. What is the probability that odd and even numbers will appear the same number of times (probability of appearing 5 times each)?

#Import the library of numerical calculations.
import numpy as np
sample_size = 10000 #Number of random numbers generated

#An odd number appears with a probability p (probability 1)-When you play roulette n times, you get an even number with p)
#Distribution of the number of odd numbers
dist = [np.random.binomial(n=10, p=0.5) for i in range(sample_size)]

#Draw a histogram.
plt.hist(dist, bins=100)
plt.grid()
plt.show()

As you can see from the above figure, if you play roulette with equal probability of odd and even numbers, the probability of odd and even numbers appearing the same number of times (probability of 5 times each) is about 25% (about 10000 times). 2500 times). You may have the impression that it is unexpectedly small.

Is that roulette squid?

You were observing other guests playing roulette at the casino. Then, since the number of odd numbers appearing is extremely high, I felt that the roulette was a squid. If it's not crazy, roulette should have odd and even odd numbers with equal probability. However, this roulette had an odd number 60 times out of 100 times. Is this roulette squid?

When you play roulette with equal probability of odd and even numbers 100 times, what is the probability that odd numbers will appear 60 times or more? First, let's draw the distribution.

sample_size = 10000 #Number of random numbers generated

#An odd number appears with a probability p (probability 1)-When you play roulette n times, you get an even number with p)
#Distribution of the number of odd numbers
dist = [np.random.binomial(n=100, p=0.5) for i in range(sample_size)]

#Draw a histogram.
plt.hist(dist, bins=100)
plt.grid()
plt.show()

With the same calculation as above, let's calculate the "probability of playing roulette 100 times and getting an odd number 60 times or more".

sample_size = 10000 #Number of random numbers generated

#An odd number appears with a probability p (probability 1)-When you play roulette n times, you get an even number with p)
#Distribution of the number of odd numbers
dist = [np.random.binomial(n=100, p=0.5) for i in range(sample_size)]

p = sum([1 for n in dist if n >= 60]) / sample_size
print("p-value: %(p)s " %locals())

After playing roulette with equal probability of odd and even numbers 100 times, it was found that the probability of odd numbers appearing 60 times or more "accidentally" is less than 5%. In other words, for a roulette that gives an odd number 60 times or more out of 100 times, it seems good to suspect that the roulette is crazy.

P at this time is called the p value (significance probability).

Exercise 1

For a roulette wheel that has an odd number of 60 or more out of 100, it seems good to suspect that the roulette wheel is crazy. Then, if odd numbers appear 6 or more times out of 10 times, the probability of odd numbers appearing is the same 60%, but can you say that the roulette is crazy? Calculate the p-value and answer.

#Exercise 1

Binomial distribution with non-equal probabilities

It is estimated that 5% of all populations have an infectious disease. If 20 people were randomly selected from the total population, how many people would be affected in the extracted population? Such a distribution is also a binomial distribution. Let's draw a distribution.

sample_size = 10000 #Number of random numbers generated

#An odd number appears with a probability p (probability 1)-When you play roulette n times, you get an even number with p)
#Distribution of the number of odd numbers
dist = [np.random.binomial(n=20, p=0.05) for i in range(sample_size)]

#Draw a histogram.
plt.hist(dist, bins=100)
plt.grid()
plt.show()

Exercise 2

It is estimated that 5% of all populations have an infectious disease. When 100 people were randomly selected from all the inhabitants, there were more than 10 affected people in the extracted population.

(1) Estimate the probability that it will happen by chance.

(2) How should the result be interpreted?

#Exercise 2

normal distribution

__random.normalvariate (mu, sigma) __ is a function that generates random numbers that follow a normal distribution (mu is the mean, sigma is the standard deviation).

Standard normal distribution

A normal distribution with a mean of 0 and a standard deviation of 1 is called a "standard normal distribution". Let's draw a standard normal distribution.

sample_size = 10000 #Number of random numbers generated

dist = [random.normalvariate(mu=0, sigma=1) for i in range(sample_size)]

#Draw a histogram.
plt.hist(dist, bins=100)
plt.grid()
plt.show()

What is the probability that a random number that follows a standard normal distribution will output a value of 2 or more? Let's calculate.

sample_size = 10000 #Number of random numbers generated

dist = [random.normalvariate(mu=0, sigma=1) for i in range(sample_size)]

p = sum([1 for n in dist if n >= 2]) / sample_size
print("p-value: %(p)s " %locals())

Deviation value

It is assumed that the "deviation value", which is often used in university entrance exams, follows a normal distribution with an average of 50 and a standard deviation of 10. Let's draw a distribution. Here, imagine the number of students on the vertical axis.

sample_size = 10000 #Number of random numbers generated

#Normal distribution with mean 50 and standard deviation 10
dist = [random.normalvariate(mu=50, sigma=10) for i in range(sample_size)]

#Draw a histogram.
plt.hist(dist, bins=100)
plt.grid()
plt.show()

Exercise 3

How many out of 10,000 students have a deviation of 70 or more?

#Exercise 3

Test

import numpy as np #Library for numerical calculation
import scipy as sp #Scientific calculation library
from scipy import stats #Statistical calculation library

Chi-square test

The chi-square test is a method used to test whether two distributions are the same.

After rolling the dice 60 times and counting the number of times each roll was rolled, the result was as follows.

Dice roll
Number of occurrences 171067155

At this time, let's test whether or not it follows the distribution of theoretical values (uniform distribution).

significance = 0.05
o = [17, 10, 6, 7, 15, 5] #Measured value
e = [10, 10, 10, 10, 10, 10] #Theoretical value

chi2, p = stats.chisquare(o, f_exp = e)

print('chi2 value is%(chi2)s' %locals())
print('The probability is%(p)s' %locals())

if p < significance:
    print('Significance level%(significance)There is a significant difference in s' %locals())
else:
    print('Significance level%(significance)There is no significant difference in s' %locals())

chi2 value is 12.4 The probability is 0.029699459203520212 At a significance level of 0.05, there is a significant difference

Exercise 4

When the shipping grades of a vegetable grown by the A method and the B method are as shown in the table below, should we consider that there is a relationship between these growing methods and the product grade?

Excellent Good Yes Total
A method 12 30 58 100
B method 14 90 96 200
total 26 120 154 300
#Exercise 4

Unpaired t-test

#Unpaired t-test
significance = 0.05
X = [68, 75, 80, 71, 73, 79, 69, 65]
Y = [86, 83, 76, 81, 75, 82, 87, 75]

t, p = stats.ttest_ind(X, Y)

print('t value is%(t)s' %locals())
print('The probability is%(p)s' %locals())

if p < significance:
    print('Significance level%(significance)There is a significant difference in s' %locals())
else:
    print('Significance level%(significance)There is no significant difference in s' %locals())

The t value is -3.214043146821967 The probability is 0.006243695014300228 At a significance level of 0.05, there is a significant difference

Exercise 5

The same math test was conducted in two classes, the 6th grade 1st class and the 6th grade 2nd class, and the scoring results were obtained. Please test if there is a difference in points between the two classes.

6th grade 1 group Score 6th grade 2nd group Score
1 70 1 85
2 75 2 80
3 70 3 95
4 85 4 70
5 90 5 80
6 70 6 75
7 80 7 80
8 75 8 90
class_one = [70, 75, 70, 85, 90, 70, 80, 75]
class_two = [85, 80, 95, 70, 80, 75, 80, 90] 
#Exercise 5

Paired t-test

#Paired t-test
significance = 0.05
X = [68, 75, 80, 71, 73, 79, 69, 65]
Y = [86, 83, 76, 81, 75, 82, 87, 75]

t, p = stats.ttest_rel(X, Y)

print('t value is%(t)s' %locals())
print('The probability is%(p)s' %locals())

if p < significance:
    print('Significance level%(significance)There is a significant difference in s' %locals())
else:
    print('Significance level%(significance)There is no significant difference in s' %locals())

The t value is -2.9923203754253302 The probability is 0.02016001617368161 At a significance level of 0.05, there is a significant difference

Exercise 6

Please test if there is a difference between the national language and the math score.

6th grade 1 group Japanese Arithmetic
1 90 95
2 75 80
3 75 80
4 75 80
5 80 75
6 65 75
7 75 80
8 80 85
kokugo =   [90, 75, 75, 75, 80, 65, 75, 80]
sansuu = [95, 80, 80, 80, 75, 75, 80, 85]
#Exercise 6

Analysis of variance

#One-factor analysis of variance
significance = 0.05
a = [34, 39, 50, 72, 54, 50, 58, 64, 55, 62]
b = [63, 75, 50, 54, 66, 31, 39, 45, 48, 60]
c = [49, 36, 46, 56, 52, 46, 52, 68, 49, 62]
f, p = stats.f_oneway(a, b, c)

print('f value is%(f)s' %locals())
print('The probability is%(p)s' %locals())

if p < significance:
    print('Significance level%(significance)There is a significant difference in s' %locals())
else:
    print('Significance level%(significance)There is no significant difference in s' %locals())

The f value is 0.09861516667148518 The probability is 0.9064161716556407 Significance level 0.05, no significant difference

Exercise 7

Perform an analysis of variance using the data below.

group1 = [80, 75, 80, 90, 95, 80, 80, 85, 85, 80, 90, 80, 75, 90, 85, 85, 90, 90, 85, 80]
group2 = [75, 70, 80, 85, 90, 75, 85, 80, 80, 75, 80, 75, 70, 85, 80, 75, 80, 80, 90, 80]
group3 = [80, 80, 80, 90, 95, 85, 95, 90, 85, 90, 95, 85, 98, 95, 85, 85, 90, 90, 85, 85]
#Exercise 7

Exercise 8

Choose one of the following survey results on Twitter and perform a statistical test. Also, consider the results statistically.

Recommended Posts

Distribution and test
Hypothesis test and probability distribution
Mixed Gaussian distribution and logsumexp
test
Basic statistics and Gaussian distribution
Python debug and test module
Countdown test dates and post automatically
Jarque-Bera test
[Statistical test 2nd grade] Discrete probability distribution
Locust-Load test
Django test
Post test
LPIC304 test preparation 330.1 Virtualization concept and theory
Pytest Current time test (fixed date and time)
About _ and __
Training data and test data (What are X_train and y_train?) ②
Statistical test grade 2 probability distribution learned in Python ②
Prime number enumeration and primality test in Python
Concept of Bayesian reasoning (2) ... Bayesian estimation and probability distribution
Statistical test grade 2 probability distribution learned in Python ①