Statistics with python

Total and average


numpy.sum(data) #total
numpy.mean(data) #average

Maximum and minimum, and median

numpy.amax(data)
numpy.amin(data)
numpy.median(data)

Distributed

An index that indicates "how far the data is from the average value"

\sigma^2=\frac{1}{N}\sum_{i=1}^{n} (x_i-\mu)^2
numpy.var(data, ddof = 0)

Unbiased dispersion

The sample variance is a value obtained by further calculating the variance using the sample mean, but this value is biased to be underestimated.
Therefore, the one without bias becomes universal variance.

\sigma^2=\frac{1}{N-1}\sum_{i=1}^{n} (x_i-\mu)^2
numpy.var(data, ddof = 1)

Hereafter, unbiased variance will be used.

standard deviation

The square root of the variance

\begin{align}
\sigma&=\sqrt{\sigma^2}\\
&=\frac{1}{N-1}\sum_{i=1}^{n} (x_i-μ)^2
\end{align}
numpy.std(data, ddof=1)

Covariance

--When the covariance is greater than 0
→ If one variable takes a large value, the other also increases
→ There is a positive correlation. --When the covariance is less than 0
→ If one variable takes a large value, the other becomes smaller
→ There is a negative correlation.

Cov(x,y)=\frac{1}{N}\sum_{i=1}^{n-1} (x_i-\mu_x)(y_i-\mu_y)

print(cov_data) スクリーンショット 2020-01-25 17.37.23.png

#Data retrieval
x = cov_data["x"]
y = cov_data["y"]
#sample size
N = len(cov_data)
#Calculation of mean value
mu_x = sp.mean(x)
mu_y = sp.mean(y)
#Covariance
cov = sum((x - mu_x) * (y - mu_y)) / (N - 1)

Covariance matrix

Cov(x,y)=
\begin{bmatrix}
\sigma_x^2 & Cov(x,y) \\
Cov(x,y) & \sigma_y^2 
\end{bmatrix}
np.cov(x, y, ddof = 1)

When retrieving a value from a matrix

hoge = np.cov(x, y, ddof = 1)
cov = hoge[1,0]

Pearson's product moment correlation coefficient

The covariance is standardized to a maximum value of 1 and a minimum value of 1.

\rho_{xy}=\frac{Cov_{(x,y)}}{\sqrt{\sigma_x^2\sigma_y^2}}

#Variance calculation
sigma_2_x_sample = sp.var(x, ddof = 0)
sigma_2_y_sample = sp.var(y, ddof = 0)
#Correlation coefficient
cov_sample / sp.sqrt(sigma_2_x_sample * sigma_2_y_sample)

Correlation matrix

Cov_{(x,y)}=
\begin{bmatrix}
1 & \rho_{xy} \\
\rho_{xy} & 1
\end{bmatrix}

numpy.corrcoef(x,y)

Standardization

A conversion that sets the mean of the data to 0 and the standard deviation to 1. That is, the average value subtracted from each data and divided by the standard deviation.

standerd = (data - numpy.mean(data)) / numpy.std(data, ddof=1)

Probability density

Probability in a continuous variable [^ 1]. When it is a continuous variable, the probability of a specific value is always 0. This is because some values have an infinite number of decimal places. For example, a person cannot be exactly 160 centimeters tall. However, the "probability of a person between 159 cm and 160 cm" can be calculated. That probability is the "probability density". The probability density from e.g. 0 to the maximum value is 1.

c.f. The probability of a discrete variable [^ 2] is the probability that many people learn at school. (P (x) = 1/4)

In particular, when considering the probability that the variable X that takes a real value takes x <= X <= x + ⊿x, when ⊿x → 0, P (x) is called the probability density of x.

Random variable

When calculating a probability, the variable to be calculated is called a random variable. Suppose the probability of x = 2 is 1/3. At this time, 2 is the establishment variable.

Normal distribution probability density function

N(x|\mu, \sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-{\frac{(x-\mu)^2}{2\sigma^2}}}

Example: When the random variable x = 3, mean = 4, standard deviation = 0.8

>>>x = 3
>>>mu = 4
>>>sigma = 0.8
>>>1 / (numpy.sqrt(2 * sp.pi * sigma**2)) * numpy.exp(- ((x - mu)**2) / (2 * sigma**2))
>>>0.228

You can easily do it with the function below.

>>>stats.norm.pdf(loc = 4, scale = 0.8, x = 3)
>>>0.228

Cumulative distribution function and lower probability, percentage point

F(x)=P(X\leq x)

A function expressed as. That is, "a function that calculates the probability that the value will be less than or equal to a certain value". The value obtained here is called the lower probability. In addition, x at this time is called a percentage point. In the case of a normal distribution, it can be obtained by the integral calculation below. Also, use the scipy.stats.hoge.cdf function

P(X\leq x)=\int_{-\infty}^{x}\frac{1}{\sqrt{2\pi\sigma^2}}e^{-{\frac{(x-\mu)^2}{2\sigma^2}}}dx
>>>import scipy as sp
>>>from scipy import stats
>>>stats.norm.cdf(loc = 4, scale = 0.8, x = 3) #loc is mean, scale is standard deviation
>>>0.106

Function to find percentage points-ppf function

Percentage point where the lower probability is 2.5%

>>>stats.norm.ppf(loc = 4, scale = 0.8, q = 0.025)
>>>2.432

T-value and t-value sample distribution

t=\frac{\hat{\mu}-\mu}{\frac{\hat{\sigma}}{\sqrt{N}}}

That is,

t value=\frac{Specimen average-Mother mean}{Standard error}

Will be. The distribution of repeated trials multiple times is the t-value sample distribution.

t distribution

The sample distribution of t-values when the population distribution is normal is called the t-distribution.

t-test

To check whether the mean value of data differs from a specific value. However, the specific method of t-test depends on the data correspondence. See the following page for details. Functions of stats module

Pearson residual

It is interpreted as "ordinary residual divided by the standard deviation of the distribution". Example: When binomial distribution --When p = 0.5, it becomes 0 or 1, but it means that it is half, so the probability of making a guess is low. The deviation at this time is recognized as a "small deviation" in the Pearson residual. --When p = 0.9, there should be a high probability that the guess will be correct. If the guess is wrong at this time, it is recognized as a "large deviation" in the Pearson residuals.

\begin{align}
Pearson \quad residuals &= \frac{y-N\hat{p}}{\sqrt{N\hat{p}\quad(1-\hat{p}\quad)}}\\
&=\frac{y-\hat{p}}{\sqrt{\hat{p}\quad(1-\hat{p}\quad)}}
\end{align}
\\
\hat{p}\The quad represents the estimated probability of success.

The sum of squares of the Pearson residuals is the Pearson chi-square statistic.

[^ 1]: A value that takes a value after the decimal point and changes continuously.
Example: x cm ← 3 cm, 4.5 cm [^ 2]: Those that take only integers.
Example: One.

Recommended Posts

Statistics with python
FizzBuzz with Python3
Scraping with Python
Scraping with Python
Python with Go
Twilio with Python
Integrate with Python
Play with 2016-Python
AES256 with python
Tested with Python
python starts with ()
with syntax (Python)
1. Statistics learned with Python 1-1. Basic statistics (Pandas)
Bingo with python
Zundokokiyoshi with python
Excel with Python
Microcomputer with Python
Cast with python
1. Statistics learned with Python 1-3. Calculation of various statistics (statistics)
Serial communication with Python
Zip, unzip with python
Django 1.11 started with Python3.6
Python with eclipse + PyDev.
Socket communication with Python
Data analysis with python 2
Scraping with Python (preparation)
Try scraping with Python.
Learning Python with ChemTHEATER 03
Sequential search with Python
"Object-oriented" learning with python
Handling yaml with python
Solve AtCoder 167 with python
Serial communication with python
[Python] Use JSON with Python
Learning Python with ChemTHEATER 05-1
Learn Python with ChemTHEATER
Run prepDE.py with python3
1.1 Getting Started with Python
Collecting tweets with Python
Binarization with OpenCV / Python
3. 3. AI programming with Python
Kernel Method with Python
Non-blocking with Python + uWSGI
Scraping with Python + PhantomJS
Posting tweets with python
Drive WebDriver with python
Use mecab with Python3
Feature Prediction Statistics python
[Python] Redirect with CGIHTTPServer
Voice analysis with python
Think yaml with python
Operate Kinesis with Python
Getting Started with Python
Use DynamoDB with Python
Zundko getter with python
Handle Excel with python
Ohm's Law with Python
Primality test with python
Run Blender with python
Solve Sudoku with Python
Python starting with Windows 7