The jupyter notebook is available below.

https://gist.github.com/hnishi/544c77e35b98b737bbd004a1a9ac8924

Basic statistics and Gaussian distribution

Summary Statistics Summary

--A summary statistic is a statistical value that typically (summarizes) the characteristics of a sample distribution and is a type of statistic. --Mainly represents the center and spread of data distribution --Also called basic statistics, descriptive statistics, and representative values

Below is an example of a summary statistic

Summary statistics obtained from moments --Average --Dispersion, standard deviation --Skewness --Kurtosis
Summary statistics obtained from the order

Median --Truncated mean (truncated mean) --Quartile --Minimum / maximum value --Midpoint value
range

Summary statistics obtained from frequency --Mode

Summary statistics obtained from moments

N * data $ x_1, \ x_2, \ \ dots, \ x_N $ Consider the statistic for. First, the average value $ \ mu $ and * m * around the average value Next central moment [1] $ \ mu_m $

\mu = \frac{1}{\,N\,} \sum_{i = 1}^N x_i

\mu_m = \frac{1}{\,N\,} \sum_{i = 1}^N (x_i - \mu)^m \quad\ (m = 2, 3, \cdots)

Defined in.

average

Primary moment around the origin $ \ mu $. The sum divided by the number.

\mu = \frac{1}{\,N\,} \sum_{i = 1}^N x_i

Variance, standard deviation

A statistic obtained from the second-order central moment. Represents the spread of the distribution.

Dispersion: $ \ sigma ^ 2 = \ mu_2 $ Standard deviation: $ \ sigma = \ sqrt {\ mu_2} $

skewness

A statistic obtained from the third-order central moment. Represents the degree of left-right asymmetry of the distribution.

\gamma_1 = \mu_3 / \sigma^3

kurtosis

A statistic obtained from the 4th central moment. It represents the sharpness of the peaks of the distribution (the width of the base).

\gamma_2 = \mu_4 / \sigma^4 - 3

However, some definitions do not subtract 3.

[1]: The term "* m * next central moment" is used by Kei Takeuchi (representative of the editorial board) "Statistics Dictionary" Toyo Keizai Inc., According to 1989.

Summary statistics obtained from the order

Below, * N * data sorted in ascending order Consider a statistic (order statistic) for $ x_1 \ le x_2 \ le \ dots \ le x_N $.

Median

Median, median Also called. Data that is just in the center of the size of the data x_{(N+1)/2} .. However, the median for non-integer coordinates is defined by linear interpolation (ie). $ X_ {N / 2} $ and $ x_ {(N + 1) / 2} $ when * N * is even To be the average of).

Truncated mean (truncated mean)

Average excluding maximum and minimum values. If you increase the number of exclusions, the final value will be the median. Therefore, the median is one of the pruned averages [^ 1].

Quartile

When the population is divided into four equal parts by the size of the value, the value that becomes the boundary. $ x_ {(N + 3) / 4} $ The first quartile, $ x_ {(3N + 1) / 4} $ Is called the 3rd quartile. $ x_ {(2N + 2) / 4} $ That is, the second quartile is the median.

Minimum / maximum value

The smallest value $ x_1 $ and the largest value $ x_N $ in the population.

A box plot is used to visualize these statistics.

Midpoint value

The value obtained by adding the maximum value and the minimum value and dividing by 2 is called the midpoint value and is sometimes used as a representative value.

range

The difference between the maximum value and the minimum value is called a range and is sometimes used as a representative value. R is used as the symbol.

[^ 1]: Yasuo Nishioka, Mathematics Tutorial Easy Talking Probability Statistics, Ohmsha, p.5, p.52013, ISBN 9784274214073

Summary statistics obtained from frequency

Mode

Mode, average number Also called. Of the data, the value indicating the highest frequency in the frequency distribution, that is, the value of the data that appears most frequently.

Unbiased estimator of variance

u^2 = \frac{1}{N-1} \sum_{i = 1}^N (x_i - \mu)^2

Unbiased variance $ u ^ 2 $

The normal population variance uses the normal variance, and the unbiased variance is used to infer the population variance from the sample. The Excel function var () computes the unbiased variance.

In the field of machine learning, the variance described above is often used instead of the unbiased variance. (Whichever you use, you can get similar results and have almost the same interpretation.)

Reference: https://www.heisei-u.ac.jp/ba/fukui/pdf/stattext05.pdf

Try to find summary statistics using IRIS data

――What is IRIS data?

Data famous for machine learning. IRIS means the flower of "Ayame" and is distributed by UCI (University of California, Irvine) as data for studying machine learning and data mining.

The types of irises are as follows.

--Setosa --Versicolor --Virginica

This data is analyzed from the following information.

--Sepal Length --Sepal Width --Petal Length --Petal Width

The unit is cm.

https://carp.cc.it-hiroshima.ac.jp/~tateyama/Lecture/AppEx/LoadCSV.html

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn import datasets

iris = datasets.load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['name'] = iris.target_names[iris.target]

#You can easily output the main statistics with pandas.
#Of course, each can be output individually, but it is omitted here.
iris_df.describe()

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

#Data confirmation
plt.show(sns.pairplot(data=iris_df, hue='name', vars=iris.feature_names, diag_kind='hist', palette='spring'))

Covariance

When one sample has two or more features, the feature 1, $ x ^ {\ left (1 \ right)} $, and the feature 2, $ x ^ {\ left (2 \ right)} The covariance between $ and is expressed as follows.

cov(x^{(1)}, x^{(2)}) = \frac{1}{N} \sum_{i = 1}^N (x^{(1)}_{i} - \mu^{(1)}) (x^{(2)}_{i} - \mu^{(2)})

If there is a positive correlation between the two features, it takes a positive value, and if there is a negative correlation, it takes a negative value. The magnitude of the value indicates the strength of the correlation. However, this is only when the units (scales) of the two features are the same.

The covariance (matrix) will be used in the principal component analysis (PCA analysis), which will be introduced next time (?).

https://ja.wikipedia.org/wiki/%E5%85%B1%E5%88%86%E6%95%A3

Covariance-covariance matrix (Covariance-covariance matrix, Covariance matrix)

Consider the following column vector. $ X_1, X_2, ..., X_m $ represent m different features.

\textbf{X}= \begin{bmatrix}X_1 \\ X_2 \\ \vdots \\ X_m \end{bmatrix}

When the elements of this vector are random variables whose variances are finite, the matrix Σ whose elements ('' i'','' j'') are as follows is called a variance-covariance matrix.

\Sigma_{ij} = \frac{1}{N} \sum_{i = 1}^m (X_{i} - \mu_{i}) (X_{j} - \mu_{j}) \quad\

N is the number of specimens. In other words, a matrix whose diagonal components are dispersed and the other matrices are covariance is called a variance-covariance matrix.

You can see the covariance of all pairs of features.

Below, the variance-covariance matrix of each feature of the iris dataset is shown as a heat map. The diagonal components are dispersed, and the other components are covariant. For example, it can be seen that there is a positive correlation between petal length and sepal length.

import numpy as np

#Create a covariance matrix
cov_mat = np.cov(iris.data.T)
df = pd.DataFrame(cov_mat, index=iris.feature_names, columns=iris.feature_names)
ax = sns.heatmap(df, annot=True, center=0, vmin=-3, vmax=3)

Correlation coefficient

Covariance is difficult to interpret when comparing multiple variables with different units because the numerical value is determined by the size of the original value. For example, even if the covariance of the population of each town and the sales of ramen shops is calculated for each municipality, the meaning of the numbers is difficult to understand.

Therefore, when looking at the relationship, it is common to use the correlation coefficient.

The correlation coefficient is the value of the covariance divided by the product of the standard deviations of each variable. The correlation coefficient takes a value from -1 to 1. If 1, the values of the two variables are perfectly synchronized.

\rho = \frac{\sigma_{X Y}}{\sigma_X\sigma_Y}

$ \ Rho $ represents the correlation coefficient, and X and Y represent different features.

The correlation coefficient can be said to be a standardized covariance (indicating data relevance without being influenced by the unit).

Normal distribution (Gaussian distribution, Gaussian distribution)

f(x)=\frac{1}{\sqrt{2\pi \sigma^2}} \exp \! \left( -\frac{(x-\mu)^2}{2\sigma^2} \right) \quad(x\in \mathbb{R} )

--The Gaussian distribution is the most common probability density function (a function whose integral is a probability).

--The mean $ \ mu $ represents the center of the distribution and the standard deviation $ \ sigma $ represents the width of the distribution.

--If you take a random sample x from the normal distribution N (μ, $ \ sigma ^ 2 $), the probability that x is included in the range where the deviation from the mean μ is ± 1σ or less is 68.27%, and ± 2σ or less. If it is 95.45% and ± 3σ, it will be 99.73%.

--The normal distribution is not only the basis for thinking about various distributions such as the t distribution and F distribution, but is also used in various situations such as hypothesis testing and interval estimation in actual statistical inference.

reference:

https://en.wikipedia.org/wiki/Statistics
https://en.wikipedia.org/wiki/Central_limit_theorem
https://ja.wikipedia.org/wiki/%E6%AD%A3%E8%A6%8F%E5%88%86%E5%B8%83

Try to artificially create a Gaussian distribution

Below is a histogram of data randomly generated according to the Gaussian distribution and a Gaussian distribution superimposed.

mu is the mean (center of distribution) sigma is the standard deviation (width of distribution)

import numpy as np

#Random number generation according to Gaussian distribution
mu, sigma = 0, 1 # mean and standard deviatin
np.random.seed(1)
s = np.random.normal(mu, sigma, 1000)

import matplotlib.pyplot as plt

#Histogram creation
count, bins, ignored = plt.hist(s, 30, density=True)
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *
               np.exp( - (bins - mu)**2 / (2 * sigma**2) ),
         linewidth=2, color='r')
plt.show()

Try changing the values of mu and sigma, and you can see that the distribution changes. Moreover, since it is a probability density function, it can be imagined from the value on the vertical axis that it becomes 1 when the integral is taken over the entire section on the horizontal axis.

By the way, in the standardization performed to align the scales of features, the following processing is performed.

x_{i_{std}} = \frac{x_i-\mu}{\sigma}

$ x_ {i_ {std}} $: Standardized feature $ x_i $ $ \ mu $: Mean $ \ sigma $: Standard deviation

This process transforms the mean to 0 and the standard deviation to 1. In other words, it can be said that the distribution of each feature is converted so that it follows a Gaussian distribution with a center of 0 and a distribution width of the same scale, assuming a normal distribution.

This method is more practical because it is less affected by outliers, as opposed to min-max scaling (often called normalization), which scales the data to a limited range of values. I can say.

(Words such as normalization and standardization are often used quite vaguely in some fields, and it is necessary to infer their meaning depending on the situation. Also, the operation $ x_i-\ mu $ is called mean normalization, and $ 1 / \ sigma $ is called feature scaling. )

Since the above is an artificially created Gaussian distribution, it is natural that the distribution is similar to the histogram.

Apply Gaussian distribution to iris dataset

Now it looks like the Gaussian function fits nicely into the natural data, so let's check it out using the iris dataset.

#Fitting iris dataset with Gaussian function

import matplotlib.pyplot as plt

for i_column in iris_df.columns:
  if i_column == 'name':
    continue
  print(i_column)
  mu = iris_df[i_column].mean()
  sigma = iris_df[i_column].std()

  #Histogram creation
  count, bins, ignored = plt.hist(iris_df[i_column], 30, density=True)
  plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *
                np.exp( - (bins - mu)**2 / (2 * sigma**2) ),
          linewidth=2, color='r')
  plt.show()

sepal length (cm)

sepal width (cm)

petal length (cm)

petal width (cm)

It seems that only the Sepal width has a Gaussian distribution, but it can be seen that the other features do not have a Gaussian distribution. From these results, looking at the entire iris dataset, the Gaussian distribution does not appear to be suitable for representing that distribution.

It is thought that this is because the data of multiple groups (types of iris) are mixed in the dataset, so let's draw a Gaussian distribution for each label (setosa, versicolor, virginica).

Draw a Gaussian distribution for each label (iris type)

#Let's look at the distribution for each type

for i_name in iris_df['name'].unique():
  print(i_name)
  df_tmp = iris_df[iris_df['name'] == i_name]
  print(df_tmp.shape)

  for i_column in df_tmp.columns:
    if i_column == 'name':
      continue
    print(i_column)
    mu = df_tmp[i_column].mean()
    sigma = df_tmp[i_column].std()

    #Histogram creation
    count, bins, ignored = plt.hist(df_tmp[i_column], 10, density=True)
    plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *
                  np.exp( - (bins - mu)**2 / (2 * sigma**2) ),
            linewidth=2, color='r')
    plt.show()

setosa
(50, 5)
sepal length (cm)

sepal width (cm)

petal length (cm)

petal width (cm)

versicolor
(50, 5)
sepal length (cm)

sepal width (cm)

petal length (cm)

petal width (cm)

virginica
(50, 5)
sepal length (cm)

sepal width (cm)

petal length (cm)

petal width (cm)

Since the amount of data for each type is small (50), the number of bins (separation for taking aggregation) when creating a histogram is roughly reduced to 10.

When viewed for each type, it can be seen that the distribution of all features can be roughly represented by the Gaussian distribution.

The Gaussian distribution for each type and feature quantity is defined by the mean value and standard deviation. From the Gaussian distribution obtained in this way, it is possible to use it as a classifier because it is possible to obtain a probability for an unknown set of features (I would like to explain it when talking about anomaly detection).

reference

-Basic Statistics wikipedia -[2nd Edition] Python Machine Learning Programming Theory and Practice by Expert Data Scientists -Lecture by Dr. Andrew Ng

[PYTHON] Basic statistics and Gaussian distribution