The jupyter notebook is available below.
https://gist.github.com/hnishi/544c77e35b98b737bbd004a1a9ac8924
--A summary statistic is a statistical value that typically (summarizes) the characteristics of a sample distribution and is a type of statistic. --Mainly represents the center and spread of data distribution --Also called basic statistics, descriptive statistics, and representative values
Below is an example of a summary statistic
Defined in.
Primary moment around the origin $ \ mu $. The sum divided by the number.
A statistic obtained from the second-order central moment. Represents the spread of the distribution.
Dispersion: $ \ sigma ^ 2 = \ mu_2 $ Standard deviation: $ \ sigma = \ sqrt {\ mu_2} $
A statistic obtained from the third-order central moment. Represents the degree of left-right asymmetry of the distribution.
A statistic obtained from the 4th central moment. It represents the sharpness of the peaks of the distribution (the width of the base).
However, some definitions do not subtract 3.
[1]: The term "* m * next central moment" is used by Kei Takeuchi (representative of the editorial board) "Statistics Dictionary" Toyo Keizai Inc., According to 1989.
Below, * N * data sorted in ascending order Consider a statistic (order statistic) for $ x_1 \ le x_2 \ le \ dots \ le x_N $.
Median, median
Also called. Data that is just in the center of the size of the data
Average excluding maximum and minimum values. If you increase the number of exclusions, the final value will be the median. Therefore, the median is one of the pruned averages [^ 1].
When the population is divided into four equal parts by the size of the value, the value that becomes the boundary. $ x_ {(N + 3) / 4} $ The first quartile, $ x_ {(3N + 1) / 4} $ Is called the 3rd quartile. $ x_ {(2N + 2) / 4} $ That is, the second quartile is the median.
The smallest value $ x_1 $ and the largest value $ x_N $ in the population.
A box plot is used to visualize these statistics.
The value obtained by adding the maximum value and the minimum value and dividing by 2 is called the midpoint value and is sometimes used as a representative value.
The difference between the maximum value and the minimum value is called a range and is sometimes used as a representative value. R is used as the symbol.
[^ 1]: Yasuo Nishioka, Mathematics Tutorial Easy Talking Probability Statistics, Ohmsha, p.5, p.52013, ISBN 9784274214073
Mode, average number Also called. Of the data, the value indicating the highest frequency in the frequency distribution, that is, the value of the data that appears most frequently.
Unbiased variance $ u ^ 2 $
The normal population variance uses the normal variance, and the unbiased variance is used to infer the population variance from the sample. The Excel function var () computes the unbiased variance.
In the field of machine learning, the variance described above is often used instead of the unbiased variance. (Whichever you use, you can get similar results and have almost the same interpretation.)
Reference: https://www.heisei-u.ac.jp/ba/fukui/pdf/stattext05.pdf
――What is IRIS data?
Data famous for machine learning. IRIS means the flower of "Ayame" and is distributed by UCI (University of California, Irvine) as data for studying machine learning and data mining.
The types of irises are as follows.
--Setosa --Versicolor --Virginica
This data is analyzed from the following information.
--Sepal Length --Sepal Width --Petal Length --Petal Width
The unit is cm.
https://carp.cc.it-hiroshima.ac.jp/~tateyama/Lecture/AppEx/LoadCSV.html
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn import datasets
iris = datasets.load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['name'] = iris.target_names[iris.target]
#You can easily output the main statistics with pandas.
#Of course, each can be output individually, but it is omitted here.
iris_df.describe()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
---|---|---|---|---|
count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
mean | 5.843333 | 3.057333 | 3.758000 | 1.199333 |
std | 0.828066 | 0.435866 | 1.765298 | 0.762238 |
min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
#Data confirmation
plt.show(sns.pairplot(data=iris_df, hue='name', vars=iris.feature_names, diag_kind='hist', palette='spring'))
When one sample has two or more features, the feature 1, $ x ^ {\ left (1 \ right)} $, and the feature 2, $ x ^ {\ left (2 \ right)} The covariance between $ and is expressed as follows.
cov(x^{(1)}, x^{(2)}) = \frac{1}{N} \sum_{i = 1}^N (x^{(1)}_{i} - \mu^{(1)}) (x^{(2)}_{i} - \mu^{(2)})
If there is a positive correlation between the two features, it takes a positive value, and if there is a negative correlation, it takes a negative value. The magnitude of the value indicates the strength of the correlation. However, this is only when the units (scales) of the two features are the same.
The covariance (matrix) will be used in the principal component analysis (PCA analysis), which will be introduced next time (?).
https://ja.wikipedia.org/wiki/%E5%85%B1%E5%88%86%E6%95%A3
Consider the following column vector. $ X_1, X_2, ..., X_m $ represent m different features.
When the elements of this vector are random variables whose variances are finite, the matrix Σ whose elements ('' i'','' j'') are as follows is called a variance-covariance matrix.
N is the number of specimens. In other words, a matrix whose diagonal components are dispersed and the other matrices are covariance is called a variance-covariance matrix.
You can see the covariance of all pairs of features.
Below, the variance-covariance matrix of each feature of the iris dataset is shown as a heat map. The diagonal components are dispersed, and the other components are covariant. For example, it can be seen that there is a positive correlation between petal length and sepal length.
import numpy as np
#Create a covariance matrix
cov_mat = np.cov(iris.data.T)
df = pd.DataFrame(cov_mat, index=iris.feature_names, columns=iris.feature_names)
ax = sns.heatmap(df, annot=True, center=0, vmin=-3, vmax=3)
Covariance is difficult to interpret when comparing multiple variables with different units because the numerical value is determined by the size of the original value. For example, even if the covariance of the population of each town and the sales of ramen shops is calculated for each municipality, the meaning of the numbers is difficult to understand.
Therefore, when looking at the relationship, it is common to use the correlation coefficient.
The correlation coefficient is the value of the covariance divided by the product of the standard deviations of each variable. The correlation coefficient takes a value from -1 to 1. If 1, the values of the two variables are perfectly synchronized.
$ \ Rho $ represents the correlation coefficient, and X and Y represent different features.
The correlation coefficient can be said to be a standardized covariance (indicating data relevance without being influenced by the unit).
--The Gaussian distribution is the most common probability density function (a function whose integral is a probability).
--The mean $ \ mu $ represents the center of the distribution and the standard deviation $ \ sigma $ represents the width of the distribution.
--If you take a random sample x from the normal distribution N (μ, $ \ sigma ^ 2 $), the probability that x is included in the range where the deviation from the mean μ is ± 1σ or less is 68.27%, and ± 2σ or less. If it is 95.45% and ± 3σ, it will be 99.73%.
--The normal distribution is not only the basis for thinking about various distributions such as the t distribution and F distribution, but is also used in various situations such as hypothesis testing and interval estimation in actual statistical inference.
reference:
Below is a histogram of data randomly generated according to the Gaussian distribution and a Gaussian distribution superimposed.
mu is the mean (center of distribution) sigma is the standard deviation (width of distribution)
import numpy as np
#Random number generation according to Gaussian distribution
mu, sigma = 0, 1 # mean and standard deviatin
np.random.seed(1)
s = np.random.normal(mu, sigma, 1000)
import matplotlib.pyplot as plt
#Histogram creation
count, bins, ignored = plt.hist(s, 30, density=True)
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *
np.exp( - (bins - mu)**2 / (2 * sigma**2) ),
linewidth=2, color='r')
plt.show()
Try changing the values of mu and sigma, and you can see that the distribution changes. Moreover, since it is a probability density function, it can be imagined from the value on the vertical axis that it becomes 1 when the integral is taken over the entire section on the horizontal axis.
By the way, in the standardization performed to align the scales of features, the following processing is performed.
$ x_ {i_ {std}} $: Standardized feature $ x_i $ $ \ mu $: Mean $ \ sigma $: Standard deviation
This process transforms the mean to 0 and the standard deviation to 1. In other words, it can be said that the distribution of each feature is converted so that it follows a Gaussian distribution with a center of 0 and a distribution width of the same scale, assuming a normal distribution.
This method is more practical because it is less affected by outliers, as opposed to min-max scaling (often called normalization), which scales the data to a limited range of values. I can say.
(Words such as normalization and standardization are often used quite vaguely in some fields, and it is necessary to infer their meaning depending on the situation. Also, the operation $ x_i-\ mu $ is called mean normalization, and $ 1 / \ sigma $ is called feature scaling. )
Since the above is an artificially created Gaussian distribution, it is natural that the distribution is similar to the histogram.
Now it looks like the Gaussian function fits nicely into the natural data, so let's check it out using the iris dataset.
#Fitting iris dataset with Gaussian function
import matplotlib.pyplot as plt
for i_column in iris_df.columns:
if i_column == 'name':
continue
print(i_column)
mu = iris_df[i_column].mean()
sigma = iris_df[i_column].std()
#Histogram creation
count, bins, ignored = plt.hist(iris_df[i_column], 30, density=True)
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *
np.exp( - (bins - mu)**2 / (2 * sigma**2) ),
linewidth=2, color='r')
plt.show()
sepal length (cm)
sepal width (cm)
petal length (cm)
petal width (cm)
It seems that only the Sepal width has a Gaussian distribution, but it can be seen that the other features do not have a Gaussian distribution. From these results, looking at the entire iris dataset, the Gaussian distribution does not appear to be suitable for representing that distribution.
It is thought that this is because the data of multiple groups (types of iris) are mixed in the dataset, so let's draw a Gaussian distribution for each label (setosa, versicolor, virginica).
#Let's look at the distribution for each type
for i_name in iris_df['name'].unique():
print(i_name)
df_tmp = iris_df[iris_df['name'] == i_name]
print(df_tmp.shape)
for i_column in df_tmp.columns:
if i_column == 'name':
continue
print(i_column)
mu = df_tmp[i_column].mean()
sigma = df_tmp[i_column].std()
#Histogram creation
count, bins, ignored = plt.hist(df_tmp[i_column], 10, density=True)
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *
np.exp( - (bins - mu)**2 / (2 * sigma**2) ),
linewidth=2, color='r')
plt.show()
setosa
(50, 5)
sepal length (cm)
sepal width (cm)
petal length (cm)
petal width (cm)
versicolor
(50, 5)
sepal length (cm)
sepal width (cm)
petal length (cm)
petal width (cm)
virginica
(50, 5)
sepal length (cm)
sepal width (cm)
petal length (cm)
petal width (cm)
Since the amount of data for each type is small (50), the number of bins (separation for taking aggregation) when creating a histogram is roughly reduced to 10.
When viewed for each type, it can be seen that the distribution of all features can be roughly represented by the Gaussian distribution.
The Gaussian distribution for each type and feature quantity is defined by the mean value and standard deviation. From the Gaussian distribution obtained in this way, it is possible to use it as a classifier because it is possible to obtain a probability for an unknown set of features (I would like to explain it when talking about anomaly detection).
-Basic Statistics wikipedia -[2nd Edition] Python Machine Learning Programming Theory and Practice by Expert Data Scientists -Lecture by Dr. Andrew Ng
Recommended Posts