[PYTHON] Machine learning and statistical prediction, a paradigm of modern statistics that you should know before that

** Warnings and Disclaimers **: The content provided here is a notebook summarizing what you have studied while reading references on statistics, and many of them leave the content of the copyrighted work as it is. Originality and legitimacy. It is not written with the intention of spreading certain information, and since it will be updated from time to time, we cannot guarantee the validity of the content.

Introduction

Anyone can do "machine learning" with powerful packages and trained models in languages ​​such as Python, R, and Julia.

However, if you do not suppress the basics of statistics that are the background of machine learning and statistical processing, you will be overwhelmed by the displayed model improvement indicators and numerical values ​​without knowing what you are doing. I thought it could be.

I think it's very dangerous to study only how to use superficial tools.

Therefore, I started by studying the framework of modern statistics as an introduction to the introduction. First, I would like to talk about the two aspects of modern statistics, ** "descriptive statistics" ** and ** "inference statistics" **. We aim to understand what the ** probabilistic model ** is and what the ** statistical model ** is.

1. Descriptive statistics

** Descriptive statistics * ** is, in short, a technology that summarizes the data at hand in a way that we can understand.

1.1 Typical statistics

The various indicators that summarize the data (such as means and variability) are called ** statistics **.

1.1.1. One-variable statistics

As an example, let's say you want to know the height characteristics of students in the classroom.

Sample mean

When a student is 155 cm tall, the height data is expressed as $ x_i = 155 $. The numerical values ​​obtained by measuring the height in this way are expressed as $ x_1, x_2, ... x_n . I can. The data collected in this way is called ** sample ** (* Sample *). At this time, dividing the sum of the samples by the number of samples gives ** sample mean ** (* Sample Mean *). Masu. $ \overline{X} = \frac{x_1 + x_2, ... + x_n}{n} $$

Sample variance

In addition, there is ** Sample Variance ** (* Sample Variance *) as an index showing the variation of data. The deviation from the average of each data is squared and the average is taken. $ Var(X) = \frac{1}{n}\sum_{i}^n (x_i - \overline{X})^2 $ The reason for squaring is that both negative and positive deviations are correctly counted as distances from the average.

standard deviation

Since the sample variance is squared, the deviation is emphasized more than the original unit.

Therefore, if you want to know the variation in the original unit, ** standard deviation ** (* Standard Deviation *), which is the square root of the variance, is used. $ sd(X) = \sqrt{Var(X)} = \sqrt{\frac{1}{n}\sum_{i}^n (x_i - \overline{X})^2} $

1.1.2. Multivariable statistics

Specimen covariance

When you have more than one variable, you may want to know the relationship between them.

For example, suppose you want to find out how much height * X * changes with age * Y . It can be found by ** Sample Covariance ** ( Sample Covariance *). For each data. Multiply the deviation from the average from X and the deviation from the average from Y and sum them up.

If both * x * and * y * are ** above or below the average, it will be positive, if one is below the average and the other is above the average, it will be negative. In short, both * X * and * Y * will change. If (* covary *), the covariance is positive, and if it changes in the opposite direction, it is negative. $ Cov(X, Y) = \frac{1}{n}\sum_{i}^n (x_i - \overline{X})(y_i - \overline{Y}) $

Correlation coefficient

The covariance divided by the standard deviation of each variable is called the ** Correlation Coefficient *. $ corr(X, Y) = \frac{Cov(X, Y)}{sd(X)sd(Y)} $

When we say "see the correlation", we basically mean this index.

The correlation coefficient always falls within the range of $ -1 \ le corr (X, Y) \ le 1 $, which is useful when comparing the strength of relationships between multiple variables. When the correlation coefficient is negative. ** Negative correlation **, positive is ** positive correlation **.

Regression coefficient

If the covariance or correlation coefficient is far from zero, we know that one change (for example, age) and the other (height) changes, but ** how much it changes **. , You can't tell just by looking at the covariance and the correlation coefficient.

How much does the average height go up or down as you get older? The answer is ** Regression Coefficient ** (* Regression Coefficient *). $ b_{x,y} =\frac{Cov(X, Y)}{Var(Y)} $ ↑ is called the regression coefficient of * X * to * Y * and represents the change of X per unit of Y. For example, in the ** data, ** age * Y * increases by 1. Height * X * is on average increased by $ b_ {x, y} $.

2. Statistical inference

If descriptive statistics are a method for summarizing given data, inference statistics are a technique for predicting and estimating unobserved events based on the data.

2.1. Concept of inference statistics

data-model.png

Statistical inference is often talked about in the framework shown above.

In this framework, the data we handle is reinterpreted as a sample / sample extracted from the ** probability model ** (* probability model *) behind it. Since the extraction is random, it is within each sample. In fact, it changes, but the assumption is that the underlying stochastic model itself does not change.

However, since this probabilistic model itself cannot be observed (only an entity with omnipotent and omnipotent intelligence can exist), we must infer it based on the data at hand, and it was estimated as such. The idea of ​​guess statistics is to predict future data through a probabilistic model.

This "source of data" is called ** population ** (* population ) or ** sample space ** ( sample space *).

2.3. Random variables and probability distributions

It turns out that the events are a subset of the sample space (= population), but each event does not have a "name".

The "probability of throwing a dice once and getting an even number" can be expressed as $ P \ {2, 4, 6 } $, but "only voters over the age of 18 out of all Japanese nationals" are extracted. If you want, it's tedious to list {20, 21, 18 ...}.

Therefore, as a method of extracting properties from the population (eg, "voters over 18 years old"), there is the idea of ​​** random variable ** (* random variable *).

By introducing a random variable, we can express the subset "nations over 18 years old" as $ Y \ ge 18 $. Of course, we can assign probabilities to the subset, so all the people are randomized. The probability that the chosen person is 18 years of age or older can be expressed as $ P (Y \ ge 18) $.

In general, for a random variable $ X $, the probability that its value will take $ x $ is represented by $ P (X = x) $. In the above example, $ P (X> 18) = 0.83 $ Means that there is an 83% chance that the chosen person will be a voter over the age of 18.

** What are you happy about introducing random variables? **

We are usually interested in some attribute or property, such as height or age. By expressing these attributes as variables, we can express the probability as a function of the value of the variable.

This function, which gives the probability $ P (x) $ for any value $ x $ of the random variable $ X $, is called the ** probability distribution ** (* probability distribution *) of $ X $, $ P ( Notated as X) $.

2.4. Continuous random variables and probability densities

Depending on the nature, it depends on whether it takes a discrete value or a continuous value. It may be appropriate to express continuously changing features such as height by a continuous random variable, but there is a caveat.

For example, if $ X $ is a continuous random variable, what is the probability $ P (X = x) $ that it takes a specific real value $ x $?

No matter how large the population is, no one will be exactly "170.00000 .... cm" in height. In the case of a continuous random variable, the probability of that value will be zero.

But, for example, the probability between a height of 169 cm and 170 cm could be greater than or equal to zero, so we can consider the result of narrowing this section to a certain point.

This is called the ** probability density ** (* probability density ) at that point, and the function that gives the probability density to each point $ x $ is called the ** probability density function ** ( probability density function *). The probability of the interval can be obtained by integrating this probability density function between them.

For example, if the probability density function of height is $ f , the probability of height from 169 cm to 170 cm is $ P(169 \le X \le 170) = \int_{169}^{170} f(x)dx $$

2.5. Expected value

The probability distribution is a function of the values ​​of random variables. We do not know the whole picture of this function (sample space (= population) is not only what we observe, but almost all possibilities that can occur. Because it contains).

However, we can think of values ​​that summarize the distribution. The values ​​that characterize the distribution of such a random variable are called its ** expected value **.

Population mean

The ** population mean *; $ \ mu $ is given as the "center" of the probability distribution. $ \mu = \sum_{x}x \cdot P(X=x) $

Mother dispersion

The variation of the distribution is given by ** population variance ** (* population variance; * $ \ theta ^ 2 ). $ \theta^2 = \sum_{x}(x - \mu)^2 \cdot P(X=x) $$

2.6 Law of large numbers and central limit theorem

Even if the original distribution is completely unknown, the probability distribution of the mean $ \ bar {X_n} $ falls within a specific range (around the law of large numbers) if sampling is performed infinitely (the law of large numbers). * Details will be added Also, assuming that the population variance of $ X_1, X_2, ... X_n $ follows the same distribution independently is $ \ sigma ^ 2 $, the probability distribution of the sample mean as $ n $ approaches infinity. $ P (\ bar {X_n}) $ becomes closer to the normal distribution with mean $ \ mu $ and variance $ \ sigma ^ 2/n $. This is shown by ** Central Limit Theorem **.

2.7 Statistical model

To summarize the story so far,

  1. In order to make inferences beyond the data at hand (eg, I want to know the average height of all people from the survey data), I defined a probabilistic model behind the data at hand as the source of the data.

  2. We introduced a random variable as a means of representing an arbitrary event and confirmed that it has a constant distribution.

  3. The distribution of ↑ is unknown to us, but we can think of an expected value that summarizes it, and we can approach this expected value by collecting a large number of samples.

Was confirmed.

We got the theoretical endorsement that if we keep collecting data infinitely, we will definitely reach the true distribution in the end, but we can only get a finite amount of data.

Also, the true probability distribution is very complex and may not be represented by a finite number of parameters.

Therefore, inference statistics make additional assumptions to the stochastic model, allowing effective inference even with finite data by assuming a distribution that can be represented by a particular function.

2.8 Parametric and nonparametric statistics

There are two main ways to create a statistical model.

Nonparametric statistics

In nonparametric statistics, we do not determine the specific functional type of the target distribution, but make only very loose assumptions such as "differentiability" and "continuity".

Parametric statistics

In parametric statistics, the shape of the target distribution is specified by a concrete functional type. The type of this distribution is called ** family of distributions *. I call.

Parametric statistics specify even specific functional types, so there is a high risk of distorting reality, but more detailed and effective inference is possible if the distribution family can be determined appropriately.

2.9 Typical distribution family

Uniform distribution

A distribution that assigns the same probability to all possible values ​​$ x_1, x_2, x_3 ... $ for a random variable $ X $ is called a ** uniform distribution *. Screenshot 2021-01-12 9.14.14.png

Example: The probability of each roll on a fair die is a uniform distribution of $ P (X = x) = \ frac {1} {6} $. Also, if $ X $ takes a continuous value from $ \ alpha $ to $ \ beta , its uniform distribution is $ P(X=x) = \frac{1}{\beta - \alpha} $$ It is expressed by the formula of.

Bernoulli distribution

The result of throwing a coin once is represented by a random variable $ X $, the back is $ X = 0 $, the front is $ X = 1 $, and the probability that the front appears is $ P (X = 1) = \ theta. If $, the distribution of $ X $ is $ P(X=0)=\theta^x (1-\theta)^{1-x} $

Then $ X $ follows the ** Bernoulli distribution ** (* Bernoulli des tribution *). Screenshot 2021-01-12 9.15.11.png

Binomial distribution

Consider an experiment in which coins are tossed multiple times in a row instead of once, and the number of times the coin appears is recorded.

The probability that the table will appear $ x $ times is given by throwing $ n $ coins, and the distribution obtained in this way is called the ** binomial distribution ** (* Binomial distribution *).

P(X=x)= {}_n C _x \theta^x (1 - \theta)^{n-x}

Screenshot 2021-01-12 9.16.13.png

normal distribution

It represents a distribution in which the data accumulates near the mean, and the probability density function is represented by the following equation with two parameters, mean $ \ mu $ and variance $ \ sigma . $ f(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left(-\frac{(x - \mu)^2} {2\sigma^2} \right) \hspace{20px} (-\infty < x < \infty) $$ The normal distribution is used in various statistical inferences. Taking a random sample $ x $ from the normal distribution $ N (\ mu, \ sigma ^ 2) $, the deviation from the mean $ \ mu $ is $ \ The probability that $ x $ is included in the range below pm 1 $ is 68.27%, below $ \ pm 2 \ sigma $ is 95.45%, and below $ \ pm3 \ sigma $ is 99.73%. Screenshot 2021-01-12 9.16.58.png

Also, if $ \ mu = 0 $, $ \ sigma ^ 2 = 1 , it is called ** standard normal distribution **. The probability density function is $ f(x) = \frac{1}{\sqrt{2\pi}} \exp \left(-\frac{x^2} {2} \right) \hspace{20px} (-\infty < x < \infty) $$ It is represented by. The standard normal distribution is given a random variable $ X $ that follows the normal distribution $ N (\ mu, \ sigma) $ $ Z = \frac{X − \mu}{\sigma} $ If standardized, the random variable $ Z $ follows a standard normal distribution. By finding the $ Z $ value, a normal distribution is used without using a computer, using a list called the standard normal distribution table that shows the probabilities corresponding to variates. You can find the probability of the event according to.

References

--Atsushi Otsuka (2020) "Philosophy of Statistics" The University of Nagoya Press. --Tomokazu Haebara (2002) "Basics of Psychological Statistics-For Integrated Understanding" Yuhikaku Alma.

Recommended Posts

Machine learning and statistical prediction, a paradigm of modern statistics that you should know before that
[Linux] A list of Linux commands that beginners should know
An example of a mechanism that returns a prediction by HTTP from the result of machine learning
[Machine learning] Understand from mathematics that standardization results in an average of 0 and a standard deviation of 1.
Knowledge of linear algebra that you should know when doing AI
Significance of machine learning and mini-batch learning
I wrote a book that allows you to learn machine learning implementations and algorithms in a well-balanced manner.
Free version of DataRobot! ?? Introduction to "PyCaret", a library that automates machine learning
List of main probability distributions used in machine learning and statistics and code in python
Until launching a boat race triple prediction site using machine learning and Flask
Implementation of a model that predicts the exchange rate (dollar-yen rate) by machine learning
Horse Racing Prediction: If you think that the recovery rate has exceeded 100% in machine learning (LightGBM), it's a story
Machine learning memo of a fledgling engineer Part 1
Numerai Tournament-Fusion of Traditional Quants and Machine Learning-
A memorandum of studying and implementing deep learning
Machine learning Training data division and learning / prediction / verification
List of links that machine learning beginners are learning
Machine learning memo of a fledgling engineer Part 2
Get a glimpse of machine learning in Python
Simple code that gives a score of 0.81339 in Kaggle's Titanic: Machine Learning from Disaster
Until you create a machine learning environment with Python on Windows 7 and run it