** Warnings and Disclaimers **: The content provided here is a notebook summarizing what you have studied while reading references on statistics, and many of them leave the content of the copyrighted work as it is. Originality and legitimacy. It is not written with the intention of spreading certain information, and since it will be updated from time to time, we cannot guarantee the validity of the content.
Anyone can do "machine learning" with powerful packages and trained models in languages such as Python, R, and Julia.
However, if you do not suppress the basics of statistics that are the background of machine learning and statistical processing, you will be overwhelmed by the displayed model improvement indicators and numerical values without knowing what you are doing. I thought it could be.
I think it's very dangerous to study only how to use superficial tools.
Therefore, I started by studying the framework of modern statistics as an introduction to the introduction. First, I would like to talk about the two aspects of modern statistics, ** "descriptive statistics" ** and ** "inference statistics" **. We aim to understand what the ** probabilistic model ** is and what the ** statistical model ** is.
** Descriptive statistics * ** is, in short, a technology that summarizes the data at hand in a way that we can understand.
The various indicators that summarize the data (such as means and variability) are called ** statistics **.
As an example, let's say you want to know the height characteristics of students in the classroom.
When a student is 155 cm tall, the height data is expressed as $ x_i = 155 $. The numerical values obtained by measuring the height in this way are expressed as $ x_1, x_2, ... x_n
In addition, there is ** Sample Variance ** (* Sample Variance *) as an index showing the variation of data. The deviation from the average of each data is squared and the average is taken.
Since the sample variance is squared, the deviation is emphasized more than the original unit.
Therefore, if you want to know the variation in the original unit, ** standard deviation ** (* Standard Deviation *), which is the square root of the variance, is used.
When you have more than one variable, you may want to know the relationship between them.
For example, suppose you want to find out how much height * X * changes with age * Y . It can be found by ** Sample Covariance ** ( Sample Covariance *). For each data. Multiply the deviation from the average from X and the deviation from the average from Y and sum them up.
If both * x * and * y * are ** above or below the average, it will be positive, if one is below the average and the other is above the average, it will be negative. In short, both * X * and * Y * will change. If (* covary *), the covariance is positive, and if it changes in the opposite direction, it is negative.
The covariance divided by the standard deviation of each variable is called the ** Correlation Coefficient *.
When we say "see the correlation", we basically mean this index.
The correlation coefficient always falls within the range of $ -1 \ le corr (X, Y) \ le 1 $, which is useful when comparing the strength of relationships between multiple variables. When the correlation coefficient is negative. ** Negative correlation **, positive is ** positive correlation **.
If the covariance or correlation coefficient is far from zero, we know that one change (for example, age) and the other (height) changes, but ** how much it changes **. , You can't tell just by looking at the covariance and the correlation coefficient.
How much does the average height go up or down as you get older? The answer is ** Regression Coefficient ** (* Regression Coefficient *).
If descriptive statistics are a method for summarizing given data, inference statistics are a technique for predicting and estimating unobserved events based on the data.
Statistical inference is often talked about in the framework shown above.
In this framework, the data we handle is reinterpreted as a sample / sample extracted from the ** probability model ** (* probability model *) behind it. Since the extraction is random, it is within each sample. In fact, it changes, but the assumption is that the underlying stochastic model itself does not change.
However, since this probabilistic model itself cannot be observed (only an entity with omnipotent and omnipotent intelligence can exist), we must infer it based on the data at hand, and it was estimated as such. The idea of guess statistics is to predict future data through a probabilistic model.
This "source of data" is called ** population ** (* population ) or ** sample space ** ( sample space *).
It turns out that the events are a subset of the sample space (= population), but each event does not have a "name".
The "probability of throwing a dice once and getting an even number" can be expressed as $ P \ {2, 4, 6 } $, but "only voters over the age of 18 out of all Japanese nationals" are extracted. If you want, it's tedious to list {20, 21, 18 ...}.
Therefore, as a method of extracting properties from the population (eg, "voters over 18 years old"), there is the idea of ** random variable ** (* random variable *).
By introducing a random variable, we can express the subset "nations over 18 years old" as $ Y \ ge 18 $. Of course, we can assign probabilities to the subset, so all the people are randomized. The probability that the chosen person is 18 years of age or older can be expressed as $ P (Y \ ge 18) $.
In general, for a random variable $ X $, the probability that its value will take $ x $ is represented by $ P (X = x) $. In the above example, $ P (X> 18) = 0.83 $ Means that there is an 83% chance that the chosen person will be a voter over the age of 18.
** What are you happy about introducing random variables? **
We are usually interested in some attribute or property, such as height or age. By expressing these attributes as variables, we can express the probability as a function of the value of the variable.
This function, which gives the probability $ P (x) $ for any value $ x $ of the random variable $ X $, is called the ** probability distribution ** (* probability distribution *) of $ X $, $ P ( Notated as X) $.
Depending on the nature, it depends on whether it takes a discrete value or a continuous value. It may be appropriate to express continuously changing features such as height by a continuous random variable, but there is a caveat.
For example, if $ X $ is a continuous random variable, what is the probability $ P (X = x) $ that it takes a specific real value $ x $?
No matter how large the population is, no one will be exactly "170.00000 .... cm" in height. In the case of a continuous random variable, the probability of that value will be zero.
But, for example, the probability between a height of 169 cm and 170 cm could be greater than or equal to zero, so we can consider the result of narrowing this section to a certain point.
This is called the ** probability density ** (* probability density ) at that point, and the function that gives the probability density to each point $ x $ is called the ** probability density function ** ( probability density function *). The probability of the interval can be obtained by integrating this probability density function between them.
For example, if the probability density function of height is $ f
The probability distribution is a function of the values of random variables. We do not know the whole picture of this function (sample space (= population) is not only what we observe, but almost all possibilities that can occur. Because it contains).
However, we can think of values that summarize the distribution. The values that characterize the distribution of such a random variable are called its ** expected value **.
The ** population mean *; $ \ mu $ is given as the "center" of the probability distribution.
The variation of the distribution is given by ** population variance ** (* population variance; * $ \ theta ^ 2
Even if the original distribution is completely unknown, the probability distribution of the mean $ \ bar {X_n} $ falls within a specific range (around the law of large numbers) if sampling is performed infinitely (the law of large numbers). * Details will be added Also, assuming that the population variance of $ X_1, X_2, ... X_n $ follows the same distribution independently is $ \ sigma ^ 2 $, the probability distribution of the sample mean as $ n $ approaches infinity. $ P (\ bar {X_n}) $ becomes closer to the normal distribution with mean $ \ mu $ and variance $ \ sigma ^ 2/n $. This is shown by ** Central Limit Theorem **.
To summarize the story so far,
In order to make inferences beyond the data at hand (eg, I want to know the average height of all people from the survey data), I defined a probabilistic model behind the data at hand as the source of the data.
We introduced a random variable as a means of representing an arbitrary event and confirmed that it has a constant distribution.
The distribution of ↑ is unknown to us, but we can think of an expected value that summarizes it, and we can approach this expected value by collecting a large number of samples.
Was confirmed.
We got the theoretical endorsement that if we keep collecting data infinitely, we will definitely reach the true distribution in the end, but we can only get a finite amount of data.
Also, the true probability distribution is very complex and may not be represented by a finite number of parameters.
Therefore, inference statistics make additional assumptions to the stochastic model, allowing effective inference even with finite data by assuming a distribution that can be represented by a particular function.
There are two main ways to create a statistical model.
In nonparametric statistics, we do not determine the specific functional type of the target distribution, but make only very loose assumptions such as "differentiability" and "continuity".
In parametric statistics, the shape of the target distribution is specified by a concrete functional type. The type of this distribution is called ** family of distributions *. I call.
Parametric statistics specify even specific functional types, so there is a high risk of distorting reality, but more detailed and effective inference is possible if the distribution family can be determined appropriately.
A distribution that assigns the same probability to all possible values $ x_1, x_2, x_3 ... $ for a random variable $ X $ is called a ** uniform distribution *.
Example: The probability of each roll on a fair die is a uniform distribution of $ P (X = x) = \ frac {1} {6} $.
Also, if $ X $ takes a continuous value from $ \ alpha $ to $ \ beta
The result of throwing a coin once is represented by a random variable $ X $, the back is $ X = 0 $, the front is $ X = 1 $, and the probability that the front appears is $ P (X = 1) = \ theta. If $, the distribution of $ X $ is
Then $ X $ follows the ** Bernoulli distribution ** (* Bernoulli des tribution *).
Consider an experiment in which coins are tossed multiple times in a row instead of once, and the number of times the coin appears is recorded.
The probability that the table will appear $ x $ times is given by throwing $ n $ coins, and the distribution obtained in this way is called the ** binomial distribution ** (* Binomial distribution *).
It represents a distribution in which the data accumulates near the mean, and the probability density function is represented by the following equation with two parameters, mean $ \ mu $ and variance $ \ sigma
Also, if $ \ mu = 0 $, $ \ sigma ^ 2 = 1
--Atsushi Otsuka (2020) "Philosophy of Statistics" The University of Nagoya Press. --Tomokazu Haebara (2002) "Basics of Psychological Statistics-For Integrated Understanding" Yuhikaku Alma.
Recommended Posts