[PYTHON] After all, what is statistical modeling?

The other day, as it is Article on how to study time series analysis "Study method for statistical beginners to learn time series analysis" When I hit it, I was surprised at the magnitude of the response.

To be honest, I thought that time series analysis was a very niche field (only about 3 people would read it). I wonder if everyone who is staying at home is thinking about making a profit with Forex using time series analysis because they are free.

State space model, which is a topic of time series analysis, is also a kind of ** statistical modeling **, but today I would like to think about "what is statistical modeling" again.

Speaking of statistical modeling, everyone loves Midorimoto "Introduction to Statistical Modeling for Data Analysis: Generalized Linear Models, Hierarchical Bayes Models, MCMC (Science of Probability and Information)" .jp / dp / 400006973X / ref = cm_sw_em_r_mt_dp_U_izQ9EbM450CQM).

The green book is certainly a wonderful book, but just reading the green book clears my long-standing question of "what is statistical modeling and what is not statistical modeling" and "what is a model in statistics?" It was.

In addition, I read a lot of wonderful articles such as the following, but it is difficult to write and it feels like an explanation for professionals.

-Let's reorganize "what is statistical modeling" -How to discuss the difference between statistics and machine learning -After all, what is the difference between machine learning and statistics?

Let's start with something a little simpler and approach the deep question, "What is statistical modeling after all?"

Who is the target of this article

This article wastes an already trivial life on trivial questions such as "what is statistics", "what is a model in statistics", and "what is the difference between statistics and machine learning". It is for everyone who has done it.

The content is understandable even with little knowledge of statistics.

Let's get started!

What is statistical modeling?

The basic part of statistical modeling is the probability distribution

Probability distribution is indispensable when talking about statistical modeling. When it comes to probability distributions, readers will think of normal distributions, binomial distributions, Poisson distributions, gamma distributions, etc.

650px-Normal_Distribution_PDF.svg.png ([Normal distribution] reigning as the emperor of the probability distribution world ((https://ja.wikipedia.org/wiki/%E6%AD%A3%E8%A6%8F%E5%88%86%E5% B8% 83)) The divine appearance of Mr. ……)

Many people think that learning about the mathematical properties of these probability distributions is statistics.

This is probably because the curriculum of statistics such as universities begins with the mathematical theory of these typical probability distributions. (That's why many people think statistics are crap.)

However, it is not the essence of statistics, such as the mathematical theory of probability distributions.

The important thing is, "In statistics, how do you model using a probability distribution?"

In this article, by clarifying "what is statistical modeling and what is not statistical modeling", "In statistics, how do you model using a probability distribution?" In other words, we are approaching the question, "What is statistical modeling?"

Now, let's clarify through a few questions, "What is statistical modeling and what is not statistical modeling?"

What is statistical modeling and what is not statistical modeling

Q1. I have data on the height of all junior high school boys in Japan. Is it statistical modeling to find these means and variances?

** A1. In my opinion, just calculating the mean or variance is not statistical modeling. ** **

This is because the mean and variance can be calculated ** as is ** from the obtained data. If you have height data for all junior high school boys, you can add them all and divide by the number of people to calculate the average. If you can calculate the mean, you can also calculate the variance.

Certainly, the mean and variance are called statistics. The concept of average itself is an important index in the statistical world view, It may be said that finding the average is a statistical activity.

But I don't think it's "modeling".

Then

Q2. What kind of operation should be performed on the height data of all junior high school boys in Japan to call it statistical "modeling"?

** A2. Is the shape of the histogram of the observed data like a normal distribution? Thinking that, if you superimpose the normal distribution on the histogram, you are stepping into the world of statistical modeling ** unnamed.png

Apparently, the height distribution of all junior high school boys can be approximated to the normal distribution. With that in mind, you are starting statistical modeling.

normal distribution

\frac {1}{\sqrt{2\pi\sigma^2}} \exp(-\frac {(x-\mu)^2}{2\sigma^2})\\
\\
\mu:average\\
\sigma^2:Distributed

If you apply the mean and variance of the observed data to the mean and variance of, you can superimpose the normal distribution on the histogram as shown in the above figure.

Consider the probability distribution of the obtained data. I think that this observation data is generated from the normal distribution. It's already a good "statistical modeling".

However, it doesn't seem that I did something very meaningful when I called myself "statistical modeling".

Q3. Then, in what situations does statistical modeling make sense?

** A3. When only some junior high school boys' height data was available, not all Japanese junior high school boys **

So far, we have assumed that all junior high school boys are tall, but in the field of actual data analysis, we are rarely lucky.

We have a high degree of thought about the mean and variance values (distribution in Bayesian statistics) of the whole (population) from the data (sample, sample) of ** small part ** of the whole. Intellectual activity is required. That's what statistics want to do.

In order to think about the population from the data at hand, we must first assume the distribution of the population. For example, suppose that the height distribution of all junior high school boys in Japan is a normal distribution. Here, we need to assume the distribution of the population, with only the data at hand as hints. You will mobilize all your experience and knowledge and choose the probability distribution that you find most appropriate.

This is statistical modeling. And this is also where statistics are difficult. The reason why you did the statistical modeling is also where your subjectivity gets involved. And whether or not the statistical modeling is persuasive depends on the subjectivity of the person listening to your claim.

We won't go any further here, but the world of "modeling" often leaves room for judgment on a complete objective basis.

Once statistical modeling is possible, I think that the height data of 100 people at hand was generated from the normally distributed population (= height of all junior high school boys), and from the data at hand, the normal distribution of the population Try to imagine the shape.

This is a "statistical" estimate based on statistical modeling.

If the population distribution is normal, then once you know the mean and variance of the population, you can draw the shape of the distribution.

I won't go into detail here, It is statistically the most reasonable estimate to estimate the average height of all junior high school boys in the population based on the average height of 100 people at hand. Intuitively, it's convincing.

The variance estimate is a bit confusing, so if you're interested, study for yourself.

If you know the shape of the normal distribution of the population, you can see how likely it is that the data for the 100 people this time were produced (how rare it is). That means we have a way to explain the data at hand stochastically.

Statistics and probability theory are inextricably linked. This is because statistics is also a discipline that deals with the probability of occurrence of the obtained data by statistical modeling.

If you were able to do statistical modeling in this way, what kind of problems would you be interested in next? For example, you might want to compare it with data from other groups (Japanese junior high school girls and high school boys). An analytical method that plays an active role in comparison with other populations is called a "test." The test is also a statistical method made possible by assuming the probability distribution of the population.

By doing "statistical modeling" in this way, a rich world is expanding beyond that. That world is called statistics.

Relationship between regression models such as generalized linear models and statistical modeling described here

In conclusion, many people think of it as statistical modeling. Linear model (LM), generalized linear model (GLM), generalized linear mixed model (GLMM) Statistical models such as are merely extensions of the statistical modeling described here to the world of regression.

The basic parts of these evolutionary models are also probability distributions.

For example, I said the green book I mentioned at the beginning and I will reorganize "what is statistical modeling". The article mainly deals with regression models.

but, Assuming a true distribution, infer the value of the parameter (mean or variance. In Bayesian statistics, the mean or variance itself is not a value but a distribution) using the data at hand. The essence of statistical modeling is the same.

Summary

Today, I thought about the question, "What is statistical modeling after all?"

I've read a lot of statistics books, No book has so much explained the meaning of models in statistics. In particular, I didn't really understand what was discussed here, "what is statistical modeling and what is not statistical modeling".

So, I decided to summarize my thoughts, which was the reason I wrote this article.

We hope that this article will serve as a reference for everyone who challenges the deep academic discipline of "statistics."

Recommended Posts

After all, what is statistical modeling?
After all, who is Embedding?
[Pyro] Statistical modeling by the stochastic programming language Pyro ① ~ What is Pyro ~
Data analysis, what do you do after all?
What is copy.copy ()
What is Django? .. ..
What is dotenv?
What is POSIX?
What is Linux
What is klass?
What is SALOME?
What is Linux?
What is python
What is hyperopt?
What is Linux
What is pyvenv
What is __call__
What is Linux
What is Python
What are python tuples and * args after all?
[For beginners] After all, what is written in Deep Learning made from scratch?
What is a distribution?
What is Piotroski's F-Score?
What is Raspberry Pi?
[Python] What is Pipeline ...
What is Calmar Ratio?
What is a terminal?
[PyTorch Tutorial ①] What is PyTorch?
What is hyperparameter tuning?
What is a hacker?
What is JSON? .. [Note]
What is Linux for?
What is a pointer?
What is ensemble learning?
What is TCP / IP?
What is Python's __init__.py?
What is an iterator?
What is UNIT-V Linux?
[Python] What is virtualenv
What is machine learning?
Which is the most popular python visualization tool after all?
After all it is wrong to cat with python subprocess.
What happened to that after all? "Hakidame" Motoi "Setsuna" project
What is Minisum or Minimax?
What is Linux? [Command list]
What is Logistic Regression Analysis?
What is the activation function?
What is the Linux kernel?
What is an instance variable?
What is a decision tree?
What is a Context Switch?
What is Google Cloud Dataflow?
[DL] What is weight decay?
[Python] Python and security-① What is Python?
What is a super user?
Competitive programming is what (bonus)
[Python] * args ** What is kwrgs?
What is a system call
[Definition] What is a framework?
What is the interface for ...
What is Project Euler 3 Acceleration?