[PYTHON] Bayesian modeling-estimation of the difference between the two groups-

Estimating the difference between the two groups

Introduction

Bayesian modeling practice 1st

I didn't have much time due to various reasons, but I was finally able to practice Bayesian modeling.

Part 1 is about the difference between the two groups of Bayesian statistics

Reference books

"First Statistical Data Analysis" (https://www.asakura.co.jp/books/isbn/978-4-254-12214-5/)

is.

The data used is the famous "iris"

Furthermore, I used "pedal.length" (the length of the calyx?).

I use stan + python (pystan) for Bayesian modeling software.

About the difference between the two groups using Bayesian statistics

The purpose of this time is to perform the t-test, which is famous as a frequency-based statistic, by Bayesian modeling.

Assuming that the observation data are x1 and x2, these are

x1 ~ normal(mu1, sigma1)
x2 ~ normal(mu2, sigma2)

It is generated like this.

By comparing the population means, mu1 and mu2, estimated from the observed values (x1, x2),

It becomes possible to discuss the probability that there is a difference between them.

Experiment

First, for the original data, the plot below shows the normal distribution with mu and sigma estimated as histograms. image1.png

50 data each for versicolor and virginica species,

Certainly, just by plotting, you can see that there seems to be a difference between them.

Next estimated mu_versicolor, mu_virginica plot image2.png

There are two types of colors, dark and light, because we estimated from 10 samples (lighter one) and 30 samples (darker one).

From the fact that the thinner ones are widely distributed, it can be seen that the more observations there are, the more accurate the estimation can be made.

Finally mu_versicolor --plot of 10 sample version and 30 sample version of mu_viginica

image3.png

10 samples for blue and 30 samples for green

By summarizing this histogram (obtaining EAP)

It is possible to evaluate the probability that the size of the calyx is larger than ~ ~ cm.

After all, the larger the number of samples, the higher the kurtosis of the histogram.

In the case of 30 samples, it can be evaluated that there is a difference of 1.0 to 1.5 with most probability.

Conclusion

This time, I evaluated the difference between the two values using Bayesian statistics, but it seems to be quite usable.

It is good that not only the p-value but also the probability of how far apart can be obtained as ~~%.

At the same time, it would be nice to be able to evaluate the low reliability of the small amount of data.

I want to use it for my own master's thesis ~ ~ ~

Recommended Posts

Bayesian modeling-estimation of the difference between the two groups-
Consideration of the difference between ROC curve and PR curve
Calculate the time difference between two columns with Pandas DataFrame
Calculate the correspondence between two word-separators
Estimate the delay between two signals
I investigated the behavior of the difference between hard links and symbolic links
Approximation of distance between two points on the surface of a spheroid (on the surface of the earth)
Test method for size difference between groups
Test of the difference between the mean values of count data according to the Poisson distribution
What is the difference between `pip` and` conda`?
Summary of the differences between PHP and Python
The answer of "1/2" is different between python2 and 3
About the difference between "==" and "is" in python
About the difference between PostgreSQL su and sudo
What is the difference between Unix and Linux?
The rough difference between Unicode and UTF-8 (and their friends)
Can BERT tell the difference between "candy (candy)" and "candy (rain)"?
Difference between Ruby and Python in terms of variables
What is the difference between usleep, nanosleep and clock_nanosleep?
A python implementation of the Bayesian linear regression class
python chrome driver ver. Solving the problem of difference
Difference in results depending on the argument of multiprocess.Process
Visualization of the connection between malware and the callback server
How to use argparse and the difference between optparse