Estimating the difference between the two groups

Introduction

Bayesian modeling practice 1st

I didn't have much time due to various reasons, but I was finally able to practice Bayesian modeling.

Part 1 is about the difference between the two groups of Bayesian statistics

Reference books

"First Statistical Data Analysis" (https://www.asakura.co.jp/books/isbn/978-4-254-12214-5/)

is.

The data used is the famous "iris"

Furthermore, I used "pedal.length" (the length of the calyx?).

I use stan + python (pystan) for Bayesian modeling software.

About the difference between the two groups using Bayesian statistics

The purpose of this time is to perform the t-test, which is famous as a frequency-based statistic, by Bayesian modeling.

Assuming that the observation data are x1 and x2, these are

x1 ~ normal(mu1, sigma1)
x2 ~ normal(mu2, sigma2)

It is generated like this.

By comparing the population means, mu1 and mu2, estimated from the observed values (x1, x2),

It becomes possible to discuss the probability that there is a difference between them.

Experiment

First, for the original data, the plot below shows the normal distribution with mu and sigma estimated as histograms.

50 data each for versicolor and virginica species,

Certainly, just by plotting, you can see that there seems to be a difference between them.

Next estimated mu_versicolor, mu_virginica plot

There are two types of colors, dark and light, because we estimated from 10 samples (lighter one) and 30 samples (darker one).

From the fact that the thinner ones are widely distributed, it can be seen that the more observations there are, the more accurate the estimation can be made.

Finally mu_versicolor --plot of 10 sample version and 30 sample version of mu_viginica

10 samples for blue and 30 samples for green

By summarizing this histogram (obtaining EAP)

It is possible to evaluate the probability that the size of the calyx is larger than ~ ~ cm.

After all, the larger the number of samples, the higher the kurtosis of the histogram.

In the case of 30 samples, it can be evaluated that there is a difference of 1.0 to 1.5 with most probability.

Conclusion

This time, I evaluated the difference between the two values using Bayesian statistics, but it seems to be quite usable.

It is good that not only the p-value but also the probability of how far apart can be obtained as ~~%.

At the same time, it would be nice to be able to evaluate the low reliability of the small amount of data.

I want to use it for my own master's thesis ~ ~ ~

[PYTHON] Bayesian modeling-estimation of the difference between the two groups-

Estimating the difference between the two groups

Introduction

About the difference between the two groups using Bayesian statistics

Experiment

Conclusion