[PYTHON] Organizing basic procedures for data analysis and statistical processing (4)

The second of Three points of social statistics following Last time It is a story of guessing the population from the sample. This is the part I've written many times, so let's review it.

Sampling

The entire population that you want to analyze and find out is called the ** population **.

I have already written about Sampling from population and Sampling method. ..

In statistics, the mean and variance of the population are rarely known in advance, and tests are used to estimate them. By sampling samples extracted from the population, the nature of the population can be investigated with a certain degree of confidence.

There are several reasons why it is difficult to know completely about the population.

The population is a very big subject, like the whole Japanese and the whole world.
Although the population itself is not large, a census is not feasible, as it is difficult to inspect all planned or shipped products.
Since it includes future factors such as next year's economic growth rate, it cannot be measured at this time and must be estimated. And so on.

Estimated

To use the data numerically for real-life economic analysis, policy evaluation, customer surveys, etc., you need to know its mean and variance. The population parameter is unknown in real-life problems and will be ** estimated ** from the sample at hand.

** interval estimation ** estimates a range of values that include parameters. The main information required at this time is as follows.

Sample mean
Sample ratio
Estimated standard error
How much to suppress the estimation error (standard error)

In statistics, the degree of freedom is the number of values that can be set freely. Degrees of freedom and test explained the definition of degree of freedom and its application to the test.

Unbiased is the value of the true parameter when the expected value of the estimator is taken. In other words, there is no overestimation or underestimation on average. The estimator that satisfies this is ** unbiased estimator **.

The unbiasedness of the sample mean and sample variance is especially important. The sample mean is always an unbiased estimator of the population mean.

#Prepare sample data according to 500 normal distributions
data = np.random.normal(loc=100, scale=25, size=500)

#Find the average
mu = np.mean(data)
#=> 99.416556898424659

#Find the variance
s2 = np.var(data, ddof=1) #Unbiased dispersion
#=> 685.08664455245321

# 90%Confidence interval
from scipy.stats import norm
rv = norm()
z = rv.ppf(0.995)

# 100(1-σ)%Confidence interval
r = np.array([-z, z]) * np.sqrt(25/500)
#=> array([-0.36780045,  0.36780045])
mu + r
#=> array([ 99.04875645,  99.78435735]) #Interval estimation

In the above example, N = 500, but as this N increases, it approaches the value of the normal distribution based on the law of large numbers. ..

Test

If you make any assumptions about the parameter distribution, test the goodness of fit of the distribution. To test if there is a difference in the population mean of each level, do analyds of variance.

In Test of equal variance hypothesis [Use Welch's test in t-test regardless of whether the population variances are equal] (http://qiita.com/ynakayama/items/b9ec31a296de48e62863) Should be.

As a matter of fact, the t-test on recent Rs results in Welch's test by default. You should do the same in Python (SciPy) (with the equal_var = False option). However, keep in mind whether the population variances are known, unknown but equal, or not equal.

Next time, I will continue with this story to investigate the relationship between variables.