[PYTHON] Organizing basic procedures for data analysis and statistical processing (4)

The second of Three points of social statistics following Last time It is a story of guessing the population from the sample. This is the part I've written many times, so let's review it.

Sampling

The entire population that you want to analyze and find out is called the ** population **.

I have already written about Sampling from population and Sampling method. ..

In statistics, the mean and variance of the population are rarely known in advance, and tests are used to estimate them. By sampling samples extracted from the population, the nature of the population can be investigated with a certain degree of confidence.

There are several reasons why it is difficult to know completely about the population.

Estimated

To use the data numerically for real-life economic analysis, policy evaluation, customer surveys, etc., you need to know its mean and variance. The population parameter is unknown in real-life problems and will be ** estimated ** from the sample at hand.

** interval estimation ** estimates a range of values that include parameters. The main information required at this time is as follows.

In statistics, the degree of freedom is the number of values that can be set freely. Degrees of freedom and test explained the definition of degree of freedom and its application to the test.

Unbiased is the value of the true parameter when the expected value of the estimator is taken. In other words, there is no overestimation or underestimation on average. The estimator that satisfies this is ** unbiased estimator **.

The unbiasedness of the sample mean and sample variance is especially important. The sample mean is always an unbiased estimator of the population mean.

#Prepare sample data according to 500 normal distributions
data = np.random.normal(loc=100, scale=25, size=500)

#Find the average
mu = np.mean(data)
#=> 99.416556898424659

#Find the variance
s2 = np.var(data, ddof=1) #Unbiased dispersion
#=> 685.08664455245321

# 90%Confidence interval
from scipy.stats import norm
rv = norm()
z = rv.ppf(0.995)

# 100(1-σ)%Confidence interval
r = np.array([-z, z]) * np.sqrt(25/500)
#=> array([-0.36780045,  0.36780045])
mu + r
#=> array([ 99.04875645,  99.78435735]) #Interval estimation

In the above example, N = 500, but as this N increases, it approaches the value of the normal distribution based on the law of large numbers. ..

Test

If you make any assumptions about the parameter distribution, test the goodness of fit of the distribution. To test if there is a difference in the population mean of each level, do analyds of variance.

In Test of equal variance hypothesis [Use Welch's test in t-test regardless of whether the population variances are equal] (http://qiita.com/ynakayama/items/b9ec31a296de48e62863) Should be.

As a matter of fact, the t-test on recent Rs results in Welch's test by default. You should do the same in Python (SciPy) (with the equal_var = False option). However, keep in mind whether the population variances are known, unknown but equal, or not equal.

Next time, I will continue with this story to investigate the relationship between variables.

Recommended Posts

Organizing basic procedures for data analysis and statistical processing (4)
Organizing basic procedures for data analysis and statistical processing (2)
JupyterLab Basic Setting 2 (pip) for data analysis
JupyterLab Basic Setup for Data Analysis (pip)
Introduction to Statistical Modeling for Data Analysis GLM Likelihood-Ratio Test and Test Asymmetry
An introduction to statistical modeling for data analysis
Data analysis planning collection processing and judgment (Part 1)
Data analysis planning collection processing and judgment (Part 2)
Data processing methods for mechanical engineers and non-computer engineers (Introduction 2)
Data processing methods for mechanical engineers and non-computer engineers (Introduction 1)
[Translation] scikit-learn 0.18 tutorial Statistical learning tutorial for scientific data processing Statistical learning: Settings and estimator objects in scikit-learn
Python for Data Analysis Chapter 4
Python for Data Analysis Chapter 2
Tips for data analysis ・ Notes
Python for Data Analysis Chapter 3
Introduction to Statistical Modeling for Data Analysis GLM Model Selection
[Translation] scikit-learn 0.18 Tutorial Statistical learning tutorial for scientific data processing Model selection: Estimator and its parameter selection
An introduction to statistical modeling for data analysis (Midorimoto) reading notes (in Python and Stan)
Introduction to Statistical Modeling for Data Analysis Generalized Linear Models (GLM)
Preprocessing template for data analysis (Python)
Data analysis for improving POG 3-Regression analysis-
Overlay and visualize Geo data and statistical data
[Translation] scikit-learn 0.18 Tutorial Statistical learning tutorial for scientific data processing Put all together
Starbucks Twitter Data Location Visualization and Analysis
Python visualization tool for data analysis work
Elasticsearch installation and basic operation for ubuntu
About data expansion processing for deep learning
Analysis for Data Scientists: Qiita Self-Article Summary 2020
[Translation] scikit-learn 0.18 Tutorial Search for help on statistical learning tutorials for scientific data processing
Introduction to Statistical Modeling for Data Analysis Expanding the range of applications of GLM
Practice of creating a data analysis platform with BigQuery and Cloud DataFlow (data processing)
[Explanation for beginners] TensorFlow basic syntax and concept
Data analysis for improving POG 2 ~ Analysis with jupyter notebook ~
Prepare a programming language environment for data analysis
Analysis for Data Scientists: Qiita Self-Article Summary 2020 (Practice)
[CovsirPhy] COVID-19 Python Package for Data Analysis: Data loading
How to use data analysis tools for beginners
[Python / Chrome] Basic settings and operations for scraping
[Translation] scikit-learn 0.18 Tutorial Statistical learning tutorial for scientific data processing Unsupervised learning: Finding the representation of data
Until you install Anaconda for data analysis on your Mac and launch the IDE