[PYTHON] About the matter that was worried about sampling error

This article is the article on the 21st day of Furukawa Lab Advent_calendar. This article was written by a student at Furukawa Lab as part of his studies. The content may be ambiguous or the expression may be slightly different.

Introduction

I'm not familiar with statistics at all, but when I was enjoying surfing on Twitter, I saw a tweet with a ranking image like "Future dream ranking I asked 200 junior high school students! (It may not have been like this, but w)", 200 Is it something that people can understand? I had a question. Did anyone think the same thing? In the reply, there was a comment saying "Isn't it a survey of 200 people?", But in the reply to that, "The confidence interval is 95% and the sampling error is within 10%. It has become. " I've heard it and I remember doing it in a university class, but unfortunately I couldn't remember it so I decided to ask google teacher ...

Sample error

The sample error is the error that accompanies the estimation of the numerical value in the population from the sample. Hmmmm ... I feel like I learned it. In the test, I remembered that there was a problem with the confidence interval. This site is easy to understand, and when you actually calculate with the previous example ...

\begin{align}
\bar{p}=0.5\\
n=200\\
1.96\sqrt{\cfrac{\bar{p}(1-\bar{p})}{n}}&=1.96\sqrt{\cfrac{0.5(1-0.5)}{200}}\\
&=0.0692924...
\end{align}

It seems that the sampling error is 7%. (By the way, 100 samples made about 10%, lie) I thought it was better to be aware of the idea of sampling error.

Extra play from here (short text w)

I heard that a completely randomly selected sample would be fine ... Prepare the data of Saddle shape as a population, and then prepare a random sample. How much should you prepare to estimate the original Saddle shape?

image.png

import numpy as np

    map_size = 60
    def create_data(nb_samples, input_dim=3, retdim=False):
        latent_dim = 2
        z1 = np.random.rand(nb_samples) * map_size - map_size/2
        z2 = np.random.rand(nb_samples) * map_size - map_size/2

        x = np.zeros((nb_samples, input_dim))
        x[:, 0] = z1
        x[:, 1] = z2
        x[:, 2] = z1 ** 2 - z2 ** 2

        if retdim:
            return x, latent_dim
        else:
            return x

    # parameters to make data
    x_sigma = 0.1
    nb_samples = 900
    seed = 1

    # make data
    np.random.seed(seed)
    X = create_data(nb_samples)
    X += np.random.normal(0, x_sigma, X.shape)

Create a saddle shape like this! Start with 900 samples and try to estimate 30 * 30 node points using Gaussian process regression Estimate the Z axis from the coordinates of the X and Y axes! I plotted the average with blue dots because it would be difficult to draw with dispersion!

from matplotlib import pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process import kernels as sk_kern
    x_input = X[:,:2]
    y_input = X[:,2]
    kernel = sk_kern.RBF(1.0, (1e-3, 1e3))  # + sk_kern.ConstantKernel(1.0,(1e-3,1e3)) + sk_kern.WhiteKernel()
    clf = GaussianProcessRegressor(
        kernel=kernel,
        alpha=1e-10,
        optimizer="fmin_l_bfgs_b",
        n_restarts_optimizer=20,
        normalize_y=True)

    clf.fit(x_input,y_input )
    clf.kernel_

    test_x = np.linspace(-map_size/2,map_size/2,30)
    test = np.meshgrid(test_x,test_x)
    test=np.dstack(test).reshape(900,2)

    pred_mean, pred_std = clf.predict(test, return_std=True)

    fig = plt.figure()
    ax = fig.add_subplot(1, 1, 1, aspect='equal', projection='3d')
    ax.scatter(X[:,0],X[:,1],X[:,2],color='r')
    ax.scatter(test[:,0],test[:,1],pred_mean)
    plt.show()

image.png

Well, if you have 900 points, you can do it, but if you calculate with the above formula, the error seems to be about 3%. From now on, from the one with the smallest number of samples First 50! image.png Hmmmm ... By the way, the sampling error in the previous formula is about 14%. Next is 100! image.png It's shaped like a saddle! !! As expected 10% Let's try 200 image.png Is it a little rub? It looks like it's going to happen, but I can ride it enough! The error is 7% Will it be the last at 400 image.png Eh, isn't this inferior to 900?

Summary

I thought the idea of sampling error was interesting! (I'm scared that I might get angry with such a statistical tag) It's a little busy and the extra chapter doesn't have a strict experimental plan at all, so it's adorable that it doesn't make any sense just to put up a figure like that m (_ _) m

Recommended Posts

About the matter that was worried about sampling error
Where I was worried about heroku
About the matter that the re.compiled object can be used for the re.match pattern
The story that XGBoost was finally installed
About the matter that softmax is not needed at the end of Torchvision model.
About the point that python 3.x got stuck due to an error due to caching_sha2_password
About the matter that localhost: 4040 cannot be accessed after running Spark with Docker
Can't change aspect_ratio with sympy.plotting? About the matter
About the matter that the contents of Python print are not visible in docker logs
About the test
About the queue
The problem that scikit-learn gives the error No module named'_bz2'
About the process that the Linux kernel handles x86 microcode
The story that the return value of tape.gradient () was None
The story that Japanese output was confused with Django
A story about PHP that was okay in the development environment but buggy in the production environment LEVEL 1-3 + 1
About the matter that nosetests does not pass when __init__.py is created in the project directory