This article is the article on the 21st day of Furukawa Lab Advent_calendar. This article was written by a student at Furukawa Lab as part of his studies. The content may be ambiguous or the expression may be slightly different.

Introduction

I'm not familiar with statistics at all, but when I was enjoying surfing on Twitter, I saw a tweet with a ranking image like "Future dream ranking I asked 200 junior high school students! (It may not have been like this, but w)", 200 Is it something that people can understand? I had a question. Did anyone think the same thing? In the reply, there was a comment saying "Isn't it a survey of 200 people?", But in the reply to that, "The confidence interval is 95% and the sampling error is within 10%. It has become. " I've heard it and I remember doing it in a university class, but unfortunately I couldn't remember it so I decided to ask google teacher ...

Sample error

The sample error is the error that accompanies the estimation of the numerical value in the population from the sample. Hmmmm ... I feel like I learned it. In the test, I remembered that there was a problem with the confidence interval. This site is easy to understand, and when you actually calculate with the previous example ...

\begin{align}
\bar{p}=0.5\\
n=200\\
1.96\sqrt{\cfrac{\bar{p}(1-\bar{p})}{n}}&=1.96\sqrt{\cfrac{0.5(1-0.5)}{200}}\\
&=0.0692924...
\end{align}

It seems that the sampling error is 7%. (By the way, 100 samples made about 10%, lie) I thought it was better to be aware of the idea of sampling error.

Extra play from here (short text w)

I heard that a completely randomly selected sample would be fine ... Prepare the data of Saddle shape as a population, and then prepare a random sample. How much should you prepare to estimate the original Saddle shape?

import numpy as np

    map_size = 60
    def create_data(nb_samples, input_dim=3, retdim=False):
        latent_dim = 2
        z1 = np.random.rand(nb_samples) * map_size - map_size/2
        z2 = np.random.rand(nb_samples) * map_size - map_size/2

        x = np.zeros((nb_samples, input_dim))
        x[:, 0] = z1
        x[:, 1] = z2
        x[:, 2] = z1 ** 2 - z2 ** 2

        if retdim:
            return x, latent_dim
        else:
            return x

    # parameters to make data
    x_sigma = 0.1
    nb_samples = 900
    seed = 1

    # make data
    np.random.seed(seed)
    X = create_data(nb_samples)
    X += np.random.normal(0, x_sigma, X.shape)

Create a saddle shape like this! Start with 900 samples and try to estimate 30 * 30 node points using Gaussian process regression Estimate the Z axis from the coordinates of the X and Y axes! I plotted the average with blue dots because it would be difficult to draw with dispersion!

from matplotlib import pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process import kernels as sk_kern
    x_input = X[:,:2]
    y_input = X[:,2]
    kernel = sk_kern.RBF(1.0, (1e-3, 1e3))  # + sk_kern.ConstantKernel(1.0,(1e-3,1e3)) + sk_kern.WhiteKernel()
    clf = GaussianProcessRegressor(
        kernel=kernel,
        alpha=1e-10,
        optimizer="fmin_l_bfgs_b",
        n_restarts_optimizer=20,
        normalize_y=True)

    clf.fit(x_input,y_input )
    clf.kernel_

    test_x = np.linspace(-map_size/2,map_size/2,30)
    test = np.meshgrid(test_x,test_x)
    test=np.dstack(test).reshape(900,2)

    pred_mean, pred_std = clf.predict(test, return_std=True)

    fig = plt.figure()
    ax = fig.add_subplot(1, 1, 1, aspect='equal', projection='3d')
    ax.scatter(X[:,0],X[:,1],X[:,2],color='r')
    ax.scatter(test[:,0],test[:,1],pred_mean)
    plt.show()

Well, if you have 900 points, you can do it, but if you calculate with the above formula, the error seems to be about 3%. From now on, from the one with the smallest number of samples First 50! Hmmmm ... By the way, the sampling error in the previous formula is about 14%. Next is 100! It's shaped like a saddle! !! As expected 10% Let's try 200 Is it a little rub? It looks like it's going to happen, but I can ride it enough! The error is 7% Will it be the last at 400 Eh, isn't this inferior to 900?

Summary

I thought the idea of sampling error was interesting! (I'm scared that I might get angry with such a statistical tag) It's a little busy and the extra chapter doesn't have a strict experimental plan at all, so it's adorable that it doesn't make any sense just to put up a figure like that m (_ _) m

[PYTHON] About the matter that was worried about sampling error

Introduction

Sample error

Extra play from here (short text w)

Summary