This article is the article on the 21st day of Furukawa Lab Advent_calendar. This article was written by a student at Furukawa Lab as part of his studies. The content may be ambiguous or the expression may be slightly different.
I'm not familiar with statistics at all, but when I was enjoying surfing on Twitter, I saw a tweet with a ranking image like "Future dream ranking I asked 200 junior high school students! (It may not have been like this, but w)", 200 Is it something that people can understand? I had a question. Did anyone think the same thing? In the reply, there was a comment saying "Isn't it a survey of 200 people?", But in the reply to that, "The confidence interval is 95% and the sampling error is within 10%. It has become. " I've heard it and I remember doing it in a university class, but unfortunately I couldn't remember it so I decided to ask google teacher ...
The sample error is the error that accompanies the estimation of the numerical value in the population from the sample. Hmmmm ... I feel like I learned it. In the test, I remembered that there was a problem with the confidence interval. This site is easy to understand, and when you actually calculate with the previous example ...
It seems that the sampling error is 7%. (By the way, 100 samples made about 10%, lie) I thought it was better to be aware of the idea of sampling error.
I heard that a completely randomly selected sample would be fine ... Prepare the data of Saddle shape as a population, and then prepare a random sample. How much should you prepare to estimate the original Saddle shape?
import numpy as np
map_size = 60
def create_data(nb_samples, input_dim=3, retdim=False):
latent_dim = 2
z1 = np.random.rand(nb_samples) * map_size - map_size/2
z2 = np.random.rand(nb_samples) * map_size - map_size/2
x = np.zeros((nb_samples, input_dim))
x[:, 0] = z1
x[:, 1] = z2
x[:, 2] = z1 ** 2 - z2 ** 2
if retdim:
return x, latent_dim
return x
# parameters to make data
x_sigma = 0.1
nb_samples = 900
seed = 1
# make data
X = create_data(nb_samples)
X += np.random.normal(0, x_sigma, X.shape)
Create a saddle shape like this! Start with 900 samples and try to estimate 30 * 30 node points using Gaussian process regression Estimate the Z axis from the coordinates of the X and Y axes! I plotted the average with blue dots because it would be difficult to draw with dispersion!
from matplotlib import pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process import kernels as sk_kern
x_input = X[:,:2]
y_input = X[:,2]
kernel = sk_kern.RBF(1.0, (1e-3, 1e3)) # + sk_kern.ConstantKernel(1.0,(1e-3,1e3)) + sk_kern.WhiteKernel()
clf = GaussianProcessRegressor(
normalize_y=True),y_input )
test_x = np.linspace(-map_size/2,map_size/2,30)
test = np.meshgrid(test_x,test_x)
pred_mean, pred_std = clf.predict(test, return_std=True)
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1, aspect='equal', projection='3d')
Well, if you have 900 points, you can do it, but if you calculate with the above formula, the error seems to be about 3%.
From now on, from the one with the smallest number of samples
First 50!
Hmmmm ...
By the way, the sampling error in the previous formula is about 14%.
Next is 100!
It's shaped like a saddle! !! As expected 10%
Let's try 200
Is it a little rub? It looks like it's going to happen, but I can ride it enough!
The error is 7%
Will it be the last at 400
Eh, isn't this inferior to 900?
I thought the idea of sampling error was interesting! (I'm scared that I might get angry with such a statistical tag) It's a little busy and the extra chapter doesn't have a strict experimental plan at all, so it's adorable that it doesn't make any sense just to put up a figure like that m (_ _) m
Recommended Posts