[Wikipedia](http://en.wikipedia.org/wiki/%E3%82%AB%E3%83%BC%E3%83%8D%E3%83%AB%E5%AF%86%E5%BA % A6% E6% 8E% A8% E5% AE% 9A) Please refer to it. In some situations (large number of data, following a smooth distribution function, etc.), a histogram can help you get an overview of your data.
First, load the required packages and create about 5 bimodal datasets with normal distributions superposed.
import numpy as np
from scipy.stats import gaussian_kde
import matplotlib.pyplot as plt
N = 5
means = np.random.randn(N,2) * 10 + np.array([100, 200])
stdev = np.random.randn(N,2) * 10 + 30
count = np.int64(np.int64(np.random.randn(N,2) * 10000 + 50000))
a = [
np.hstack([
np.random.randn(count[i,j]) * stdev[i,j] + means[i,j]
for j in range(2)])
for i in range(N)]
It is troublesome if there are ridiculous outliers, so the data will be cut at the quantiles from 0.1% to 99.9%.
(As an aside, in numpy, percentile (array, x)
is specified in the range of 0..100, but in pandas,Series.quantile (x)
is specified in 0.1. It is confusing.)
Then pass the data to scipy.stats.gaussian_kde ()
and it will return the density function estimated by the Gaussian kernel, so you can plot it quickly by combining it with numpy.linspace
.
limmin = min(np.percentile(x, 0.1) for x in a)
limmax = max(np.percentile(x, 99.9) for x in a)
ls = np.linspace(limmin, limmax, 100)
for n in range(N):
x = a[n]
x = x[(x > limmin)&(x < limmax)]
kde = gaussian_kde(x)
plt.plot(ls, kde(ls), label='data %d' % n)
plt.xlim([limmin, limmax])
plt.legend()
plt.title('data distributions')
plt.show()
If you use the resample ()
method of the kde object, you can generate data for simulation of a distribution close to the measured data. For more information, see Official Documents.
Recommended Posts