What is Kernel Density Optimization (KDE)?

[Wikipedia](http://en.wikipedia.org/wiki/%E3%82%AB%E3%83%BC%E3%83%8D%E3%83%AB%E5%AF%86%E5%BA % A6% E6% 8E% A8% E5% AE% 9A) Please refer to it. In some situations (large number of data, following a smooth distribution function, etc.), a histogram can help you get an overview of your data.

Make appropriate data

First, load the required packages and create about 5 bimodal datasets with normal distributions superposed.

import numpy as np
from scipy.stats import gaussian_kde
import matplotlib.pyplot as plt

N = 5

means = np.random.randn(N,2) * 10 + np.array([100, 200])
stdev = np.random.randn(N,2) * 10 + 30
count = np.int64(np.int64(np.random.randn(N,2) * 10000 + 50000))

a = [
    np.hstack([
        np.random.randn(count[i,j]) * stdev[i,j] + means[i,j]
        for j in range(2)])
    for i in range(N)]

Estimate the distribution from the data and draw a graph.

It is troublesome if there are ridiculous outliers, so the data will be cut at the quantiles from 0.1% to 99.9%. (As an aside, in numpy, percentile (array, x) is specified in the range of 0..100, but in pandas,Series.quantile (x)is specified in 0.1. It is confusing.)

Then pass the data to scipy.stats.gaussian_kde () and it will return the density function estimated by the Gaussian kernel, so you can plot it quickly by combining it with numpy.linspace.

limmin = min(np.percentile(x, 0.1) for x in a)
limmax = max(np.percentile(x, 99.9) for x in a)
ls = np.linspace(limmin, limmax, 100)

for n in range(N):
    x = a[n]
    x = x[(x > limmin)&(x < limmax)]
    kde = gaussian_kde(x)
    plt.plot(ls, kde(ls), label='data %d' % n)

plt.xlim([limmin, limmax])
plt.legend()
plt.title('data distributions')
plt.show()

If you use the resample () method of the kde object, you can generate data for simulation of a distribution close to the measured data. For more information, see Official Documents.

Kernel density estimation in Python

What is Kernel Density Optimization (KDE)?

Make appropriate data

Estimate the distribution from the data and draw a graph.