Kernel density estimation in Python

What is Kernel Density Optimization (KDE)?

[Wikipedia](http://en.wikipedia.org/wiki/%E3%82%AB%E3%83%BC%E3%83%8D%E3%83%AB%E5%AF%86%E5%BA % A6% E6% 8E% A8% E5% AE% 9A) Please refer to it. In some situations (large number of data, following a smooth distribution function, etc.), a histogram can help you get an overview of your data.

Make appropriate data

First, load the required packages and create about 5 bimodal datasets with normal distributions superposed.

import numpy as np
from scipy.stats import gaussian_kde
import matplotlib.pyplot as plt

N = 5

means = np.random.randn(N,2) * 10 + np.array([100, 200])
stdev = np.random.randn(N,2) * 10 + 30
count = np.int64(np.int64(np.random.randn(N,2) * 10000 + 50000))

a = [
    np.hstack([
        np.random.randn(count[i,j]) * stdev[i,j] + means[i,j]
        for j in range(2)])
    for i in range(N)]

Estimate the distribution from the data and draw a graph.

It is troublesome if there are ridiculous outliers, so the data will be cut at the quantiles from 0.1% to 99.9%. (As an aside, in numpy, percentile (array, x) is specified in the range of 0..100, but in pandas,Series.quantile (x)is specified in 0.1. It is confusing.)

Then pass the data to scipy.stats.gaussian_kde () and it will return the density function estimated by the Gaussian kernel, so you can plot it quickly by combining it with numpy.linspace.

limmin = min(np.percentile(x, 0.1) for x in a)
limmax = max(np.percentile(x, 99.9) for x in a)
ls = np.linspace(limmin, limmax, 100)

for n in range(N):
    x = a[n]
    x = x[(x > limmin)&(x < limmax)]
    kde = gaussian_kde(x)
    plt.plot(ls, kde(ls), label='data %d' % n)

plt.xlim([limmin, limmax])
plt.legend()
plt.title('data distributions')
plt.show()

kde.png

If you use the resample () method of the kde object, you can generate data for simulation of a distribution close to the measured data. For more information, see Official Documents.

Recommended Posts

Kernel density estimation in Python
Python: Diagram of 2D data distribution (kernel density estimation)
HMM parameter estimation implementation in python
Quadtree in Python --2
Python in optimization
CURL in python
Metaprogramming in Python
Python 3.3 in Anaconda
Geocoding in python
SendKeys in Python
Meta-analysis in Python
Unittest in python
Epoch in Python
Sudoku in Python
DCI in Python
quicksort in python
nCr in python
N-Gram in Python
Programming in python
[Machine learning] Supervised learning using kernel density estimation
Plink in Python
Constant in python
Lifegame in Python.
FizzBuzz in Python
Sqlite in python
StepAIC in Python
N-gram in python
LINE-Bot [0] in Python
Csv in python
Disassemble in Python
Reflection in Python
Constant in python
nCr in Python.
format in python
Scons in Python3
Puyo Puyo in python
python in virtualenv
PPAP in Python
Quad-tree in Python
Reflection in Python
Chemistry in Python
Hashable in python
DirectLiNGAM in Python
LiNGAM in Python
Flatten in python
flatten in python
[Machine learning] Supervised learning using kernel density estimation Part 2
[Machine learning] Supervised learning using kernel density estimation Part 3
Sorted list in Python
Daily AtCoder # 36 in Python
Clustering text in Python
Daily AtCoder # 2 in Python
Implement Enigma in python
Daily AtCoder # 32 in Python
Daily AtCoder # 6 in Python
Daily AtCoder # 18 in Python
Edit fonts in Python
File operations in Python
Read DXF in python
Daily AtCoder # 53 in Python
Key input in Python