[PYTHON] [Machine learning] Supervised learning using kernel density estimation

Supervised learning using kernel density estimation

This article is written by beginners in machine learning. Please note.

An example actually used is here. The specific background of the idea and the revised content are here.

What's kernel density estimation

Bukkake [WIkipedia](https://ja.wikipedia.org/wiki/%E3%82%AB%E3%83%BC%E3%83%8D%E3%83%AB%E5%AF%86%E5% It is faster to look at BA% A6% E6% 8E% A8% E5% AE% 9A).

Imagine a simple histogram. It can be said that the part where the histogram is high is *** relatively easy to happen ***, and the part where the histogram is low *** is relatively unlikely to occur ***. Have you ever heard a similar story somewhere?

This is the same idea as the probability density function. Histogram is, in a sense, *** a true probability density function *** estimated by *** measured values ***. *** Kernel density estimation *** is a more continuous and smoother estimation method using kernel functions.

What's supervised learning

[Wikipedia](https://ja.wikipedia.org/wiki/%E6%95%99%E5%B8%AB%E3%81%82%E3%82%8A%E5%AD%A6%E7%BF % 92) See the teacher or read another person's Qiita.

Kernel density estimation and supervised learning

A "teacher" in supervised learning is a set of "data" and "correct labels".

Consider a dataset with the correct label "0, 1, 2". This is divided into label 0 data, label 1 data, and label 2 data. If you estimate the kernel density using teacher data with a correct label of 0, you can find the probability density function for the event that the label becomes 0.

Find the probability density function for all labels based on the teacher data and calculate the probability density of the test data. Then, let's classify by the size of the value. That is this attempt.

Strictly speaking, we really have to calculate the percentage of each label in the population ... I would like to summarize the difficult story again.

Let's implement it for the time being

This world is wonderful. This is because kernel density estimation using the Gaussian kernel has already been implemented in SciPy.

How to use Gaussian KDE

Here is a brief summary of how to use SciPy's gaussian_kde.

Kernel density estimation

kernel = gaussian_kde(X, bw_method="scotts_factor", weights="None")

--X: Data set for kernel density estimation. --bw_method: Kernel bandwidth. Scotts_factor if not specified. --weights: Weights for kernel density estimation. If not specified, all weights are equal.

Calculate the probability

Enter new data into the estimated probability density function to calculate the probability.

pd = kernel.evaluate(Z)

--Z: Data point (s) for which you want to calculate the probability.

It is returned as a list array containing the probabilities of Z.

Try supervised learning

Try it with Scikit-learn's iris dataset!

The flow is like this iris dataset read → Split training data and test data with train_test_split → Standardization of training data and test data → Perform kernel density estimation for each label using training data → Calculate the probability density for each label of test data → Output the label with the largest value

↓ Script ↓

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from scipy.stats import gaussian_kde

# Loading iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Division of training data and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=1, stratify=y)

# Standardization
sc = StandardScaler()
sc = sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

# Kernel density estimation
kernel0 = gaussian_kde(X_train_std[y_train==0].T)
kernel1 = gaussian_kde(X_train_std[y_train==1].T)
kernel2 = gaussian_kde(X_train_std[y_train==2].T)

# Calculate the probability density of test data
p0s = kernel0.evaluate(X_test_std.T)
p1s = kernel1.evaluate(X_test_std.T)
p2s = kernel2.evaluate(X_test_std.T)

# Prediction label output
y_pred = []
for p0, p1, p2 in zip(p0s, p1s, p2s):
    if max(p0, p1, p2) == p0:
        y_pred.append(0)
    elif max(p0, p1, p2) == p1:
        y_pred.append(1)
    else:
        y_pred.append(2)

Precautions for standardization

Test data is standardized using the mean and standard deviation of the training data. This is because if standardization is performed separately, the data may be biased or misaligned.

Precautions for kernel density estimation

If you let gaussian_kde read the dataset as it is, it seems that the *** column vector is processed as one data ***. But the iris dataset transposes the data because *** row vector is one data ***. The same is true when calculating the probability density of test data.

Prediction label output

y_pred = []
for p0, p1, p2 in zip(p0s, p1s, p2s):
    if max(p0, p1, p2) == p0:
        y_pred.append(0)
    elif max(p0, p1, p2) == p1:
        y_pred.append(1)
    else:
        y_pred.append(2)

The probability density of the test data is stored in p0s, p1s, p2s for each label. Take out one each

--0 if the value of label 0 is the maximum --1 if the value of label 1 is the maximum --Otherwise 2

Store the results in the list y_pred in the order of test data.

Result announcement

Let's check the accuracy rate of the prediction label with the accuracy_score of scikit-learn. Throb.

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))
 1.0

Hooray.

Finally

I tried to use the result of kernel density estimation as a classifier for supervised learning. In reality, such techniques are rarely used. I think that there is a drawback that the amount of calculation is large and the accuracy is significantly reduced depending on the data. However, as you can see from this trial, it seems that some data can be classified relatively quickly and nicely.

Continue to Part 2

Recommended Posts

[Machine learning] Supervised learning using kernel density estimation
[Machine learning] Supervised learning using kernel density estimation Part 2
[Machine learning] Supervised learning using kernel density estimation Part 3
Kernel density estimation in Python
Machine Learning: Supervised --Random Forest
Machine Learning: Supervised --Support Vector Machine
Supervised machine learning (classification / regression)
Machine Learning: Supervised --Decision Tree
Machine Learning: Supervised --Linear Discriminant Analysis
Application development using Azure Machine Learning
Machine learning
Stock price forecast using machine learning (scikit-learn)
[Machine learning] LDA topic classification using scikit-learn
[Machine learning] FX prediction using decision trees
Stock price forecast using machine learning (regression)
[Machine learning] Regression analysis using scikit learn
A story about simple machine learning using TensorFlow
Data supply tricks using deques in machine learning
Face image dataset sorting using machine learning model (# 3)
[Python3] Let's analyze data using machine learning! (Regression)
Supervised learning (classification)
Creating a position estimation model for the Werewolf Intelligence Tournament using machine learning
Reasonable price estimation of Mercari by machine learning
[Memo] Machine learning
Machine learning classification
Try using Jupyter Notebook of Azure Machine Learning
Machine Learning sample
[Machine learning] Extract similar words mechanically using WordNet
Causal reasoning using machine learning (organization of causal reasoning methods)
Create machine learning projects at explosive speed using templates
What I learned about AI / machine learning using Python (3)
Python: Diagram of 2D data distribution (kernel density estimation)
Machine Learning with Caffe -1-Category images using reference model
Tech-Circle Let's start application development using machine learning (self-study)
[Machine learning] Try to detect objects using Selective Search
[Machine learning] Text classification using Transformer model (Attention-based classifier)
Memo for building a machine learning environment using Python
What I learned about AI / machine learning using Python (2)
I tried to compress the image using machine learning
Machine learning tutorial summary
About machine learning overfitting
Machine learning ⑤ AdaBoost Summary
Machine learning logistic regression
Machine learning support vector machine
Studying Machine Learning ~ matplotlib ~
Machine learning linear regression
Machine learning course memo
Machine learning library dlib
Machine learning (TensorFlow) + Lotto 6
Stat estimation using pyirt
Somehow learn machine learning
Supervised learning (regression) 1 Basics
Python: Supervised Learning (Regression)
Machine learning library Shogun
Machine learning rabbit challenge
Introduction to machine learning
Python: Supervised Learning (Classification)
Machine Learning: k-Nearest Neighbors
What is machine learning?
Build an environment for machine learning using Python on MacOSX
Titanic survival prediction using machine learning workflow management tool Kedro