Supervised learning with kernel density estimation

This article is written by beginners in machine learning. Please note.

The first article is here. The second article is here.

In this article, I will explain the background of the idea using mathematical formulas.

Probability density and probability

Kernel density estimation was to estimate the probability density function with the kernel function. So what exactly is the probability density function?

The probability density function represents the *** distribution of accessibility ***. *** For event A, a value x with a high probability density *** means that when event A occurs, the probability that the value at that time is x is relatively high ***. Let's replace "event A" with "label 0". "At value x, the probability density of label 0 is high" means "when some data has label 0, the probability that the data has value x is relatively high".

It should be noted here that probability density ≠ probability. Probability density is "relative ease of appearance for a specific event", and there is no guarantee that it can be compared with the probability density of other events. In the first place, the probability that a specific value will appear in continuous data is not defined. Probability cannot be calculated without a certain range.

P(X=x)=0 \\ P(X \leqq x)=p

By integrating the probability density function, you can find the probability in that interval (or region if it is two or more dimensions). So the probability of a particular value "near" can be calculated, probably. The higher the probability density of the value x, the higher the probability that the value x "near" will appear, and the lower the probability density of the value x, the lower the probability that the value x "near" will appear.

I would like to give you an example. At the value x, the probability density of label 1 is higher than the probability density of label 0. At this time,

--Probability that the value x "near" appears when the label is 0 --Probability that the value x "near" appears when the label is 1.

Which is higher? It's not strict, but intuitively the latter seems to be higher (miscellaneous). We will continue to discuss this intuition as correct.

Let's rewrite "label 0" as "y = 0" and "label 1" as "y = 1". Then, the story so far

 Probability density of label 0 at value x \ leqq Probability density of label 1 at value x \\
 \ Rightarrow Probability of getting value x when y = 0 \ leqq Probability of getting value x when y = 1 \\
 \Rightarrow P(x|y=0) \leqq P(x|y=1)

It is summarized as. However, it is exactly the value x "near".

Conditional probabilities and Bayes' theorem

For the value x

 P(y=0|x) \leqq P(y=1|x)

If, then it makes sense to assign label 1 instead of label 0. Therefore, if you can find the probability of "label 0 or label 1" and the conditional probability under the condition of "extracting the value x", you win.

Let's rewrite this inequality using Bayes' theorem.

 P(y=0|x) \leqq P(y=1|x) \\ \Leftrightarrow
 \frac{P(x|y=0)P(y=0)}{P(x)} \leqq \frac{P(x|y=1)P(y=1)}{P(x)}

The denominator P (x) is common and is greater than or equal to 0. After all

 P(x|y=0)P(y=0) \leqq P(x|y=1)P(y=1)

If you can show

 P(y=0|x) \leqq P(y=1|x)

Can be said to hold. Now using kernel density estimation

 P(x|y=0) \leqq P(x|y=1)

I know that. Therefore, if we can know the values of P (y = 0) and P (y = 1), it will be settled.

Label ratio estimation

P (y = 0) can be interpreted as "the probability that some data is selected and it is labeled 0".

Finding the population P (y = 0) and P (y = 1) is not easy. Therefore, let the composition ratio of the label of the teacher data be the estimated value of P (y = 0) and P (y = 1). Of 100 teacher data, 40 are label 0 and 60 are label 1.

P(y=0)=0.4\\ P(y=1)=0.6

I presume.

In fact, the classifiers I implemented in previous articles ignored the effects of P (y = 0) and P (y = 1). He made a strong assumption that the ratio of each label was equal.

Reimplementation

Based on the explanation so far, we will implement the object-oriented classifier again.

import numpy as np

class GKDEClassifier(object):
    
    def __init__(self, bw_method="scotts_factor", weights="None"):
 # Kernel bandwidth
        self.bw_method = bw_method
 # Kernel weight
        self.weights = weights
        
    def fit(self, X, y):
 Number of labels for # y
        self.y_num = len(np.unique(y))   
 #Label ratio calculation
        self.label, y_count = np.unique(y, return_counts=True)
        self.y_rate = y_count/y_count.sum()
 # List containing estimated probability density functions
        self.kernel_ = []
 # Store probability density function
        for i in range(self.y_num):
            kernel = gaussian_kde(X[y==self.label[i]].T)
            self.kernel_.append(kernel)
        return self

    def predict(self, X):
 # List to store predictive labels
        pred = []
 #Ndarray that stores the probabilities of test data by label
        self.p_ = np.empty([self.y_num, len(X)])
 # Store probabilities by label
        for i in range(self.y_num):
            self.p_[i] = self.kernel_[i].evaluate(X.T)
 # Multiply the label ratio
        for j in range(self.y_num):
            self.p_[j] = self.p_[j] * self.y_rate[j]
 # Prediction label allocation
        for k in range(len(X)):
            pred.append(self.label[np.argmax(self.p_.T[k])])
        return pred

The added and modified parts are explained below.

Label ratio calculation

self.label, y_count = np.unique(y, return_counts=True)
self.y_rate = y_count/y_count.sum()

Added the part to calculate the label ratio of teacher data to the fit method. The total value of y_rate is set to 1 by dividing the number of appearances y_count for each label by the total value. If you use y_count as it is without dividing by the total value, the result will not change.

In addition, the *** label breakdown (0, 1, or character string) is output as a list label. ***(important)

Calculation of probability density function

Here is the new code. Fixed to match various labels.

for i in range(self.y_num):
    kernel = gaussian_kde(X[y==self.label[i]].T)
    self.kernel_.append(kernel)

This is the code so far ↓ ↓

kernel = gaussian_kde(X[y==i].T)

Originally, "data with label i" was specified, but it has been changed to specify "data with label i". By specifying from the output label, it corresponds to the label (character string etc.) that is not a non-negative integer.

Reflect label ratio

for j in range(self.y_num):
    self.p_[j] = self.p_[j] * self.y_rate[j]

Added a part to the predict method that multiplies the probability density by the label ratio.

Forecast label allocation

Here is the new code.

for k in range(len(X)):
    pred.append(self.label[np.argmax(self.p_.T[k])])

By specifying the assigned label from the list label, labels other than non-negative integers are also supported.

Other

I rewrote the messy part of the predict method using numpy. By creating the ndarray first and then substituting the result, readability and calculation speed are improved.

Finally

I managed to finish writing in my brain to the end. Thank you for your relationship. We hope that those who read the article will find it even a little more interesting.

(Corrected on August 5, 2020)

[PYTHON] [Machine learning] Supervised learning using kernel density estimation Part 3