PRML Chapter 4 Bayesian Logistic Regression Python Implementation

In curve fitting, posterior prediction distribution is calculated by treating it like Bayesian, but I have the impression that such a thing is not done so much in classification, and this time logistic regression, which is often used in classification, is more Bayesian. I would like to implement the code that calculates the posterior predictive distribution.

Bayesian logistic regression

As I wrote above, I haven't seen much or no code that calculates the predicted distribution in the classification (maybe just a few codes I've seen). In order to calculate the predicted distribution in Bayesian manner, the posterior distribution of the weight parameters must be used to integrate and eliminate the parameters. However, since logistic regression uses a logistic sigmoid function, it is not possible to integrate the parameters analytically. Therefore, the predicted distribution is approximately obtained using the Laplace approximation.

Logistic regression

First, let's consider logistic regression, which classifies two classes. Let $ \ phi $ be the feature vector of a point $ x $, and let that point belong to class $ C_1 $

p(C_1|\phi) = y(\phi) = \sigma({\bf w}^\intercal\phi)

It is expressed as. However,\sigma(\cdot)Is a logistic sigmoid function\sigma(x)={1\over1+\exp(-x)}And. Since we are considering two-class classification here, classesC_1If you know the probability of belonging to the other classC_2The probability of belonging top(C_2|\phi(x)) = 1 - p(C_1|\phi(x))Given in. Therefore, after this, the classC_1Only the probability of belonging to is considered.

Training dataset $ \ {\ phi_n, t_n \} _ {n = 1} ^ N $ as $ \ phi_n = \ phi (x_n) $, $ t_n \ in \ {0,1 \} $ The likelihood function given is modeled on the Bernoulli distribution.

\begin{align}
p({\bf t}|{\bf\Phi},{\bf w}) &= \prod_{n=1}^N{\rm Bern}(t_n|y_n)\\
&= \prod_{n=1}^N y_n^{t_n}(1 - y_n)^{1 - t_n}
\end{align}

Will be. However, $ {\ bf \ Phi} $ is a design matrix in which the horizontal vectors $ \ phi ^ {\ rm T} $ are arranged vertically. As always, defining the cost function as a negative log-likelihood function takes the form of cross entropy.

E({\bf w}) = -\ln p({\bf t}|{\bf\Phi},{\bf w}) = -\sum_{n=1}^N\{t_n\ln y_n + (1 - t_n)\ln(1 - y_n)\}

The results of first-order and second-order differentiation of this cost function for the parameter $ {\ bf w} $ are

\begin{align}
\nabla E({\bf w}) &= \sum_{n=1}^N (y_n - t_n) {\bf\phi}_n = {\bf\Phi}^\intercal({\bf y} - {\bf t})\\
{\bf H} = \nabla\nabla E({\bf w}) &= \sum_{n=1}^N y_n(1 - y_n)\phi_n\phi_n^\intercal = {\bf\Phi}^\intercal{\bf R}{\bf\Phi}
\end{align}

However, the matrix $ {\ bf R} $ is a diagonal matrix whose elements are $ R_ {nn} = y_n (1-y_n) $. Using these results,

{\bf w}^{new} = {\bf w}^{old} - {\bf H}^{-1}\nabla E({\bf w})

As, the update is repeated until the parameter $ {\ bf w} $ converges.

Laplace approximation

Prior distribution for parameter $ {\ bf w} $ $ p ({\ bf w}) = \ mathcal {N} ({\ bf w} | {\ bf 0}, \ alpha ^ {-1} {\ bf I }) Introduce $ and use the Laplace approximation to find an approximate posterior distribution in the form of a Gaussian function. From Bayes' theorem

p({\bf w}|{\bf t},{\bf\Phi}) \propto p({\bf t}|{\bf\Phi},{\bf w})p({\bf w})

So, if you take that logarithm

\ln p({\bf w}|{\bf t},{\bf\Phi}) = -{\alpha\over2}{\bf w}^\intercal{\bf w} + \sum_{n=1}^N\{t_n\ln y_n + (1 - t_n)\ln(1 - y_n)\}

Will be. To calculate the Gaussian approximation from this, first maximize the above equation for $ {\ bf w} $, then the parameter $ {\ bf w} _ {MAP} $ and the second derivative $ S_N ^ {-1 } Using $, the approximate posterior distribution is

q({\bf w}) = \mathcal{N}({\bf w}|{\bf w}_{MAP}, S_N)

Is required as.

Posterior predictive distribution

The posterior predictive distribution for the new point $ \ phi $ is obtained by marginalizing the posterior distribution of the parameters.

\begin{align}
p(C_1|\phi,{\bf t},{\bf\Phi}) &= \int p(C_1|\phi,{\bf w})p({\bf w}|{\bf t},{\bf\Phi}){\rm d}{\bf w}\\
&\approx \int \sigma({\bf w}^\intercal\phi)\mathcal{N}({\bf w}|{\bf w}_{MAP},S_N){\rm d}{\bf w}\\
&\approx \sigma\left({\mu\over\sqrt{1 + \pi\sigma^2/8}}\right)
\end{align}

However,

\begin{align}
\mu &= {\bf w}_{MAP}^\intercal\phi\\
\sigma^2 &= \phi^\intercal S_N\phi
\end{align}

And. Laplace approximation of the posterior distribution of the parameter $ {\ bf w} $ and approximation of the logistic sigmoid function with the probit function, these two approximations are used. Please see PRML for detailed formula transformation.

Implementation

package

In addition to the usual matplotlib and numpy, it uses a standard Python library called itertools.

import itertools
import matplotlib.pyplot as plt
import numpy as np

Design matrix: Polynomial features

Converting a two-dimensional point $ (x_1, x_2) $ to a quadratic polynomial feature vector

\phi(x_1,x_2) =
\begin{bmatrix}
1\\
x_1\\
x_2\\
x_1^2\\
x_1x_2\\
x_2^2
\end{bmatrix}

It will be. Below is a class that transforms a number of 2D points into a design matrix. For example, the design matrix of the quadratic polynomial features of the two-dimensional points $ (2,5) $ and $ (-3,1) $

{\bf\Phi}=
\begin{bmatrix}
1 & 2 & 5 & 2^2 & 2 \times 5 & 5^2\\
1 & 3 & -1 & 3^2 & 3 \times (-1) & (-1)^2
\end{bmatrix}

It looks like.

class PolynomialFeatures(object):

    def __init__(self, degree=2):
        self.degree = degree

    def transform(self, x):
        x_t = x.transpose()
        features = [np.ones(len(x))]
        for degree in xrange(1, self.degree + 1):
            for items in itertools.combinations_with_replacement(x_t, degree):
                features.append(reduce(lambda x, y: x * y, items))
        return np.array(features).transpose()

Logistic regression

It is not necessary to separate the classes for logistic regression and Bayesian logistic regression, but this time to make the difference between the two noticeable.

class LogisticRegression(object):

    def __init__(self, iter_max, alpha=0):
        self.iter_max = iter_max
        self.alpha = alpha

    def _sigmoid(self, a):
        return np.divide(1, 1 + np.exp(-a))

    def fit(self, X, t):
        self.w = np.zeros(np.size(X, 1))
        for i in xrange(self.iter_max):
            w = np.copy(self.w)
            y = self.predict_proba(X)
            grad = X.T.dot(y - t) + self.alpha * w
            hessian = (X.T.dot(np.diag(y * (1 - y))).dot(X)
                       + self.alpha * np.identity(len(w)))
            try:
                self.w -= np.linalg.solve(hessian, grad)
            except np.linalg.LinAlgError:
                break
            if np.allclose(w, self.w):
                break
            if i == self.iter_max - 1:
                print "weight parameter w may not have converged"

    def predict(self, X):
        return (self.predict_proba(X) > 0.5).astype(np.int)

    def predict_proba(self, X):
        return self._sigmoid(X.dot(self.w))
LogisticRegression Method description
__init__ Setting the maximum number of parameter updates and the accuracy of the parameter prior distribution
_sigmoid Logistic sigmoid function
fit Parameter estimation
predict Predict input label
predict_proba Calculate the probability that the input will be label 1

Bayesian logistic regression

In estimating the maximum posteriori probability of a parameter, not only the mode value but also the approximate variance is obtained. The logistic regression above also sought the Hessian matrix (the precision matrix of the approximate parameters), but I didn't store that matrix in any variable because it only uses that matrix when estimating. However, Bayesian logistic regression also uses the parameter variance to calculate the predicted distribution and saves it. The predict_dist method calculates the predicted distribution using the variance matrix of that parameter.

class BayesianLogisticRegression(LogisticRegression):

    def __init__(self, iter_max, alpha=0.1):
        super(BayesianLogisticRegression, self).__init__(iter_max, alpha)

    def fit(self, X, t):
        super(BayesianLogisticRegression, self).fit(X, t)
        y = self.predict_proba(X)
        hessian = (X.T.dot(np.diag(y * (1 - y))).dot(X)
                   + self.alpha * np.identity(len(self.w)))
        self.w_var = np.linalg.inv(hessian)

    def predict_dist(self, X):
        mu_a = X.dot(self.w)
        var_a = np.sum(X.dot(self.w_var) * X, axis=1)
        return self._sigmoid(mu_a / np.sqrt(1 + np.pi * var_a / 8))
BayesianLogisticRegression Method description
__init__ Setting the maximum number of parameter updates and the accuracy of the parameter prior distribution
fit Parameter estimation
predict_dist Calculation of posterior predictive distribution for input

Whole code

import itertools
import matplotlib.pyplot as plt
import numpy as np


class PolynomialFeatures(object):

    def __init__(self, degree=2):
        self.degree = degree

    def transform(self, x):
        x_t = x.transpose()
        features = [np.ones(len(x))]
        for degree in xrange(1, self.degree + 1):
            for items in itertools.combinations_with_replacement(x_t, degree):
                features.append(reduce(lambda x, y: x * y, items))
        return np.array(features).transpose()


class LogisticRegression(object):

    def __init__(self, iter_max, alpha=0):
        self.iter_max = iter_max
        self.alpha = alpha

    def _sigmoid(self, a):
        return np.divide(1, 1 + np.exp(-a))

    def fit(self, X, t):
        self.w = np.zeros(np.size(X, 1))
        for i in xrange(self.iter_max):
            w = np.copy(self.w)
            y = self.predict_proba(X)
            grad = X.T.dot(y - t) + self.alpha * w
            hessian = (X.T.dot(np.diag(y * (1 - y))).dot(X)
                       + self.alpha * np.identity(len(w)))
            try:
                self.w -= np.linalg.inv(hessian).dot(grad)
            except np.linalg.LinAlgError:
                break
            if np.allclose(w, self.w):
                break
            if i == self.iter_max - 1:
                print "weight parameter w may not have converged"

    def predict(self, X):
        return (self.predict_proba(X) > 0.5).astype(np.int)

    def predict_proba(self, X):
        return self._sigmoid(X.dot(self.w))


class BayesianLogisticRegression(LogisticRegression):

    def __init__(self, iter_max, alpha=0.1):
        super(BayesianLogisticRegression, self).__init__(iter_max, alpha)

    def fit(self, X, t):
        super(BayesianLogisticRegression, self).fit(X, t)
        y = self.predict_proba(X)
        hessian = (X.T.dot(np.diag(y * (1 - y))).dot(X)
                   + self.alpha * np.identity(len(self.w)))
        self.w_var = np.linalg.inv(hessian)

    def predict_dist(self, X):
        mu_a = X.dot(self.w)
        var_a = np.sum(X.dot(self.w_var) * X, axis=1)
        return self._sigmoid(mu_a / np.sqrt(1 + np.pi * var_a / 8))


def create_data_set():
    x = np.random.normal(size=50).reshape(-1, 2)
    y = np.random.normal(size=50).reshape(-1, 2)
    y += np.array([2., 2.])
    return (np.concatenate([x, y]), np.concatenate([np.zeros(25), np.ones(25)]))


def main():
    x, labels = create_data_set()
    colors = ['blue', 'red']
    plt.scatter(x[:, 0], x[:, 1], c=[colors[int(label)] for label in labels])

    features = PolynomialFeatures(degree=3)

    classifier = BayesianLogisticRegression(iter_max=100, alpha=0.1)
    classifier.fit(features.transform(x), labels)

    X_test, Y_test = np.meshgrid(np.linspace(-2, 4, 100), np.linspace(-2, 4, 100))
    x_test = np.array([X_test, Y_test]).transpose(1, 2, 0).reshape(-1, 2)
    probs = classifier.predict_proba(features.transform(x_test))
    Probs = probs.reshape(100, 100)
    dists = classifier.predict_dist(features.transform(x_test))
    Dists = dists.reshape(100, 100)
    levels = np.linspace(0, 1, 5)
    cp = plt.contour(X_test, Y_test, Probs, levels, colors='k', linestyles='dashed')
    plt.clabel(cp, inline=True, fontsize=10)
    plt.contourf(X_test, Y_test, Dists, levels, alpha=0.5)
    plt.colorbar()
    plt.xlim(-2, 4)
    plt.ylim(-2, 4)
    plt.show()


if __name__ == '__main__':
    main()

result

--Learning data: Blue (class 0), red (class 1) dots -** Black dotted line: Logistic regression output , a line with a probability of belonging to class 1 from the top 0.75,0.5,0.25, although it is difficult to see - Color coding: Bayesian logistic regression output **, each with a probability of belonging to class 1 --Red: 0.75 ~ --Yellow: 0.5 to 0.75 --Water: 0.25 to 0.5 --Blue: ~ 0.25

predictive_distribution.png

Looking at this, the line with a probability of 0.5 (the dotted line in the middle and the boundary between light blue and yellow) is common in both cases, while the lines of 0.25 and 0.75 are more 0.5 for those who treat them like Bayesian. It is far from the line and tells us that it is unclear which class it belongs to. Furthermore, where there are no data points in the upper left of the figure, the predicted distribution becomes more ambiguous than in other regions. On the other hand, since there are many data points around the center, the width from 0.25 to 0.75 is narrow. After calculating the probability of belonging to class 1 by either method, it is assigned to either class by some criterion, but if the criterion is misclassification minimization, it does not change which one is used. Because the line with a probability of 0.5 is common. However, it may be better to treat it like Bayesian when using more complicated decision criteria.

At the end

This time, we implemented a method to perform Bayesian logistic regression using Laplace approximation. The advantage of treating it in a Bayesian way is that it behaves differently in areas where there are many data points and where there are not. In addition to using the Laplace approximation, there seems to be a method of deriving the predicted distribution using variational Bayes. Chapter 10 of PRML also uses variational Bayes to estimate hyperparameters $ \ alpha $. I think I'll implement that as well.

Recommended Posts

PRML Chapter 4 Bayesian Logistic Regression Python Implementation
PRML Chapter 6 Gaussian Process Regression Python Implementation
PRML Chapter 1 Bayesian Curve Fitting Python Implementation
Implemented in Python PRML Chapter 3 Bayesian Linear Regression
PRML Chapter 12 Bayesian Principal Component Analysis Python Implementation
PRML Chapter 5 Neural Network Python Implementation
PRML Chapter 3 Evidence Approximation Python Implementation
PRML Chapter 7 Related Vector Machine Python Implementation for Regression Problems
PRML Chapter 8 Product Sum Algorithm Python Implementation
PRML Chapter 5 Mixed Density Network Python Implementation
PRML Chapter 9 Mixed Gaussian Distribution Python Implementation
PRML Chapter 14 Conditional Mixed Model Python Implementation
PRML Chapter 10 Variational Gaussian Distribution Python Implementation
PRML Chapter 2 Student's t Distribution Python Implementation
Implemented in Python PRML Chapter 1 Bayesian Inference
PRML Chapter 11 Markov Chain Monte Carlo Python Implementation
A python implementation of the Bayesian linear regression class
Explanation and implementation of PRML Chapter 4
Logistic regression analysis Self-made with python
Logistic regression
Logistic regression
[Python] I thoroughly explained the theory and implementation of logistic regression
PRML Chapter 13 Maximum Likelihood Estimating Python Implementation of Hidden Markov Models
<Course> Machine Learning Chapter 3: Logistic Regression Model
PRML implementation Chapter 3 Linear basis function model
Implemented in Python PRML Chapter 7 Nonlinear SVM
Plate reproduction of Bayesian linear regression (PRML ยง3.3)
I implemented Cousera's logistic regression in Python
Implemented in Python PRML Chapter 5 Neural Networks
Implemented in Python PRML Chapter 1 Polynomial Curve Fitting
Logistic regression implementation with particle swarm optimization method
"Linear regression" and "Probabilistic version of linear regression" in Python "Bayesian linear regression"
Implemented in Python PRML Chapter 4 Classification by Perceptron Algorithm
Coursera Machine Learning Challenges in Python: ex2 (Logistic Regression)
Variational Bayesian inference implementation of topic model in python
Logistic distribution in Python
RNN implementation in python
ValueObject implementation in Python
Machine learning logistic regression
Python: Supervised Learning (Regression)
[Python] Chapter 01-01 About Python (First Python)
SVM implementation in python
Regression analysis in Python
Python learning memo for machine learning by Chainer Chapter 7 Regression analysis
2. Multivariate analysis spelled out in Python 5-3. Logistic regression analysis (stats models)