In curve fitting, posterior prediction distribution is calculated by treating it like Bayesian, but I have the impression that such a thing is not done so much in classification, and this time logistic regression, which is often used in classification, is more Bayesian. I would like to implement the code that calculates the posterior predictive distribution.

Bayesian logistic regression

As I wrote above, I haven't seen much or no code that calculates the predicted distribution in the classification (maybe just a few codes I've seen). In order to calculate the predicted distribution in Bayesian manner, the posterior distribution of the weight parameters must be used to integrate and eliminate the parameters. However, since logistic regression uses a logistic sigmoid function, it is not possible to integrate the parameters analytically. Therefore, the predicted distribution is approximately obtained using the Laplace approximation.

Logistic regression

First, let's consider logistic regression, which classifies two classes. Let $ \ phi $ be the feature vector of a point $ x $, and let that point belong to class $ C_1 $

p(C_1|\phi) = y(\phi) = \sigma({\bf w}^\intercal\phi)

It is expressed as. However,\sigma(\cdot)Is a logistic sigmoid function\sigma(x)={1\over1+\exp(-x)}And. Since we are considering two-class classification here, classesC_1If you know the probability of belonging to the other classC_2The probability of belonging top(C_2|\phi(x)) = 1 - p(C_1|\phi(x))Given in. Therefore, after this, the classC_1Only the probability of belonging to is considered.

Training dataset $ \ {\ phi_n, t_n \} _ {n = 1} ^ N $ as $ \ phi_n = \ phi (x_n) $, $ t_n \ in \ {0,1 \} $ The likelihood function given is modeled on the Bernoulli distribution.

\begin{align}
p({\bf t}|{\bf\Phi},{\bf w}) &= \prod_{n=1}^N{\rm Bern}(t_n|y_n)\\
&= \prod_{n=1}^N y_n^{t_n}(1 - y_n)^{1 - t_n}
\end{align}

Will be. However, $ {\ bf \ Phi} $ is a design matrix in which the horizontal vectors $ \ phi ^ {\ rm T} $ are arranged vertically. As always, defining the cost function as a negative log-likelihood function takes the form of cross entropy.

E({\bf w}) = -\ln p({\bf t}|{\bf\Phi},{\bf w}) = -\sum_{n=1}^N\{t_n\ln y_n + (1 - t_n)\ln(1 - y_n)\}

The results of first-order and second-order differentiation of this cost function for the parameter $ {\ bf w} $ are

\begin{align}
\nabla E({\bf w}) &= \sum_{n=1}^N (y_n - t_n) {\bf\phi}_n = {\bf\Phi}^\intercal({\bf y} - {\bf t})\\
{\bf H} = \nabla\nabla E({\bf w}) &= \sum_{n=1}^N y_n(1 - y_n)\phi_n\phi_n^\intercal = {\bf\Phi}^\intercal{\bf R}{\bf\Phi}
\end{align}

However, the matrix $ {\ bf R} $ is a diagonal matrix whose elements are $ R_ {nn} = y_n (1-y_n) $. Using these results,

{\bf w}^{new} = {\bf w}^{old} - {\bf H}^{-1}\nabla E({\bf w})

As, the update is repeated until the parameter $ {\ bf w} $ converges.

Laplace approximation

Prior distribution for parameter $ {\ bf w} $ $ p ({\ bf w}) = \ mathcal {N} ({\ bf w} | {\ bf 0}, \ alpha ^ {-1} {\ bf I }) Introduce $ and use the Laplace approximation to find an approximate posterior distribution in the form of a Gaussian function. From Bayes' theorem

p({\bf w}|{\bf t},{\bf\Phi}) \propto p({\bf t}|{\bf\Phi},{\bf w})p({\bf w})

So, if you take that logarithm

\ln p({\bf w}|{\bf t},{\bf\Phi}) = -{\alpha\over2}{\bf w}^\intercal{\bf w} + \sum_{n=1}^N\{t_n\ln y_n + (1 - t_n)\ln(1 - y_n)\}

Will be. To calculate the Gaussian approximation from this, first maximize the above equation for $ {\ bf w} $, then the parameter $ {\ bf w} _ {MAP} $ and the second derivative $ S_N ^ {-1 } Using $, the approximate posterior distribution is

q({\bf w}) = \mathcal{N}({\bf w}|{\bf w}_{MAP}, S_N)

Is required as.

Posterior predictive distribution

The posterior predictive distribution for the new point $ \ phi $ is obtained by marginalizing the posterior distribution of the parameters.

\begin{align}
p(C_1|\phi,{\bf t},{\bf\Phi}) &= \int p(C_1|\phi,{\bf w})p({\bf w}|{\bf t},{\bf\Phi}){\rm d}{\bf w}\\
&\approx \int \sigma({\bf w}^\intercal\phi)\mathcal{N}({\bf w}|{\bf w}_{MAP},S_N){\rm d}{\bf w}\\
&\approx \sigma\left({\mu\over\sqrt{1 + \pi\sigma^2/8}}\right)
\end{align}

However,

\begin{align}
\mu &= {\bf w}_{MAP}^\intercal\phi\\
\sigma^2 &= \phi^\intercal S_N\phi
\end{align}

And. Laplace approximation of the posterior distribution of the parameter $ {\ bf w} $ and approximation of the logistic sigmoid function with the probit function, these two approximations are used. Please see PRML for detailed formula transformation.

Implementation

package

In addition to the usual matplotlib and numpy, it uses a standard Python library called itertools.

import itertools
import matplotlib.pyplot as plt
import numpy as np

Design matrix: Polynomial features

Converting a two-dimensional point $ (x_1, x_2) $ to a quadratic polynomial feature vector

\phi(x_1,x_2) =
\begin{bmatrix}
1\\
x_1\\
x_2\\
x_1^2\\
x_1x_2\\
x_2^2
\end{bmatrix}

It will be. Below is a class that transforms a number of 2D points into a design matrix. For example, the design matrix of the quadratic polynomial features of the two-dimensional points $ (2,5) $ and $ (-3,1) $

{\bf\Phi}=
\begin{bmatrix}
1 & 2 & 5 & 2^2 & 2 \times 5 & 5^2\\
1 & 3 & -1 & 3^2 & 3 \times (-1) & (-1)^2
\end{bmatrix}

It looks like.

class PolynomialFeatures(object):

    def __init__(self, degree=2):
        self.degree = degree

    def transform(self, x):
        x_t = x.transpose()
        features = [np.ones(len(x))]
        for degree in xrange(1, self.degree + 1):
            for items in itertools.combinations_with_replacement(x_t, degree):
                features.append(reduce(lambda x, y: x * y, items))
        return np.array(features).transpose()

Logistic regression

It is not necessary to separate the classes for logistic regression and Bayesian logistic regression, but this time to make the difference between the two noticeable.

class LogisticRegression(object):

    def __init__(self, iter_max, alpha=0):
        self.iter_max = iter_max
        self.alpha = alpha

    def _sigmoid(self, a):
        return np.divide(1, 1 + np.exp(-a))

    def fit(self, X, t):
        self.w = np.zeros(np.size(X, 1))
        for i in xrange(self.iter_max):
            w = np.copy(self.w)
            y = self.predict_proba(X)
            grad = X.T.dot(y - t) + self.alpha * w
            hessian = (X.T.dot(np.diag(y * (1 - y))).dot(X)
                       + self.alpha * np.identity(len(w)))
            try:
                self.w -= np.linalg.solve(hessian, grad)
            except np.linalg.LinAlgError:
                break
            if np.allclose(w, self.w):
                break
            if i == self.iter_max - 1:
                print "weight parameter w may not have converged"

    def predict(self, X):
        return (self.predict_proba(X) > 0.5).astype(np.int)

    def predict_proba(self, X):
        return self._sigmoid(X.dot(self.w))

LogisticRegression	Method description
__init__	Setting the maximum number of parameter updates and the accuracy of the parameter prior distribution
_sigmoid	Logistic sigmoid function
fit	Parameter estimation
predict	Predict input label
predict_proba	Calculate the probability that the input will be label 1

Bayesian logistic regression

In estimating the maximum posteriori probability of a parameter, not only the mode value but also the approximate variance is obtained. The logistic regression above also sought the Hessian matrix (the precision matrix of the approximate parameters), but I didn't store that matrix in any variable because it only uses that matrix when estimating. However, Bayesian logistic regression also uses the parameter variance to calculate the predicted distribution and saves it. The predict_dist method calculates the predicted distribution using the variance matrix of that parameter.

class BayesianLogisticRegression(LogisticRegression):

    def __init__(self, iter_max, alpha=0.1):
        super(BayesianLogisticRegression, self).__init__(iter_max, alpha)

    def fit(self, X, t):
        super(BayesianLogisticRegression, self).fit(X, t)
        y = self.predict_proba(X)
        hessian = (X.T.dot(np.diag(y * (1 - y))).dot(X)
                   + self.alpha * np.identity(len(self.w)))
        self.w_var = np.linalg.inv(hessian)

    def predict_dist(self, X):
        mu_a = X.dot(self.w)
        var_a = np.sum(X.dot(self.w_var) * X, axis=1)
        return self._sigmoid(mu_a / np.sqrt(1 + np.pi * var_a / 8))

BayesianLogisticRegression	Method description
__init__	Setting the maximum number of parameter updates and the accuracy of the parameter prior distribution
fit	Parameter estimation
predict_dist	Calculation of posterior predictive distribution for input

Whole code

import itertools
import matplotlib.pyplot as plt
import numpy as np


class PolynomialFeatures(object):

    def __init__(self, degree=2):
        self.degree = degree

    def transform(self, x):
        x_t = x.transpose()
        features = [np.ones(len(x))]
        for degree in xrange(1, self.degree + 1):
            for items in itertools.combinations_with_replacement(x_t, degree):
                features.append(reduce(lambda x, y: x * y, items))
        return np.array(features).transpose()


class LogisticRegression(object):

    def __init__(self, iter_max, alpha=0):
        self.iter_max = iter_max
        self.alpha = alpha

    def _sigmoid(self, a):
        return np.divide(1, 1 + np.exp(-a))

    def fit(self, X, t):
        self.w = np.zeros(np.size(X, 1))
        for i in xrange(self.iter_max):
            w = np.copy(self.w)
            y = self.predict_proba(X)
            grad = X.T.dot(y - t) + self.alpha * w
            hessian = (X.T.dot(np.diag(y * (1 - y))).dot(X)
                       + self.alpha * np.identity(len(w)))
            try:
                self.w -= np.linalg.inv(hessian).dot(grad)
            except np.linalg.LinAlgError:
                break
            if np.allclose(w, self.w):
                break
            if i == self.iter_max - 1:
                print "weight parameter w may not have converged"

    def predict(self, X):
        return (self.predict_proba(X) > 0.5).astype(np.int)

    def predict_proba(self, X):
        return self._sigmoid(X.dot(self.w))


class BayesianLogisticRegression(LogisticRegression):

    def __init__(self, iter_max, alpha=0.1):
        super(BayesianLogisticRegression, self).__init__(iter_max, alpha)

    def fit(self, X, t):
        super(BayesianLogisticRegression, self).fit(X, t)
        y = self.predict_proba(X)
        hessian = (X.T.dot(np.diag(y * (1 - y))).dot(X)
                   + self.alpha * np.identity(len(self.w)))
        self.w_var = np.linalg.inv(hessian)

    def predict_dist(self, X):
        mu_a = X.dot(self.w)
        var_a = np.sum(X.dot(self.w_var) * X, axis=1)
        return self._sigmoid(mu_a / np.sqrt(1 + np.pi * var_a / 8))


def create_data_set():
    x = np.random.normal(size=50).reshape(-1, 2)
    y = np.random.normal(size=50).reshape(-1, 2)
    y += np.array([2., 2.])
    return (np.concatenate([x, y]), np.concatenate([np.zeros(25), np.ones(25)]))


def main():
    x, labels = create_data_set()
    colors = ['blue', 'red']
    plt.scatter(x[:, 0], x[:, 1], c=[colors[int(label)] for label in labels])

    features = PolynomialFeatures(degree=3)

    classifier = BayesianLogisticRegression(iter_max=100, alpha=0.1)
    classifier.fit(features.transform(x), labels)

    X_test, Y_test = np.meshgrid(np.linspace(-2, 4, 100), np.linspace(-2, 4, 100))
    x_test = np.array([X_test, Y_test]).transpose(1, 2, 0).reshape(-1, 2)
    probs = classifier.predict_proba(features.transform(x_test))
    Probs = probs.reshape(100, 100)
    dists = classifier.predict_dist(features.transform(x_test))
    Dists = dists.reshape(100, 100)
    levels = np.linspace(0, 1, 5)
    cp = plt.contour(X_test, Y_test, Probs, levels, colors='k', linestyles='dashed')
    plt.clabel(cp, inline=True, fontsize=10)
    plt.contourf(X_test, Y_test, Dists, levels, alpha=0.5)
    plt.colorbar()
    plt.xlim(-2, 4)
    plt.ylim(-2, 4)
    plt.show()


if __name__ == '__main__':
    main()

result

--Learning data: Blue (class 0), red (class 1) dots -** Black dotted line: Logistic regression output , a line with a probability of belonging to class 1 from the top 0.75,0.5,0.25, although it is difficult to see - Color coding: Bayesian logistic regression output **, each with a probability of belonging to class 1 --Red: 0.75 ~ --Yellow: 0.5 to 0.75 --Water: 0.25 to 0.5 --Blue: ~ 0.25

Looking at this, the line with a probability of 0.5 (the dotted line in the middle and the boundary between light blue and yellow) is common in both cases, while the lines of 0.25 and 0.75 are more 0.5 for those who treat them like Bayesian. It is far from the line and tells us that it is unclear which class it belongs to. Furthermore, where there are no data points in the upper left of the figure, the predicted distribution becomes more ambiguous than in other regions. On the other hand, since there are many data points around the center, the width from 0.25 to 0.75 is narrow. After calculating the probability of belonging to class 1 by either method, it is assigned to either class by some criterion, but if the criterion is misclassification minimization, it does not change which one is used. Because the line with a probability of 0.5 is common. However, it may be better to treat it like Bayesian when using more complicated decision criteria.

At the end

This time, we implemented a method to perform Bayesian logistic regression using Laplace approximation. The advantage of treating it in a Bayesian way is that it behaves differently in areas where there are many data points and where there are not. In addition to using the Laplace approximation, there seems to be a method of deriving the predicted distribution using variational Bayes. Chapter 10 of PRML also uses variational Bayes to estimate hyperparameters $ \ alpha $. I think I'll implement that as well.

PRML Chapter 4 Bayesian Logistic Regression Python Implementation

Bayesian logistic regression

Logistic regression

Laplace approximation

Posterior predictive distribution

Implementation

package

Design matrix: Polynomial features

Logistic regression

Bayesian logistic regression

Whole code

result

At the end