[PYTHON] Rank learning using neural network (Implementation of RankNet by Chainer)

What is this article?

Within the framework of machine learning, there is rank learning (Learning to Rank). RankNet, which uses a neural network, is one of the rank learning models. In this article, I will talk about implementing RankNet using Chainer.

Looking at the prediction results, the implementation seems to be correct, but if there are any mistakes such as how to use Chainer, I would be grateful if you could point out.

What is rank learning in the first place

In rank learning, a set of $ ({\ mathbf x}, y) = $ (features, goodness of fit) is learned. We will learn a function $ f $ that returns a large value for items with high goodness of fit and a small value for items with low goodness of fit.

Now suppose two items $ A $ and $ B $ are represented as follows.

A=(\mathbf{x}_A, y_A) \\
B=(\mathbf{x}_B, y_B)

Also assume that $ A $ has a greater goodness of fit than $ B $. In other words, suppose $ y_A> y_B $ holds. At this time, the function $ f $ learns so that the following relationship holds.

f(\mathbf{x}_A)>f(\mathbf{x}_B)

It feels like something is wrong, so love live! I will explain using. First of all, love live! Let's add the goodness of fit (☆, ☆☆, ☆☆☆) to the character of. The larger the number of ☆, the higher the goodness of fit (favorite).

cat_1.png

If you learn the following pairs, you should get the function $ f $ like this. learning.PNG

When $ f ({\ bf x}) = {\ bf w} ^ T {\ bf x} $, the weight $ {\ bf w} $ is orthogonally projected on the features as shown in the figure below. It is estimated that the points are arranged in order of goodness of fit. In the actual problem, there would be no $ {\ bf w} $ that would be perfectly ordered in order of fit, but ... (⇒ linear separability)

wpredict_1.png

The function $ f $ obtained as a result of learning can be used to predict the ranking of characters that have not yet been fitted.

calc_f1.png

Here, I wrote "learning pairs", but in fact, there are roughly three methods for rank learning, and it corresponds to the method called pairwise. There are other methods such as pointwise and listwise, but please refer to the following materials for these methods.

About RankNet

RankNet is a model of rank learning based on a neural network and is a pairwise method. The prediction model itself is the same as a normal neural network, but the design of the cost function is different. However, the idea is the same as the cross entropy function that is often used as a cost function.

As above, suppose the items $ A $ and $ B $ are represented as follows, and $ A $ is better fit than $ B $.

A=(\mathbf{x}_A, y_A) \\
B=(\mathbf{x}_B, y_B)

The scores of $ A $ and $ B $ by the function $ f $ are calculated as follows.

s_A=f(\mathbf{x}_A) \\
s_B=f(\mathbf{x}_B)

The probability that $ A $ is ranked higher than $ B $ output by RankNet is expressed as follows using $ s_A $ and $ s_B $.

P_{AB}=\frac{1}{1+e^{-\sigma(s_A-s_B)}}

Intuitively, we can see that as $ s_A $ increases, $ P_ {AB} $ approaches 1, and as $ s_B $ increases, $ P_ {AB} $ approaches 0.

Using $ P_ {AB} $, the cost function is expressed as follows.

C_{AB}=-\bar{P}_{AB}\log{P_{AB}}-(1-\bar{P}_{AB})\log(1-P_{AB})

Here, $ \ bar {P} _ {AB} $ is the known probability that $ A $ will be ranked higher than $ B $. In References, the settings were as follows.

\bar{P}_{AB}=\begin{cases}
1 & (A \succ B) \\
0 & (A \prec B) \\
\frac{1}{2} & (otherwise)
\end{cases}

$ C_ {AB} $ is just a cross-entropy function, and the farther $ P_ {AB} $ is $ \ bar {P} _ {AB} $, the larger the value. RankNet learns to minimize this cost function.

Implementation of RankNet on Chainer

Now, let's implement RankNet with Chainer. The loss function softmax_cross_entropy () is implemented in Chainer, but softmax_cross_entropy () cannot be used because it can only receive int32 type as the target variable. ($ \ Bar {P} _ {AB} $ can be $ \ frac {1} {2} $.)

Therefore, I have to implement the cost function $ C_ {AB} $ by myself, but I can learn without implementing ** backward () which is implemented in other loss functions *. *… I wonder if it is automatically differentiated numerically if backward () is not implemented, I would be grateful if you could comment on it.

The following is an implementation of RankNet. As a neural network model, a simple multi-layer perceptron of [input layer]-> [hidden layer]-> [hidden layer]-> [output layer] is used.

net.py


from chainer import Chain
import chainer.functions as F
import chainer.links as L


class MLP(Chain):

    def __init__(self, n_in, n_hidden):
        super(MLP, self).__init__(
            l1=L.Linear(n_in, n_hidden),
            l2=L.Linear(n_hidden, n_hidden),
            l3=L.Linear(n_hidden, 1)
        )

    def __call__(self, x):
        h1 = F.relu(self.l1(x))
        h2 = F.relu(self.l2(h1))
        return self.l3(h2)


class RankNet(Chain):

    def __init__(self, predictor):
        super(RankNet, self).__init__(predictor=predictor)

    def __call__(self, x_i, x_j, t_i, t_j):
        s_i = self.predictor(x_i)
        s_j = self.predictor(x_j)
        s_diff = s_i - s_j
        if t_i.data > t_j.data:
            S_ij = 1
        elif t_i.data < t_j.data:
            S_ij = -1
        else:
            S_ij = 0
        self.loss = (1 - S_ij) * s_diff / 2. + \
            F.math.exponential.Log()(1 + F.math.exponential.Exp()(-s_diff))
        return self.loss

Let's apply the implemented RankNet to artificial data. The artificial data was generated as follows.

X=({\bf x}_1, {\bf x}_2, \dots, {\bf x}_{1000})^{\rm T} \\
{\bf y}=(y_1, y_2, \dots, y_{1000}) \\
{\bf w} = {\mathcal N}({\bf 0}, {\bf 1}) \\
y_i = uniform(1,5) \\
{\bf x}_i={\mathcal N}(0, 5) + y_i{\bf w}

1000 pieces of 50-dimensional data, goodness of fit is either [1,2,3,4,5]. When plotted in PCA space, the figure is as follows.

pca_fig.png

Somehow, the goodness of fit seems to be higher toward the upper left of the figure, and smaller toward the lower right.

Below is the learning and evaluation code. The training uses the stochastic gradient descent method, and the training data is randomly sampled each time. NDCG @ 100 was used to evaluate the prediction accuracy.

train_toy.py


import numpy as np
from chainer import Variable, optimizers
from sklearn.cross_validation import train_test_split
import net
import matplotlib.pyplot as plt
import seaborn as sns


def make_dataset(n_dim, n_rank, n_sample, sigma):
    ys = np.random.random_integers(n_rank, size=n_sample)
    w = np.random.randn(n_dim)
    X = [sigma * np.random.randn(n_dim) + w * y for y in ys]
    X = np.array(X).astype(np.float32)
    ys = np.reshape(np.array(ys), (-1, 1))
    return X, ys


def ndcg(y_true, y_score, k=100):
    y_true = y_true.ravel()
    y_score = y_score.ravel()
    y_true_sorted = sorted(y_true, reverse=True)
    ideal_dcg = 0
    for i in range(k):
        ideal_dcg += (2 ** y_true_sorted[i] - 1.) / np.log2(i + 2)
    dcg = 0
    argsort_indices = np.argsort(y_score)[::-1]
    for i in range(k):
        dcg += (2 ** y_true[argsort_indices[i]] - 1.) / np.log2(i + 2)
    ndcg = dcg / ideal_dcg
    return ndcg

if __name__ == '__main__':
    np.random.seed(0)
    n_dim = 50
    n_rank = 5
    n_sample = 1000
    sigma = 5.
    X, ys = make_dataset(n_dim, n_rank, n_sample, sigma)
    X_train, X_test, y_train, y_test = train_test_split(X, ys, test_size=0.33)

    n_iter = 2000
    n_hidden = 20
    loss_step = 50
    N_train = np.shape(X_train)[0]

    model = net.RankNet(net.MLP(n_dim, n_hidden))
    optimizer = optimizers.Adam()
    optimizer.setup(model)

    N_train = np.shape(X_train)[0]
    train_ndcgs = []
    test_ndcgs = []
    for step in range(n_iter):
        i, j = np.random.randint(N_train, size=2)
        x_i = Variable(X_train[i].reshape(1, -1))
        x_j = Variable(X_train[j].reshape(1, -1))
        y_i = Variable(y_train[i])
        y_j = Variable(y_train[j])
        model.zerograds()
        loss = model(x_i, x_j, y_i, y_j)
        loss.backward()
        optimizer.update()
        if (step + 1) % loss_step == 0:
            train_score = model.predictor(Variable(X_train))
            test_score = model.predictor(Variable(X_test))
            train_ndcg = ndcg(y_train, train_score.data)
            test_ndcg = ndcg(y_test, test_score.data)
            train_ndcgs.append(train_ndcg)
            test_ndcgs.append(test_ndcg)
            print("step: {0}".format(step + 1))
            print("NDCG@100 | train: {0}, test: {1}".format(train_ndcg, test_ndcg))

    plt.plot(train_ndcgs, label="Train")
    plt.plot(test_ndcgs, label="Test")
    xx = np.linspace(0, n_iter / loss_step, num=n_iter / loss_step + 1)
    labels = np.arange(loss_step, n_iter + 1, loss_step)
    plt.xticks(xx, labels, rotation=45)
    plt.legend(loc="best")
    plt.xlabel("step")
    plt.ylabel("NDCG@10")
    plt.ylim(0, 1.1)
    plt.tight_layout()
    plt.savefig("result.pdf")

result.PNG

Since the training data and the test data are generated from the same distribution, it is natural to say that it is natural, but you can see how the evaluation values increase together.

Other

I want to make something interesting

Recommended Posts

Rank learning using neural network (Implementation of RankNet by Chainer)
Implementation of "blurred" neural network using Chainer
Simple neural network implementation using Chainer
Implementation of 3-layer neural network (no learning)
Overview of DNC (Differentiable Neural Computers) + Implementation by Chainer
Bayesian optimization implementation of neural network hyperparameters (Chainer + GPyOpt)
Implementation of a convolutional neural network using only Numpy
Implementation of Chainer series learning using variable length mini-batch
Implementation of a two-layer neural network 2
Implementation of dialogue system using Chainer [seq2seq]
Simple neural network implementation using Chainer-Data preparation-
Simple neural network implementation using Chainer-Model description-
[Chainer] Document classification by convolutional neural network
Python learning memo for machine learning by Chainer Chapter 13 Neural network training ~ Chainer completed
Python learning memo for machine learning by Chainer Chapter 13 Basics of neural networks
Simple neural network implementation using Chainer-optimization algorithm setting-
Reinforcement learning 10 Try using a trained neural network.
Recognition of handwritten numbers by multi-layer neural network
Deep learning learned by implementation (segmentation) ~ Implementation of SegNet ~
Python vs Ruby "Deep Learning from scratch" Chapter 3 Implementation of 3-layer neural network
[Deep Learning from scratch] Initial value of neural network weight using sigmoid function
[For beginners of deep learning] Implementation of simple binary classification by full coupling using Keras
[Deep Learning from scratch] Initial value of neural network weight when using Relu function
Implementation of TF-IDF using gensim
Neural network starting with Chainer
I tried to implement ListNet of rank learning with Chainer
Neural network implementation in python
Construction of a neural network that reproduces XOR by Z3
Neural network implementation (NumPy only)
Deep reinforcement learning 2 Implementation of reinforcement learning
Study of Recurrent Neural Network (RNN) by Chainer ~ Accuracy verification of random numbers in Excel and R ~
Try to make a blackjack strategy by reinforcement learning ((1) Implementation of blackjack)
Implementation example of hostile generation network (GAN) by keras [For beginners]
Summary of basic implementation by PyTorch
PRML Chapter 5 Neural Network Python Implementation
Simple neural network theory and implementation
Reinforcement learning 8 Try using Chainer UI
Deep learning learned by implementation 1 (regression)
Implementation of desktop notifications using Python
Touch the object of the neural network
Python learning memo for machine learning by Chainer until the end of Chapter 2
Survivor prediction using kaggle's titanic neural network [80.8%]
Deep learning learned by implementation 2 (image classification)
[Learning memo] Basics of class by python
Othello-From the tic-tac-toe of "Implementation Deep Learning" (3)
Qiskit: Implementation of Quantum Circuit Learning (QCL)
Try using TensorFlow-Part 2-Convolutional Neural Network (MNIST)
Implementation of SVM by stochastic gradient descent
Machine learning algorithm (implementation of multi-class classification)
[Reinforcement learning] Easy high-speed implementation of Ape-X!
Python & Machine Learning Study Memo ③: Neural Network
Judgment of igneous rock by machine learning ②
Othello-From the tic-tac-toe of "Implementation Deep Learning" (2)
[Super Introduction] Machine learning using Python-From environment construction to implementation of simple perceptron-
Machine Learning: Image Recognition of MNIST by using PCA and Gaussian Native Bayes
Reconsideration of paging implementation by Relay style in GraphQL (version using Window function)
Implementation of a model that predicts the exchange rate (dollar-yen rate) by machine learning