[PYTHON] Learn collaborative filtering along with Coursera Machine Learning materials

Introduction

Collaborative filtering, which is used in recommender systems, is an application of machine learning, but it differs slightly from the simple “supervised learning” pattern. I tried a little bit of literature and Internet information, but it is difficult to understand the concept and system configuration immediately. However, I remembered that it was dealt with in Coursera Machine Learning (week 9) by Dr. Andrew Ng, and I was able to deepen my understanding of collaborative filtering by reviewing the teaching materials.

As those who have taken this course may know, Matlab / Octave is used in the exercises of this course. This time, I implemented collaborative filtering (prototype) in Python. First, we confirmed the operation with Numpy + Scipy (Optimize), and then confirmed the implementation using TensorFlow.

A little overview of collaborative filtering

Quoted from the Data Scientist Training Reader, Introductory Machine Learning.

Collaborative filtering is a method to realize the following presentations on EC sites. --"People who bought this product also bought this product." --"People who buy the same products as you also buy these products."

I will implement it when I understand the concept, but I will confirm the explanation in Coursera Machine Learning. Here, the explanation is advanced using an example of movie ranking.

Fig.1 Movie Rating Data

The table shows the ratings of movies by each user, but the ratings are divided according to the user's preference. Data [X], which are parameters of different scales (called major in Coursera class), such as "strong tendency of love romance" and "high action", are prepared according to the content of the movie. On the other hand, the parameter [Theta] for "user preference" is also required. As shown in the figure, the rating data is a two-dimensional matrix ([Y]) of the number of movies x the number of users, but it is often indefinite because users do not rate all movies. .. This indefinite? The main point of the problem is to complement the mark by regression.

The explanation is that if the complementary rating matrix can be calculated accurately, the processing in the next process of recommending similar movies can be done well.

Try to implement with Numpy + Scipy.Optimize

As with the other exercises in the Coursera course, we start by finding the cost function. Since there are two parameters, it is a fairly "tough" expression.

 \begin{eqnarray}
J(x^{(1)},...,x^{n_m},\theta^{1},...,\theta^{n_u}) = &\ & \frac{1}{2} \sum_{(i,j):r(i,j)=1} 
(({\theta}^{(j)})^T \ x^{(i)} - y^{(i,j)})^2 \ + \\
&\ & (\frac{\lambda}{2} \sum_{j=1}^{n_u} \sum_{k=1}^{n} ({\theta}_k^{(j)}) ^2)
 +  (\frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^{n} ({x}_k^{(i)}) ^2)    
 \end{eqnarray}

The cost function with parameters [X] and [Theta] consists of the square error between the rated predicted value determined by the matrix product of the parameters and the actual rating data (label), and the regularization term of the parameters. If this is converted into Python code, it will be as follows.

def cofi_costfunc(Y, R, shapes, lambda_c, params):
    #  [J, grad] = COFICOSTFUNC(params, Y, R, num_users, num_movies, ...
    #   num_features, lambda) returns the cost and gradient for the
    #   collaborative filtering problem.
    num_users = shapes[0]
    num_movies = shapes[1]
    num_features = shapes[2]
    boundary = num_movies * num_features
    params_c1 = np.copy(params[:boundary])
    params_c2 = np.copy(params[boundary:])
    X = params_c1.reshape((num_movies, num_features))
    Theta = params_c2.reshape((num_users, num_features))
           
    # You need to return the following values correctly
    Jcost = 0.5 * (((np.dot(X, Theta.T) - Y) * R) ** 2).sum() \
        + 0.5 * lambda_c * ((X.ravel() **2).sum() + (Theta.ravel() **2).sum())

    return Jcost

The parameters X (movie content parameters) and Theta (user-oriented parameters) are input as a single block params. Y is rating data in the form of the number of movies x the number of users. R is binary data of the same size as Y, set to 1 when rated and 0 when there is no rating yet. (The variable type of the program is an integer type.) After preprocessing, the final statement is to find the cost corresponding to the above cost function formula.

Next is the equation of the partial differential coefficient (gradient) for the parameters (x, theta) of this cost function.

\frac{\partial J}{\partial x_k^{(i)}} = \sum_{j:r(i,j)=1} (( {\theta}^{(j)})^T x^{(i)} 
- y^{(i,j)} ) {\theta}_k^{(j)} + \lambda x_k^{(i)}\\

\frac{\partial J}{\partial {\theta}_k^{(j)}} = \sum_{i:r(i,j)=1} (( {\theta}^{(j)})^T x^{(i)} 
- y^{(i,j)} ) {x}_k^{(i)} + \lambda {\theta}_k^{(j)}

This is also a "tough" mathematical formula. However, it is easier to understand if it is converted into code.

def cofi_costfunc_grad(Y, R, shapes, lambda_c, params):
    #  [J, grad] = COFICOSTFUNC(params, Y, R, num_users, num_movies, ...
    #   num_features, lambda) returns the cost and gradient for the
    #   collaborative filtering problem.
    num_users = shapes[0]
    num_movies = shapes[1]
    num_features = shapes[2]
    boundary = num_movies * num_features
    params_c1 = np.copy(params[:boundary])
    params_c2 = np.copy(params[boundary:])
    X = params_c1.reshape((num_movies, num_features))
    Theta = params_c2.reshape((num_users, num_features))
           
    # You need to return the following values correctly  
    X_grad = np.dot(((np.dot(X, Theta.T) - Y) * R), Theta) + lambda_c * X
    Tgsub = (np.dot(X, Theta.T) - Y) * R
    Theta_grad = np.dot(Tgsub.T, X) + lambda_c * Theta
    Jgrad = np.concatenate((X_grad.ravel(), Theta_grad.ravel()))

    return Jgrad

The cost gradient is calculated in the latter four lines.

In Coursera's exercise, the cost minimization optimization process is performed using the function fmincg.m provided in advance. If you want to convert to Python, you can do the same with scipy.optimize.minimize ().

# 
import scipy.optimize as spopt

#Set the initial value of the parameter
theta_ini = np.concatenate((X.flatten(), Theta.flatten()))

#Prepare wrapper function
def compute_cost_sp(theta):
    global Y, R, shapes, lambda_c
    j =  cofi_costfunc(Y, R, shapes, lambda_c, params=theta)

    return j

def compute_grad_sp(theta):
    global Y, R, shapes, lambda_c
    j_grad = cofi_costfunc_grad(Y, R, shapes, lambda_c, params=theta)

    return j_grad

# scipy.optimize.Set minimize options
options = {'gtol': 1.e-6, 'disp': True}
#Regularization coefficient
lambda_c = 1.5

res1 = spopt.minimize(compute_cost_sp, theta_ini, method='CG', \
         jac=compute_grad_sp, options=options)
theta =  res1.x

#Decompose a set of parameters into X and Theta
X = theta[:num_movies*num_features].reshape((num_movies, num_features))
Theta = theta[num_movies*num_features:].reshape((num_users, num_features))

For details on scipy.optimize.minimize, please refer to the document, but you can select various optimization methods. This time, we set method ='CG' (Conjugate gradient method) according to the original material (fmincg.m).

From the parameters [X] and [Theta] obtained by minimizing the cost in this way, the necessary information (ex. "People who gave a high rating to this movie also gave a high rating to this movie.") Is the processing of the recommender system. (There seem to be various ways to measure similarity, but I will omit the explanation here.)

Use TensorFlow

It's been about a year since TensorFlow was released last year (2015). I remember writing the code with excitement and writing the Qiita post in a hurry. This time, we decided to use TensorFlow to execute collaborative filtering learning at a higher speed.

In normal "supervised learning", the input-output relationship is clear, so I don't get lost in the graph representation of TensorFlow, but this time it was a little difficult.

--A normal problem that corresponds to weight w and bias b ... This time, the parameter x of the movie content and the user's favorite parameter Theta. --A normal problem that corresponds to teacher data ... This time, rating information given by the user in advance. Specifically, data Y (matrix of the number of movies x number of users that includes the rating given by the user to the movie) and R (binary data indicating whether the user has rated or not).

Therefore, the code for variable preparation is as follows.

# Parameter definition
    if WARM_START:
        data_param = load_data(fn='./data/ex8_movieParams.mat')
        X_np = data_param['X'].astype(np.float32)
        Theta_np = data_param['Theta'].astype(np.float32)
        x = tf.Variable(X_np)
        theta = tf.Variable(Theta_np)
    else:   # COLD START
        x = tf.Variable(tf.random_normal([num_movies, num_features],
            mean=0.0, stddev=0.05))
        theta = tf.Variable(tf.random_normal([num_users, num_features],
            mean=0.0, stddev=0.05))
# Placeholders
    Y_ph = tf.placeholder(tf.float32, shape=[None, num_users])
    R_ph = tf.placeholder(tf.float32, shape=[None, num_users])

Since the parameter data that had been pre-learned to some extent was provided in Coursera's program exercises, we prepared an option of'WARM_START'to use it and an option to start from zero. In'WARM_START', the data of numpy.array is passed to tf.Variable () as it is, and in the case of zero start, a random number with a small value is generated and set in x and theta. In addition, the placeholder of the teacher data has the required shape. (Shape = [None, num_users] is OK withshape = [num_movies, num_users], but according to TensorFlow convention, it is set to None.)

Next, we define the calculation of the cost function.

def Ypred(X, Theta):
    '''
      calculate rating with parameter X and Theta
      args.:
        X:      movie contents parameter
        Theta:  user characteristic parameter
    '''
    feat_dim1 = tf.shape(X)[1]
    feat_dim2 = tf.shape(Theta)[1]

    tf.assert_equal(feat_dim1, feat_dim2)
    rating = tf.matmul(X, tf.transpose(Theta))

    return rating


# Cost Function, etc.
    cost = tf.reduce_sum(((Ypred(x, theta) - Y_ph) * R_ph) ** 2)
    L2_sqr = tf.nn.l2_loss(x) + tf.nn.l2_loss(theta)
    lambda_c = 1.0      # L2 norm coefficient      
    loss = cost + lambda_c * L2_sqr

Not only TensorFlow, but the convenient part of the Deep Learning framework is that it automatically calculates Gradient. I feel that it is "great" to define only the cost function as described above. After this, as usual, the optimizer is set and a learning loop is created.

    lr = 1.e-5
    train_step = tf.train.GradientDescentOptimizer(lr).minimize(loss)
    init_op = tf.initialize_all_variables()

    # Train
    with tf.Session() as sess:
        sess.run(init_op)

        for i in range(10001):
            sess.run(train_step, feed_dict={Y_ph: Y_np, R_ph: Y_np})
        
            if i % 1000 == 0:
                loss_mon = loss.eval({Y_ph: Y_np, R_ph: Y_np})
                print(' step, loss = {:6d}: {:10.1f}'.format(i, loss_mon))

        # evaluate ranking with final parameters
        ymat = Ypred(x, theta).eval()
        sio.savemat('./ymat_tf.mat', {'Y': ymat})
        print('ymat is saved to "ymat_tf.mat".')

I think you can use your favorite optimizer instead of tf.train.GradientDescentOptimizer (), but this time I tried using a basic function because it is a problem of Cousera. It takes some time to adjust the learning rate, but I was able to calculate safely (at high speed with gpu calculation).

Result comparison

Since we are solving with the same algorithm, the result by scipy.optimize.minimize and the result by TensorFlow should be similar. This time, we calculated the rating matrix Y from the obtained parameters (X, Theta) and made it as shown in the Heatmap diagram. (Blue: low rate ~ red: high rate)

By scipy.optimize.minimize

By TensorFLow

Differences such as shades of color can be seen in the details, but similar patterns can be seen as large patterns. There seems to be no problem in the calculation process.

If you want to make up for the difference between the two, you need to adjust the details such as regularization, learning rate, and convergence status. I think the above content is OK for the purpose of "learning collaborative filtering", but in order to compete in competitions such as Kaggle, various details will have to be examined.

(The programming environment used this time is as follows. Python 3.5.2, Numpy 1.11.2, Scipy 0.18.1, TensorFlow 0.11.0) (The code has been uploaded to Gist: https://gist.github.com/tomokishii/2e9f83d391df82f25942c1310b41e0b5)

References, web site

--Cousera, Machine Learning, by Prof. Andrew Ng (A manual / PDF of programming assignments will be distributed every week. You can access the teaching material videos and related files of programming exercises on the site even after the course is over.) --Data Scientist Training Reader, Introduction to Machine Learning (Part 2, Special Feature 3), Introduction to Recommender Systems http://gihyo.jp/book/2015/978-4-7741-7631-4 --Implementing item-based collaborative filtering in python --Using MovieLens as an example --Qiita http://qiita.com/kotaroito/items/6acb58bb16b68a460af9 --Practical machine learning system --O'Reilly Japan, Chapter 8 Regression: Improvement of recommendations http://www.oreilly.co.jp/books/9784873116983/

Scipy Documentation
https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.optimize.minimize.html