[PYTHON] Let's analyze the sentiment of Tweet using Chainer (1st)

Correction history

** 2016/7/18: Corrected because there was an error in the calculation formula of the size after filtering. ** **

Overview

In recent years, AI, deep learning, etc. have been making noise in various places. I think that the number of libraries has increased and the atmosphere has become easier to try. Therefore, I will try to analyze the sentiment of Tweet using Chainer. Let's do something similar with Chainer while referring to Article on Theano.

However, if you are accustomed to machine learning libraries such as sklearn, it is a little difficult to use, so if you can solve that as well.

I have a lot of my own understanding, so there may be mistakes, but I would appreciate it if you could point out that.

Contents of the first

Difference between sklearn and Chainer
Sample MLP implementation on Chainer
Understand CNN in Chainer

environment

Mac OSX Yosemite 10.10.15 Python 2.7 CPU Intel Core i5 2.6GHz Memory 8GB

(Is it okay with such equipment? → I don't know)

Preparation

Install Chainer

pip install chainer

sklearn and Chainer

In sklearn

model = (SVM or Random Forest)
model.fit(x_train,y_train)
y_p = model.predict(x_test)

It was easy to do.

Where x_train is a matrix of magnitude $ N x M $ and y_train is a teacher vector of length $ N $ (such as 0,1). $ N $ is the sample size and $ M $ is the number of features. x_test is the test data with the same number of columns (that is, the same size of features) as x_train.

On the other hand, Chainer does not have methods like "fit" and "predict" like this, you have to make it yourself.

For example, in Multilayer Perceptron (MLP), it seems to implement as follows.

Base class as follows

# -*- coding: utf-8 -*-

from chainer import FunctionSet, Variable, optimizers
from chainer import functions as F
from sklearn import base
from abc import ABCMeta, abstractmethod
import numpy as np
import six


class BaseChainerEstimator(base.BaseEstimator):
    __metaclass__= ABCMeta  # python 2.x
    def __init__(self, optimizer=optimizers.SGD(), n_iter=10000, eps=1e-5, report=100,
                 **params):
        self.network = self._setup_network(**params)
        self.optimizer = optimizer
        self.optimizer.setup(self.network.collect_parameters())
        self.n_iter = n_iter
        self.eps = eps
        self.report = report

    @abstractmethod
    def _setup_network(self, **params):
        return FunctionSet(l1=F.Linear(1, 1))

    @abstractmethod
    def forward(self, x, train=True):
        y = self.network.l1(x)
        return y

    @abstractmethod
    def loss_func(self, y, t):
        return F.mean_squared_error(y, t)

    @abstractmethod
    def output_func(self, h):
        return F.identity(h)

    def fit(self, x_data, y_data):
        batchsize = 100
        N = len(y_data)
        for loop in range(self.n_iter):
            perm = np.random.permutation(N)
            sum_accuracy = 0
            sum_loss = 0
            for i in six.moves.range(0, N, batchsize):
                x_batch = x_data[perm[i:i + batchsize]]
                y_batch = y_data[perm[i:i + batchsize]]
                x = Variable(x_batch)
                y = Variable(y_batch)
                self.optimizer.zero_grads()
                yp = self.forward(x)
                loss = self.loss_func(yp,y)
                loss.backward()
                self.optimizer.update()
                sum_loss += loss.data * len(y_batch)
                sum_accuracy += F.accuracy(yp,y).data * len(y_batch)
            if self.report > 0 and loop % self.report == 0:
                print('loop={}, train mean loss={} , train mean accuracy={}'.format(loop, sum_loss / N,sum_accuracy / N))

        return self

    def predict(self, x_data):
        x = Variable(x_data)
        y = self.forward(x,train=False)
        return self.output_func(y).data

class ChainerClassifier(BaseChainerEstimator, base.ClassifierMixin):
    def predict(self, x_data):
        return BaseChainerEstimator.predict(self, x_data).argmax(1) #argmax returns the largest index in the rows of the matrix. So the class is 0 to 1,Must be 2

    def predict_proba(self,x_data):
        return BaseChainerEstimator.predict(self, x_data)

On top of that, the MLP class inherits the ChainerClassifier,

class MLP3L(ChainerClassifier):
    """
    3-Layer Perceptron
    """
    def _setup_network(self, **params):
        network = FunctionSet(
            l1=F.Linear(params["input_dim"], params["hidden_dim"]),
            l2=F.Linear(params["hidden_dim"], params["hidden_dim"]),
            l3=F.Linear(params["hidden_dim"], params["n_classes"]),
        )
        return network

    def forward(self, x, train=True):
        h1 = F.dropout(F.relu(self.network.l1(x)),train=train)
        h2 = F.dropout(F.relu(self.network.l2(h1)),train=train)
        y = self.network.l3(h2)
        return y

    def loss_func(self, y, t):
        return F.softmax_cross_entropy(y, t)

    def output_func(self, h):
        return F.softmax(h)

To implement.

Now you can use "fit" and "predict (predict_proba)" like sklearn.

It seems that x_data must be numpy.float32 type and y_data must be numpy.int32 type. (Casted to Chainer's Variable inside fit)

Now, in the case of the above MLP, the above x_data can be a matrix of size $ N × M $, just like sklearn. However, if you try to extend this to, for example, a convolutional neural network (CNN), problems suddenly arise.

Since CNN is mainly used in image processing, the input is two-dimensional, and if you add the batch size (sample size) to it, you have to make it three-dimensional x_data. (There is a concept of channel ?, and it is actually a 4D tensor)

Decode CNN samples in Chainer.

I used the code of here as a sample.

The MNIST image I'm using is $ 28 x 28 $.

model = chainer.FunctionSet(conv1=F.Convolution2D(1, 20, 5),
								conv2=F.Convolution2D(20, 50, 5),  
                            l1=F.Linear(800, 500),
                            l2=F.Linear(500, 10))
                            
def forward(x_data, y_data, train=True):
    x, t = chainer.Variable(x_data), chainer.Variable(y_data)
    h = F.max_pooling_2d(F.relu(model.conv1(x)), 2)
    h = F.max_pooling_2d(F.relu(model.conv2(h)), 2)
    h = F.dropout(F.relu(model.l1(h)), train=train)
    y = model.l2(h)
    if train:
        return F.softmax_cross_entropy(y, t)
    else:
        return F.accuracy(y, t)

Looking at the reference of F.Convolution2D,

Kobito.SCEMO4.png

It is designed to put in_channels in the first argument, out_channels in the second argument, and ksize (Filter size) in the third argument. It seems that in_channels is set to 3 with RGB, but I'm trying with 1, and out_channels is the number of output channels, but maybe 20 kinds of images are created with different filters? I understand it without permission. Since ksize is 5, it means that the filter is $ 5 x 5 $.

Feature size after convolution and pooling

(Corrected on July 18, 2016 from here)

~~ In the convolution process, if the filter size is $ F $ and the image size is $ S × S $, the image size after filtering will be $ S_f × S_f $ if no padding is included. According to the article](http://aidiary.hatenablog.com/entry/20151108/1446952402) ~~

S_f = S - 2 × [F/2]

It becomes ~~. $ [] $ Is truncated after the decimal point. ~~

** Apparently, it seems different when I try it, or rather it was written in Chainer's Document. ** **

S_f = S - F + 1

It's okay. It's the same as the moving average, isn't it? The previous formula works well for odd filter sizes, but not for even numbers.

Also, in the pooling process, the edge processing differs depending on whether Max pooling is used or Average pooling is used. As I tried, Average pooling cannot calculate if there is a remainder after dividing the target size by the pooling size, but Max pooling does. Therefore, you have to be careful about that area.

(2016/7/18 correction so far)

In other words, in this example,

In the first convolution

S_{f1} = 28 - 2 × [5/2] = 24

So, since Max pooling is performed in the forward function, the size after pooling is set to $ S_ {p1} x S_ {p1} $.

S_{p1} = 24 / 2 = 12

So, in the second convolution

S_{f2} = 12 - 2 × [5/2] = 8

So, in the forward function, Max pooling is performed, so the size after pooling is $ S_ {p2} × S_ {p2} $.

S_{p2} = 8 / 2 = 4

It will be.

In other words, the dimension of the feature amount that becomes the final input is that the number of outputs is 50, so

M = 50 × 4 × 4 = 800

And the first layer

l1=F.Linear(800, 500)

Matches the first argument of. (Chainer seems to tell you the correct answer if you make a mistake)

Preparation before throwing forward

Well, after defining the model, we throw x_data to the forward function, but there is still a problem, and when doing Convolution, we have to throw a 4D tensor from the following reference. (See x in Parameters)

Kobito.sKeAcV.png

$ n $ is the batch size (sample size), $ c_I $ is the number of channels, and $ h $ and $ w $ are the vertical and horizontal sizes of the image.

Looking at the above sample code, it is converted to a 4D tensor using reshape as shown below.

X_train = X_train.reshape((len(X_train), 1, 28, 28))

This time, I wanted to reshape from the Variable type state, and when I looked it up, the same thing was defined as a Chainer function.

Kobito.StweJ1.png

Use this.

next time

Examine the characteristics of Embed ID
Convolutional neural network in natural language processing