Introduction

Convolutional Neural Networks (CNN), which is widely used in image recognition, etc. Recently, it is also used in the field of natural language processing [Kim, EMNLP2014]. This time, I built a simple network using CNN with Chainer and applied it to the task of document classification.

Data storage

--Source code implemented this time - ichiroex@github --Data used in the experiment -Data can be found at here. I used "sentence polarity dataset v1.0". -Download directly

Advance preparation

--Installation of Chainer, scikit-learn, gensim --Download of the trained model of word2vec (GoogleNews-vectors-negative300.bin.gz).

environment

--Python 2.7 series

Chainer 1.6.2.1

Document vector creation

This time, the input document is vectorized using word2vec, and the vectorized document is convolved. Document vectorization is done with def load_data (fname) defined in ʻuitl.py`.

As an image, some input word string (document) $ x_1 $, $ x_2 $, $ x_3 $, ...., $ x_n $ for each word $ x_i $ Converts to a fixed-dimensional $ N $ vector and creates a two-dimensional document vector with them side by side.

[Example]

Also, since the sentence length is different for each document, padding is performed to match the maximum sentence length $ maxlen $ in the input document. That is, the dimension of the generated 2D document vector is $ N * maxlen $.

By the way, in the word2vec model published by Google this time, the dimension of each word vector is 300.

`util.py`


def load_data(fname):
    #Load the trained word2vec model
    model =  word2vec.Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

    target = [] #label
    source = [] #Document vector

    #Create a document list
    document_list = []
    for l in open(fname, 'r').readlines():
        sample = l.strip().split(' ',  1)
        label = sample[0]
        target.append(label) #label
        document_list.append(sample[1].split()) #Word list for each document
    
    max_len = 0
    rev_document_list = [] #Document list after unknown word processing
    for doc in document_list:
        rev_doc = []
        for word in doc:
            try:
                word_vec = np.array(model[word]) #In the case of unknown words,KeyError occurs
                rev_doc.append(word)
            except KeyError:
                rev_doc.append('<unk>') #Unknown word
        rev_document_list.append(rev_doc)
        #Find the maximum length of a document(For padding)
        if len(rev_doc) > max_len:
            max_len = len(rev_doc)
    
    #Match document length by padding
    rev_document_list = padding(rev_document_list, max_len)
    
    width = 0 #Number of dimensions of each word
    #Document feature vectorization
    for doc in rev_document_list:
        doc_vec = []
        for word in doc:
            try:
                vec = model[word.decode('utf-8')]
            except KeyError:
                vec = model.seeded_vector(word)
            doc_vec.extend(vec)
            width = len(vec)
        source.append(doc_vec)

    dataset = {}
    dataset['target'] = np.array(target)    
    dataset['source'] = np.array(source)    

    return dataset, max_len, width

Model definition

This time, we defined a model with a structure of convolution layer-> pooling layer-> fully connected layer. I also have a dropout on the way.

`net.py`


class SimpleCNN(Chain):
    
    def __init__(self, input_channel, output_channel, filter_height, filter_width, mid_units, n_units, n_label):
        super(SimpleCNN, self).__init__(
            conv1 = L.Convolution2D(input_channel, output_channel, (filter_height, filter_width)),
            l1    = L.Linear(mid_units, n_units),
            l2    = L.Linear(n_units,  n_label),
        )
    
    #Called by Classifier
    def __call__(self, x):
        h1 = F.max_pooling_2d(F.relu(self.conv1(x)), 3)
        h2 = F.dropout(F.relu(self.l1(h1)))
        y = self.l2(h2)
        return y

Learning

In learning, the correct answer rate and loss are calculated and displayed for each training data and test data for each epoch. The code is almost the same as when Document classification using feedforward neural network.

The filter size at the time of convolution is 3 x 300 (the number of dimensions of each word vector).

`train_cnn.py`


# Prepare dataset
dataset, height, width = util.load_data(args.data)
print 'height:', height
print 'width:', width

dataset['source'] = dataset['source'].astype(np.float32) #Feature value
dataset['target'] = dataset['target'].astype(np.int32) #label

x_train, x_test, y_train, y_test = train_test_split(dataset['source'], dataset['target'], test_size=0.15)
N_test = y_test.size         # test data size
N = len(x_train)             # train data size
in_units = x_train.shape[1]  #Number of units in the input layer(Vocabulary number)

# (nsample, channel, height, width)Converted to a 4-dimensional tensor
input_channel = 1
x_train = x_train.reshape(len(x_train), input_channel, height, width) 
x_test  = x_test.reshape(len(x_test), input_channel, height, width)

#Number of hidden layer units
n_units = args.nunits
n_label = 2
filter_height = 3
output_channel = 50

#Model definition
model = L.Classifier( SimpleCNN(input_channel, output_channel, filter_height, width, 950, n_units, n_label))

#Whether to use GPU
if args.gpu > 0:
    cuda.check_cuda_available()
    cuda.get_device(args.gpu).use()
    model.to_gpu()
    xp = np if args.gpu <= 0 else cuda.cupy #args.gpu <= 0: use cpu, otherwise: use gpu

batchsize = args.batchsize
n_epoch = args.epoch

# Setup optimizer
optimizer = optimizers.AdaGrad()
optimizer.setup(model)

# Learning loop
for epoch in six.moves.range(1, n_epoch + 1):

    print 'epoch', epoch, '/', n_epoch
    
    # training)
    perm = np.random.permutation(N) #Get a random integer sequence list
    sum_train_loss     = 0.0
    sum_train_accuracy = 0.0
    for i in six.moves.range(0, N, batchsize):

        #x using perm_train, y_Select a dataset from train(The target data is different each time)
        x = chainer.Variable(xp.asarray(x_train[perm[i:i + batchsize]])) #source
        t = chainer.Variable(xp.asarray(y_train[perm[i:i + batchsize]])) #target
        
        optimizer.update(model, x, t)

        sum_train_loss      += float(model.loss.data) * len(t.data)   #For average error calculation
        sum_train_accuracy  += float(model.accuracy.data ) * len(t.data)   #For calculating the average accuracy rate

    print('train mean loss={}, accuracy={}'.format(sum_train_loss / N, sum_train_accuracy / N)) #Mean error

    # evaluation
    sum_test_loss     = 0.0
    sum_test_accuracy = 0.0
    for i in six.moves.range(0, N_test, batchsize):

        # all test data
        x = chainer.Variable(xp.asarray(x_test[i:i + batchsize]))
        t = chainer.Variable(xp.asarray(y_test[i:i + batchsize]))

        loss = model(x, t)

        sum_test_loss     += float(loss.data) * len(t.data)
        sum_test_accuracy += float(model.accuracy.data)  * len(t.data)

    print(' test mean loss={}, accuracy={}'.format(
        sum_test_loss / N_test, sum_test_accuracy / N_test)) #Mean error

    if epoch > 10:
        optimizer.lr *= 0.97
        print 'learning rate: ', optimizer.lr

    sys.stdout.flush()

Experimental result

The final accuracy rate was ʻaccuracy = 0.775624996424. [When classified by feedforward neural network](http://qiita.com/ichiroex/items/9aa0bcada0b5bf6f9e1c) was ʻaccuracy = 0.716875001788, so the accuracy rate was considerably improved. (In the case of feedforward neural network, word2vec was not used and the document vector was created using one hot word vector, so the experimental conditions are different.)

height: 59
width: 300
epoch 1 / 100
train mean loss=0.68654858897, accuracy=0.584814038988
 test mean loss=0.673290403187, accuracy=0.674374999106
epoch 2 / 100
train mean loss=0.653146019086, accuracy=0.678733030628
 test mean loss=0.626838338375, accuracy=0.695624998212
epoch 3 / 100
train mean loss=0.604344114544, accuracy=0.717580840894
 test mean loss=0.582373640686, accuracy=0.713124997914

...

epoch 98 / 100
train mean loss=0.399981137426, accuracy=0.826288489978
 test mean loss=0.460177404433, accuracy=0.775625003874
learning rate:  6.85350312961e-05
epoch 99 / 100
train mean loss=0.400466494895, accuracy=0.822536144887
 test mean loss=0.464013618231, accuracy=0.773749999702
learning rate:  6.64789803572e-05
epoch 100 / 100
train mean loss=0.399539747416, accuracy=0.824081227461
 test mean loss=0.466326575726, accuracy=0.775624996424
learning rate:  6.44846109465e-05
save the model
save the optimizer

in conclusion

I tried document classification (positive / negative classification) using a convolutional neural network. It was a simple model, but it seems to have some accuracy.

Next, I also implemented the model of Yoon Kim with chainer, so I will post an article.

References

Yoon Kim, Convolutional Neural Networks for Sentence Classification EMNLP2014

[PYTHON] [Chainer] Document classification by convolutional neural network