Convolutional Neural Networks (CNN), which is widely used in image recognition, etc. Recently, it is also used in the field of natural language processing [Kim, EMNLP2014]. This time, I built a simple network using CNN with Chainer and applied it to the task of document classification.
--Source code implemented this time - ichiroex@github --Data used in the experiment -Data can be found at here. I used "sentence polarity dataset v1.0". -Download directly
--Installation of Chainer, scikit-learn, gensim --Download of the trained model of word2vec (GoogleNews-vectors-negative300.bin.gz).
--Python 2.7 series
This time, the input document is vectorized using word2vec, and the vectorized document is convolved.
Document vectorization is done with def load_data (fname)
defined in ʻuitl.py`.
As an image, some input word string (document) $ x_1 $, $ x_2 $, $ x_3 $, ...., $ x_n $ for each word $ x_i $ Converts to a fixed-dimensional $ N $ vector and creates a two-dimensional document vector with them side by side.
[Example]
Also, since the sentence length is different for each document, padding is performed to match the maximum sentence length $ maxlen $ in the input document. That is, the dimension of the generated 2D document vector is $ N * maxlen $.
By the way, in the word2vec model published by Google this time, the dimension of each word vector is 300.
util.py
def load_data(fname):
#Load the trained word2vec model
model = word2vec.Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
target = [] #label
source = [] #Document vector
#Create a document list
document_list = []
for l in open(fname, 'r').readlines():
sample = l.strip().split(' ', 1)
label = sample[0]
target.append(label) #label
document_list.append(sample[1].split()) #Word list for each document
max_len = 0
rev_document_list = [] #Document list after unknown word processing
for doc in document_list:
rev_doc = []
for word in doc:
try:
word_vec = np.array(model[word]) #In the case of unknown words,KeyError occurs
rev_doc.append(word)
except KeyError:
rev_doc.append('<unk>') #Unknown word
rev_document_list.append(rev_doc)
#Find the maximum length of a document(For padding)
if len(rev_doc) > max_len:
max_len = len(rev_doc)
#Match document length by padding
rev_document_list = padding(rev_document_list, max_len)
width = 0 #Number of dimensions of each word
#Document feature vectorization
for doc in rev_document_list:
doc_vec = []
for word in doc:
try:
vec = model[word.decode('utf-8')]
except KeyError:
vec = model.seeded_vector(word)
doc_vec.extend(vec)
width = len(vec)
source.append(doc_vec)
dataset = {}
dataset['target'] = np.array(target)
dataset['source'] = np.array(source)
return dataset, max_len, width
This time, we defined a model with a structure of convolution layer-> pooling layer-> fully connected layer. I also have a dropout on the way.
net.py
class SimpleCNN(Chain):
def __init__(self, input_channel, output_channel, filter_height, filter_width, mid_units, n_units, n_label):
super(SimpleCNN, self).__init__(
conv1 = L.Convolution2D(input_channel, output_channel, (filter_height, filter_width)),
l1 = L.Linear(mid_units, n_units),
l2 = L.Linear(n_units, n_label),
)
#Called by Classifier
def __call__(self, x):
h1 = F.max_pooling_2d(F.relu(self.conv1(x)), 3)
h2 = F.dropout(F.relu(self.l1(h1)))
y = self.l2(h2)
return y
In learning, the correct answer rate and loss are calculated and displayed for each training data and test data for each epoch. The code is almost the same as when Document classification using feedforward neural network.
The filter size at the time of convolution is 3 x 300 (the number of dimensions of each word vector).
train_cnn.py
# Prepare dataset
dataset, height, width = util.load_data(args.data)
print 'height:', height
print 'width:', width
dataset['source'] = dataset['source'].astype(np.float32) #Feature value
dataset['target'] = dataset['target'].astype(np.int32) #label
x_train, x_test, y_train, y_test = train_test_split(dataset['source'], dataset['target'], test_size=0.15)
N_test = y_test.size # test data size
N = len(x_train) # train data size
in_units = x_train.shape[1] #Number of units in the input layer(Vocabulary number)
# (nsample, channel, height, width)Converted to a 4-dimensional tensor
input_channel = 1
x_train = x_train.reshape(len(x_train), input_channel, height, width)
x_test = x_test.reshape(len(x_test), input_channel, height, width)
#Number of hidden layer units
n_units = args.nunits
n_label = 2
filter_height = 3
output_channel = 50
#Model definition
model = L.Classifier( SimpleCNN(input_channel, output_channel, filter_height, width, 950, n_units, n_label))
#Whether to use GPU
if args.gpu > 0:
cuda.check_cuda_available()
cuda.get_device(args.gpu).use()
model.to_gpu()
xp = np if args.gpu <= 0 else cuda.cupy #args.gpu <= 0: use cpu, otherwise: use gpu
batchsize = args.batchsize
n_epoch = args.epoch
# Setup optimizer
optimizer = optimizers.AdaGrad()
optimizer.setup(model)
# Learning loop
for epoch in six.moves.range(1, n_epoch + 1):
print 'epoch', epoch, '/', n_epoch
# training)
perm = np.random.permutation(N) #Get a random integer sequence list
sum_train_loss = 0.0
sum_train_accuracy = 0.0
for i in six.moves.range(0, N, batchsize):
#x using perm_train, y_Select a dataset from train(The target data is different each time)
x = chainer.Variable(xp.asarray(x_train[perm[i:i + batchsize]])) #source
t = chainer.Variable(xp.asarray(y_train[perm[i:i + batchsize]])) #target
optimizer.update(model, x, t)
sum_train_loss += float(model.loss.data) * len(t.data) #For average error calculation
sum_train_accuracy += float(model.accuracy.data ) * len(t.data) #For calculating the average accuracy rate
print('train mean loss={}, accuracy={}'.format(sum_train_loss / N, sum_train_accuracy / N)) #Mean error
# evaluation
sum_test_loss = 0.0
sum_test_accuracy = 0.0
for i in six.moves.range(0, N_test, batchsize):
# all test data
x = chainer.Variable(xp.asarray(x_test[i:i + batchsize]))
t = chainer.Variable(xp.asarray(y_test[i:i + batchsize]))
loss = model(x, t)
sum_test_loss += float(loss.data) * len(t.data)
sum_test_accuracy += float(model.accuracy.data) * len(t.data)
print(' test mean loss={}, accuracy={}'.format(
sum_test_loss / N_test, sum_test_accuracy / N_test)) #Mean error
if epoch > 10:
optimizer.lr *= 0.97
print 'learning rate: ', optimizer.lr
sys.stdout.flush()
The final accuracy rate was ʻaccuracy = 0.775624996424. [When classified by feedforward neural network](http://qiita.com/ichiroex/items/9aa0bcada0b5bf6f9e1c) was ʻaccuracy = 0.716875001788
, so the accuracy rate was considerably improved.
(In the case of feedforward neural network, word2vec was not used and the document vector was created using one hot word vector, so the experimental conditions are different.)
height: 59
width: 300
epoch 1 / 100
train mean loss=0.68654858897, accuracy=0.584814038988
test mean loss=0.673290403187, accuracy=0.674374999106
epoch 2 / 100
train mean loss=0.653146019086, accuracy=0.678733030628
test mean loss=0.626838338375, accuracy=0.695624998212
epoch 3 / 100
train mean loss=0.604344114544, accuracy=0.717580840894
test mean loss=0.582373640686, accuracy=0.713124997914
...
epoch 98 / 100
train mean loss=0.399981137426, accuracy=0.826288489978
test mean loss=0.460177404433, accuracy=0.775625003874
learning rate: 6.85350312961e-05
epoch 99 / 100
train mean loss=0.400466494895, accuracy=0.822536144887
test mean loss=0.464013618231, accuracy=0.773749999702
learning rate: 6.64789803572e-05
epoch 100 / 100
train mean loss=0.399539747416, accuracy=0.824081227461
test mean loss=0.466326575726, accuracy=0.775624996424
learning rate: 6.44846109465e-05
save the model
save the optimizer
I tried document classification (positive / negative classification) using a convolutional neural network. It was a simple model, but it seems to have some accuracy.
Next, I also implemented the model of Yoon Kim with chainer, so I will post an article.
References
Recommended Posts