Recently, I implemented a binary classifier that determines the positive and negative of a document using Chainer, which is a hot topic. Since I used Chainer for the first time, it is a simple model because it is for practice. For those who want to implement a deep neural network with Chainer and do something like the author.
It would be helpful if you could point out any mistakes in the comments section.
Please refer to here for the full code.
--Install chainer, gensim, scikit-learn
--Python 2.7 series
The data used is a document about a review of something in English. Each line corresponds to one document, and each word in the document is separated by a single-byte space. The number at the beginning of each line (e.g. 1, 0) is the label.
0 each scene drags , underscoring the obvious , and sentiment is slathered on top . 0 afraid to pitch into farce , yet only half-hearted in its spy mechanics , all the queen's men is finally just one long drag . 1 clooney directs this film always keeping the balance between the fantastic and the believable . . . 1 just about the best straight-up , old-school horror film of the last 15 years .
Vectorize with bag-of-words to treat each document as input to the neural network. I used the gensim function for vectorization. Please refer to this article for details. → Classify news articles by scikit-learn and gensim
The function load_data reads the input data and splits the label and word string of each document byl.strip (). Split ("", 1).
The document label is stored in target and the document vector is stored in source, and they are put together in dataset and returned as a return value.
corpora.Dictionary (document_list) creates a word dictionary by passing a list of document lists (document_list) with each word as an element.
Originally, I had to create a word dictionary using only training data, but I wanted to omit unknown word processing, so I created a word dictionary using all documents.
Where vocab_size is the vocabulary of the entire document and corresponds to the number of dimensions of the document vector.
Therefore, the number of units in the input layer of the neural network implemented this time is equal to vocab_size.
def load_data(fname):
    source = []
    target = []
    f = open(fname, "r")
    document_list = [] #One document on each line.Elements in the document are words
    for l in f.readlines():
        sample = l.strip().split(" ", 1)        #Separate labels and word strings
        label = int(sample[0])                  #label
        target.append(label)
        document_list.append(sample[1].split()) #Split words and add to document list
    #Create a word dictionary
    dictionary = corpora.Dictionary(document_list)
    dictionary.filter_extremes(no_below=5, no_above=0.8)
    # no_below:The document used is no_Ignore words below
    # no_above:The percentage of sentences used is no_Ignore above above
    #Document vectorization
    for document in document_list:
        tmp = dictionary.doc2bow(document) #BoW representation of the document
        vec = list(matutils.corpus2dense([tmp], num_terms=len(dictionary)).T[0])
        source.append(vec)
    dataset = {}
    dataset['target'] = np.array(target)
    dataset['source'] = np.array(source)
    print "vocab size:", len(dictionary.items()) #Vocabulary number=Number of units in the input layer
    return dataset, dictionary
This time it was for practice, so I implemented a simple model.
(The dataset received from the previous functionload_data is divided into training data and test data using the function train_test_split contained in scikit-learn.)
The number of units in the input layer ʻin_units contains the number of dimensions of the document vector (x_train.shape [1]`).
The hidden layer (intermediate layer) can be set appropriately. This time, I am trying to pass 500 by default.
Since the output layer uses the softmax function, the number of units is 2, which is the number of label types.
x_train, x_test, y_train, y_test = train_test_split(dataset['source'], dataset['target'], test_size=0.15)
N_test = y_test.size         # test data size
N = len(x_train)             # train data size
in_units = x_train.shape[1]  #Number of units in the input layer(Vocabulary number)
n_units = args.units #Number of hidden layer units
n_label = 2          #Number of units in the output layer
#Model definition
model = chainer.Chain(l1=L.Linear(in_units, n_units),
                      l2=L.Linear(n_units, n_units),
                      l3=L.Linear(n_units,  n_label))
The function forward performs forward propagation.
The sigmoid function was used for the activation function of input layer-> hidden layer, hidden layer-> hidden layer.
def forward(x, t, train=True):
    h1 = F.sigmoid(model.l1(x))
    h2 = F.sigmoid(model.l2(h1))
    y = model.l3(h2)
    return F.softmax_cross_entropy(y, t), F.accuracy(y, t)
As a whole flow,
Each epoch calculates the error for the training data and the error for the test data. Also, since it is a classification problem, the classification accuracy rate ʻaccuracy` is also calculated.
# Setup optimizer
optimizer = optimizers.Adam()
optimizer.setup(model)
# Learning loop
for epoch in six.moves.range(1, n_epoch + 1):
    print 'epoch', epoch
    # training
    perm = np.random.permutation(N) #Get a random integer sequence list
    sum_train_loss     = 0.0
    sum_train_accuracy = 0.0
    for i in six.moves.range(0, N, batchsize):
        #x using perm_train, y_Select a dataset from train(The target data is different each time)
        x = chainer.Variable(xp.asarray(x_train[perm[i:i + batchsize]])) #source
        t = chainer.Variable(xp.asarray(y_train[perm[i:i + batchsize]])) #target
        model.zerograds()            #Zero initialization of gradient
        loss, acc = forward(x, t)    #Forward propagation
        sum_train_loss      += float(cuda.to_cpu(loss.data)) * len(t)   #For average error calculation
        sum_train_accuracy  += float(cuda.to_cpu(acc.data )) * len(t)   #For calculating the average accuracy rate
        loss.backward()              #Backpropagation of error
        optimizer.update()           #optimisation
    print('train mean loss={}, accuracy={}'.format(
        sum_train_loss / N, sum_train_accuracy / N)) #Mean error
    # evaluation
    sum_test_loss     = 0.0
    sum_test_accuracy = 0.0
    for i in six.moves.range(0, N_test, batchsize):
        # all test data
        x = chainer.Variable(xp.asarray(x_test[i:i + batchsize]))
        t = chainer.Variable(xp.asarray(y_test[i:i + batchsize]))
        loss, acc = forward(x, t, train=False)
        sum_test_loss     += float(cuda.to_cpu(loss.data)) * len(t)
        sum_test_accuracy += float(cuda.to_cpu(acc.data))  * len(t)
    print(' test mean loss={}, accuracy={}'.format(
        sum_test_loss / N_test, sum_test_accuracy / N_test)) #Mean error
#Save model and optimizer
print 'save the model'
serializers.save_npz('pn_classifier_ffnn.model', model)
print 'save the optimizer'
serializers.save_npz('pn_classifier_ffnn.state', optimizer)
The classification accuracy rate for the final test data was ʻaccuracy = 0.716875001788`. However, as the learning progresses, the test error increases and overfitting is occurring ...
Probably because I finally built the model.
>python train.py --gpu 1 --data input.dat --units 1000
vocab size: 4442
epoch 1
train mean loss=0.746377664579, accuracy=0.554684912523
 test mean loss=0.622971419245, accuracy=0.706875003874
epoch 2
train mean loss=0.50845754933, accuracy=0.759408453399
 test mean loss=0.503996372223, accuracy=0.761249992996
epoch 3
train mean loss=0.386604680468, accuracy=0.826067760105
 test mean loss=0.506066314876, accuracy=0.769374992698
epoch 4
train mean loss=0.301527346359, accuracy=0.870433726909
 test mean loss=0.553729468957, accuracy=0.774999994785
epoch 5
train mean loss=0.264981631757, accuracy=0.889085094432
 test mean loss=0.599407823756, accuracy=0.766874998808
epoch 6
train mean loss=0.231274759588, accuracy=0.901114668847
 test mean loss=0.68350501731, accuracy=0.755625002086
...
epoch 95
train mean loss=0.0158744356008, accuracy=0.993598945303
 test mean loss=5.08019682765, accuracy=0.717499997467
epoch 96
train mean loss=0.0149783944279, accuracy=0.994261124581
 test mean loss=5.30629962683, accuracy=0.723749995232
epoch 97
train mean loss=0.00772037562047, accuracy=0.997351288256
 test mean loss=5.49559159577, accuracy=0.720624998212
epoch 98
train mean loss=0.00569957431572, accuracy=0.99834455516
 test mean loss=5.67661693692, accuracy=0.716875001788
epoch 99
train mean loss=0.00772406136085, accuracy=0.997240925267
 test mean loss=5.63734056056, accuracy=0.720000002533
epoch 100
train mean loss=0.0125463016702, accuracy=0.995916569395
 test mean loss=5.23713605106, accuracy=0.716875001788
save the model
save the optimizer
I used Chainer to implement a feedforward neural network for document classification. I would like to improve the model so that overfitting does not occur.
If you would like to see the code for studying Chainer, please refer to here.
-Classify anime faces by deep learning -Image classification using Deep Learning framework Chainer 4 -I had a regression problem with Chainer
Recommended Posts