[PYTHON] Survive Christmas with character-level CNN

This article is the 18th day of Retty Advent Calendar. Yesterday was @ YutaSakata's I want Kotlin 1.1 for Christmas gifts.

By the way, Christmas is coming soon, but do you have anyone to spend time with? I? I am of course. This child. mycat.jpg

If you're alone, you'll want to go drinking with alcohol, right? It is also good to drink moistly at a good store. But what if the store you entered with that in mind was a rear-filled den? The precious gourmet time of loneliness is ruined.

Let's use the power of deep learning to avoid such dangerous stores in advance.

What to prepare

keras can be either tensorflow or theano A library for Deep Learning that works as a backend. It's quite annoying to try to do complicated things, but it's pretty easy to write for most models. I will use this this time.

(2017/3/1 postscript) I use tensorflow for the back end. If it is theano, the following code will not work due to the difference in the handling of channel on CNN. A small fix will fix it. See the comment section for details.

We use Retty reviews for shop reviews. It is a privilege of the inside person that you do not have to do crawling.

things to do

Let's make a classifier with Deep Learning because we want to classify shops into rear-filled dens and non-rear-filled dens. As a flow,

  1. Retty reviews of stores set for dating purposes are used as dating purpose store reviews.
  2. Reviews of stores other than the above are considered as destination store reviews other than dates
  3. Make a word-of-mouth classifier with the above two reviews as teachers.
  4. All the reviews of the shop are put into a classifier, and the place where the word-of-mouth rate of the date destination shop is high is recognized as the rear cave.

It's like that.

There are various ways to make a classifier, but this time we will use ** character-level CNN **.

character-level CNN

When talking about using Deep Learning for natural language processing, LSTM is often mentioned, but this time I will not use it. I use CNN. character-level CNN has very nice features. It means ** no word-separation required **. Character-level CNN works character by character, not word by word, so you don't have to break sentences into words. The outline of the method is as follows.

  1. Break down sentences into an array of letters
  2. Convert each character to a UNICODE value
  3. Make a fixed length array. (Censored if long, 0 padding if short)
  4. Make the UNICODE array a vector array with keras.layers.embeddings.Embedding
  5. Multiply the vector sequence by CNN
  6. Returns the classification result through the fully connected layer

Implementation

From here, I will introduce a concrete implementation.

Model making

First of all, from the character-level CNN model. It's super easy.

  1. Receive an Input whose shape is (batch size, maximum string length)
  2. Embedding converts each character to the embed_size dimension. Where the emb shape is (batch size, maximum string length, embed_size).
  3. The shape received by Convolution2D is (batch size, maximum string length, embed_size, number of channels). The channel can be 1, so here we will add one axis with Reshape.
  4. ** (Important here) ** Convolve the same input with multiple kernel sizes and combine the results.
  5. Flatten the combined material and apply it to the fully connected layer to make it one-dimensional (0 is the word of mouth other than the date destination store, 1 is the date destination store word of mouth).
def create_model(embed_size=128, max_length=300, filter_sizes=(2, 3, 4, 5), filter_num=64):
    inp = Input(shape=(max_length,))
    emb = Embedding(0xffff, embed_size)(inp)
    emb_ex = Reshape((max_length, embed_size, 1))(emb)
    convs = []
    #Apply multiple Convolution 2D
    for filter_size in filter_sizes:
        conv = Convolution2D(filter_num, filter_size, embed_size, activation="relu")(emb_ex)
        pool = MaxPooling2D(pool_size=(max_length - filter_size + 1, 1))(conv)
        convs.append(pool)
    convs_merged = merge(convs, mode='concat')
    reshape = Reshape((filter_num * len(filter_sizes),))(convs_merged)
    fc1 = Dense(64, activation="relu")(reshape)
    bn1 = BatchNormalization()(fc1)
    do1 = Dropout(0.5)(bn1)
    fc2 = Dense(1, activation='sigmoid')(do1)
    model = Model(input=inp, output=fc2)
    return model

I will write a little more about 4. The specifications of the arguments of Convolution2D are as follows.

keras.layers.convolutional.Convolution2D(nb_filter, nb_row, nb_col, init='glorot_uniform', activation='linear', weights=None, border_mode='valid', subsample=(1, 1), dim_ordering='default', W_regularizer=None, b_regularizer=None, activity_regularizer=None, W_constraint=None, b_constraint=None, bias=True)

Here, 2,3,4,5 is specified for nb_row, and embed_size is specified for nb_col. This means that you are applying a kernel that is 2,3,4,5 characters in size. It feels like imitating 2-gram, 3-gram, 4-gram, 5-gram. By connecting these results into one, you can use the results of multiple n-grams together.

Data reading

The data reading part can be made memory-friendly using a generator, but it's not an image and it doesn't eat that much memory, so let's put it all in memory.

def load_data(filepath, targets, max_length=300, min_length=10):
    comments = []
    tmp_comments = []
    with open(filepath) as f:
        for l in f:
            #Store ID separated by tab for each line,Premise that word of mouth is written
            restaurant_id, comment = l.split("\t", 1)
            restaurant_id = int(restaurant_id)
            #Convert to UNICODE for each character
            comment = [ord(x) for x in comment.strip().decode("utf-8")]
            #The long part is censored
            comment = comment[:max_length]
            comment_len = len(comment)
            if comment_len < min_length:
                #Reviews that are too short are not eligible
                continue
            if comment_len < max_length:
                #Fill in the missing parts with 0 to make it a fixed length
                comment += ([0] * (max_length - comment_len))
            if restaurant_id not in targets:
                tmp_comments.append((0, comment))
            else:
                comments.append((1, comment))
    #For learning, it is better to have the same number of reviews for dating destinations and others
    random.shuffle(tmp_comments)
    comments.extend(tmp_comments[:len(comments)])
    random.shuffle(comments)
    return comments

Learning

Let's learn.

def train(inputs, targets, batch_size=100, epoch_count=100, max_length=300, model_filepath="model.h5", learning_rate=0.001):

    #Try to reduce the learning rate little by little
    start = learning_rate
    stop = learning_rate * 0.01
    learning_rates = np.linspace(start, stop, epoch_count)

    #Modeling
    model = create_model(max_length=max_length)
    optimizer = Adam(lr=learning_rate)
    model.compile(loss='binary_crossentropy',
                  optimizer=optimizer,
                  metrics=['accuracy'])

    #Learning
    model.fit(inputs, targets,
              nb_epoch=epoch_count,
              batch_size=batch_size,
              verbose=1,
              validation_split=0.1,
              shuffle=True,
              callbacks=[
                  LearningRateScheduler(lambda epoch: learning_rates[epoch]),
              ])

    #Save model
    model.save(model_filepath)


if __name__ == "__main__":
    comments = load_data(..., ...)

    input_values = []
    target_values = []
    for target_value, input_value in comments:
        input_values.append(input_value)
        target_values.append(target_value)
    input_values = np.array(input_values)
    target_values = np.array(target_values)
    train(input_values, target_values, epoch_count=50)

When I tried it, the accuracy was over 99% for the training data and less than 80% for the test data.

Discrimination

Now, when you get here, you can determine the word of mouth.

# -*- coding:utf-8 -*-

import numpy as np
from keras.models import load_model

def predict(comments, model_filepath="model.h5"):
    model = load_model(model_filepath)
    ret = model.predict(comments)
    return ret

if __name__ == "__main__":
    raw_comment = "Great for dates!"
    comment = [ord(x) for x in raw_comment.strip().decode("utf-8")]
    comment = comment[:300]
    if len(comment) < 10:
        exit("too short!!")
    if len(comment) < 300:
        comment += ([0] * (300 - len(comment)))
    ret = predict(np.array([comment]))
    predict_result = ret[0][0]
    print "Rear filling: {}%".format(predict_result * 100)

Musashi-Koyama Yakitori Wine Shop. It was delicious in general! The price is not cheap, but all the wines are Bio wines. It was said that the Yakitori course was recommended, so go there. The customer service of the clerk is also the best, so please go.

When I tried it above, it was 99.9996066093%. Even if it's yakitori, you can't hide the drifting rear odor. By the way, this review is from our Retty founder, Takeda. You would never have gone to such a glittering place alone. Who did you go with?

I chose it because it is directly connected to the station !! I asked for yakitori, mizutaki, etc., but the birds were plump and delicious !! The image is that the people who finished work are full. But the price was reasonable and it was quite good ♫

In the above it was 2.91604362879e-07%. Even with the same yakitori, if it feels like a flower, the rear filling will drop to this point. Your heart will be calm. This review is from a Retty employee, but let's not know who it is.

Once you've done this, you can find out if it's a rear-filled den by digging into all the reviews of the store and taking the average of the rear-filled reviews.

Summary

Deep Learning is also an excellent technique for protecting peace of mind. The character-level CNN came out around the beginning of this year, but recently came out [QRNN](http://metamind.io/research/new-neural-network-building-block-allows-faster-and -more-accurate-text-understanding /) and so on, so I'd like to try it.

Have a nice Christmas.

Recommended Posts

Survive Christmas with character-level CNN
CNN implementation with just numpy
Try running CNN with ChainerRL
Easily build CNN with Keras
Use Maxout + CNN with Pylearn2
Decrypt the QR code with CNN