This is a memo for myself as I read Introduction to Natural Language Processing Applications in 15 Steps. This time, in Chapter 3, Step 12, make a note of your own points. I've studied CNN itself, so the content is rough.
--Personal MacPC: MacOS Mojave version 10.14.6 --docker version: Version 19.03.2 for both Client and Server
In the previous chapter, by introducing word embeddings, it became possible to handle distributed expressions of words as features. However, in order to tailor it to sentence-level features, it is necessary to take the sum or average of the distributed expressions, which makes the prediction inferior to the BoW system. In this chapter, we will build a convolutional neural network (CNN) with a sequence of distributed expressions of words arranged in a form corresponding to a sentence as input. Note that the CNN used for familiar image analysis is two-dimensional, so the CNN when dealing with natural language processing (text classification, etc.) is one-dimensional.
12.1 ~ 12.4
| CNN layer | Contents | 
|---|---|
| Convolutional layer | ·input · Distributed representation sequence of words obtained by word embeddings -When stacking CNN layers, the output string of the Pooling layer of the previous layer ・ Align the length of the distributed expression sequence of each word that composes a sentence ・ Ignore the excess length ・ Fill in the missing parts with zero vectors -Kernel for the direction of sentence structure_For each size, multiply it by the weight and add a bias to make it one of the outputs. -Repeat the same operation for each stride in the direction of sentence structure, but use the same weight as the previous layer:weight sharing | 
| Pooling layer | ·input -Output string of Convolutional layer -There are Max pooling and Average pooling, but Max pooling, which is a non-linear process, has higher performance. -The same operation can be repeated for each stride in the direction of sentence structure, but there is also a method of processing all at once without setting the stride;global max pooling, global average pooling | 
| fully-connected layer (densely-connected layer;Fully connected layer) | ・ I want to input to the multi-layer perceptron for multi-class classification -Since the output of the pooling layer is a two-dimensional array, convert it to a one-dimensional array that can be input to the multi-layer perceptron. | 
In the previous chapter, the distributed expressions were totaled, so I tried to average them.
import numpy as np
from gensim.models import Word2Vec
from sklearn.svm import SVC
from tokenizer import tokenize
from sklearn.pipeline import Pipeline
class DialogueAgent:
    def __init__(self):
        self.model = Word2Vec.load(
            './latest-ja-word2vec-gensim-model/word2vec.gensim.model')  # <1>
    def train(self, texts, labels):
        pipeline = Pipeline([
            ('classifier', SVC()),
        ])
        pipeline.fit(texts, labels)
        self.pipeline = pipeline
    def predict(self, texts):
        return self.pipeline.predict(texts)
    #The content is almost the same as that of Step 11
    def calc_text_feature(self, text):
~~
#        return np.sum(word_vectors, axis=0)
        return np.average(word_vectors, axis=0)
evaluate_dialogue_agent.py
from os.path import dirname, join, normpath
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
from <Implemented module name> import DialogueAgent
if __name__ == '__main__':
    BASE_DIR = normpath(dirname(__file__))
    # Training
    training_data = pd.read_csv(join(BASE_DIR, './training_data.csv'))
    dialogue_agent = DialogueAgent()
    X_train = np.array([dialogue_agent.calc_text_feature(text) for text in training_data['text']])
    y_train = np.array(training_data['label'])
    dialogue_agent.train(X_train, y_train)
    # Evaluation
    test_data = pd.read_csv(join(BASE_DIR, './test_data.csv'))
    X_test = np.array([dialogue_agent.calc_text_feature(text) for text in test_data['text']])
    y_test = np.array(test_data['label'])
    y_pred = dialogue_agent.predict(X_test)
    print(accuracy_score(y_test, y_pred))
In the above, the classifier was SVC, so let's change it to NN.
Since the distributed representation of each word is averaged, the dimension of the distributed representation texts.shape [1] is set as the input_dim of the Keras Classifier.
~~
    def _build_mlp(self, input_dim, hidden_units, output_dim):
        mlp = Sequential()
        mlp.add(Dense(units=hidden_units,
                      input_dim=input_dim,
                      activation='relu'))
        mlp.add(Dense(units=output_dim, activation='softmax'))
        mlp.compile(loss='categorical_crossentropy',
                    optimizer='adam')
        return mlp
    def train(self, texts, labels, hidden_units = 32, classifier__epochs = 100):
        feature_dim = texts.shape[1]
        print(feature_dim)
        n_labels = max(labels) + 1
        classifier = KerasClassifier(build_fn=self._build_mlp,
                                     input_dim=feature_dim,
                                     hidden_units=hidden_units,
                                     output_dim=n_labels)
        pipeline = Pipeline([
            ('classifier', classifier),
        ])
        pipeline.fit(texts, labels, classifier__epochs=classifier__epochs)
        self.pipeline = pipeline
    def predict(self, texts):
        return self.pipeline.predict(texts)
~~
Since the distributed representation of word embedding is a two-dimensional array, input it to the Dense layer after inserting the Flatten layer.
    #Model building
    model = Sequential()
    model.add(get_keras_embedding(we_model.wv,
                                  input_shape=(MAX_SEQUENCE_LENGTH, ),
                                  trainable=False))
    model.add(Flatten())
    model.add(Dense(units=256, activation='relu'))
    model.add(Dense(units=128, activation='relu'))
    model.add(Dense(units=n_classes, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='rmsprop',
                  metrics=['accuracy'])
Input the distributed representation of word embedding to the Convolutional layer of CNN, but set kernel_size to the dimension x_train.shape [1] of the distributed representation and make it the same configuration as the Dense layer.
    #Model building
    model = Sequential()
    model.add(get_keras_embedding(we_model.wv,
                                  input_shape=(MAX_SEQUENCE_LENGTH, ),
                                  trainable=False))  # <6>
    # 1D Convolution
    model.add(Conv1D(filters=256, kernel_size=x_train.shape[1], strides=1, activation='relu'))
    # Global max pooling
    model.add(MaxPooling1D(pool_size=int(model.output.shape[1])))
    model.add(Flatten())
    model.add(Dense(units=128, activation='relu'))
    model.add(Dense(units=n_classes, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='rmsprop',
                  metrics=['accuracy'])
Enter the distributed representation of word embedding into the CNN Convolutional layer. Details are omitted because it is as in the book.
Embedding layer
    Embedding(input_dim=word_num + 1,
             output_dim=embedding_dim,
             weights=[weights_with_zero],
             *args, **kwargs)
↓
# *args, **kwargs actually looks like this
    Embedding(input_dim=word_num + 1,
             output_dim=embedding_dim,
             weights=[weights_with_zero],
             input_shape=(MAX_SEQUENCE_LENGTH, ),
             trainable=False)
--Trainable: Weights are not updated during learning (Embedding is performed using already learned weights, so weights cannot be updated by learning) --input_shape: When adding a layer with add in Keras, specify it as the first input layer --input_dim / output_dim: In / out dim of Embedding layer weights. The output dimension of the Embedding layer is equal to output_dim
| Distributed representation of word embeddings | Identifyer | Execution result | 
|---|---|---|
| total | SVC | 0.40425531914893614 | 
| average | SVC | 0.425531914893617 | 
| total/average | NN | 0.5638297872340425 / 0.5531914893617021 | 
| Remain in line | Embedding -> Flatten -> Dense | 0.5319148936170213 | 
| Remain in line | Embedding -> CNN(Dense) | 0.5 | 
| Remain in line | Embedding -> CNN | 0.6808510638297872 | 
--Normal implementation (Step01): 37.2% --Addition of preprocessing (Step02): 43.6% --Pre-processing + feature extraction change (Step04): 58.5% --Pretreatment + feature extraction change + classifier change RandomForest (Step06): 61.7% --Pre-processing + feature extraction change + classifier change NN (Step09): 66.0% --Pretreatment + feature extraction change (Step 11): 40.4% --Pre-processing + feature extraction change + classifier change CNN (Step12): 68.1%