[PYTHON] I tried running the TensorFlow tutorial with comments (text classification of movie reviews)

Reference URL: https://www.tensorflow.org/tutorials/keras/text_classification?hl=ja

Target

Do the following

--Classify movie reviews into positive or negative using the text --Construct a neural network to classify as positive or negative (binary classification) --Training neural networks --Evaluate the performance of the model

Preparation

Preparation of package

import tensorflow as tf
from tensorflow import keras

import numpy as np

#Ver confirmation of tensorflow
print(tf.__version__)

2.3.0

Data set preparation

Use IMDB dataset num_words = 10000 is for holding 10,000 most frequently occurring words Rarely appearing words are discarded to make the data size manageable

imdb = keras.datasets.imdb

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
17465344/17464789 [==============================] - 0s 0us/step

Observing the data

print("Training entries: {}, labels: {}".format(len(train_data), len(train_labels)))

Training entries: 25000, labels: 25000

The data is labeled positive or negative with an array of integers representing the words in the movie review 0 indicates negative reviews, 1 indicates positive reviews

print(train_data[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]

print(train_labels[0])

Movie reviews vary in length for each data Inputs to the neural network must be the same length and must be resolved

len(train_data[0]), len(train_data[1])

(218, 189)

Try converting an integer back to a word

#A dictionary that maps words to integers
word_index = imdb.get_word_index()

#The first part of the index is reserved
word_index = {k:(v+3) for k,v in word_index.items()} 
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
1646592/1641221 [==============================] - 0s 0us/step

decode_review(train_data[0])

"<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"

Convert review data to tensor

There are two main ways to shape the data input to the neural network.

--Convert an array into a vector of 0s and 1s representing the occurrence of words, similar to one-hot encoding --For example, the array [3, 5] is a 10,000-dimensional vector with all zeros except indexes 3 and 5. And let this be the first layer of the network, that is, the Dense layer that can handle floating point vector data. However, this is a memory-intensive method that requires a matrix of words x reviews.

--Align the array to the same length by padding, and tensor an integer in the form of the number of samples * maximum length And make the Embedding layer that can handle this format the first layer of the network

The latter is adopted in this tutorial

Use the pad_sequences function to standardize lengths Reference: Preprocessing of sequence --Keras Documentation

--value: Value to pad --Padding: Position to pad (pre or post) --maxlen: Maximum length of array If not specified, it will be the maximum length in a given array

train_data = keras.preprocessing.sequence.pad_sequences(train_data,
                                                        value=word_index["<PAD>"],
                                                        padding='post',
                                                        maxlen=256)

test_data = keras.preprocessing.sequence.pad_sequences(test_data,
                                                       value=word_index["<PAD>"],
                                                       padding='post',
                                                       maxlen=256)

#Make sure the lengths are the same
len(train_data[0]), len(train_data[1])

(256, 256)

#Check the first padded data
print(train_data[0])

[   1   14   22   16   43  530  973 1622 1385   65  458 4468   66 3941
    4  173   36  256    5   25  100   43  838  112   50  670    2    9
   35  480  284    5  150    4  172  112  167    2  336  385   39    4
  172 4536 1111   17  546   38   13  447    4  192   50   16    6  147
 2025   19   14   22    4 1920 4613  469    4   22   71   87   12   16
   43  530   38   76   15   13 1247    4   22   17  515   17   12   16
  626   18    2    5   62  386   12    8  316    8  106    5    4 2223
 5244   16  480   66 3785   33    4  130   12   16   38  619    5   25
  124   51   36  135   48   25 1415   33    6   22   12  215   28   77
   52    5   14  407   16   82    2    8    4  107  117 5952   15  256
    4    2    7 3766    5  723   36   71   43  530  476   26  400  317
   46    7    4    2 1029   13  104   88    4  381   15  297   98   32
 2071   56   26  141    6  194 7486   18    4  226   22   21  134  476
   26  480    5  144   30 5535   18   51   36   28  224   92   25  104
    4  226   65   16   38 1334   88   12   16  283    5   16 4472  113
  103   32   15   16 5345   19  178   32    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0]

Model building

Embedding layer

keras.layers.Embedding

Takes an integer-encoded vocabulary and searches for the embedded vector corresponding to each word index Embedded vectors are learned during model training One dimension is added to the output matrix for vectorization As a result, the dimensions are (batch, sequence, embedding) Simply put, it returns a vector representation of each word as input.

GlobalAveragePooling1D (1D global average pooling) layer

keras.layers.GlobalAveragePooling1D

GlobalAveragePooling1D (1D global average pooling) layer For each sample, find the mean in the dimensional direction of the sequence and return a fixed-length vector Simply put, take the average value for each dimension of the word vector (compression of information amount)

Definition of fully connected layer

Fully connected layer with 16 hidden units activation ='relu' specifies the activation function ReLU

Definition of fully connected (classification) layer

Fully join to one output node By using a sigmoid for the activation function, the output value will be a floating point number between 0 and 1 representing probability or certainty.

#The input format is the number of vocabularies used in movie reviews (10),000 words)
vocab_size = 10000

model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 16)          160000    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 16)                272       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
=================================================================
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
_________________________________________________________________

Hidden unit

The above model has two intermediate or "hidden" layers between the input and the output. The output (unit, node, or neuron) is the number of dimensions of the internal representation of that layer. In other words, the degree of freedom in which this network acquires internal representations through learning.

If the model has more hidden units (larger dimensions in the internal representation space) Or, if there are more layers, or both, the network can learn more complex internal representations. However, as a result, the amount of calculation of the network increases, and patterns that you do not want to learn are learned. Patterns that you do not want to learn are patterns that improve the performance of training data but not the performance of test data. This problem is called overfitting.

Compiling the model

Defining a model for learning

--optimer: Optimizer algorithm --This time, specify ʻAdam --Other optimization algorithms: https://www.tensorflow.org/api_docs/python/tf/keras/optimizers --loss: loss function --This time, specifybinary cross entropy --metrics: Items quantified during learning and testing --This time, specify ʻaccuracy

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

Verification

Create data for verification

When training, verify the accuracy rate with data that the model does not see Create validation set by separating 10,000 samples from the original training data The verification data is used for adjusting hyperparameters when building a model.

x_val = train_data[:10000]
partial_x_train = train_data[10000:]

y_val = train_labels[:10000]
partial_y_train = train_labels[10000:]

Model learning

Train a 40 epoch model using a mini-batch of 512 samples As a result, all the samples contained in x_train and y_train will be repeated 40 times. During training, use 10,000 samples of validation data to monitor model loss and accuracy

history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=40,
                    batch_size=512,
                    validation_data=(x_val, y_val),
                    verbose=1)

Epoch 1/40
30/30 [==============================] - 1s 21ms/step - loss: 0.6916 - accuracy: 0.5446 - val_loss: 0.6894 - val_accuracy: 0.6478
Epoch 2/40
30/30 [==============================] - 0s 16ms/step - loss: 0.6855 - accuracy: 0.7160 - val_loss: 0.6815 - val_accuracy: 0.6852
Epoch 3/40
30/30 [==============================] - 1s 17ms/step - loss: 0.6726 - accuracy: 0.7333 - val_loss: 0.6651 - val_accuracy: 0.7548
Epoch 4/40
30/30 [==============================] - 1s 17ms/step - loss: 0.6499 - accuracy: 0.7683 - val_loss: 0.6393 - val_accuracy: 0.7636
Epoch 5/40
30/30 [==============================] - 1s 17ms/step - loss: 0.6166 - accuracy: 0.7867 - val_loss: 0.6045 - val_accuracy: 0.7791
Epoch 6/40
30/30 [==============================] - 1s 17ms/step - loss: 0.5747 - accuracy: 0.8083 - val_loss: 0.5650 - val_accuracy: 0.7975
Epoch 7/40
30/30 [==============================] - 0s 16ms/step - loss: 0.5286 - accuracy: 0.8265 - val_loss: 0.5212 - val_accuracy: 0.8165
Epoch 8/40
30/30 [==============================] - 0s 16ms/step - loss: 0.4819 - accuracy: 0.8442 - val_loss: 0.4807 - val_accuracy: 0.8285
Epoch 9/40
30/30 [==============================] - 1s 17ms/step - loss: 0.4382 - accuracy: 0.8589 - val_loss: 0.4438 - val_accuracy: 0.8415
Epoch 10/40
30/30 [==============================] - 0s 16ms/step - loss: 0.3996 - accuracy: 0.8721 - val_loss: 0.4133 - val_accuracy: 0.8496
Epoch 11/40
30/30 [==============================] - 1s 17ms/step - loss: 0.3673 - accuracy: 0.8774 - val_loss: 0.3887 - val_accuracy: 0.8570
Epoch 12/40
30/30 [==============================] - 0s 16ms/step - loss: 0.3398 - accuracy: 0.8895 - val_loss: 0.3682 - val_accuracy: 0.8625
Epoch 13/40
30/30 [==============================] - 0s 16ms/step - loss: 0.3159 - accuracy: 0.8935 - val_loss: 0.3524 - val_accuracy: 0.8652
Epoch 14/40
30/30 [==============================] - 0s 16ms/step - loss: 0.2961 - accuracy: 0.8995 - val_loss: 0.3388 - val_accuracy: 0.8710
Epoch 15/40
30/30 [==============================] - 0s 16ms/step - loss: 0.2785 - accuracy: 0.9049 - val_loss: 0.3282 - val_accuracy: 0.8741
Epoch 16/40
30/30 [==============================] - 0s 16ms/step - loss: 0.2631 - accuracy: 0.9095 - val_loss: 0.3192 - val_accuracy: 0.8768
Epoch 17/40
30/30 [==============================] - 1s 17ms/step - loss: 0.2500 - accuracy: 0.9136 - val_loss: 0.3129 - val_accuracy: 0.8769
Epoch 18/40
30/30 [==============================] - 0s 16ms/step - loss: 0.2371 - accuracy: 0.9185 - val_loss: 0.3064 - val_accuracy: 0.8809
Epoch 19/40
30/30 [==============================] - 0s 16ms/step - loss: 0.2255 - accuracy: 0.9233 - val_loss: 0.3010 - val_accuracy: 0.8815
Epoch 20/40
30/30 [==============================] - 1s 17ms/step - loss: 0.2155 - accuracy: 0.9265 - val_loss: 0.2976 - val_accuracy: 0.8829
Epoch 21/40
30/30 [==============================] - 1s 17ms/step - loss: 0.2060 - accuracy: 0.9290 - val_loss: 0.2942 - val_accuracy: 0.8823
Epoch 22/40
30/30 [==============================] - 0s 17ms/step - loss: 0.1962 - accuracy: 0.9331 - val_loss: 0.2914 - val_accuracy: 0.8844
Epoch 23/40
30/30 [==============================] - 1s 17ms/step - loss: 0.1877 - accuracy: 0.9374 - val_loss: 0.2894 - val_accuracy: 0.8838
Epoch 24/40
30/30 [==============================] - 1s 17ms/step - loss: 0.1794 - accuracy: 0.9421 - val_loss: 0.2877 - val_accuracy: 0.8850
Epoch 25/40
30/30 [==============================] - 1s 17ms/step - loss: 0.1724 - accuracy: 0.9442 - val_loss: 0.2867 - val_accuracy: 0.8854
Epoch 26/40
30/30 [==============================] - 0s 16ms/step - loss: 0.1653 - accuracy: 0.9479 - val_loss: 0.2862 - val_accuracy: 0.8854
Epoch 27/40
30/30 [==============================] - 1s 17ms/step - loss: 0.1583 - accuracy: 0.9515 - val_loss: 0.2866 - val_accuracy: 0.8842
Epoch 28/40
30/30 [==============================] - 1s 17ms/step - loss: 0.1520 - accuracy: 0.9536 - val_loss: 0.2867 - val_accuracy: 0.8859
Epoch 29/40
30/30 [==============================] - 1s 17ms/step - loss: 0.1459 - accuracy: 0.9563 - val_loss: 0.2868 - val_accuracy: 0.8861
Epoch 30/40
30/30 [==============================] - 0s 16ms/step - loss: 0.1402 - accuracy: 0.9582 - val_loss: 0.2882 - val_accuracy: 0.8860
Epoch 31/40
30/30 [==============================] - 1s 17ms/step - loss: 0.1347 - accuracy: 0.9597 - val_loss: 0.2887 - val_accuracy: 0.8863
Epoch 32/40
30/30 [==============================] - 1s 17ms/step - loss: 0.1295 - accuracy: 0.9625 - val_loss: 0.2899 - val_accuracy: 0.8862
Epoch 33/40
30/30 [==============================] - 1s 17ms/step - loss: 0.1249 - accuracy: 0.9634 - val_loss: 0.2919 - val_accuracy: 0.8856
Epoch 34/40
30/30 [==============================] - 1s 18ms/step - loss: 0.1198 - accuracy: 0.9657 - val_loss: 0.2939 - val_accuracy: 0.8858
Epoch 35/40
30/30 [==============================] - 1s 17ms/step - loss: 0.1155 - accuracy: 0.9677 - val_loss: 0.2957 - val_accuracy: 0.8850
Epoch 36/40
30/30 [==============================] - 1s 18ms/step - loss: 0.1109 - accuracy: 0.9691 - val_loss: 0.2988 - val_accuracy: 0.8850
Epoch 37/40
30/30 [==============================] - 1s 17ms/step - loss: 0.1068 - accuracy: 0.9709 - val_loss: 0.3005 - val_accuracy: 0.8837
Epoch 38/40
30/30 [==============================] - 1s 19ms/step - loss: 0.1030 - accuracy: 0.9718 - val_loss: 0.3045 - val_accuracy: 0.8829
Epoch 39/40
30/30 [==============================] - 0s 17ms/step - loss: 0.0997 - accuracy: 0.9733 - val_loss: 0.3077 - val_accuracy: 0.8816
Epoch 40/40
30/30 [==============================] - 1s 17ms/step - loss: 0.0952 - accuracy: 0.9751 - val_loss: 0.3088 - val_accuracy: 0.8828

Model evaluation

Two values are returned --Loss (a numerical value indicating an error, the smaller the better) --Correct answer rate

results = model.evaluate(test_data,  test_labels, verbose=2)

print(results)

782/782 - 1s - loss: 0.3297 - accuracy: 0.8723
[0.32968297600746155, 0.8723199963569641]

Draw a time series graph of accuracy rate and loss

Visualize the progress of learning Visualizing how learning is progressing makes it easier to find out if overfitting is occurring. model.fit () returns a History object containing a dictionary that records everything that happened during training

history_dict = history.history
history_dict.keys()

dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])

import matplotlib.pyplot as plt

acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

plt.clf()   #Clear the figure

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

In the graph above, the points are the loss and correct answer rate during training, and the solid line is the loss and correct answer rate during verification.

It has been flat since around 20 epochs. This is an example of overfitting Model performance is high for training data, but not so high for data you've never seen Beyond this point, the model is over-optimized and is learning internal representations that are characteristic of training data but cannot be generalized to test data.

In this case, overfitting can be prevented by stopping training after 20 epochs. To prevent overfitting, use callback or regularize (do it in another tutorial)