[PYTHON] Aim to improve prediction accuracy with Kaggle / MNIST (2. Change filter size)

wrap up

-Aiming to improve prediction accuracy with Kaggle / MNIST (1. Create CNN according to the tutorial) Aim to improve accuracy by increasing the filter size of the CNN created --The prediction accuracy was 0.99035, which was higher than the previous 0.98792.

Introduction

In the previous article (Kaggle / MNIST to improve prediction accuracy (1. Make CNN according to tutorial)), TensorFlow tutorial convolutional neural network (CNN) The prediction accuracy of 0.98792 was obtained.

This time, we will check the MNIST image again and consider measures to improve accuracy.

Check the MNIST image

In data processing, it is important to confirm the data in a form close to raw. So, let's take a look at the MNIST image here. (The script to display this is posted at the end of this article)

MNIST画像

What I felt when I saw this was that *** there are many bold letters ***. The last CNN used the (3, 3) filter on tensorflow.keras.layers.Cond2D. However, I thought that this may not be enough to extract the characteristics of thick characters.

Therefore, we decided to change the filter of the first layer from (3, 3) to (5, 5).

Creating CNNs and searching for the optimal number of epochs

Data preparation

Read the data and convert it to the range 0 to 1. This part is the same as last time.

digit-recognition_CNN1e_1.py


# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Load data
train_data = pd.read_csv("/kaggle/input/digit-recognizer/train.csv")
test_data = pd.read_csv("/kaggle/input/digit-recognizer/test.csv")

train_data_len = len(train_data)
test_data_len = len(test_data)
print("Length of train_data ; {}".format(train_data_len))
print("Length of test_data ; {}".format(test_data_len))

# Length of train_data ; 42000
# Length of test_data ; 28000

train_data_y = train_data["label"]
train_data_x = train_data.drop(columns="label")

train_data_x = train_data_x.astype('float64').values.reshape((train_data_len, 28, 28, 1))
test_data = test_data.astype('float64').values.reshape((test_data_len, 28, 28, 1))
train_data_x /= 255.0
test_data /= 255.0

Divided into training data and validation data

This time, we want to check the accuracy of each epoch, so we will divide it into training data and validation data. Here we use sklearn.model_selection.train_test_split.

--Training data; X, y --Validation data; X_cv, y_cv

is.

digit-recognition_CNN1e_1.py


from sklearn.model_selection import train_test_split
X, X_cv, y, y_cv = train_test_split(train_data_x, train_data_y, test_size=0.2, random_state=0)

It was divided into training data: validation data = 8: 2 with test_size = 0.2. The specific number of data is

--Training data; 42000 * 0.8 = 33600 --Validation data; 42000 * 0.2 = 8400

Is. Actually, the batch size for later learning is 32 by default, but 32 * 1050 = 33600, and in fact 33600 is a good number.

Make a CNN

Based on the previous CNN, change the filter of the first layer from (3, 3) to (5, 5).

digit-recognition_CNN1e_1.py


# Create CNN model
import tensorflow as tf
from tensorflow.keras import layers, models

model = models.Sequential()
# model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1))) #Last time
model.add(layers.Conv2D(32, (5, 5), activation='relu', input_shape=(28, 28, 1))) #this time
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))
model.summary()

Compile and perform learning

This time, set the number of epochs to 20 and search for the optimum number of epochs. Validation data is specified with the validation_data option. The result of tensorflow.keras.models.fit is saved in a variable called history to check the accuracy later.

digit-recognition_CNN1e_1.py


# Compile
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Fit
history = model.fit(X, y, validation_data=(X_cv, y_cv), epochs=20)

Show the log.

Epoch 1/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.1967 - accuracy: 0.9399 - val_loss: 0.0809 - val_accuracy: 0.9752
Epoch 2/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0601 - accuracy: 0.9815 - val_loss: 0.0607 - val_accuracy: 0.9810
Epoch 3/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0429 - accuracy: 0.9867 - val_loss: 0.0503 - val_accuracy: 0.9843
Epoch 4/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0333 - accuracy: 0.9893 - val_loss: 0.0479 - val_accuracy: 0.9852
Epoch 5/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0265 - accuracy: 0.9913 - val_loss: 0.0396 - val_accuracy: 0.9873
Epoch 6/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0221 - accuracy: 0.9924 - val_loss: 0.0464 - val_accuracy: 0.9875
Epoch 7/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0181 - accuracy: 0.9941 - val_loss: 0.0514 - val_accuracy: 0.9862
Epoch 8/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0162 - accuracy: 0.9946 - val_loss: 0.0524 - val_accuracy: 0.9850
Epoch 9/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0134 - accuracy: 0.9958 - val_loss: 0.0379 - val_accuracy: 0.9888
Epoch 10/20
1050/1050 [==============================] - 4s 4ms/step - loss: 0.0120 - accuracy: 0.9958 - val_loss: 0.0458 - val_accuracy: 0.9890
Epoch 11/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0100 - accuracy: 0.9968 - val_loss: 0.0378 - val_accuracy: 0.9899
Epoch 12/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0084 - accuracy: 0.9971 - val_loss: 0.0568 - val_accuracy: 0.9873
Epoch 13/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0102 - accuracy: 0.9968 - val_loss: 0.0495 - val_accuracy: 0.9892
Epoch 14/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0083 - accuracy: 0.9971 - val_loss: 0.0399 - val_accuracy: 0.9913
Epoch 15/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0072 - accuracy: 0.9977 - val_loss: 0.0404 - val_accuracy: 0.9901
Epoch 16/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0058 - accuracy: 0.9980 - val_loss: 0.0504 - val_accuracy: 0.9906
Epoch 17/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0088 - accuracy: 0.9972 - val_loss: 0.0658 - val_accuracy: 0.9885
Epoch 18/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0042 - accuracy: 0.9986 - val_loss: 0.0609 - val_accuracy: 0.9883
Epoch 19/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0082 - accuracy: 0.9976 - val_loss: 0.0636 - val_accuracy: 0.9876
Epoch 20/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0065 - accuracy: 0.9981 - val_loss: 0.0520 - val_accuracy: 0.9895

Search for the best epochs

Check the following two variables. Draw a graph with the number of epochs on the horizontal axis.

--accuracy; Prediction accuracy of training data --val_accuracy; Prediction accuracy of test data

digit-recognition_CNN1e_1.py


# draw graph of accuracy and val_accuracy
import matplotlib.pyplot as plt

plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label = 'val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.ylim([0.9, 1])
plt.legend(loc='lower right')
plt.show()

accuracy と val_accuracy のグラフ

Note that the epochs number in the log starts at 1, but the x-axis of the graph starts at 0. (I was careless)

From the previous log and graph, we judge here that val_accuracy (accuracy of test data) is the best, and epochs = 14 is optimal.

Note

Actually, I should also check loss (loss of training data) and val_loss (loss of test data) to see if it is overfitting, but I did not notice it at this time: -p

Predict with the optimal number of epochs

Compile and learn

The process up to reading the data and creating the CNN is the same as before, so it will be omitted. Learn with epochs = 14 as shown below.

digit-recognition_CNN1e_2.py


# Compile
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Maximum of val_accuracy ; epochs = 14
history = model.fit(train_data_x, train_data_y, epochs=14)

Forecast and save results

As before, predict using test data and save to a file in CSV format

digit-recognition_CNN1e_2.py


# Prediction
rediction = model.predict_classes(test_data, verbose=0)
output = pd.DataFrame({"ImageId" : np.arange(1, 28000+1), "Label":prediction})

output.to_csv('digit_recognizer_CNN1e_epochs14.csv', index=False)
print("Your submission was successfully saved!")

Saving prediction results of training data

Save the prediction results using the training data for future script improvement. Two types of data are stored here.

--tensorflow.keras.models.predict_classes; Label prediction results --tensorflow.keras.models.predict_proba; Probability for each label

digit-recognition_CNN1e_2.py


# Save probability for further study
pred = model.predict_classes(train_data_x, verbose=0)
pred_proba = model.predict_proba(train_data_x, verbose=0)

pred_df = pd.DataFrame(pred, index=np.arange(train_data_len), columns=["Prediction"])
pred_proba_df = pd.DataFrame(pred_proba, index=np.arange(train_data_len), columns=["p{}".format(i) for i in range(10)])

output = pd.concat([pred_df, pred_proba_df], axis=1)

output.to_csv("prediction_CNN1e_epochs14.csv", index=False)
print("Your prediction was successfully saved!")

result

No Explanation Score
Ref SVM 0.98375
01 As per the tutorial 0.98792
02 (3,3) → (5,5) 0.99035

I exceeded the previous score I made according to the tutorial. I can't say for sure that the hypothesis that "the filter size should be increased so that the characteristics of thick letters can be recognized" was hit, but the score could be improved.

In the future, we will investigate "data with incorrect predictions" from the saved data and consider improvements that match it.

Also, this time, "Search for the optimum number of epochs" and "Predict by the optimum number of epochs" are divided into two scripts. This method is time-consuming, so we will review that part as well.

reference

web site -Aiming to improve prediction accuracy with Kaggle / MNIST (1. Create CNN according to the tutorial)

Sample script

-show_digit.py; Script to display MNIST data -digit-recognition_CNN1e_1.py; Script to find the optimal number of epochs -digit-recognition_CNN1e_2.py; Script that predicts the optimal number of epochs

Recommended Posts

Aim to improve prediction accuracy with Kaggle / MNIST (2. Change filter size)
Aim to improve prediction accuracy with Kaggle / MNIST (1. Create CNN according to the tutorial)
Change point detection with Kalman filter