-Aiming to improve prediction accuracy with Kaggle / MNIST (1. Create CNN according to the tutorial) Aim to improve accuracy by increasing the filter size of the CNN created --The prediction accuracy was 0.99035, which was higher than the previous 0.98792.
In the previous article (Kaggle / MNIST to improve prediction accuracy (1. Make CNN according to tutorial)), TensorFlow tutorial convolutional neural network (CNN) The prediction accuracy of 0.98792 was obtained.
This time, we will check the MNIST image again and consider measures to improve accuracy.
In data processing, it is important to confirm the data in a form close to raw. So, let's take a look at the MNIST image here. (The script to display this is posted at the end of this article)
What I felt when I saw this was that *** there are many bold letters ***. The last CNN used the (3, 3) filter on tensorflow.keras.layers.Cond2D
. However, I thought that this may not be enough to extract the characteristics of thick characters.
Therefore, we decided to change the filter of the first layer from (3, 3) to (5, 5).
Read the data and convert it to the range 0 to 1. This part is the same as last time.
digit-recognition_CNN1e_1.py
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
# Load data
train_data = pd.read_csv("/kaggle/input/digit-recognizer/train.csv")
test_data = pd.read_csv("/kaggle/input/digit-recognizer/test.csv")
train_data_len = len(train_data)
test_data_len = len(test_data)
print("Length of train_data ; {}".format(train_data_len))
print("Length of test_data ; {}".format(test_data_len))
# Length of train_data ; 42000
# Length of test_data ; 28000
train_data_y = train_data["label"]
train_data_x = train_data.drop(columns="label")
train_data_x = train_data_x.astype('float64').values.reshape((train_data_len, 28, 28, 1))
test_data = test_data.astype('float64').values.reshape((test_data_len, 28, 28, 1))
train_data_x /= 255.0
test_data /= 255.0
This time, we want to check the accuracy of each epoch, so we will divide it into training data and validation data. Here we use sklearn.model_selection.train_test_split
.
--Training data; X
, y
--Validation data; X_cv
, y_cv
is.
digit-recognition_CNN1e_1.py
from sklearn.model_selection import train_test_split
X, X_cv, y, y_cv = train_test_split(train_data_x, train_data_y, test_size=0.2, random_state=0)
It was divided into training data: validation data = 8: 2 with test_size = 0.2
. The specific number of data is
--Training data; 42000 * 0.8 = 33600 --Validation data; 42000 * 0.2 = 8400
Is. Actually, the batch size for later learning is 32 by default, but 32 * 1050 = 33600, and in fact 33600 is a good number.
Based on the previous CNN, change the filter of the first layer from (3, 3) to (5, 5).
digit-recognition_CNN1e_1.py
# Create CNN model
import tensorflow as tf
from tensorflow.keras import layers, models
model = models.Sequential()
# model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1))) #Last time
model.add(layers.Conv2D(32, (5, 5), activation='relu', input_shape=(28, 28, 1))) #this time
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))
model.summary()
This time, set the number of epochs to 20 and search for the optimum number of epochs. Validation data is specified with the validation_data
option. The result of tensorflow.keras.models.fit
is saved in a variable called history
to check the accuracy later.
digit-recognition_CNN1e_1.py
# Compile
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# Fit
history = model.fit(X, y, validation_data=(X_cv, y_cv), epochs=20)
Show the log.
Epoch 1/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.1967 - accuracy: 0.9399 - val_loss: 0.0809 - val_accuracy: 0.9752
Epoch 2/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0601 - accuracy: 0.9815 - val_loss: 0.0607 - val_accuracy: 0.9810
Epoch 3/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0429 - accuracy: 0.9867 - val_loss: 0.0503 - val_accuracy: 0.9843
Epoch 4/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0333 - accuracy: 0.9893 - val_loss: 0.0479 - val_accuracy: 0.9852
Epoch 5/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0265 - accuracy: 0.9913 - val_loss: 0.0396 - val_accuracy: 0.9873
Epoch 6/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0221 - accuracy: 0.9924 - val_loss: 0.0464 - val_accuracy: 0.9875
Epoch 7/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0181 - accuracy: 0.9941 - val_loss: 0.0514 - val_accuracy: 0.9862
Epoch 8/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0162 - accuracy: 0.9946 - val_loss: 0.0524 - val_accuracy: 0.9850
Epoch 9/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0134 - accuracy: 0.9958 - val_loss: 0.0379 - val_accuracy: 0.9888
Epoch 10/20
1050/1050 [==============================] - 4s 4ms/step - loss: 0.0120 - accuracy: 0.9958 - val_loss: 0.0458 - val_accuracy: 0.9890
Epoch 11/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0100 - accuracy: 0.9968 - val_loss: 0.0378 - val_accuracy: 0.9899
Epoch 12/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0084 - accuracy: 0.9971 - val_loss: 0.0568 - val_accuracy: 0.9873
Epoch 13/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0102 - accuracy: 0.9968 - val_loss: 0.0495 - val_accuracy: 0.9892
Epoch 14/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0083 - accuracy: 0.9971 - val_loss: 0.0399 - val_accuracy: 0.9913
Epoch 15/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0072 - accuracy: 0.9977 - val_loss: 0.0404 - val_accuracy: 0.9901
Epoch 16/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0058 - accuracy: 0.9980 - val_loss: 0.0504 - val_accuracy: 0.9906
Epoch 17/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0088 - accuracy: 0.9972 - val_loss: 0.0658 - val_accuracy: 0.9885
Epoch 18/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0042 - accuracy: 0.9986 - val_loss: 0.0609 - val_accuracy: 0.9883
Epoch 19/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0082 - accuracy: 0.9976 - val_loss: 0.0636 - val_accuracy: 0.9876
Epoch 20/20
1050/1050 [==============================] - 3s 3ms/step - loss: 0.0065 - accuracy: 0.9981 - val_loss: 0.0520 - val_accuracy: 0.9895
Check the following two variables. Draw a graph with the number of epochs on the horizontal axis.
--accuracy; Prediction accuracy of training data --val_accuracy; Prediction accuracy of test data
digit-recognition_CNN1e_1.py
# draw graph of accuracy and val_accuracy
import matplotlib.pyplot as plt
plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label = 'val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.ylim([0.9, 1])
plt.legend(loc='lower right')
plt.show()
Note that the epochs number in the log starts at 1, but the x-axis of the graph starts at 0. (I was careless)
From the previous log and graph, we judge here that val_accuracy (accuracy of test data) is the best, and epochs = 14 is optimal.
Actually, I should also check loss (loss of training data) and val_loss (loss of test data) to see if it is overfitting, but I did not notice it at this time: -p
The process up to reading the data and creating the CNN is the same as before, so it will be omitted. Learn with epochs = 14 as shown below.
digit-recognition_CNN1e_2.py
# Compile
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# Maximum of val_accuracy ; epochs = 14
history = model.fit(train_data_x, train_data_y, epochs=14)
As before, predict using test data and save to a file in CSV format
digit-recognition_CNN1e_2.py
# Prediction
rediction = model.predict_classes(test_data, verbose=0)
output = pd.DataFrame({"ImageId" : np.arange(1, 28000+1), "Label":prediction})
output.to_csv('digit_recognizer_CNN1e_epochs14.csv', index=False)
print("Your submission was successfully saved!")
Save the prediction results using the training data for future script improvement. Two types of data are stored here.
--tensorflow.keras.models.predict_classes
; Label prediction results
--tensorflow.keras.models.predict_proba
; Probability for each label
digit-recognition_CNN1e_2.py
# Save probability for further study
pred = model.predict_classes(train_data_x, verbose=0)
pred_proba = model.predict_proba(train_data_x, verbose=0)
pred_df = pd.DataFrame(pred, index=np.arange(train_data_len), columns=["Prediction"])
pred_proba_df = pd.DataFrame(pred_proba, index=np.arange(train_data_len), columns=["p{}".format(i) for i in range(10)])
output = pd.concat([pred_df, pred_proba_df], axis=1)
output.to_csv("prediction_CNN1e_epochs14.csv", index=False)
print("Your prediction was successfully saved!")
No | Explanation | Score |
---|---|---|
Ref | SVM | 0.98375 |
01 | As per the tutorial | 0.98792 |
02 | (3,3) → (5,5) | 0.99035 |
I exceeded the previous score I made according to the tutorial. I can't say for sure that the hypothesis that "the filter size should be increased so that the characteristics of thick letters can be recognized" was hit, but the score could be improved.
In the future, we will investigate "data with incorrect predictions" from the saved data and consider improvements that match it.
Also, this time, "Search for the optimum number of epochs" and "Predict by the optimum number of epochs" are divided into two scripts. This method is time-consuming, so we will review that part as well.
web site -Aiming to improve prediction accuracy with Kaggle / MNIST (1. Create CNN according to the tutorial)
-show_digit.py; Script to display MNIST data -digit-recognition_CNN1e_1.py; Script to find the optimal number of epochs -digit-recognition_CNN1e_2.py; Script that predicts the optimal number of epochs