Coursera Machine Learning has become the world's leading introduction to machine learning. This is the third in a series of trying to implement in Python after studying the programming tasks of Matlab / Octave.
This time, in the first half of ex3, the task of recognizing handwritten numbers using logistic regression. The dataset is a subset of MNIST, which is given 5000 20x20 pixel grayscale images in the Matlab / Octave .mat data format.
In fact, scikit-learn has a function called fetch_mldata ()
, which also allows you to download MNIST data (28x28 pixels, 70,000 sheets) (see this article: [Handwriting MNIST with Multilayer Perceptron] Number recognition](http://aidiary.hatenablog.com/entry/20140205/1391601418), but this time I will use the above .mat data for comparison.
The latter half of ex3 is a little half-finished content that only the forward propagation part of the neural network is created, so I will omit it.
The data is also well-formed and it's simple code as it just uses scikit-learn's LogisticRegression
class. Matlab's .mat format data can be read using Scipy's scipy.io.loadmat ()
function.
ex3.py
import numpy as np
import matplotlib.pyplot as plt
import scipy.io as scio
from sklearn import linear_model
# scipy.io.loadmat()Load matlab data using
data = scio.loadmat('ex3data1.mat')
X = data['X'] #X is a 5000x400 matrix
y = data['y'].ravel() #y is 5000 x 1 matrix, ravel()Convert to 5000 dimensional vector using
model = linear_model.LogisticRegression(penalty='l2', C=10.0) #Model definition
model.fit(X,y) #Learning with training data
model.score(X,y) #Correct answer rate in training data
When executed, the correct answer rate for character recognition in the training data was displayed as 0.96499999999999997.
The parameter $ \ lambda $, which indicates the strength of regularization, was $ \ lambda = 0.1 $ in Coursera. As introduced in the previous article, in the sklearn.linear_model.LogisticRegression
class, the regularization parameter is specified by $ C $ (corresponding to the reciprocal of $ \ lambda $), so this time the model is set as C = 10.0
. Defined.
As a result, the correct answer rate in the training data was 96.5% as mentioned above. The result with Matlab / Octave was 94.9%, so is it a little overfit? I'm not sure why.
This alone is too easy, so I wrote a code to display the data that was misrecognized. When training with the above model, 175 out of 5000 training data are incorrectly determined. I will display the label (how I made a mistake) along with the image for 25 randomly selected ones.
ex3-wrong.py
wrong_index = np.array(np.nonzero(np.array([model.predict(X) != y]).ravel())).ravel()
wrong_sample_index = np.random.randint(0,len(wrong_index),25)
fig = plt.figure()
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0.5, hspace=0.5)
for i in range(0,25):
ax = fig.add_subplot(5,5,i+1)
ax.axis('off')
ax.imshow(X[wrong_index[wrong_sample_index[i]]].reshape(20,20).T, cmap = plt.get_cmap('gray'))
ax.set_title(str(model.predict(X[wrong_index[wrong_sample_index[i]]])[0]))
plt.show()
np.array ([model.predict (X)! = Y])
, if you make a mistake in recognition, True, and if you answer correctly, Extract the matrix (5000x1) containing False. Put it in the np.nonzero ()
function and you'll get a matrix with indexes of True data. I'm using .ravel ()
twice because I want to finally retrieve it as a vector.wrong_sample_index
is an index created to randomly extract 25 out of the wrong indexed vectors (175 in this case) extracted in the 1st line.subplot
of pyplot
.set_title
, display the (wrong) label that the model gave as the title of the image. The label is obtained by model.predict (X [wrong_index [wrong_sample_index [1]]])
, but since it is returned as a 1x1 matrix, it is extracted as a scalar with [0]
.The result is as follows.
There are some convincing mistakes, such as mistaken 4 for 9, and vice versa, but there are also others that are not. Well, it may be something like this because I just logistically regressed the pixel data without extracting any features. It may be a little overfitting than that, so it seems that Coursera should select the appropriate regularization parameter $ C $ by using Cross Validation etc. that will appear in a later module.