Introduction

Coursera Machine Learning has become the world's leading introduction to machine learning. This is the third in a series of trying to implement in Python after studying the programming tasks of Matlab / Octave.

This time, in the first half of ex3, the task of recognizing handwritten numbers using logistic regression. The dataset is a subset of MNIST, which is given 5000 20x20 pixel grayscale images in the Matlab / Octave .mat data format. In fact, scikit-learn has a function called fetch_mldata (), which also allows you to download MNIST data (28x28 pixels, 70,000 sheets) (see this article: [Handwriting MNIST with Multilayer Perceptron] Number recognition](http://aidiary.hatenablog.com/entry/20140205/1391601418), but this time I will use the above .mat data for comparison.

The latter half of ex3 is a little half-finished content that only the forward propagation part of the neural network is created, so I will omit it.

code

The data is also well-formed and it's simple code as it just uses scikit-learn's LogisticRegression class. Matlab's .mat format data can be read using Scipy's scipy.io.loadmat () function.

`ex3.py`


import numpy as np
import matplotlib.pyplot as plt
import scipy.io as scio
from sklearn import linear_model

# scipy.io.loadmat()Load matlab data using
data = scio.loadmat('ex3data1.mat')
X = data['X']  #X is a 5000x400 matrix
y = data['y'].ravel()  #y is 5000 x 1 matrix, ravel()Convert to 5000 dimensional vector using

model = linear_model.LogisticRegression(penalty='l2', C=10.0) #Model definition
model.fit(X,y)    #Learning with training data
model.score(X,y)  #Correct answer rate in training data

When executed, the correct answer rate for character recognition in the training data was displayed as 0.96499999999999997.

Machine learning points

The parameter $ \ lambda $, which indicates the strength of regularization, was $ \ lambda = 0.1 $ in Coursera. As introduced in the previous article, in the sklearn.linear_model.LogisticRegression class, the regularization parameter is specified by $ C $ (corresponding to the reciprocal of $ \ lambda $), so this time the model is set as C = 10.0. Defined.

As a result, the correct answer rate in the training data was 96.5% as mentioned above. The result with Matlab / Octave was 94.9%, so is it a little overfit? I'm not sure why.

Other points

This alone is too easy, so I wrote a code to display the data that was misrecognized. When training with the above model, 175 out of 5000 training data are incorrectly determined. I will display the label (how I made a mistake) along with the image for 25 randomly selected ones.

`ex3-wrong.py`


wrong_index = np.array(np.nonzero(np.array([model.predict(X) != y]).ravel())).ravel()
wrong_sample_index = np.random.randint(0,len(wrong_index),25)
fig = plt.figure()
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0.5, hspace=0.5)
for i in range(0,25):
    ax = fig.add_subplot(5,5,i+1)
    ax.axis('off')
    ax.imshow(X[wrong_index[wrong_sample_index[i]]].reshape(20,20).T, cmap = plt.get_cmap('gray'))
    ax.set_title(str(model.predict(X[wrong_index[wrong_sample_index[i]]])[0]))
plt.show()

1st line: In np.array ([model.predict (X)! = Y]), if you make a mistake in recognition, True, and if you answer correctly, Extract the matrix (5000x1) containing False. Put it in the np.nonzero () function and you'll get a matrix with indexes of True data. I'm using .ravel () twice because I want to finally retrieve it as a vector.
2nd line: wrong_sample_index is an index created to randomly extract 25 out of the wrong indexed vectors (175 in this case) extracted in the 1st line.
Display is a 5x5 subplot using subplot of pyplot.
At set_title, display the (wrong) label that the model gave as the title of the image. The label is obtained by model.predict (X [wrong_index [wrong_sample_index [1]]]), but since it is returned as a 1x1 matrix, it is extracted as a scalar with [0].

The result is as follows.

There are some convincing mistakes, such as mistaken 4 for 9, and vice versa, but there are also others that are not. Well, it may be something like this because I just logistically regressed the pixel data without extracting any features. It may be a little overfitting than that, so it seems that Coursera should select the appropriate regularization parameter $ C $ by using Cross Validation etc. that will appear in a later module.

Coursera Machine Learning Challenges in Python: ex3 (Handwritten Number Recognition with Logistic Regression)