[PYTHON] Deep Learning! The story of the data itself that is read when it does not follow after handwritten number recognition

Deep Learning! It's artificial intelligence! Machine learning!

Enthusiasm as an engineer I think I can do something amazing! With the support of my boss Scan through deep learning books and Qiita articles, Introducing TensorFlow and Chainer, Accumulate virtue with the sutras of the tutorial, The GPU used for the game was also fully operational under the cause of learning ...

Accuracy 99.23%

This time, an engineer who was satisfied with implementing an engine that identifies MNIST and IRIS with more than 99% ** A story to know MNIST to get out of MNIST **

Familiar, what kind of data is "MNIST"?

Image of 28 rows and 28 columns → Correct, but not there now. As a data set, 70,000 rows and 784 columns, 70000 is the number of samples, and 784 is the number of dimensions when 28 * 28 is converted into a vector. If you think that the total number of data you want to analyze in a row and each data you want to analyze in a column, there is no problem until the next step (RNN?), And Support Vector Machine and K-means can be used in the same way.

Further observation of MNIST ...

The handwritten numbers appear to be arranged randomly.

This is also important, and when collecting teacher data in practice, it is easy to prepare a large amount of 1 and a large amount of 2. You need to mix it up properly before you start learning. Try learning only the numbers from 0 to 8 and see if you can identify 9. I'm sure there will be a lot of 0s and 4s.

There is nothing but numbers.

What an ideal. When trying to collect data in practice, there are times when "A" or "-" is included even though it is a numerical data set due to some trouble. The data that can be properly sorted and labeled in advance is extremely valuable.

You can also have them labeled by crowdfunding, so use it as needed. Character collection using captcha was also a topic for a while. Luis von Ahn "Large-scale collaboration over the net" https://www.youtube.com/watch?v=-Ht4qiDRZE8

Data amount of 70000

Around 200MB. If you want to deploy it to the memory of a modern PC, it's just right. Even if you implement it while watching Qiita in Firefox and execute it suddenly, you can check the behavior with realistic memory usage. Moreover, the identification performance that is more than enough to be astonishing can be realized even with an ordinary 3-layer perceptron.

Black and white 255 layers

It's not a huge and sparse matrix like the natural language processing area, it's not that the lengths are different and you can't get a bird's-eye view like voice, and it's not necessary to handle the three colors well because it's colorful. It is wonderful that you can judge by looking at the putt even if you arrange multiple results. You can understand what the features that appear in the middle layer are excellent.

What kind of data do you want to output?

Now that you have praised MNIST, what do you want to do with MNIST? think of. This time, the purpose is to understand the shape of the output data and the shape of the label attached to the teacher data.

In the MNIST example

1,2,3,4,5,6,7,8,9,0

Put out one of them. What kind of expression would you be happy to see? Example 1. Output in one dimension: 6 => 5.5 or more and 6.4 or less 7 => 6.5 or more and 7.4 or less

Example 2. Output in 10 dimensions: 6 => [0,0,0,0,0,1,0,0,0,0] 7 => [0,0,0,0,0,0,1,0,0,0] 8 => [0,0,0,0,0,0,0,1,0,0] In MNIST, which is a multi-class classification, there are many tutorials in Example 2, which is appropriate. If you make a mistake here, you may end up selecting the wrong error function. There is an error function that does a good job, but if it's not common, you may have to modify the label of the teacher data ... If you do not clarify the output of this area, you may be confused with two-class classification and regression, etc.

After all, what you care about when dealing with data

When applied to MNIST

Make sure you really care.

WhatMnist.py


import pandas as pd
import numpy as np
# Chainer
import chainer
import chainer.functions as F
import chainer.links as L
from chainer import optimizers
from chainer import serializers, Variable
#Visualization(Jupyter assumption)
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.cm as cm

For PKL

WhatMnist.py


# http://deeplearning.net/data/mnist/mnist.pkl.gz
#With tuple(train, valid, test)
# train -> (data, label)
# valid -> (data, label)It's too kind and even the validation data is separated
# test -> (data, label)
#Read data using Pandas
mnist = pd.read_pickle('mnist.pkl')

http://deeplearning.net/tutorial/gettingstarted.html Click here for the dataset.

For CSV

WhatMnist.py


if(0):
    # Pandas +For csv
    mnist = pd.read_csv('mnist.csv') 
    # Numpy +For csv
    mnist = np.loadtxt('mnist.csv')
    #Label column separation(Assuming the first line is the label)
    mnist_data, mnist_label = np.split(mnist, [1], axis=1)
    #Split between learning and test lines
    x_train,x_test = np.split(mnist_data, [50000])
    y_train,y_test = np.split(mnist_label, [50000])

Check data format

WhatMnist.py


print('##Dimension and quantity')
print("train.data:{0}, train.label:{1}".format(mnist[0][0].shape, mnist[0][1].shape))
print("valid.data:{0}, valid.label:{1}".format(mnist[1][0].shape, mnist[1][1].shape))
print("test.data:{0}, test.label:{1}".format(mnist[2][0].shape, mnist[2][1].shape))

print('##Range and unit')
print("train.data.max:{0}, train.data.min:{1}".format(np.max(mnist[0][0]), np.min(mnist[0][0])))
print("train.label.max:{0}, train.label.min:{1}".format(np.max(mnist[0][1]), np.min(mnist[0][1])))

print('##Arrangement and output method')
print("head -n 30 label: {0}".format(mnist[0][1][:30]))

print('##input method(Read at once and np.I'm stuck in an array)')
fig = plt.figure()
ax1 = fig.add_subplot(2,1,1)
ax2 = fig.add_subplot(2,1,2)
print('##Confirmation method')
print('I tried to display a suitable one as a representative.')
ax1.imshow(mnist[0][0][40].reshape((28,28)), cmap = cm.Greys_r)

print('##Kind and nature')
print('Here, I tried to visualize the frequency of each class with a histogram.')
ax2.hist(mnist[0][1], bins=range(11), alpha=0.9, color='b', normed=True)

Dimension and quantity

train.data:(50000, 784), train.label:(50000,) valid.data:(10000, 784), valid.label:(10000,) test.data:(10000, 784), test.label:(10000,)

Range and units

train.data.max:0.99609375, train.data.min:0.0 train.label.max:9, train.label.min:0

Sorting and output method

head -n 30 label: [5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7 2 8 6 9 4 0 9 1 1 2 4 3 2 7]

Input method (read at once and plunge into np.array)

How to check

I tried to display a suitable one as a representative.

Type and nature

Here, I tried to visualize the frequency of each class with a histogram. (array([ 0.09864, 0.11356, 0.09936, 0.10202, 0.09718, 0.09012, 0.09902, 0.1035 , 0.09684, 0.09976]), array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]), mnisthist.png

Give it a name that is easy to call for the future

In chainer, data is handled as float32, int32 and treated as array (on CPU)

WhatMnist.py


x_train = np.array(mnist[0][0], dtype=np.float32)
y_train = np.array(mnist[0][1], dtype=np.int32)
x_test = np.array(mnist[2][0], dtype=np.float32)
y_test = np.array(mnist[2][1], dtype=np.int32)
print('x_train:' + str(x_train.shape))
print('y_train:' + str(y_train.shape))
print('x_test:' + str(x_test.shape))
print('y_test:' + str(y_test.shape))

x_train:(50000, 784) y_train:(50000,) x_test:(10000, 784) y_test:(10000,)

The rest is a familiar tutorial

WhatMnist.py


#Predictor class
class MLP(chainer.Chain):
    def __init__(self):
        super(MLP, self).__init__(
            l1=L.Linear(784, 100),
            l2=L.Linear(100, 100),
            l3=L.Linear(100, 10),
        )
    
    def __call__(self, x):
        h1 = F.relu(self.l1(x))
        h2 = F.relu(self.l2(h1))
        y = self.l3(h2)
        return y
    
#Calculate Loss and Accuracy
class Classifier(chainer.Chain):
    def __init__(self, predictor):
        super(Classifier, self).__init__(predictor=predictor)
        
    def __call__(self, x, t):
        y = self.predictor(x)
        self.loss = F.softmax_cross_entropy(y, t)
        self.accuracy = F.accuracy(y, t)
        return self.loss

model = Classifier(MLP())
optimizer = optimizers.SGD()
optimizer.setup(model)

batchsize = 100
datasize = 50000
for epoch in range(20):
    print('epoch %d' % epoch)
    indexes = np.random.permutation(datasize)
    for i in range(0, datasize, batchsize):
        x = Variable(x_train[indexes[i : i + batchsize]])
        t = Variable(y_train[indexes[i : i + batchsize]])
        optimizer.update(model, x, t)

Use learning results

(I personally think that this device is important.)

WhatMnist.py


n = 10
x = Variable(x_test[n:n+1])
v = model.predictor(x)
plt.imshow(x_test[n:n+1].reshape((28,28)), cmap = cm.Greys_r)
print(np.argmax(v.data))

0 show.png

in conclusion

A long time ago when the waves of deep learning came. When I was doing machine learning in detail, I was a small fish graduate student, and I tended to focus on algorithms and coding, and often lost sight of the nature of the data itself. As the amount increases, it becomes necessary to observe things that are difficult to observe, and understanding the properties greatly affects performance. ** If you get lost, look at the data. ** **

** Now, beyond the tutorial **

Recommended Posts

Deep Learning! The story of the data itself that is read when it does not follow after handwritten number recognition
It seems that the version of pyflakes is not the latest when flake8 is installed
The story that it turns blue when the data read by Pillow is converted so that it can be handled by OpenCV
When incrementing the value of a key that does not exist
[Verification] Just because there is deep learning, it does not mean that the recovery rate can easily exceed 100% in horse racing.
The story of the release work of the application that Google does not tell
Judging whether or not it is my child from the picture of Shiba Inu by deep learning (2) Data increase, transfer learning, fine tuning
The story of doing deep learning with TPU
It is said that libmysqlclient.so.18 does not exist
It is a piggybacking story about the service that returns "Nyan" when you ping
[Python Data Frame] When the value is empty, fill it with the value of another column.
Count the number of parameters in the deep learning model
Judge whether it is my child from the picture of Shiba Inu by deep learning (1)
About the matter that nosetests does not pass when __init__.py is created in the project directory
A story that pyenv is stuck because the python execution command PATH does not pass
Does TensorFlow change the image of deep learning? What I thought after touching a little
Grep so that grep does not appear at the time of grep
A story that verified whether the number of coronas is really increasing rapidly among young people
Why is the floating point number of 0.1 larger than 0.1, but when added 10 times, it is smaller than 1.0 [Part 1]
Why is the floating point number of 0.1 larger than 0.1, but when added 10 times, it is smaller than 1.0 [Part 2]
A story about a student who does not know the machine learning machine learned machine learning (deep learning) for half a year