[PYTHON] TensorFlow MNIST For ML Beginners Translation

It's been a month since TensorFlow came out, and I'm already feeling late, but I first translated it to try the beginner's tutorial "MNIST For ML Beginners". However, please be careful when referring to it because there is a high possibility that the translation is incorrect. I would be grateful if you could point out any mistakes.

Translated by: MNIST For ML Beginners


MNIST For ML Beginners (MNIST for machine learning beginners)

This tutorial is intended for readers who are new to both machine learning and TensorFlow. If you already know what MNIST is and what softmax (multinomial logistic) regression is, you should do a further tutorial.

When you learn how to program, the first thing you typically do is print "Hello World.". Machine learning has MNIST, just as programming has Hello World.

MNIST is a simple computer vision dataset. The dataset consists of the following handwritten digit images.

(Figure)

Also, each image contains a label of which number it is. For example, the label of the above image is 5, 0, 4, 1.

In this tutorial, we train a model that inspects images and predicts what numbers they are. Our goal isn't really to train complex models to achieve cutting-edge performance (but we'll provide you with that code later!), But rather to try TensorFlow. Thus, we start with a very simple model called Softmax regression.

The actual code in this tutorial is very short, and all the interesting things happen in just three lines. However, it is very important to understand the idea behind it. It's both how TensorFlow works and the heart of the machine learning concept. For this reason, we are very careful and helpful throughout the code.

The MNIST Data

MNIST data is provided on the Yann LeCun's website. For your convenience, we have included some Python code that automatically downloads and installs the data. You can download this code and import it as follows, or simple You may copy and paste to. (Note: Copy and paste means that you can paste and use in the same file without downloading input_data.py linked above and importing it as a separate file.)

import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

This downloaded data is divided into two parts: 60,000 training data (mnist.train) and 10,000 test data (mnist.test). This division is very important. Divided data we don't learn is essential in machine learning so we can see what we've actually learned about generalization! (Note: The translation is suspicious around here. Perhaps I would like to say that in addition to the training data used for training, test data not used for training is indispensable ...)

As mentioned at the beginning, every MNIST data point has two parts, a handwritten numeric image and an associated label. We will call the image "xs" and the label "ys". Both training sets and test sets (note: probably mnist.train and mnist.test) include xs and ys. For example, the training data images are mnist.train.images and the training data labels are mnist.train.labels.

Each image is 28x28 pixels. We can interpret this as an array of large numbers.

(Figure)

We can extend this array to a vector of numbers 28x28 = 784. It doesn't matter how we stretch the sequence if we are consistent between the images. From this point of view, the MNIST image is just a very rich structure (Warning: intensive computational visualization ) Is a large number of points in a 784-dimensional vector space.

Stretching the data throws away information about the two-dimensional structure of the image. Is that a bad thing? It's the best computer vision method to take advantage of this structure in subsequent tutorials. (Note: Translation is suspicious.) But the simple method we use here is softmax regression, not it.

The result of mnist.train.images is a (n-dimensional array) tensor with the form [60000, 784]. The first dimension (60000) represents the image and the second dimension (784) represents the pixels in each image. Each item in the tensor is a pixel intensity between 0 and 1 and is a particular pixel in a particular image.

(Figure)

The corresponding label in MNIST is a number from 0 to 9, described as the number given to the image. For the purposes of this tutorial, we want our label as "one hot vector". One hot vector is a vector of 0 in multiple dimensions or 1 in one dimension. (Note: There is only one 1 in the vector described below.) In this case, the nth number is represented as a vector of n-dimensional ones. For example, 3 is [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]. Therefore, mnist.train.labels is a float type [60000, 10] array. (Note: There are 60000 28x28 = 784 trained images, which are represented by [60000, 784] in mnist.train.images, and those images are numbers 0-9, so mnist.train.labels is It is represented by [60000, 10].)

(Figure)

We are now ready to actually make our model!

Softmax Regressions

We know that all images in MNIST are numbers, 0 or 9. We want to find the image and give a probability for what number it will be. For example, our model looks for an image of 9, and we're sure it's 9 in 80%, but it gives us a 5% chance of being 8 (because of the top loop), and certainly Because it is not, it gives a little probability of becoming another.

This is the classic case where Softmax regression is a natural and simple model. If you want to assign probabilities to be one of several different things, Softmax is for that. Later, when we train more sophisticated models, the final step will be the Softmax layer.

Softmax regression has two steps. First, we increase the evidence of our input, which is a solid class, and we turn the evidence into probabilities.

Summing the evidence given in the image is in a particular class (Note: the numbers 0-9) and we weight the sum of the pixel intensities. The weight is negative if the image in the class has a pixel with high intensity that is the opposite evidence, and positive if it is the evidence of support.

The figure below shows the weights of one model trained for each class. Red represents a negative weight and blue represents a positive weight.

(Figure)

We also add some additional evidence called bias. Basically, we want to be able to say that something has nothing to do with input. Evidence for class $ i $ is the result given by input $ x $.

evidence_i = \sum_{j} W_{i,j} x_j + b_i

$ W_i $ is the weight, $ b_i $ is the bias for the class $ i $, and $ j $ is the index of the total pixels in the input image $ x $. We then use the softmax function to convert our predicted probability $ y $ from the sum of the evidence.

y = softmax(evidence)

This softmax serves as an "activation function" or "link function", shaping the output of our linear function into the form we want. In this case, it is a probability distribution over 10 cases. You can think of this as converting the sum of the evidence into the probabilities of our inputs for each class. It is defined as follows.

softmax(x) = normalize(exp(x))

If you extend this formula, you get:

softmax(x)_i = \frac {exp(x_i)}  {\sum_{j} exp(x_j)}

But this often helps to think of Softmax as the first method. Power the input and then normalize it. Exponentiation means another unit of evidence that amplifies and increases the weight given to a hypothesis. And on the other hand, reducing the unit of evidence by one means that the hypothesis gains early weight. None of the hypotheses have zero or negative weights. Softmax then normalizes these weights, adding them together into one, transforming it into a valid probability distribution. (To get a more intuitive knowledge of softmax functions, check out the section of Michael Nieslen's book with bidirectional visualization. Just do it.

You can think of softmax regression as something like this, despite the large amount of $ x $. For each output, we calculate a weighted sum of $ x $, bias it, and then apply softmax.

(Figure)

If we write it as a mathematical formula, we get:

\begin{bmatrix}
y_1 \\
y_2 \\
y_3 \\
\end{bmatrix} = softmax 
\begin{pmatrix}
W_{1,1} x_1 + W_{1,2} x_2 + W_{1,3} x_3 + b_1 \\
W_{2,1} x_1 + W_{2,2} x_2 + W_{2,3} x_3 + b_2\\
W_{3,1} x_1 + W_{3,2} x_2 + W_{3,3} x_3 + b_3 
\end{pmatrix}

We can "show this procedure as a vector" and turn it into matrix multiplication and vector addition. This is useful for computational efficiency. (It's also a useful way to think)

\begin{bmatrix}
y_1 \\
y_2 \\
y_3 \\
\end{bmatrix} = softmax 
\begin{pmatrix}
\begin{bmatrix}
W_{1,1} & W_{1,2} & W_{1,3} \\
W_{2,1} & W_{2,2} & W_{2,3} \\
W_{3,1} & W_{3,2} & W_{3,3}
\end{bmatrix}
\cdot
\begin{bmatrix}
x_1 \\
x_2 \\
x_3
\end{bmatrix}
+
\begin{bmatrix}
b_1 \\
b_2\\
b_3 
\end{bmatrix}
\end{pmatrix}

To make it more compact, we can simply write:

y = softmax(W_x + b)

Implementing the Regression

To do effective numerical calculations with Python, we typically use highly effective code implemented in other languages, like NumPy, which does expensive operations as matrix multiplication outside of Python. Use the library. Unfortunately, there still seems to be a lot of overhead in switching to Python for every operation. This overhead is especially bad if you want to run the calculations on the GPU or in a distributed way, where there will be high costs to transfer the data.

TensorFlow also makes heavy lifts outside of Python, but takes further steps to avoid this overhead. Instead of running one expensive process independently of Python, TensorFlow lets us draw a graph of interaction processes that run completely outside of Python. (Such an approach can be found in some machine learning libraries.)

In order to use TensorFlow, we need to implement it.

import tensorflow as tf

We describe these interacting processes by symbolic variables that manipulate them. Let's make one.

x = tf.placeholder(tf.float32, [None, 784])

$ x $ is not a special value. This is a placeholder, the value we enter when we ask TensorFlow to run the calculation. We want to be able to enter each number in the MNIST image, each parallel to 784 dimensions. We express this as a 2-D tensor with a number of parallel points, along with the form [None, 784]. (This None means a dimension that can be of any length.)

We also need weights and biases for our model. We can imagine treating these like additional inputs, but TensorFlow has an even better way to treat them. It is Variable. Variavle is a mutable tensor that lives in the TensorFlow graph of interaction processing. It can be used and modified by calculation. For machine learning applications, model parameters are generally Variable.

W = tf.Variable(tf.zeros([784, 10))
b = tf.Variable(tf.zeros([10]))

We create these Variables by giving them tf.Variable as the initial value of Variables. In this case, we initialize both $ W $ and $ b $ as an all-zero tensor. They are very problematic until we train $ W $ and $ b $.

Note that $ W $ has the form [784, 10], because we want to multiply the vector of a 784-dimensional image by it to produce a vector of 10-dimensional evidence for different classes. $ b $ has the form [10] and we can add it to the output.

We can now implement our model. It's only one line!

y = tf.nn.softmax(tf.matmul(x, W) + b)

First, we multiply $ x $ and $ W $ by the expression tf.matmul ($ x $, $ W $). This is flipped over when multiplying them in our formula with $ W_x $ as a small trick to handle x which is a 2D tensor with a lot of inputs. We then add $ b $ and finally apply tf.nn.softmax.

That's it. It only takes one line after a pair of short setup lines to define our model. That's not because TensorFlow is designed to make softmax regression particularly easy. It's just a very flexible way to depict many types of numerical calculations from machine learning models for physics simulation. And once it's clear, our model can run on different devices. Your computer's CPU, GPU, or even your cell phone!

Training

In order to train our model, we need to define what it means to be good for the model. In fact, in machine learning we generally define what it means to be bad for a model, which is called cost or loss, and how to make the bad smaller. Try. But the two are equivalent.

One very ordinary and very good cost function is "cross entropy". Surprisingly, cross entropy arises from thinking about information-compressed codes in information theory, but it ends up being an important idea in many areas and arises from machine learning gambling. It is defined as follows.

H_{y^{'}}(y) = -\sum_i y^{'}_i \log(y_i)

$ y $ is the probability distribution we predicted and $ y_ {'} $ is the true probability distribution (one hot vector we enter). In some approximate sensibleness, cross entropy assesses how inefficient our predictions are for the truth. More details on cross entropy are outside the scope of this tutorial, but Understanding is worth it.

To implement cross entropy we first need to add the correct answer input to the placeholder.

y_ = tf.placeholder(tf.float32, [None, 10])

Then we can implement the cross entropy, $-\ sum_i y ^ {'} _i \ log (y_i) $.

cross_entropy = -tf.reduce_sum(y_*tf.log(y))

First, tf.log calculates the logarithm of each element y. Then we multiply each element y_ by the corresponding elementtf.log (y). Finally, tf.reduce_sum adds the tensors of all the elements. (As a note, this is not only the true cross entropy in one prediction, but the sum of the cross entropies of all 100 images we searched for. How good are the 100 data? Only better portrays how our model is better than a single piece of data.)

Now that we know what we want our model to do, TensorFlow is very easy to do. Because TensorFlow knows the entire graph of your calculation, it automatically [error reverses] to efficiently identify how your variables affect the cost you want to minimize. Propagation algorithm](http://colah.github.io/posts/2015-08-Backprop/) can be used. You can then change the variables and apply the optimization algorithm of your choice to reduce costs.

train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)

In this case, we want TensorFlow to minimize cross_entropy by using gradient descent with a learning factor of 0.01. Gradient descent is a simple procedure, and TensorFlow simply changes each variable to a small bit in the direction of cost savings. But TensorFlow also provides many other optimization algorithms. By using something as simple as adjusting one line.

What TensorFlow really is here, behind the scenes, is to add new processing to your graph that implements error propagation and gradient descent. Then, when performed, one process is given back to take the steps of gradient descent training and adjust the variables to reduce costs slightly.

Now we have set up a model for training. The last thing before we do that is that we need to add some processing to the initialization of the variables we created.

init = tf.initialize_all_variables()

We can now run the model in Session and run the process of initializing variables.

sess = tf.Session()
sess.run(init)

Let's train. We perform 1000 training steps!

for i in range(1000):
  batch_xs, batch_ys = mnist.train.next_batch(100)
  sess.run(train_step, feed_dict={x: batch_xs, y_:batch_ys})

At each step of the loop, we get a "batch" of 100 random data points from the training set. We perform a train_step that fetches batch data to exchange placeholders and others.

Using a small batch of random data is called stochastic training, in this case stochastic gradient descent. Ideally, we want to use all the data in every step of the training, but it's expensive because it gives us a good sense of what we should be doing. Instead, we use a different subset each time. Doing this is cheap and has the same benefits.

Evaluation Our Model

How good is our model?

Well, first, let's know the correct label we predicted. tf.argmax is a very useful function that gives the highest input index of a tensor along an axis. For example, tf.argmax (y, 1) is the label of our model most likely for each input, while tf.argmax (y_, 1) is the correct label. We can use tf.equal to see if our predictions are true.

correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))

This gives us a list of booleans. To determine if the function is correct, we cast it to floating point and then take the mean. For example, [True, False, True, True] becomes [1,0,1,1], and (Note: the average value) becomes 0.75.

accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))

Finally, we demand accuracy in the test data.

print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

This should be about 91%.

Is it good? Not really. In fact, it's a little bad. This is because I used a very simple model. With one small change we can get 97%. The best model can get an accuracy of over 99.7%! (See this List of Results for more information.)

The problem is that we learned from this model. Still, if you don't like these results, it's easy to learn how to build a more sophisticated model with TensorFlow [Next Tutorial](https://www.tensorflow. Check org / versions / master / tutorials / mnist / pros / index.html)!


The above is the translation of MNIST For ML Beginners.

At the end

The translation is suspicious and subtle in some places, but it turns out that I want to do most of the things. There are some formulas here and there, but what I'm doing is just multiplying and adding matrices, and I'm not doing that difficult. TensorFlow is good for difficult softmax regression and gradient descent. Well, I think it's the very thing that understands here.

I think the following is a brief summary.

--Read the training image and its correct label --Perform softmax regression --Calculate cross entropy --Use error backpropagation to minimize cross entropy --Do this 1000 times --Evaluate and confirm with test images

Next, I would like to actually compose the code and execute whether it can recognize characters.

Recommended Posts

TensorFlow MNIST For ML Beginners Translation
TensorFlow Tutorial MNIST For ML Beginners
TensorFlow Tutorial -MNIST For ML Beginners
Supplementary notes for TensorFlow MNIST For ML Beginners
Conducting the TensorFlow MNIST For ML Beginners Tutorial
[Explanation for beginners] TensorFlow tutorial MNIST (for beginners)
TensorFlow Deep MNIST for Experts Translation
[Roughly translate TensorFlow Tutorial into Japanese] 1. MNIST For ML Beginners
[Explanation for beginners] TensorFlow tutorial Deep MNIST
I tried the MNIST tutorial for beginners of tensorflow.
Beginners read "Introduction to TensorFlow 2.0 for Experts"
Roadmap for beginners
Mathematics for ML
[Explanation for beginners] TensorFlow basic syntax and concept
Installing TensorFlow on Windows Easy for Python beginners
I tried a TensorFlow tutorial (MNIST for beginners) on Cloud9-Classification of handwritten images-
TensorFlow Tutorial-Mandelbrot Set (Translation)
Code for TensorFlow MNIST Begginer / Expert with Japanese comments
[Translation] NumPy Official Tutorial "NumPy: the absolute basics for beginners"
TensorFlow Tutorial-TensorFlow Mechanics 101 (Translation)
Spacemacs settings (for beginners)
python textbook for beginners
Enable GPU for tensorflow
Dijkstra algorithm for beginners
TensorFlow Tutorial-Image Recognition (Translation)
OpenCV for Python beginners
[Explanation for beginners] Introduction to convolution processing (explained in TensorFlow)
[Explanation for beginners] Introduction to pooling processing (explained in TensorFlow)
How to learn TensorFlow for liberal arts and Python beginners
[Roughly translate TensorFlow Tutorial into Japanese] 2. Deep MNIST For Experts
[For beginners] I tried using the Tensorflow Object Detection API
Installation notes for TensorFlow for Windows
Learning flow for Python beginners
TensorFlow Tutorial-MNIST Data Download (Translation)
[For beginners] kaggle exercise (merucari)
TensorFlow Tutorial-Sequence Transformation Model (Translation)
Linux distribution recommended for beginners
TensorFlow Tutorial-Partial Differential Equations (Translation)
CNN (1) for image classification (for beginners)
Python3 environment construction (for beginners)
Overview of Docker (for beginners)
Python #function 2 for super beginners
Seaborn basics for beginners ④ pairplot
Basic Python grammar for beginners
100 Pandas knocks for Python beginners
Python for super beginners Python #functions 1
Python #list for super beginners
~ Tips for beginners to Python ③ ~
[For Kaggle beginners] Titanic (LightGBM)
Reference resource summary (for beginners)
TensorFlow Tutorial-Convolutional Neural Network (Translation)
Linux command memorandum [for beginners]
Try TensorFlow MNIST with RNN
Convenient Linux shortcuts (for beginners)
I tried running the TensorFlow tutorial with comments (_TensorFlow_2_0_Introduction for beginners)