[PYTHON] TensorFlow Tutorial-Convolutional Neural Network (Translation)

TensorFlow Tutorial (Convolutional Neural Networks) http://www.tensorflow.org/tutorials/deep_cnn/index.html#convolutional-neural-networks It is a translation of. We look forward to pointing out any translation errors.

Note: This tutorial is intended for advanced TensorFlow users and assumes machine learning expertise and experience.

Overview

CIFAR-10 classification is a common benchmarking problem in machine learning. The problem is to classify RGB 32x32 pixel images into 10 categories: planes, cars, birds, cats, deer, dogs, frogs, horses, boats, and trucks.

For more information, see CIFAR-10 page and [Technical Report] by Alex Krizhevsky (http://www.cs.toronto.edu/) See% 7Ekriz / learning-features-2009-TR.pdf).

Target

The goal of this tutorial is to build a relatively small convolutional neural network (CNN) for image recognition. In the steps of this tutorial:

Highlight standard configurations for network architecture, training and evaluation.
Provides templates for building larger and more sophisticated models.

I chose CIFAR-10 because it's complex enough to practice much of TensorFlow's ability to scale to large models. At the same time, the model is small enough to train fast. Therefore, it is ideal for trying out new ideas and experimenting with new technologies.

Tutorial highlights

This CIFAR-10 tutorial shows some important structures for designing larger and more sophisticated models in TensorFlow:

Fold, [ReLU Activation](https://www.tensorflow.org/api_docs/python/nn.html# relu), Maximum Pooling and Local Response Normalization Central mathematical components, including /nn.html#local_response_normalization)
Visualization of network activity during training, including input images, loss, activation and gradient distribution (https://www.tensorflow.org/how_tos/summaries_and_tensorboard/index.html)
Calculate the Trained Parameters' Moving Average (https://www.tensorflow.org/api_docs/python/train.html#ExponentialMovingAverage) and use these averages during the evaluation to improve predictive performance. Routine to use
Implementation of Learning Rate Schedule that systematically decreases over time
Prefetch of input data queue to separate the model from disk latency and expensive image preprocessing

It also provides a multi-GPU version model that describes:

Model settings for training with multiple GPU cards at the same time.
Sharing and updating variables across multiple GPUs.

We hope this tutorial will be a starting point for building larger CNNs for image tasks in TensorFlow.

Model architecture

The model in this CIFAR-10 tutorial is a multi-tiered architecture with alternating convolutions and non-linearities. These layers are followed by a fully connected layer that leads to the Softmax classifier. The model follows the architecture described by Alex Krizhevsky, with the exception of some differences in the top layers.

This model has been trained on the GPU to achieve maximum performance with an accuracy of approximately 86% within hours. For more information, see [below](# model evaluation) and code. It consists of 1,068,298 learnable parameters and requires approximately 19.5 million multiplications and additions to calculate a single image inference.

Code structure

The code for this tutorial can be found at tensorflow / models / image / cifar10 /.

File	Purpose
cifar10_input.py	CIFAR-10 Read binary file format
cifar10.py	CIFAR-Build 10 models
cifar10_train.py	CIFAR on CPU or GPU-Train 10 models
cifar10_multi_gpu_train.py	CIFAR on multiple GPUs-Train 10 models
cifar10_eval.py	CIFAR-Evaluate the predictive performance of 10 models

CIFAR-10 model

The CIFAR-10 network is primarily contained in cifar10.py. The complete training graph contains approximately 765 operations. You can see that the code can be made almost reusable by building the graph with the following modules:

Model Inputs (#Model Inputs): inputs () and distorted_inputs () read and preprocess CIFAR images for evaluation and training.
Model Prediction (#Model Prediction): inference () adds an operation to perform inference, or classify, on the specified image.
Model Training (#Model Training): loss () and train () add operations to calculate losses, gradients, variable updates, and visualization summaries.

Model input

The input part of the model is built with the functions inputs () and distorted_inputs () that read the image from the CIFAR-10 binary data file. These files contain fixed byte length records, so I'm using tf.FixedLengthRecordReader. For more information on the Reader class, see Reading Data (https://www.tensorflow.org/how_tos/reading_data/index.html#reading-from-files).

The image is processed as follows:

The image is cropped to 24x24 pixels, centered for rating and random for training.
The image is almost whitened so that the model does not depend on dynamic range.

In training, we apply a series of random distortions to artificially increase the dataset:

Randomly flip the image left and right (https://www.tensorflow.org/api_docs/python/image.html#random_flip_left_right)
Randomly distorts the [brightness] of the image (https://www.tensorflow.org/api_docs/python/image.html#random_brightness)
Randomly distorts the [contrast] of the image (https://www.tensorflow.org/api_docs/python/image.html#tf_image_random_contrast)

See the Image (https://www.tensorflow.org/api_docs/python/image.html) page for a list of available distortions. Attach image_summary to your images so that you can visualize them in TensorBoard. We recommend that you make sure that the input data is created correctly.

The process of reading an image from a disc and distorting it may take some time. To prevent this operation from slowing down training, this in 16 separate threads that continuously fill the TensorFlow Queue (https://www.tensorflow.org/api_docs/python/io_ops.html#shuffle_batch) To execute.

Model prediction

The forecasting part of the model consists of the inference () function, which adds an operation to calculate the logit of the forecast. This part of the model is constructed as follows:

Layer name	Description
conv1	ConvolutionWhenReLUactivation
pool1	Maximum pooling
norm1	Local response normalization
conv2	ConvolutionWhenReLUactivation
norm2	Local response normalization
pool2	Maximum pooling
local3	Fully connected layer with ReLU activation
local4	Fully connected layer with ReLU activation
softmax_linear	Linear transformation to generate logit

The graph of inference operations generated by TensorBoard is as follows:

Exercise: The inference output is a non-normalized logit. Please edit the network architecture and use tf.softmax () to return the normalized predictions. ..

The inputs () and inference () functions provide all the components needed to evaluate a model. Now let's shift our focus to building operations that train the model.

Exercise: The model architecture of inference () is slightly different from the CIFAR-10 model defined in cuda-convnet. Specifically, the top layer of Alex's original model is a partial join, not a full join. Try editing the architecture so that the top layer is partially joined.

Model training

The usual way to train a network for N-class classification is Multinomial Logistic Regression (https://en.wikipedia.org/wiki/Multinomial_logistic_regression), also known as Softmax Regression. Softmax regression applies Softmax non-linearity to the output of the network, with normalized predictions and labels 1-Hot Coding, [Cross Entropy](https://www.tensorflow.org/api_docs/python/nn. html # softmax_cross_entropy_with_logits) is calculated. It also applies the usual weight decay (https://www.tensorflow.org/api_docs/python/nn.html#l2_loss) loss to all trained variables for regularization. The objective function of the model returned by the loss () function is the sum of the cross entropy loss and all these weight attenuation terms.

Visualize this with TensorBoard using scalar_summary:

Standard Gradient Descent (https://en.wikipedia.org/wiki/Gradient_descent) Algorithm (Training for Other Methods (https://www.tensorflow.org/api_docs/python/train.) (See html)) to train the model with a learning rate that exponentially decays over time. To do.

The train () function calculates the gradient and updates the training variables (see GradientDescentOptimizer for more information). Add the operations required to minimize the objective function. This function returns an operation that performs all the calculations needed to train and update the model for a batch of images.

Model launch and training

Building the model is now complete. Let's start the model and perform the training operation with the script cifar10_train.py.

python cifar10_train.py

Note: The CIFAR-10 dataset is automatically downloaded the first time you run any target of the CIFAR-10 tutorial. The dataset is about 160MB, so drink coffee the first time you run it.

You should see something like this:

Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
2015-11-04 11:45:45.927302: step 0, loss = 4.68 (2.0 examples/sec; 64.221 sec/batch)
2015-11-04 11:45:49.133065: step 10, loss = 4.66 (533.8 examples/sec; 0.240 sec/batch)
2015-11-04 11:45:51.397710: step 20, loss = 4.64 (597.4 examples/sec; 0.214 sec/batch)
2015-11-04 11:45:54.446850: step 30, loss = 4.62 (391.0 examples/sec; 0.327 sec/batch)
2015-11-04 11:45:57.152676: step 40, loss = 4.61 (430.2 examples/sec; 0.298 sec/batch)
2015-11-04 11:46:00.437717: step 50, loss = 4.59 (406.4 examples/sec; 0.315 sec/batch)
...

The script not only reports the total loss every 10 steps, but also the speed at which the last batch of data was processed. Some comments:

The first batch of data can be unusually slow (for example, minutes). This is because the preprocessing thread fills the shuffled queue with 20,000 processed CIFAR images.
The reported loss is the average loss of the latest batch. Note that this loss is the sum of the cross entropy and all weight damping terms.
Notice the processing speed of the batch. The numbers shown above were obtained with the Tesla K40c. Expected performance degradation when running on CPU

Exercises: During an experiment, you may often find it annoying that the first training step takes too long. Try reducing the number of images that fill the queue first. Search for NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN in cifar10.py.

cifar10_train.py periodically saves all model parameters to the Checkpoint File (https://www.tensorflow.org/how_tos/variables/index.html#saving-and-restoring) (https) (: //www.tensorflow.org/api_docs/python/state_ops.html#Saver), but does not evaluate the model. The checkpoint file is used by cifar10_eval.py to measure predictive performance (see Model Evaluation below (#Model Evaluation)).

Performing the above steps will start training the CIFAR-10 model. Congrats!

The terminal text returned by cifar10_train.py provides minimal insight into how the model is trained. I want to get more model insights during training:

Is the loss really reduced, or is it just noise?
Is the model provided with the appropriate images?
Are gradients, activations and weights valid?
What is the learning rate now?

TensorBoard is from cifar10_train.py [SummaryWriter](https://www.tensorflow.org/api_docs/python/train.html#SummaryWriter It provides this functionality by displaying the data that is periodically exported via).

For example, you can see how the distribution of activation and sparseness in the local3 feature evolves during training:

It is especially interesting to track individual loss functions over time, as well as total losses. However, the loss shows a significant amount of noise due to the small batch size used for training. In practice, it is very useful to visualize the moving average in addition to the raw values. See how the script uses ExponentialMovingAverage (https://www.tensorflow.org/api_docs/python/train.html#ExponentialMovingAverage) for this purpose.

Model evaluation

Now let's evaluate how well the trained model works with the provided dataset. The model is evaluated by the script cifar10_eval.py. It builds a model using the inference () function and uses all 10,000 images in the CIFAR-10 evaluation set. It calculates an accuracy of 1, that is, how often the top predictions match the true label of the image.

An evaluation script is periodically run against the latest checkpoint file created by cifar10_train.py to monitor the improvement of the model during training.

python cifar10_eval.py

Be careful not to run the evaluation and training binaries on the same GPU. Otherwise you will run out of memory. Consider running the evaluation on a different GPU if available, or pausing the training binaries on the same GPU.

You should see something like this:

2015-11-06 08:30:44.391206: precision @ 1 = 0.860
...

The script simply returns precision @ 1 on a regular basis, in this case 86% precision. cifar10_eval.py also exports a summary that can be visualized in TensorBoard. These summaries provide additional insight into the model during evaluation.

The training script calculates the Moving Average (https://www.tensorflow.org/api_docs/python/train.html#ExponentialMovingAverage) version of all trained variables. The evaluation script replaces the trained model parameters with the moving average version. This replacement improves the performance of the model during evaluation.

Exercise: You can improve the predictive performance measured by accuracy @ 1 by about 3% by using the averaging parameters of the model. Edit cifar10_eval.py so that it does not use the averaging parameter and check that the prediction performance is degraded.

Model training with multiple GPU cards

Current workstations may contain multiple GPUs for scientific computing. TensorFlow can take advantage of this environment by performing training operations across multiple cards at the same time.

Training in parallel and distributed models requires coordination of the training process. For later, one copy of the model trained by a subset of the data is called a model replica.

If you simply adopt asynchronous update of model parameters Individual model replicas can be trained with older copies of model parameters, resulting in less than best training performance. Conversely, if you adopt a fully synchronous update, it will be as slow as the slowest model replica.

On workstations with multiple GPU cards, each GPU has similar speeds and contains enough memory to run all CIFAR-10 models. Therefore, we will design the training system as follows:

Place individual model replicas on each GPU
Wait for batch data processing on all GPUs to finish, and update model parameters synchronously

The diagram for this model is below:

Note that each GPU calculates a unique batch data gradient as well as an estimate. This setting allows you to efficiently split large batches of data between GPUs.

This setting requires all GPUs to share model parameters. As is well known, data transfer to and from the GPU is very slow. For this reason, we decided to store and update all model parameters in the CPU (see green box). After a new batch of data has been processed by all GPUs, a new model parameter set is transferred to the GPUs.

GPUs are synchronized in operation. All gradients are accumulated and averaged from the GPU (see green box). Model parameters are updated with the average gradient of all model replicas.

Placement of variables and operations on the device

Placing operations and variables on the device requires some special abstraction.

The first abstraction we need is a function for inferring and calculating gradients for a single model replica. In your code, this abstraction is called a "tower". You need to set two attributes for each tower.

A unique name for every operation in the tower. tf.name_scope () provides this unique name by adding a scope. For example, tower_0 is prepended to every operation in the first tower, for example tower_0 / conv1 / Conv2D
A preferred hardware device will perform the operations in the tower. tf.device () makes this explicit. For example, all operations that are in the device ('/ gpu: 0') scope in the first tower should be performed on the first GPU.

All variables are fixed to the CPU and can be shared by multiple GPU versions via tf.get_variable () To access. See how-tos in Shared Variables (https://www.tensorflow.org/how_tos/variable_scope/index.html).

Model launch and training on multiple GPU cards

If you have multiple GPU cards and they are installed on your machine, you can use them with the cifar10_multi_gpu_train.py script to train your model faster. This version of the training script parallelizes the model across multiple GPU cards.

python cifar10_multi_gpu_train.py --num_gpus=2

The output of the training script should look like this:

Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
2015-11-04 11:45:45.927302: step 0, loss = 4.68 (2.0 examples/sec; 64.221 sec/batch)
2015-11-04 11:45:49.133065: step 10, loss = 4.66 (533.8 examples/sec; 0.240 sec/batch)
2015-11-04 11:45:51.397710: step 20, loss = 4.64 (597.4 examples/sec; 0.214 sec/batch)
2015-11-04 11:45:54.446850: step 30, loss = 4.62 (391.0 examples/sec; 0.327 sec/batch)
2015-11-04 11:45:57.152676: step 40, loss = 4.61 (430.2 examples/sec; 0.298 sec/batch)
2015-11-04 11:46:00.437717: step 50, loss = 4.59 (406.4 examples/sec; 0.315 sec/batch)
...

Note that the number of GPU cards is 1 by default. In addition, if your machine has only one GPU available, all calculations will be placed on that single GPU, even if you request more.

Exercise: By default, cifar10_train.py runs with a batch size of 128. Run cifar10_multi_gpu_train.py on two GPUs with batch size 64 and compare training speeds.

Next step

Congrats! You have completed the CIFAR-10 tutorial.

If you are interested in developing and training your own image classification system, we recommend that you fork this tutorial and replace the components to address that image classification problem.

Exercise: Download the Street View House Numbers (SVHN) (http://ufldl.stanford.edu/housenumbers/) dataset. Fork the CIFAR-10 tutorial and exchange the input data for SVHN. Try modifying your network architecture to improve predictive performance.