There are many types of deep learning frameworks such as PyTorch, Tensorflow, and Keras. This time, I will pay attention to ** PyTorch **, which I often use!

Did you know that ** C ++ version has been released ** as well as PyTorch and Python version? This makes it easy to incorporate if you want to use Deep Learning as part of the processing of your C ++ program!

The C ++ version of PyTorch, I was wondering ** "C ++ is a compiled language, so maybe it's faster than the Python version?" **.

So, this time, I actually investigated ** "How much speed is different between C ++ and Python?" **! Also, I was concerned about the accuracy, so I checked it.

What to use for comparative experiments

1. Framework

This time, as the title suggests, we will use the C ++ version of "PyTorch". You can download it from the following site, so please try it!

PyTorch Official: https://pytorch.org/

I downloaded it with the above settings. The "Preview (Nightly) version" always has the latest files. However, it is still under development, so if you want to use the stable version, select "Stable (1.4)".

Also, "Run this Command" at the bottom is quite important, and if you have a build version of CXX of 11 or above, we recommend that you select the bottom one. Currently it is almost CXX17, so I think it's okay below. If you select the above, link errors of other libraries will occur and it will be a lot of trouble.

2. Model

This time, ** convolutional autoencoder ** (Convolutional Autoencoder) is used. Available from my GitHub → https://github.com/koba-jon/pytorch_cpp

This model maps the ** input image (higher dimension) ** to ** latent space (lower dimension) **, and this time based on this ** latent variable (lower dimension) **, the ** image ( The purpose is to generate high-dimensional) ** and minimize the error between this and the input image. After training, this model can generate a high-dimensional image again from a high-dimensional image through a low-dimensional space, so that ** a latent space that more characterizes the training image ** can be obtained. In other words, it has the role of dimensional compression and can be called a so-called non-linear principal component analysis. This is very convenient because it has various uses such as ** curse of dimensionality **, ** transfer learning **, and ** anomaly detection **.

Now, I will explain the structure of the model to be used.

Image size is 1/2 times with one convolution and 2 times with deconvolution
Stabilization of learning and acceleration of convergence
The possible range of latent variables is (-∞, + ∞)
The possible range of pixel values is [-1, +1]

Expecting these effects, we built the following network.

Operation		Kernel Size	Stride	Padding	Bias	Feature Map		BN	Activation
Operation		Kernel Size	Stride	Padding	Bias	Input	Output	BN	Activation
1	Convolution	4	2	1	False	3	64		ReLU
2						64	128	True	ReLU
3						128	256	True	ReLU
4						256	512	True	ReLU
5						512	512	True	ReLU
6						512	512
7	Transposed Convolution					512	512	True	ReLU
8						512	512	True	ReLU
9						512	256	True	ReLU
10						256	128	True	ReLU
11						128	64	True	ReLU
12						64	3		tanh

3. Data set

CelebA (Large-scale CelebFaces Attributes) dataset
http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html

This time, we will use the CelebA dataset, which is a dataset of 202,599 celebrity face images (color). The image size is 178 x 218 [pixel], which causes some inconvenience when deconvolution occurs, so this time I resized it to ** 64 x 64 [pixel] **. Of these, ** 90% (182,340) were used for learning images **, and ** 10% (20,259) were used for test images **.

When this is input to the previous model, the latent space becomes (C, H, W) = (512,1,1). If you input an image of 128 x 128 [pixel] or more, the intermediate layer becomes a spatial latent space.

Comparison

This time, I will mainly investigate ** "How much speed is different between C ++ and Python" **, but I would like to compare the speed and performance under the following 5 types of environments. ..

--CPU main operation - Python - C++ --GPU main operation - Python --Non-deterministic --Deterministic - C++

1. Difference in main unit (CPU or GPU)

CPU
Good at handling "serial" and "complex" instructions
GPU
Good at processing "parallel" and "simple" instructions

As you can see from the above features, in Deep Learning that handles images, GPU is overwhelmingly advantageous in terms of calculation speed.

(1) Implementation by Python

When using CPU

`CPU.py`


device = torch.device('cpu')  #Use CPU

model.to(device)  #Move model to CPU
image = image.to(device)  #Move data to CPU

When using GPU

`GPU.py`


device = torch.device('cuda')    #Use default GPU
device = torch.device('cuda:0')  #Use the first GPU
device = torch.device('cuda:1')  #Use second GPU

model.to(device)  #Move model to GPU
image = image.to(device)  #Move data to GPU

(2) Implementation by C ++

When using CPU

`CPU.cpp`


torch::Device device(torch::kCPU);  //Use CPU

model->to(device);  //Move model to CPU
image = image.to(device);  //Move data to CPU

When using GPU

`GPU.cpp`


torch::Device device(torch::kCUDA);     //Use default GPU
torch::Device device(torch::kCUDA, 0);  //Use the first GPU
torch::Device device(torch::kCUDA, 1);  //Use second GPU

model->to(device);  //Move model to GPU
image = image.to(device);  //Move data to GPU

2. Difference between deterministic and non-deterministic (GPU main operation & Python only)

In the Python version of PyTorch, in the case of learning using GPU, cuDNN is used to ** improve the learning speed **.

However, unlike C ++, just because the learning speed is improved does not mean that the exact same situation can be reproduced by turning the learning again.

Therefore, the PyTorch formula states that the behavior of cuDNN needs to be deterministic as follows in order to ensure reproducibility, and at the same time, the speed decreases.

https://pytorch.org/docs/stable/notes/randomness.html

Deterministic mode can have a performance impact, depending on your model. This means that due to the deterministic nature of the model, the processing speed (i.e. processed batch items per second) can be lower than when the model is non-deterministic.

From the engineer's point of view, we included it in this speed comparison because we may be concerned about the presence or absence of reproducibility and the speed changes depending on the presence or absence of reproducibility.

Unlike the "rand" function in C ++, if you do not set any initial value of the random number, it will be random, so in order to ensure reproducibility in Python, ** explicitly ** the initial value of the random number should be set. You need to set it. (The setting of the initial value of the random number does not affect the speed.)

The implementation is as follows.

Deterministic case

`deterministic.py`


seed = 0
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
np.random.seed(seed)
random.seed(seed)
torch.backends.cudnn.deterministic = True  #Deterministic instead of slowing down
torch.backends.cudnn.benchmark = False     #Deterministic instead of slowing down

Non-deterministic case

`non_deterministic.py`


torch.backends.cudnn.deterministic = False  #Faster instead of being non-deterministic
torch.backends.cudnn.benchmark = True       #Speed up when the image size does not change

3. Implementation differences between programming languages

Even if the content you want to implement is the same, if the programming language changes, ** notation and rules ** may change, and ** required libraries ** may change. Since both Python and C ++ are object-oriented languages, the concepts themselves are similar, but above all, Python is an interpreter type and C ++ is a compiled type, so it must be implemented considering that dynamic typing does not work for C ++. Also, since PyTorch's C ++ API is currently under development, it should be taken into consideration that some functions are not available.

Based on these points, I will introduce the implementation differences between Python and C ++, and the programs I implemented.

(1) Library usage

In addition to the usage status of the Python library that is generally written at present, the library recommended for implementation in C ++ and the usage status of the library of the program that I actually wrote are also described.

	Python (recommended)	C++(Recommendation)	C++(self made)
Handling command line arguments	argparse	boost::program_options	boost::program_options
Model design	torch.nn	torch::nn	torch::nn
Preprocessing (transform)	torchvision.transforms	torch::data::transforms (for various pre-processing before execution) or Self-made (when various preprocessing after execution)	Self-made (using OpenCV)
Get datasets (datasets)	torchvision.datasets (using Pillow)	Self-made (using OpenCV)	Self-made (using OpenCV)
Dataloader	torch.utils.data.DataLoader	torch::data::make_data_loader (for classification) or Self-made (other than classification)	Self-made (using OpenMP)
Loss function (loss)	torch.nn	torch::nn	torch::nn
Optimizer	torch.optim	torch::optim	torch::optim
Error back propagation method (backward)	torch.Tensor.backward()	torch::Tensor::backward()	torch::Tensor::backward()
progress bar	tqdm	boost	self made

** At the moment (2020/03/24) **, it looks like the above.

When using the PyTorch library in C ++, the class name and function name are almost the same as Python. This seems to be due to the user's consideration on the producer side. I am very grateful!

Next, I will describe the points that you should be especially careful when writing PyTorch programs in C ++.

(2) Model design

The following is an excerpt of a part of the program I wrote.

`networks.hpp (partial excerpt)`


using namespace torch;
namespace po = boost::program_options;

struct ConvolutionalAutoEncoderImpl : nn::Module{
private:
    nn::Sequential encoder, decoder;
public:
    ConvolutionalAutoEncoderImpl(po::Variables_map &vm);
    torch::Tensor forward(torch::Tensor x);
}

TORCH_MODULE(ConvolutionalAutoEncoder);

When designing a model, use the "torch :: nn" class as in Python. Also, when creating a model, use a structure. (There is also a class version, but it seems a bit complicated) At this time, it should be noted that ** nn :: Module is inherited ** as in Python. This is the same as how to write Python.

The next important thing is to name the ** structure "[model name] Impl" ** and under the structure ** Add "TORCH_MODULE ([model name])" **. If you do not do this, you will not be able to save or load the model. Also, by setting "TORCH_MODULE ([model name])", you can declare the ordinary structure "ConvolutionalAutoEncoderImpl" as the structure for the model "ConvolutionalAutoEncoder", but probably by inheriting the class further internally Are you there? (Expected) Therefore, when accessing member variables, such as "model-> to (device)", ** "->" (arrow operator) ** Please note that you need to use.

Next, regarding the above matters, I will explain the points to note when using the nn class module. You can use "nn :: Sequential" as well as Python. To add a module to "nn :: Sequential" in C ++, use ** "push_back" ** like vector type. Here, please be careful to use ** "->" (arrow operator) ** to call the "push_back" function. The implementation example looks like the following.

`networks.cpp (partial excerpt / modification)`


nn::Sequential sq;
sq->push_back(nn::Conv2d(nn::Conv2dOptions(3, 64, /*kernel_size=*/4).stride(2).padding(1).bias(false)));
sq->push_back(nn::BatchNorm2d(64));
sq->push_back(nn::ReLU(nn::ReLUOptions().inplace(true)));

(3) Self-made transform / datasets / dataloader

When creating transform, datasets, and dataloader by yourself, when passing tensor type data to other variables, ** ".clone ()" is used **. I was addicted to it here. Is the tensor type related to handling calculation graphs? (Expected), the value in the tensor may change if it is not set in this way.

`transforms.cpp (partial excerpt)`


void transforms::Normalize::forward(torch::Tensor &data_in, torch::Tensor &data_out){
    torch::Tensor data_out_src = (data_in - this->mean) / this->std;
    data_out = data_out_src.clone();
    return;
}

(4) Other programs

Other programs are almost the same as the Python version, and there is nothing particularly addictive to them. Also, I made my own class that I thought was a little difficult to use because it was different from the Python version. Please refer to the following GitHub for the specific program. https://github.com/koba-jon/pytorch_cpp/tree/master/ConvAE

Perhaps I will write a commentary article on the source code. If you have an opinion that "this is strange", we welcome you so please comment.

Items unified among programming languages

Basically, you can think that it is almost the same except for the part that can not be helped, such as the library that is in Python does not exist in C ++. Also, you can think that you haven't changed from the GitHub program.

Specifically, the following contents have been unified when comparing the Python version and the C ++ version.

Image size (64 x 64 x 3)
Image type (image group of learning method A = image group of learning method B)
Batch size (16)
Latent space size (1 x 1 x 512)
Optimization method (Adam, learning rate = 0.0001, β1 = 0.5, β2 = 0.999)
Model structure
How to initialize the model
Convolution layer, deconvolution layer: mean 0.0, standard deviation 0.02
Batch normalization: mean 1.0, standard deviation 0.02
How to load data
Only the path is acquired when the "datasets" class is initialized, and the image is read based on the path for the first time when it is actually operated.
When the "datasets" class is running, only one set of data (one image and one path) is read.
Execute "transform" when the "datasets" class is running.
When the "DataLoader" class is running, the mini-batch data is read in parallel from the "datasets" class.
How to shuffle the dataset
Shuffles during learning, but does not shuffle during inference.
For each epoch, shuffle the data at the very beginning.

Experimental result

For each object to be compared, 182,340 64 × 64 images of celebA were used to mini-batch train the convolutional autoencoder model by 1 [epoch] so as to minimize the L1 error. ** "Time per [epoch]" ** and ** "GPU memory usage" I checked **.

Here, "time per [epoch]" includes the processing time of tqdm and the function you created. I included this because it had little effect on the total processing time, and because it is more convenient to have visualization when actually using PyTorch, many people use it.

In addition, using the trained model, 20,259 test images were input to the model one by one and tested. ** "Average velocity of forward propagation" ** and ** "L1 error between input image and output image" I also checked> **.

Then, I learned and tested without launching anything other than the "executable file" and "nvidia-smi" (the one that was running from the beginning when Ubuntu started).

		CPU（Core i7-8700）		GPU（GeForce GTX 1070）
		Python	C++	Python		C++
		Python	C++	Non-deterministic	Deterministic	C++
Learning	time [time / epoch]	1 hour 04 minutes 49 seconds	1 hour 03:00	5 minutes 53 seconds	7 minutes 42 seconds	17 minutes 36 seconds
Learning	GPU memory [MiB]	2	9	933	913	2941
Test	Speed [seconds / data]	0.01189	0.01477	0.00102	0.00101	0.00101
Test	L1 error (MAE)	0.12621	0.12958	0.12325	0.12104	0.13158

C ++ is a compiled language. Therefore, I thought that I would beat Python, an interpreted language, and ** both were good matches **.

In terms of learning time, we found that the CPU is almost the same, and the GPU is more than twice as slow as C ++ than Python. (why?) This is the result, because the CPU is about the same and it is very different only with the GPU.

It is possible that the processing on the PyTorch C ++ version of the GPU is not perfect and the forward and back propagation by the GPU is not optimized.
It may take some time to transfer the mini-batch data obtained from the CPU to the GPU.

Is likely to be mentioned. As the following people are experimenting, it seems that there is no mistake in the result that ** Python is faster ** when GPU main is running. https://www.noconote.work/entry/2019/01/11/151624

Also, the speed and performance of inference (testing) are almost the same as Python, so ** Python may be better at present **.

The memory usage of GPU is also large for some reason. (Even though ReLU's place is set to True ...)

It is a result of Python (GPU) deterministic and non-deterministic, but as the formula clearly states, deterministic is slower. After all, the time here will change.

Conclusion

Learning speed
1st place: Python version (non-deterministic, GPU main operation)
2nd place: Python version (deterministic, GPU main operation)
3rd place: C ++ version (GPU main operation)
4th place: CPU main operation (Python version, C ++ version similar)
Inference speed
1st place: GPU main operation (Python version, C ++ version similar)
2nd place: CPU main operation (Python version, C ++ version similar)
Performance
All are about the same

in conclusion

This time, I compared the speed and performance of PyTorch for Python and C ++.

As a result, Python and C ++ are almost the same in terms of performance, so I thought that there would be no problem using PyTorch of ** C ++ **. However, ** At this stage, it may not be recommended to do C ++ PyTorch for speed **.

Perhaps the C ++ API is still evolving and may improve significantly in the future! From now on, this is my expectation!

Reference URL

https://pytorch.org/cppdocs/
https://orizuru.io/blog/deep-learning/pytorch-cpp_01/
https://www.noconote.work/entry/2019/01/08/200120
https://www.noconote.work/entry/2019/01/11/151624
https://github.com/pytorch/examples/tree/master/cpp

PyTorch C ++ VS Python (2019 Edition)