[PYTHON] [PyTorch Tutorial ②] Autograd: Automatic differentiation

Introduction

This is the second installment of PyTorch Official Tutorial following Last time. This time, I would like to proceed with Autograd: Automatic Differentiation.

table of contents

1.Autograd 2.Tensor 3. Gradient 4.You can do many crazy things with autograd! 5. Finally History

1.Autograd

PyTorch implements the Autograd feature. Gradient information is stored in Tensor, and the gradient is calculated by the backward () method for the defined calculation graph (expression). Let's take a look at Autograd with a concrete example below.

2.Tensor

The PyTorch Tensor will record the gradient by setting the requires_grad attribute to True. When you calculate the gradient with backward (), the gradient is preserved in the Tensor's grad attribute.

The following description defines the Tensor. Specify requires_grad = True so that the gradient is recorded.

import torch
x = torch.ones(2, 2, requires_grad=True)
print(x)
tensor([[1., 1.],
        [1., 1.]], requires_grad=True)

Create a calculation graph (formula) y.

y = x + 2
print(y)
tensor([[3., 3.],
        [3., 3.]], grad_fn=<AddBackward0>)

When I print, I have grad_fn in the output. This shows that a computational graph has been built to calculate the gradient.

print(y.grad_fn)
<AddBackward0 object at 0x7f8cc977e5c0>

Use y to create further calculation graphs (formulas) z and out.

z = y * y * 3
out = z.mean()

print(z, out)
tensor([[27., 27.],
        [27., 27.]], grad_fn=<MulBackward0>) tensor(27., grad_fn=<MeanBackward0>)

[Reference information] You can change the requires_grad attribute with tensor.requires_grad_ ().

a = torch.randn(2, 2)
a = ((a * 3) / (a - 1))
print(a.requires_grad)
a.requires_grad_(True)
print(a.requires_grad)
b = (a * a).sum()
print(b.grad_fn)
False
True
<SumBackward0 object at 0x7fcb2ba0a3c8>

3. Gradient

Calculate the gradient with out.backward ().

out.backward()

Outputs the partial derivative of out by x, d (out) / dx.

print(x.grad)
tensor([[4.5000, 4.5000],
        [4.5000, 4.5000]])

Since out is out = z.mean () and z is z = y * y * 3, the formula is as follows.

out = \frac{1}{4}\sum_{i=1}^{4}z_i
 、 
z_i = 3(x_i + 2)^2
 、 
z_i\bigr\rvert_{x_i=1} = 27 

Therefore, if out is partially differentiated with respect to x,

\begin{align}
\frac{\partial out}{\partial x_i} &= \frac{3}{2}(x_i+2)\\
&= \frac{9}{2} = 4.5
\end{align}

You can see that it is automatically differentiated.

4.You can do many crazy things with autograd!

I'm not sure what the code below means, but let's take a look.

x = torch.randn(3, requires_grad=True)

y = x * 2
while y.data.norm() < 1000:
    y = y * 2

print(y)
tensor([ -492.4446, -1700.8485,  -339.7951], grad_fn=<MulBackward0>)

x is a standardized (mean 0, standard deviation 1) random value. y.data.norm () is also described in Wikipedia, The distance in vector space, which is the following Euclidean norm.

Euclidean norm= \sqrt{|x_1|^2+\cdots+|x_n|^2} \\

In the case of two dimensions, the formula is the same as the distance between two points, so it is exactly the distance. In fact, if you try to output the value of x and norm (), the result of the above formula will be returned as the norm.

\begin{eqnarray}
Euclidean norm&=& \sqrt{|-0.9618|^2+|-3.3220|^2+|-0.6637|^2}\\
&=& 3.5215
\end{eqnarray}
print(x)
print(x.data.norm())
tensor([-0.9618, -3.3220, -0.6637], requires_grad=True)
tensor(3.5215)

So the code above represents the expression that keeps doubling x until the norm of x is 1,000.

I want to calculate the gradient of y with y.backward (), but since y is not a scalar, it cannot be calculated as it is. In fact, running y.backward () gives an error.

y.backward()
RuntimeError: grad can be implicitly created only for scalar outputs

The gradient is calculated by setting an appropriate vector.

gradients = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float)
y.backward(gradients)

print(x.grad)
tensor([5.1200e+01, 5.1200e+02, 5.1200e-02])

Consider what this tutorial means. As mentioned in this tutorial, the gradient can be represented by the Jacobian matrix.

\begin{split}J=\left(\begin{array}{ccc}
 \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\
 \vdots & \ddots & \vdots\\
 \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
 \end{array}\right)\end{split}

There is also a statement that autograd is an engine for calculating the product of the Jacobian matrix and a given vector.

\begin{split}J^{T}\cdot v=\left(\begin{array}{ccc}
 \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{1}}\\
 \vdots & \ddots & \vdots\\
 \frac{\partial y_{1}}{\partial x_{n}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
 \end{array}\right)\left(\begin{array}{c}
 \frac{\partial l}{\partial y_{1}}\\
 \vdots\\
 \frac{\partial l}{\partial y_{m}}
 \end{array}\right)=\left(\begin{array}{c}
 \frac{\partial l}{\partial x_{1}}\\
 \vdots\\
 \frac{\partial l}{\partial x_{n}}
 \end{array}\right)\end{split}

Based on this information, I will apply it to this case. From here, it may not be correct because it contains my imagination.

x = torch.randn(3, requires_grad=True)

Since x is 3 random variables, the subscript n of x is 3.

x_1 ,  x_2 , x_3

Consider y. The definition of y is as follows.

y = x * 2
while y.data.norm() < 1000:
    y = y * 2

First, let's look at the first y = x * 2. Focusing on the number of variables, y is x * 2, which is simply doubled, so the number of variables does not change. So after converting with y, the value remains three. Therefore, m is also 3, and the size of the Jacobian matrix is 3 × 3.

{\begin{split}J=\left(\begin{array}{ccc}
 \frac{\partial y_{1}}{\partial x_{1}} & \frac{\partial y_{1}}{\partial x_{2}} & \frac{\partial y_{1}}{\partial x_{3}}\\
 \frac{\partial y_{2}}{\partial x_{1}} & \frac{\partial y_{2}}{\partial x_{2}} & \frac{\partial y_{2}}{\partial x_{3}}\\
 \frac{\partial y_{3}}{\partial x_{1}} & \frac{\partial y_{3}}{\partial x_{2}} & \frac{\partial y_{3}}{\partial x_{3}}\\
 \end{array}\right)\end{split}
}

The print (x) values above [-0.9618, -3.3220, -0.6637] apply to [x1, x2, x3]. Also, when this value is applied to y = x * 2, it becomes [-1.9236, -6.644, -1.3274], which becomes [y1, y2, y3]. It was not necessary to apply the numerical value, but the conversion formula of x and y is as follows.

y_1 = 2x_1\\
y_2 = 2x_2\\
y_3 = 2x_3\\

The partial differential of this equation with respect to x1, x2, and x3 is as follows.

\frac{\partial y_{1}}{\partial x_{1}} = 2 , 
\frac{\partial y_{1}}{\partial x_{2}} = 0 , 
\frac{\partial y_{1}}{\partial x_{3}} = 0\\
\frac{\partial y_{2}}{\partial x_{1}} = 0 , 
\frac{\partial y_{2}}{\partial x_{2}} = 2 , 
\frac{\partial y_{2}}{\partial x_{3}} = 0\\
\frac{\partial y_{3}}{\partial x_{1}} = 0 , 
\frac{\partial y_{3}}{\partial x_{2}} = 0 , 
\frac{\partial y_{3}}{\partial x_{3}} = 2\\

Therefore, the Jacobian matrix is as follows. while J1 is used to represent the first conversion (y = x * 2) before.

{\begin{split}J_1=\left(\begin{array}{ccc}
 2 & 0 & 0\\
 0 & 2 & 0\\
 0 & 0 & 2\\
 \end{array}\right)\end{split}
}

Consider the second y. y = y * 2 after while is the second y. Since the formula is the same as the first time, the Jacobian matrix of y for the second time is the same as before and is as follows. Let's call it J2 to represent the Jacobian matrix of y for the second time.

{\begin{split}J_2=\left(\begin{array}{ccc}
 2 & 0 & 0\\
 0 & 2 & 0\\
 0 & 0 & 2\\
 \end{array}\right)\end{split}
}

Repeat this. Since the initial value x.data.norm () is "3.5215" and y.data.norm () <1000, the loop is executed 8 times and y is defined 9 times. As a whole, it looks like this:

formula Value of x1 value of x2 Value of x3
- x1 x2 x3
Convert with the first y 2 * x1 2 * x2 2 * x3
Convert with the second y 4 * x1 4 * x2 4 * x3
Convert with y for the third time 8 * x1 8 * x2 8 * x3
Convert with y for the 4th time 16 * x1 16 * x2 16 * x3
Convert with y for the 5th time 32 * x1 32 * x2 32 * x3
Convert with y for the 6th time 64 * x1 64 * x2 64 * x3
Convert with y for the 7th time 128 * x1 128 * x2 128 * x3
Convert with y for the 8th time 256 * x1 256 * x2 256 * x3
Convert with the 9th y 512 * x1 512 * x2 512 * x3

Finally y becomes the composition function of these nine transformations. As described in This math site, the Jacobian matrix of the composite function can be expressed by a matrix formula. y The whole Jacobian matrix J

\begin{eqnarray}
J &=& J_9 \times J_8 \times J_7 \times J_6 \times J_5 \times J_4 \times J_3 \times J_2 \times J_1\\\\
&=&
\left(
    \begin{array}{ccc}
      2 & 0 & 0 \\
      0 & 2 & 0 \\
      0 & 0 & 2
    \end{array}
\right) 
\left(
    \begin{array}{ccc}
      2 & 0 & 0 \\
      0 & 2 & 0 \\
      0 & 0 & 2
    \end{array}
\right) 
\cdots
\left(
    \begin{array}{ccc}
      2 & 0 & 0 \\
      0 & 2 & 0 \\
      0 & 0 & 2
    \end{array}
\right) 
\left(
    \begin{array}{ccc}
      2 & 0 & 0 \\
      0 & 2 & 0 \\
      0 & 0 & 2
    \end{array}
\right) \\\\
&=&
\left(
    \begin{array}{ccc}
      512 & 0 & 0 \\
      0 & 512 & 0 \\
      0 & 0 & 512
    \end{array}
\right) 

\end{eqnarray}

Will be.

Let's apply this to the following calculation.

gradients = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float)
y.backward(gradients)

print(x.grad)

gradients are vectors that multiply the Jacobian matrix. When applied to the Jacobian matrix calculated above, it becomes as follows.

{\begin{split}x.grad=\left(\begin{array}{ccc}
 512 & 0 & 0\\
 0 & 512 & 0\\
 0 & 0 & 512\\
 \end{array}\right)\left(\begin{array}{c}
 0.1\\
 1.0\\
 0.0001
 \end{array}\right)=\left(\begin{array}{c}
 51.2\\
 512\\
 0.0512
 \end{array}\right)\end{split}
}

In summary, does Autograd have the following image?

--Keep the Jacobian matrix every time you define a function (expression). --Calculate the "derivative" from the Jacobian matrix held by the backward method.

The story changes here, and by writing in the torch.no_grad () block as follows, we will not track the change of the function. (x ** 2) is not tracked.

print(x.requires_grad)
print((x ** 2).requires_grad)

with torch.no_grad():
    print((x ** 2).requires_grad)
True
True
False

Also, detach () copies the variables in the tensor, but the gradient is not inherited.

print(x.requires_grad)
y = x.detach()
print(y.requires_grad)
print(x.eq(y).all())
True
False
tensor(True)

5. Finally

That's it for PyTorch's second tutorial, Autograd: Automatic Differentiation. The content was different from the first tutorial. In the second half, there is a part of imagination, so it may be wrong. I would appreciate it if you could point out.

Next time would like to proceed with the third tutorial "NEURAL NETWORKS".

History

2020/02/28 First edition released 2020/04/22 Next link added

Recommended Posts

[PyTorch Tutorial ②] Autograd: Automatic differentiation
[PyTorch] Tutorial (Japanese version) ② ~ AUTOGRAD ~
[PyTorch Tutorial ①] What is PyTorch?
[PyTorch] Sample ③ ~ TENSORS AND AUTOGRAD ~
[PyTorch Tutorial ④] TRAINING A CLASSIFIER
[PyTorch] Sample ④ ~ Defining New autograd Functions (definition of automatic differential functions) ~
[PyTorch] Tutorial (Japanese version) ① ~ Tensor ~
Pytorch Neural Network (CNN) Tutorial 1.3.1.
[PyTorch Tutorial ②] Autograd: Automatic differentiation
Introduction to Thano Function definition and automatic differentiation
[PyTorch Tutorial ⑤] Learning PyTorch with Examples (Part 2)
[PyTorch Tutorial ⑤] Learning PyTorch with Examples (Part 1)
[PyTorch Tutorial ⑥] What is torch.nn really?