Introduction

This is the content of Course 2, Week 2 (C2W2) of Deep Learning Specialization.

(C2W2L01) Mini-batch gradient descent

-When $ m = 5000000 $, divide the training set into mini-batch and calculate forward propagation and back propagation for each mini-batch. --mini-batch converges faster

X^{\{1\}} = \left[ X^{(1)} \, X^{(2)} \, \cdots \, X^{(1000)}\right] \\
Y^{\{1\}} = \left[ Y^{(1)} \, Y^{(2)} \, \cdots \, Y^{(1000)}\right] \\
X^{\{2\}} = \left[ X^{(1001)} \, X^{(1002)} \, \cdots \, X^{(2000)}\right] \\
Y^{\{2\}} = \left[ Y^{(1001)} \, Y^{(1002)} \, \cdots \, Y^{(2000)}\right]

(C2W2L02) Understanding Mini-batch Gradient Descent

--For mini-batch gradient descent, cost function $ J ^ {\ {t \}} $ oscillates and decreases with each mini-batch iteration --The appropriate size of mini-batch should be large enough to benefit from the efficiency of calculation by vectorization. -Typical sizes are $ 2 ^ 6 $, $ 2 ^ 7 $, $ 2 ^ 8 $, $ 2 ^ 9 $, etc. (powers of 2 to use memory efficiently) --Inefficient if the mini-batch size does not fit in the memory size of the CPU / GPU --Try some powers of 2 to find a size that can be calculated efficiently

(C2W2L03) Exponentially Weighted Average

--Exponentially weighted (moving) average (does Japanese match with exponentially weighted (moving) average?) --Convert the original data ($ \ theta_0 $, $ \ theta_1 $, $ \ cdots $)

V_0 = 0 \\
V_t = \beta V_{t-1} + \left( 1-\beta \right) \theta_t

-$ V_t $ can be regarded as the average of approximately $ \ frac {1} {1- \ beta} $ data -If $ \ beta $ is large, the data will be smooth because it will be calculated using more data. -If $ \ beta $ is small, it is noisy and sensitive to outliers.

(C2W2L04) Understanding exponentially weighted average

--How to implement exponentially weighted average

V_0 = 0 --Repeat the following
- Get next \theta_t
- V_t = \beta V_{t-1} + (1-\beta)\theta_t --The advantage is that it requires less memory

(C2W2L05) Bias correction in exponentially weighted average

--In exponentially weight average, $ V_t $ becomes very small at the initial stage.

V_0 = 0 \\
V_1 = 0.98 V_0 + 0.02 \theta_1 = 0.02 \theta_1 \\
V_2 = 0.98 V_1 + 0.02 \theta_2 = 0.0196\theta_1 + 0.02\theta_2

--Therefore, correct $ V_t $ with $ \ frac {V_t} {1- \ beta ^ t} $. When t becomes large, it becomes $ \ beta ^ t \ sim 0 $, and the effect of correction is almost lost. --In many cases, bias correction is not implemented (using data from the initial stage onwards)

(C2W2L06) Gradient descent with momentum

--In iteration $ t $ --Calculate $ dW $, $ db $ with current mini-batch

V_{dw} = \beta V_{dw} + \left( 1-\beta \right) dW \\
V_{db} = \beta V_{db} + \left( 1-\beta \right) db \\
W := W - \alpha V_{dW} \\
b := b - \alpha V_{db}

--Smooth the vibration of the steepest descent method -$ \ beta $ and $ \ alpha $ are hyperparameters, but $ \ beta = 0.9 $ is good --Bias correction is rarely used. Repeat 10 times and it will be $ \ beta ^ t \ sim 0 $ -There is also a document that sets $ V_ {dW} = \ beta V_ {dW} + dW $. In this case, it can be considered that $ \ alpha $ is scaled by $ \ frac {1} {1- \ beta} $.

(C2W2L07) RMSProp

RMS = Root Mean Square --In iteration $ t $ --Calculate $ dW $, $ db $ with current mini-batch

S_{dW} = \beta S_{dW} + \left( 1-\beta \right) dW^2 \ (\textrm{Element by element}) \\
S_{db} = \beta S_{db} + \left( 1-\beta \right) db^2 \ (\textrm{Element by element}) \\
W := W -\alpha \frac{dW}{\sqrt{S_{dW}} + \epsilon} \\
b := b -\alpha \frac{db}{\sqrt{S_{db}} + \epsilon} \\

--Insert $ \ epsilon = 10 ^ {-8} $ so that the denominator does not become 0

(C2W2L08) Adam optimization algorithm

Adam optimization = momentum + RMSProm
Adam = Adaptive momentum estimation --In iteration $ t $ --Calculate $ dW $, $ db $ using current mini-batch

V_{dw} = \beta_1 V_{dw} + \left( 1-\beta_1 \right) dW \\
V_{db} = \beta_1 V_{db} + \left( 1-\beta_1 \right) db \\
S_{dW} = \beta_2 S_{dW} + \left( 1-\beta_2 \right) dW^2  \\
S_{db} = \beta_2 S_{db} + \left( 1-\beta_2 \right) db^2  \\
V^{corrected}_{dW} = \frac{V_{dw}}{1-\beta_1^t} \\
V^{corrected}_{db} = \frac{V_{db}}{1-\beta_1^t} \\
S^{corrected}_{dW} = \frac{S_{dw}}{1-\beta_2^t} \\
S^{corrected}_{db} = \frac{S_{db}}{1-\beta_2^t} \\
W := W -\alpha \frac{V^{corrected}_{dW}}{\sqrt{S^{corrected}_{dW}}+\epsilon} \\
b := b -\alpha \frac{V^{corrected}_{db}}{\sqrt{S^{corrected}_{db}}+\epsilon} \\

--Hyper parameters - \alpha ; needs to be tuned - \beta_1 ; 0.9 - \beta_2 ; 0.999 -$ \ epsilon $; $ 10 ^ {-8} $ (doesn't affect much, but usually $ 10 ^ {-8} $)

(C2W2L09) Learning rate decay

--In mini-batch, if $ \ alpha $ is constant, it will not converge. If you gradually reduce $ \ alpha $, it will fit near the minimum value. --epoch; 1 pass through data (When divided into mini-batch, the unit that handles all mini-batch data is called epoch)

\alpha = \frac{1}{1 + \textrm{decay_rate} \ast \textrm{epoch_num}} \alpha_0

--Other methods include the following

\alpha = 0.95^{\textrm{epoch_num}} \alpha_0\\
\alpha = \frac{k}{\sqrt{\textrm{epoch_num}}} \alpha_0

reference

-Deep Learning Specialization (Coursera) Self-study record (table of contents)

[PYTHON] Deep Learning Specialization (Coursera) Self-study record (C2W2)

Introduction

Contents

Contents

Contents

Contents

Contents

Contents

Contents

Contents

Contents

reference