This is the content of Course 2, Week 2 (C2W2) of Deep Learning Specialization.
(C2W2L01) Mini-batch gradient descent
-When $ m = 5000000 $, divide the training set into mini-batch and calculate forward propagation and back propagation for each mini-batch. --mini-batch converges faster
X^{\{1\}} = \left[ X^{(1)} \, X^{(2)} \, \cdots \, X^{(1000)}\right] \\
Y^{\{1\}} = \left[ Y^{(1)} \, Y^{(2)} \, \cdots \, Y^{(1000)}\right] \\
X^{\{2\}} = \left[ X^{(1001)} \, X^{(1002)} \, \cdots \, X^{(2000)}\right] \\
Y^{\{2\}} = \left[ Y^{(1001)} \, Y^{(1002)} \, \cdots \, Y^{(2000)}\right]
(C2W2L02) Understanding Mini-batch Gradient Descent
--For mini-batch gradient descent, cost function $ J ^ {\ {t \}} $ oscillates and decreases with each mini-batch iteration --The appropriate size of mini-batch should be large enough to benefit from the efficiency of calculation by vectorization. -Typical sizes are $ 2 ^ 6 $, $ 2 ^ 7 $, $ 2 ^ 8 $, $ 2 ^ 9 $, etc. (powers of 2 to use memory efficiently) --Inefficient if the mini-batch size does not fit in the memory size of the CPU / GPU --Try some powers of 2 to find a size that can be calculated efficiently
(C2W2L03) Exponentially Weighted Average
--Exponentially weighted (moving) average (does Japanese match with exponentially weighted (moving) average?) --Convert the original data ($ \ theta_0 $, $ \ theta_1 $, $ \ cdots $)
V_0 = 0 \\
V_t = \beta V_{t-1} + \left( 1-\beta \right) \theta_t
-$ V_t $ can be regarded as the average of approximately $ \ frac {1} {1- \ beta} $ data -If $ \ beta $ is large, the data will be smooth because it will be calculated using more data. -If $ \ beta $ is small, it is noisy and sensitive to outliers.
(C2W2L04) Understanding exponentially weighted average
--How to implement exponentially weighted average
(C2W2L05) Bias correction in exponentially weighted average
--In exponentially weight average, $ V_t $ becomes very small at the initial stage.
V_0 = 0 \\
V_1 = 0.98 V_0 + 0.02 \theta_1 = 0.02 \theta_1 \\
V_2 = 0.98 V_1 + 0.02 \theta_2 = 0.0196\theta_1 + 0.02\theta_2
--Therefore, correct $ V_t $ with $ \ frac {V_t} {1- \ beta ^ t} $. When t becomes large, it becomes $ \ beta ^ t \ sim 0 $, and the effect of correction is almost lost. --In many cases, bias correction is not implemented (using data from the initial stage onwards)
(C2W2L06) Gradient descent with momentum
--In iteration $ t $ --Calculate $ dW $, $ db $ with current mini-batch
V_{dw} = \beta V_{dw} + \left( 1-\beta \right) dW \\
V_{db} = \beta V_{db} + \left( 1-\beta \right) db \\
W := W - \alpha V_{dW} \\
b := b - \alpha V_{db}
--Smooth the vibration of the steepest descent method -$ \ beta $ and $ \ alpha $ are hyperparameters, but $ \ beta = 0.9 $ is good --Bias correction is rarely used. Repeat 10 times and it will be $ \ beta ^ t \ sim 0 $ -There is also a document that sets $ V_ {dW} = \ beta V_ {dW} + dW $. In this case, it can be considered that $ \ alpha $ is scaled by $ \ frac {1} {1- \ beta} $.
(C2W2L07) RMSProp
S_{dW} = \beta S_{dW} + \left( 1-\beta \right) dW^2 \ (\textrm{Element by element}) \\
S_{db} = \beta S_{db} + \left( 1-\beta \right) db^2 \ (\textrm{Element by element}) \\
W := W -\alpha \frac{dW}{\sqrt{S_{dW}} + \epsilon} \\
b := b -\alpha \frac{db}{\sqrt{S_{db}} + \epsilon} \\
--Insert $ \ epsilon = 10 ^ {-8} $ so that the denominator does not become 0
(C2W2L08) Adam optimization algorithm
V_{dw} = \beta_1 V_{dw} + \left( 1-\beta_1 \right) dW \\
V_{db} = \beta_1 V_{db} + \left( 1-\beta_1 \right) db \\
S_{dW} = \beta_2 S_{dW} + \left( 1-\beta_2 \right) dW^2 \\
S_{db} = \beta_2 S_{db} + \left( 1-\beta_2 \right) db^2 \\
V^{corrected}_{dW} = \frac{V_{dw}}{1-\beta_1^t} \\
V^{corrected}_{db} = \frac{V_{db}}{1-\beta_1^t} \\
S^{corrected}_{dW} = \frac{S_{dw}}{1-\beta_2^t} \\
S^{corrected}_{db} = \frac{S_{db}}{1-\beta_2^t} \\
W := W -\alpha \frac{V^{corrected}_{dW}}{\sqrt{S^{corrected}_{dW}}+\epsilon} \\
b := b -\alpha \frac{V^{corrected}_{db}}{\sqrt{S^{corrected}_{db}}+\epsilon} \\
--Hyper parameters
-
(C2W2L09) Learning rate decay
--In mini-batch, if $ \ alpha $ is constant, it will not converge. If you gradually reduce $ \ alpha $, it will fit near the minimum value. --epoch; 1 pass through data (When divided into mini-batch, the unit that handles all mini-batch data is called epoch)
\alpha = \frac{1}{1 + \textrm{decay_rate} \ast \textrm{epoch_num}} \alpha_0
--Other methods include the following
\alpha = 0.95^{\textrm{epoch_num}} \alpha_0\\
\alpha = \frac{k}{\sqrt{\textrm{epoch_num}}} \alpha_0
-Deep Learning Specialization (Coursera) Self-study record (table of contents)
Recommended Posts