Introduction

This is the content of Course 2, Week 3 (C2W3) of Deep Learning Specialization.

(C2W3L01) Tuning process

--Explanation of how to tune Hyperparameter --The importance of Hyperparameter is as follows ――The most important - \alpha ――Second important - \beta (\sim 0.9) - #hidden_units - mini-batch size ――Third important - #layers - learning rate decay --Do not tune --Adam optimization algorithm $ \ beta_1 $, $ \ beta_2 $, $ \ epsilon $

--When trying Hyperparameter, try random values, don't use a grid --Coarse to Fine; If you find a value that looks good, scrutinize it in the vicinity.

(C2W3L02) Using an appropriate scale to pich hyperparameter

--Hyperparameters that can randomly select values on a linear scale - n^{[l]} - #layers L --Hyperparameters whose values should be randomly selected on the log scale - \alpha ; 0.0001 ～ 1 - \beta ; 0.9 ～ 0.999

r = -4 \ast \textrm{np.random.rand()} \\
\alpha = 10^r \\
r = -2 \ast \textrm{np.random.rand()} - 1 \\
1-\beta = 10^r\\
\beta = 1-10^r

(C2W3L03) Hyperparameter Tuning in Practice : Panda vs. Caviar

Re-test hyperparameters occasionally
Intuitions do get stale. Re-evaluate occasionally

--Focus on one model and tune; babysitting one model (Panda) --Tuning many models in parallel (Caviar)

--When you have enough resources; Caviar --Large amount of data, large model; Panda

(C2W3L04) Normalizing Activations in a Network

--By applying the normalizing applied to input data to the hidden layer, the learning speed of $ W $ and $ b $ will be increased. --For $ z ^ {(i)} $ of hidden layer $ l $

\mu = \frac{1}{m}\sum_{i} z^{(i)} \\
\sigma^2 = \frac{1}{m} \sum_{i} \left( z^{(i)} - \mu \right)^2 \\
z^{(i)}_{norm} = \frac{z^{(i)} - \mu}{\sqrt{\sigma^2 + \epsilon}} \\
\tilde{z}^{(i)} = \gamma z^{(i)}_{norm} + \beta

-$ \ gamma $ and $ \ beta $ are learnable parameters -Set the mean and variance of $ z ^ {(i)} $ to the desired values in $ \ gamma $ and $ \ beta $

(C2W3L05) Fitting Batch Norm into a Neural Network

--Explanation of calculation method using Batch Norm --Normal; $ z ^ {[l]} \ rightarrow a ^ {[l]} $

Batch Norm ; z^{[l]} \rightarrow \tilde{z}^{[l]} \rightarrow a^{[l]} --The parameters are $ W ^ {[l]} $, $ b ^ {[l]} $, $ \ beta ^ {[l]} $, $ \ gamma ^ {[l]} $, but $ z ^ { Calculating the mean of [l]} $ eliminates the effect of the constant term $ b ^ {[l]} $, so $ b ^ {[l]} $ is no longer needed. -$ z ^ {[l]} $, $ \ beta ^ {[l]} $, $ \ gamma ^ {[l]} The dimension of $ is $ \ left (n ^ {[l]}, 1 \ right ) $
Implementing gradient descent
- for t=1 … numMiniBatches
  - Compute forward prop on X^{\\{t\\}}
    - In each hidden layer, use BN (Batch Norm) to replace z^{[l]} with \tilde{z}^{[l]}
  - Use backprop to compute dW^{[l]}, d\beta^{[l]}, d\gamma^{[l]}
Works w/ momentum, RMSProp, Adam --With TensorFlow, you can do it with tf.nn.batch_normalization.

Impressions

――Since various calculation methods have come out, honestly, I can not digest it (the content is not difficult, but there is a lot of volume)

(C2W3L06) Why Does Batch Norm Work?

--Intuitive explanation of why batch norm is used? --Applying batch norm to the mini-batch of interest has the effect of mixing some noise (like dropout).

Impressions

――Honestly, I didn't understand

(C2W3L07) Batch Norm at Test Time

-$ \ mu $ and $ \ sigma ^ 2 $ are calculated for each mini-batch. --But if the number of data is small at the time of testing, $ \ mu $ and $ \ sigma ^ 2 $ use exponentially weighted average (across mini-batch).

(C2W3L08) Softmax regression

--multi-class classification problem

C ; #classes --Let the final layer ($ L $ layer) of the Neural network be the softmax layer

z^{[L]} = W^{[L]} a^{[L-1]} + b^{[L]} \\
t = e^{z^{[L]}} \ \textrm{(element-wise)}\\
a^{[L]} = \frac{e^{z^{[L]}}}{\sum^C_{j=1}t_i}

--If there is no hidden layer, the image of dividing the area by a straight line --If it is a complicated neural network, it becomes a complicated boundary

(C2W3L09) Training a softmax classification

Softmax regressoin generalizes logistic regression to C classes.
If C=2, softmax reduces to logistic regression.

L\left( \hat{y}, y \right) = - \sum^{C}_{j=1} y_j \log \hat{y}_j\\
J = \frac{1}{m} \sum^m_{i=1} L\left( \hat{y}^{(i)}, y^{(i)} \right)

(C2W3L10) Deep Learning Frameworks

Deep Learning Frameworks
- Caffe / Caffe2
- CNTK
- DL4J
- Keras
- Lasagne
- mxnet
- PaddlePaddle
- TensorFlow
- Theano
- Torch --When choosing the Deep Learning Framework
- Ease of programming (development and deployment)
- Running speed
- Truly open (open source with good governance)

(C2W3L11) Tensorflow

-Explanation of how to use TensorFlow with the theme of finding $ w $ that minimizes $ J (w) = w ^ 2 -10w + 25 $ --If you set a cost function, back prop will be implemented automatically.

reference

-Deep Learning Specialization (Coursera) Self-study record (table of contents)

[PYTHON] Deep Learning Specialization (Coursera) Self-study record (C2W3)

Introduction

Contents

Contents

Contents

Contents

Contents

Impressions

Contents

Impressions

Contents

Contents

Contents

Contents

Contents

reference