[PYTHON] Deep Learning Specialization (Coursera) Self-study record (C2W1)


This is the content of Course 2, Week 1 (C2W1) of Deep Learning Specialization.

(C2W1L01) Train / Dev / Test sets


--Applied ML is a highly iterative process. It is important to efficiently rotate the cycle of Idea → Code → Experiment → Idea…

(C2W1L02) Bias / Variance


――High bias and high variance can be visualized in 2D, but not in high dimension.

train set error dev set error
1% 11% high variance
15% 16% high bias
15% 30% high bias & high variance
0.5% 1% low bias & low variance

--When the error (optimal error or Bayes error) when judged by a person is set to 0

(C2W1L03) Basic "recipe" for machine learning


--In case of high bias (check with training data performance) - bigger network - train longer -(NN architecture search) (may not be useful) --Repeat until high bias is resolved --For high variance (check with dev set performance) - more data - regularization -(NN architecture search) (may not be useful) --Bias and variance trade-offs were a problem in early neural networks --Now that we have more data, we can improve the variance without deteriorating the bias.

(C2W1L04) Regularization


J\left(w, b\right) = \frac{1}{m} \sum^{m}_{i=1}L\left(\hat{y}^{(i)}, y^{(i)}\right) + \frac{\lambda}{2m}\|w\|^2_2

-$ \ lambda $; Normalization parameter (one of hyperparameters)

J\left(w^{[1]}, b^{[1]}, \cdots , w^{[L]}, b^{[l]}\right) = \frac{1}{m} \sum^{m}_{i=1} L\left(\hat{y}^{(i)}, y^{(i)}\right) + \frac{\lambda}{2m} \sum^{L}_{l=1} \|w^{[l]}\|^2\\
\|w^{[l]}\|^2 = \sum^{n^{[l-1]}}_{i=1}\sum^{n^{[l]}}_{j=1}\left(w_{ij}^{[l]}\right)^2

-The dimension of $ w $ is $ (n ^ {[l-1]}, n ^ {[l]}) $

dw^{[l]} = \left( \textrm{from backprop} \right) + \frac{\lambda}{m}w^{[l]} \\
w^{[l]} = w^{[l]} - \alpha dw^{[l]} = \left(1 - \alpha \frac{\lambda}{m} \right)w^{[l]} - \alpha \left( \textrm{from backprop} \right)

--It is called weight decay because regularization makes $ dw ^ {[l]} $ smaller.

(C2W1L05) Why Regularization Reduces Overfitting


-If $ \ lambda $ is large, it will be $ w ^ {[l]} \ sim 0 $. Then, the influence of the hidden unit can be reduced, and it is considered that the network became simple. So get closer to high bias -Large $ \ lambda $ is closer to logistic regression -If $ g (z) = \ tanh (z) $, if $ z $ is small, the linear region of $ g (z) $ will be used. --When the activation function can be regarded as a linear function, it becomes impossible to represent a complex network. Therefore, it approaches high bias --When using the steepest descent method to confirm that $ J $ is getting smaller for each iteration, calculate $ J $ including the second term.

(C2W1L06) Dropout Regularization


--Drop out each unit with a certain probability (drop the unit) --Train with a reduced neural network -Assuming $ l = 3 $ (layer 3). Let keep_prob (= 0.8) be the probability of survival (the probability that 1-keep_prob will drop out). If dropout vector is $ d3 $

d3 = \mathrm{np.random.rand(} a3 \mathrm{.shape[0], }\, a3 \mathrm{.shape[1])} < \mathrm{keep\_prob} \\
a3 = \mathrm{np.multiply(} a3, d3 \mathrm{)} \\
a3\  /= \mathrm{keep\_prob} \\
a^{[4]} = W^{[4]} a^{[3]} + b^{[4]}

--Finally, keep the expected value of $ a ^ {[3]} $ by dividing by keep_prob. --Change dropout vector $ d3 $ for each iteration of steepest descent --Do not implement dropout when calculating test set (if dropout is included in the test, it will be noisy)

(C2W1L07) Understanding dropout


(C2W1L08) Other Regularization Methods


(C2W1L09) Normalizing inputs


--Normalize the input feature when the scale of the input feature is significantly different. By doing so, the steepest descent method can be calculated quickly.

\mu = \frac{1}{m} \sum^{m}_{i=1} x^{(i)} \\
x := x - \mu \\
\sigma^2 = \frac{2}{m} \sum^{m}_{i=1} x^{(i)} \ast\ast 2 \\
x \ /= \sigma^2

--Use the train set $ \ mu $ and $ \ sigma $ when normalizing the dev set.

(C2W1L10) Vanishing / exploding gradients


--When training a very deep neural network, the derivative becomes very small or large. Especially when it gets smaller, the steepest descent method takes time. --The current neural network has about 150 layers

(C2W1L11) Weight initialization for deep networks


--The more input features, the larger $ z $ calculated by $ z = wx + b $. Therefore, when there are many input features, make w small at the time of initialization.

W^{[l]} = \mathrm{np.random.randn} \left( \cdots \right) \ast \mathrm{np.sqrt} \left( \frac{2}{n^{[l-1]}} \right)

--For ReLU, $ \ sqrt {\ frac {2} {n ^ {[l-1]}}} $ works fine -$ \ tanh $ should be $ \ sqrt {\ frac {1} {n ^ {[l-1]}}} $ (Xavier initialization)

(C2W1L12) Numerial Approximation of Gradients


--The approximate value of the derivative is $ \ frac {f (\ theta + \ epsilon) --f (\ theta-\ epsilon)} {2 \ epsilon} $ with $ \ epsilon $ as a small number. --The order of error is $ O (\ epsilon ^ 2) $

(C2W1L13) Gradient checking


d\theta_{approx}^{[i]} = \frac{J(\theta_1, \cdots, \theta_i+\epsilon, \cdots) - J(\theta_1, \cdots, \theta_i-\epsilon, \cdots)}{2\epsilon} \sim d\theta^{[i]}

--check ($ \ epsilon = 10 ^ {-7} $)

\frac{\|d\theta_{approx} - d\theta\|_2}{\|d\theta_{approx}\|_2 + \|d\theta\|_2}
value judgement
10^{-7} great!
10^{-5} It may be OK, but check
10^{-3} Possible bug

--If it looks like a bug, check where the difference between $ d \ theta_ {approx} $ and $ d \ theta $ is large for a specific $ i $.

(C2W1L14) Gradient Checking Implementation Notes


--How to use gradient checking and how to deal with when $ d \ theta_ {approx} $ and $ d \ theta $ are different.


-Deep Learning Specialization (Coursera) Self-study record (table of contents)

Recommended Posts

Deep Learning Specialization (Coursera) Self-study record (C3W1)
Deep Learning Specialization (Coursera) Self-study record (C1W3)
Deep Learning Specialization (Coursera) Self-study record (C4W3)
Deep Learning Specialization (Coursera) Self-study record (C1W4)
Deep Learning Specialization (Coursera) Self-study record (C2W1)
Deep Learning Specialization (Coursera) Self-study record (C1W2)
Deep Learning Specialization (Coursera) Self-study record (C3W2)
Deep Learning Specialization (Coursera) Self-study record (C2W2)
Deep Learning Specialization (Coursera) Self-study record (C4W1)
Deep Learning Specialization (Coursera) Self-study record (C2W3)
Deep Learning Specialization (Coursera) Self-study record (C4W2)
Learning record
Learning record # 3
Learning record # 1
Learning record # 2
Deep Learning
Learning record of reading "Deep Learning from scratch"
"Deep Learning from scratch" Self-study memo (Part 12) Deep learning
"Deep Learning from scratch" self-study memo (unreadable glossary)
"Deep Learning from scratch" Self-study memo (9) MultiLayerNet class
Deep Learning Memorandum
Start Deep learning
Python Deep Learning
Deep learning × Python
"Deep Learning from scratch" Self-study memo (10) MultiLayerNet class
"Deep Learning from scratch" Self-study memo (No. 11) CNN
"Deep Learning from scratch" Self-study memo (No. 19) Data Augmentation
"Deep Learning from scratch 2" Self-study memo (No. 21) Chapters 3 and 4
Learning record so far
First Deep Learning ~ Struggle ~
Python: Deep Learning Practices
Go language learning record
Deep learning / activation functions
Deep Learning from scratch
Learning record 4 (8th day)
Learning record 9 (13th day)
Learning record 3 (7th day)
Deep learning 1 Practice of deep learning
Deep learning / cross entropy
Learning record 5 (9th day)
Learning record 6 (10th day)
First Deep Learning ~ Preparation ~
Programming learning record day 2
First Deep Learning ~ Solution ~
Learning record 8 (12th day)
[AI] Deep Metric Learning
Learning record 1 (4th day)
Learning record 7 (11th day)
I tried deep learning
Python: Deep Learning Tuning
Learning record 2 (6th day)
Deep learning large-scale technology
Linux learning record ① Plan
Learning record 16 (20th day)
Learning record 22 (26th day)
Deep learning / softmax function
"Deep Learning from scratch" self-study memo (No. 18) One! Meow! Grad-CAM!
"Deep Learning from scratch" self-study memo (No. 19-2) Data Augmentation continued
"Deep Learning from scratch" self-study memo (No. 15) TensorFlow beginner tutorial