Introduction

This is the content of Course 1, Week 2 (C1W2) of Deep Learning Specialization.

(C1W2L01) Binary Classification

--Explanation of binary classification in the case of judging "whether it is a cat" from image data --Explanation of notation (meaning of symbol) -$ X ; 1 data feature in the row direction ( n_x ), training example ( m $ pieces) in the column direction ($ X \ in \ mathbb {R} ^ {n_x \ times m} $)

Y ; Y \in \mathbb{R}^{1\times m}

Impressions

--The meaning of the rows and columns of $ X $ has changed compared to the case of the Machine Learning lecture.

(C1W2L02) Logistic Regression

--Predicted value $ \ hat {y} = P (y = 1 | x) $ (probability of $ y = 1 $) --Define the parameter $ w \ in \ mathbb {R} ^ {n_x} $, $ b \ in \ mathbb {R} $ -$ \ hat {y} = \ sigma (w ^ T x + b) $; sigmoid function

\sigma(z) = \frac{1}{1+e^{-z}}

Impressions

――The symbols are different here as well as in Machine Learning. Don't use $ x_0 ^ {(i)} = 1 $. Do not include the constant term $ b $ in $ w $.

(C1W2L03) Logistic Regression Cost Function

cost function ; J(w, b) = -\frac{1}{m} \Sigma^m_{i=1}\[y^{(i)}\log\hat{y}^{(i)} + $1-y^{(i)}$\log$1-\hat{y}^{(i)}$ \]

(C1W2L04) Gradient Descent

--Intuitive explanation of gradient descent -$ \ frac {\ partial J (w, b)} {\ partial w} $ is often written as dw in programs. -$ \ frac {\ partial J (w, b)} {\ partial b} $ is often written as db in programs.

(C1W2L05) Derivatives

--A brief explanation of differentiation

Impressions

――Since it is basic content, you can watch the video at 1.75 times.

(C1W2L06) More Derivatives Example

--A brief explanation of differentiation

Impressions

――Since it is basic content, you can watch the video at 1.75 times.

(C1W2L07) Computation Graph

-When $ J (a, b, c) = 3 \ (a + bc ) $, decompose as $ u = bc $, $ v = a + u $, $ J = 3v $ Illustrate how to calculate

(C1W2L08) Derivatives With Computation Graph

--Explanation of differentiation ($ \ frac {dJ} {da} = \ frac {dJ} {dv} \ frac {dv} {da} $) while using Computational Graph

(C1W2L09) Logistic Regression Gradient Descent

--Explanation of the derivative of the loss $ L \ (a, y ) $ of logistic regression

(C1W2L10) Gradient Descent on m Example

--Explanation of how to differentiate cost function $ J \ (w, b ) $ and apply it to the steepest descent method when the number of samples is $ m $. --As explained in the for loop, vectorization is important because the for loop is inefficient.

(C1W2L11) Vectorization

-Explanation of the concept of vectorization using $ w ^ T x $ of $ z = w ^ T x + b $ as an example --Demonstrated the calculation time of for loop and vector calculation (z = np.dot (w, x) + b) on Jupyter notebook. 300 times different

SIMD ; Single Instruction Multiple Data --Python Numpy parallelizes calculations

I also compared the time with for loop and vector calculation.

`vectorization.py`


import numpy as np
import time

a = np.random.rand(1000000)
b = np.random.rand(1000000)

tic = time.time()
c = np.dot(a, b)
toc = time.time()

print(c)
print("Vectorization version:" + str(1000*(toc-tic)) + "ms")

c = 0
tic = time.time()
for i in range(1000000):
    c += a[i]*b[i]
toc = time.time()

print(c)
print("for loop:" + str(1000*(toc-tic)) + "ms")

The result. There was a difference of less than 700 times, 12ms for vectorization and 821ms for for loop.

249840.57440415953
Vectorization version:12.021541595458984ms
249840.57440415237
for loop:821.0625648498535ms

(C1W2L12) More Vectorization Examples

--Neural network programming guideline; ** Whatever possible, avoid explicit for-loop / Avoid for-loop as much as possible **

`example.py`


import numpy as np

u = np.dot(A, v) #Product of matrix and vector

u = np.exp(v) #Let exp act on each element
u = np.log(v) #Let log work on each element
u = np.abs(v) #Element by element abs(Absolute value)To act
u = np.maximum(v, 0) #Elements below 0 should be 0
u = v ** 2 #Square for each element
u = 1/v #Reciprocal for each element

(C1W2L13) Vectorizing Logistics Regression

--Vectorize Logistics regression calculations

X = \left[x^{(1)} \ x^{(2)} \cdots \ x^{(m)}\right] \ (X \in \mathbb{R}^{n_x \times m}) \\
Z = \left[z^{(1)} \ z^{(2)} \cdots \ z^{(m)}\right] \ (Z \in \mathbb{R}^m ) \\
A = \left[a^{(1)} \ a^{(2)} \cdots \ a^{(m)}\right] \ (A \in \mathbb{R}^m ) \\
Z = w^T X + \left[b \ b \ \cdots b \right] \\
A = \mathrm{sigmoid}\left( Z \right) \ (\mathrm{sigmoid} \Implement the function properly)

--In Python, `Z = np.dot (w.T, X) + b``` ( `b``` is automatically converted to a column vector of [1, m])

(C1W2L14) Vectorizing Logistics Regression's Gradient Computation

--Explanation of vectorization of differential calculation of logistics regression

db = \frac{1}{m} \cdot \mathrm{np.sum}(Z) \\
dw = \frac{1}{m} \cdot X\ dZ^T

Impressions

--A mixture of ordinary mathematical expressions and Python code. You may understand it while listening to the class, but it may be difficult to understand if you look back at it later.

(C1W2L15) Broadcasting in Python

--Description of Python broadcast -When you add the (m, n) matrix and the (1, n) matrix, the (1, n) matrix automatically becomes the (m, n) matrix. -When you add the (m, n) matrix and the (m, 1) matrix, the (m, 1) matrix automatically becomes the (m, n) matrix. --See NumPy broadcast documentation for details --The bsxfun function in Matlab / Octave is a little different (?)

`example.py`


>>> import numpy as np
>>> a = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
>>> b = np.array([100, 200, 300, 400])
>>> a + b
array([[101, 202, 303, 404],
       [105, 206, 307, 408]])

(C1W2L16) A Note on Python/numpy vectors

--The flexibility of Python / NumPy is both an advantage and a disadvantage --Even if you add the row vector and the column vector, no error will occur and some calculation result will be obtained, so it is difficult to find the error. -** Do not use arrays of size `` `(n,) ``` (do not use rank 1 arrays) **

`example.py`


>>> import numpy as np
>>> a = np.random.rand(5) #Rank 1 array
>>> print(a)
[0.4721318  0.73582028 0.78261299 0.25030022 0.69326545]
>>> print(a.T)
[0.4721318  0.73582028 0.78261299 0.25030022 0.69326545] #The display does not change even if you change places
>>> print(np.dot(a, a.T)) #I'm calculating the inner product, but I'm not sure whether to calculate the inner product or the outer product.
1.9200902050946715
>>>
>>> a = np.random.rand(5, 1) # (5, 1)Matrix
>>> print(a) #Row vector
[[0.78323543]
 [0.18639053]
 [0.45103025]
 [0.48060903]
 [0.93265189]]
>>> print(a.T)
[[0.78323543 0.18639053 0.45103025 0.48060903 0.93265189]] #Column vector after landing
>>> print(np.dot(a, a.T)) #Correctly calculate the product of row and column vectors
[[0.61345774 0.14598767 0.35326287 0.37643002 0.73048601]
 [0.14598767 0.03474143 0.08406777 0.08958097 0.17383748]
 [0.35326287 0.08406777 0.20342829 0.21676921 0.42065422]
 [0.37643002 0.08958097 0.21676921 0.23098504 0.44824092]
 [0.73048601 0.17383748 0.42065422 0.44824092 0.86983955]]

--If you don't know the dimension, enter assert (a.shape == (5, 1)) `` `etc. --An array of rank 1 is explicitly reshaped as a = a.reshape ((5,1)) `` `

Impressions

――It's important here because I often lost track of the size of the matrix when I took Machine Learning.

(C1W2L17) Quick tour of Jupyter/ipython notebooks

--Explanation of how to use Jupyter / ipython notebook when taking Coursera

(C1W2L18) Explanation of Logistics Regression Cost Function (Optional)

--(Re) description of cost function of logistics regression

\hat{y} = \sigma(w^T x + b) {y}when \sigma(z) = \frac{1}{1 + e^{-z}}
\hat{y} = P(y=1 | x) ; y=1Probability, soy=1Ifp(y|x) = \hat{y}，y=0Ifp(y|x) = 1 - \hat{y} --To summarize this, you can also write $ p (y | x) = \ hat {y} ^ y (1- \ hat {y}) ^ {(1-y)} $ --Since the logarithmic function is also an increasing function, it is the same to maximize the logarithm of the above formula.
\log p(y|x) = y\log\hat{y} + (1-y)\log(1-\hat{y}) = -L(\hat{y}, y) --Given the probabilities of multiple training sets, $ \ Pi ^ {m} _ {i = 1} p (y ^ {(i)} | x ^ {(i)}) $ --Given the logarithm of this, $ \ Sigma ^ {m} _ {i = 1} \ log p (y ^ {(i)} | x ^ {(i)}) =-\ Sigma L (\ hat { y} ^ {(i)}, y ^ {(i)}) $. This will be maximized. --The cost function is $ \ frac {1} {m} $ and minimized, so take the negative sign and $ J (w, b) = \ frac {1} {m} \ Sigma ^ {m} _ {i = 1} L (\ hat {y} ^ {(i)}, y ^ {(i)}) $

Impressions

--Honestly, I don't understand much: -p --$ L (\ hat {y}, y) = --y \ log \ hat {y}-(1-y) \ log (1- \ hat {y}) It's more intuitive for me to talk from $ It was easy to understand

reference

-Deep Learning Specialization (Coursera) Self-study record (table of contents)

[PYTHON] Deep Learning Specialization (Coursera) Self-study record (C1W2)

Introduction

Contents

Impressions

Contents

Impressions

Contents

Contents

Contents

Impressions

Contents

Impressions

Contents

Contents

Contents

Contents

Contents

`vectorization.py`

Contents

`example.py`

Contents

Contents

Impressions

Contents

`example.py`

Contents

`example.py`

Impressions

Contents

Contents

Impressions

reference