「Note」Neural Networks and Deep Learning

Coursera Deep Learning Specialization -- Course 1

Posted by Yuwei on July 20, 2020

Hi,

Just a quick reminder: this is a quite personalized lecture note and is not meant for recording all the details.

Still, hope it can be of help.

Comments and corrections are welcomed : )

Week 2

Logistic regression

A binary classification method in supervised learning

General idea

  • input: $\large x \in \mathbb{R}^{n_x \times m}$
  • hidden layer: logistic regression neurons with parameters $\large w \in \mathbb{R}^{n_x}$ and $\large b \in \mathbb{R}$
    • $\large z=w^Tx+b$
    • binary classification: $\large \hat{y} = a = \sigma(z)$, $\sigma$ is the sigmoid function, $\large 1\over{1+e^{-z}}$
  • output: prediction $\hat{y}=$ 0 or 1

Now, use gradient descent to train the parameters $w$ and $b$ :

  • define loss function: $\large L(\hat{y}, y) = -(y\log{\hat{y}}+(1-y)\log{1-\hat{y}})$
    • applies for one case
    • $y$ can only be 0 or 1
  • define cost function: $\large J(w, b) = {1\over{m}}\sum\limits_{i=1}^{m}L(a^{(i)}, y^{(i)})$
    • applies for all ($m$) the cases
  • repeatedly update the parameters with:
    • $\large w:= w-\alpha \frac{\mathrm{d}J(w)}{\mathrm{d} w}$
    • $\large b:= b-\alpha \frac{\mathrm{d}J(b)}{\mathrm{d} b}$
    • where $\alpha$ is the learning rate
    • when coding, a variable’s (e.g. $w$) derivative of the final product (e.g. $J$ in this case), $\large \frac{\mathrm{d}J(w)}{\mathrm{d} w}$ is written as “$\mathrm{d} w$”
    • the derivatives are calculated by the “backpropagation” method, basically just get the derivatives start from the output value according to the chain rule
    • the process of minimizing the cost function $J$ to minimum, is to maximize the likelihood of $\hat{y}$ under the assumption of IID (independent identical distribution)

More details about backpropagation:

  • basically just get the derivatives start from the output value based on the chain rule
  • $\large \mathrm{d} z = \frac{1}{m}(a - y) $
  • $\large \mathrm{d}w = \frac{1}{m}x\mathrm{d} z$
  • $\large \mathrm{d}b = \frac{1}{m}\mathrm{d} z$

Vectorizing with Python

  • Avoid using “for-loop” as much as possible
  • Make sure that the data are matrix rather than array, or the $\texttt{broadcasting}$ in Python may causes bugs

Week 3

An example of a 4-layer NN:

  • Layer 0: the input layer (not counted), $x$
  • Layer 1 to 3: the hidden layers
  • Layer 4: the output layer, $\hat{y}$

4L

Two-Layer Neural Network

Structure of an $l+1$ layer NN

  1. Input layer: features $X$, or $a^{[0]}$, NOT counted as a layer
    • $X \in \mathbb{R}^{n_x \times m}$
  2. Hidden layer. A neuron is composed by:
    • A linear function, $Z$, with parameters $W$ and $b$: $Z^{[l]} = W \times X + b$
    • A non-linear activation function, $A$, $A^{[l]} = g(Z^{[l]})$
  3. Output layer: $\hat{Y}$, counted as layer $l+1$, or $a^{[l+1]}$

Notations

In the case of $m$ training examples, each of which has $n_x$ features, and the 1st layer has $I$ nodes:

  • The $j$-th training example, $x^{(j)}$ is a column vector, and $x^{(j)} \in \mathbb{R}^{n_x \times 1}$

    • Stacking the $m$ examples, get $X \in \mathbb{R}^{n_x \times m}$
  • $\large a^{[l](j)}_i$: the i-th node of the j-th trainning example in the l-th layer

    • Taking the $i$-th node in the 1st layer for example:
      • $\large a^{[1]}_i=g(z^{[1]}_i)$
      • $\large z^{[1]}_i = w^{[1]T}_i X + b^{[1]}_i$, where $\large w^{[1]}_i = w^{[1](j)}_i \in \mathbb{R}^{n_x \times 1}$, $\large b^{[1]}_i \in \mathbb{R}$ ($\texttt{braodcasting}$!)
    • Stacking all the nodes:

      • $\large W^{[1]}=\begin{bmatrix} \dots & w^{[1]T}_1 & \dots \\ \ddots & \vdots & \ddots \\ \dots & w^{[1]T}_I & \dots \end{bmatrix}_{I \times n_x}$
  • $b^{[1]} \in \mathbb{R}^{I \times 1}$
    • $\large Z^{[1]} = W^{[1]}X+b^{[1]}$, $Z^{[1]} \in \mathbb{R}^{I \times m}$
    • $\large A^{[1]} = g(Z^{[1]})$
  • Then the 2nd layer:
    • $\large Z^{[2]} = W^{[2]}a^{[1]}+b^{[2]}$
    • $\large a^{[2]} = g(Z^{[2]})$

Activation Function

Functions usually used as activation:

Name Sigmoid function Hyperbolic tangent (tanh) function
Expression $\sigma(z) = \frac{1}{1+e^{-z}}$ $\sigma(z) = \frac{1}{1+e^{-z}}$
Plot Sigmoid Sigmoid
Name Rectify linear unit (ReLU) function Leaky ReLu function
Expression $g(z) = \mathrm{max}\{0,z\}$ e.g., $g(z) = \mathrm{max}\{0.01z,z\}$
Plot ReLU LReLU

Suggestions on choosing the activation function

  1. Sigmoid function is rarely used in the hidden layer;
  2. Sigmoid function may be used for the output layer for binary classification
  3. Tanh function always performs better and trains faster than sigmoid function in the hidden layers
  4. ReLU function is the most commonly used function, it also trains faster
  5. A leaky ReLU function works better than ReLU function, but it is not commonly used, and ReLU is usually good enough
  6. Then just use ReLU : )

Why there have to be a non-linear activation

  • A combination of linear function is still a linear function
  • Only for regression problems, the output layer might use a linear activation func

Implement the Gradient Descent

Forward propagation

Just calculate from the input layer through the hidden layer to the output layer using MATRIX !!!

Backpropagation

A good tutorial here

  • Use the chain rule

  • Keep the dimensions matched

    BP

Random Initialization

To avoid symmetric (doing exactly the same calculation for all neurons)

Always start small. Most of the activation function has flat slope at large $|x|$, leading to slow learning

Week 4

Forward Propagation wrap-up

Forward Propagation

  • for one training set:

    • $\large z^{[l]} = w^{[l]}\times a^{[l-1]}+b^{[l]}$

    • $\large a^{[l]} = g^{[l]}(z^{[l]})$

  • for the whole training set:

    • $\large Z^{[l]} = W^{[l]}\times A^{[l-1]}+b^{[l]}$
  • $\large A^{[l]} = g^{[l]}(Z^{[l]})$
    • where $\large Z^{[1]}=\begin{bmatrix} \vdots & \vdots & \vdots \\ z^{[l](1)} & \vdots & z^{[l](m)} \\ \vdots & \vdots & \vdots \end{bmatrix}_{I \times n_x}$
  • $\large A^{[0]} = X$, and $\large A^{[L]} = \hat{Y}$

Backpropagation

$\large \mathrm{d}Z^{[L]}=A^{[L]}-Y$

$\large \mathrm{d}W^{[L]} = \frac{1}{m} \mathrm{d} z^{[L]} A^{[L-1]T}$

$\large \mathrm{d}b^{[L]} = \frac{1}{m}\mathrm{np.sum}(\mathrm{d}Z^{[L]}, \mathrm{axis} = 1, \mathrm{keepdims} = True)$

$\large \mathrm{d}Z^{[L-1]} = \mathrm{d}W^{[L]T} \mathrm{d}Z^{[L]} *\operatorname{g}’^{[L-1]}(Z^{[L-1]})$

$\large \mathrm{d}Z^{[1]} = \mathrm{d}W^{[2]} \mathrm{d}Z^{[2]} \operatorname{g}’^{[1]}(Z^{[1]})$

$\large \mathrm{d}W^{[1]} = \frac{1}{m} \mathrm{d} z^{[1]} A^{[0]T}$

$\large \mathrm{d}b^{[1]} = \frac{1}{m}\mathrm{np.sum}(\mathrm{d}Z^{[1]}, \mathrm{axis} = 1, \mathrm{keepdims} = True)$

”/*” denote element-wise multiplication

During the forward & backward computations, many quantities are use repeatedly. Thus it is more efficient to “catch” these quantities for latter uses, such as $w^{[l]}$, $b^{[l]}$, $a^{[l]}$, $\mathrm{d}w^{[l]}$, $\mathrm{d}b^{[l]}$, and $\mathrm{d}a^{[l]}$

Get the Matrix Dimension Right

A great tool for debugging!

  • $w^{[l]}$, $\mathrm{d}{w}^{[l]}$: $(n^{[l]}, n^{[l-1]})$
  • $b^{[l]}$, $\mathrm{d}{b}^{[l]}$: $(n^{[l]}, 1)$
  • $z^{[l]}$, $a^{[l]}$, $dz^{[l]}$, $da^{[l]}$: $(n^{[l]}, 1)$
  • $Z^{[l]}$, $A^{[l]}$, $dZ^{[l]}$, $dAa^{[l]}$: $(n^{[l]},m)$
  • because of $\mathtt{braodcasting}$, $B^{[l]}$ is the same as $b^{[l]}$

Why Use a Deep NN?

  • Can handle simple to complicated structures
  • A function that can be compute with a “small” L-layre deep NN, may require exponentially more hidden units in a shallow NN

Practice

Week 2. Logistic Regression with a Neural Network Mindset

1. Prepare the data

Flatten, reshape, and standardize the data.

For RGB image data, just divide 225.

2. Main steps of building a NN

  1. Define the model structure (such as number of input features)
  2. Initialize the model’s parameters
  3. Loop:
    1. Calculate current loss (forward propagation)
    2. Calculate current gradient (backward propagation)
    3. Update parameters (gradient descent)

The above steps are often built separately and integrated into one function, the $\mathtt{model()}$

The functions for the model may including:

  • activation function
  • initializing function: initialize the weightings and bias with zeros
  • propagate function: forward and backward propagation, with cost
  • optimize function
  • prediction function: use the optimized results to predict

And put the above functions in the model:

  1. Initialization
  2. Gradient descent
  3. Retrieve parameters w and b
  4. Predict the training and test set

Further analysis

  1. choose the learning rate. can plot the learning curve for reference