Hi,

Just a quick reminder: this is a quite personalized lecture note and is not meant for recording all the details.

Still, hope it can be of help.

Comments and corrections are welcomed : )

Week 2

Logistic regression

A binary classification method in supervised learning

General idea

input: $\large x \in \mathbb{R}^{n_x \times m}$
hidden layer: logistic regression neurons with parameters $\large w \in \mathbb{R}^{n_x}$ and $\large b \in \mathbb{R}$
- $\large z=w^Tx+b$
- binary classification: $\large \hat{y} = a = \sigma(z)$, $\sigma$ is the sigmoid function, $\large 1\over{1+e^{-z}}$
output: prediction $\hat{y}=$ 0 or 1

Now, use gradient descent to train the parameters $w$ and $b$ :

define loss function: $\large L(\hat{y}, y) = -(y\log{\hat{y}}+(1-y)\log{1-\hat{y}})$
- applies for one case
- $y$ can only be 0 or 1
define cost function: $\large J(w, b) = {1\over{m}}\sum\limits_{i=1}^{m}L(a^{(i)}, y^{(i)})$
- applies for all ($m$) the cases
repeatedly update the parameters with:
- $\large w:= w-\alpha \frac{\mathrm{d}J(w)}{\mathrm{d} w}$
- $\large b:= b-\alpha \frac{\mathrm{d}J(b)}{\mathrm{d} b}$
- where $\alpha$ is the learning rate
- when coding, a variable’s (e.g. $w$) derivative of the final product (e.g. $J$ in this case), $\large \frac{\mathrm{d}J(w)}{\mathrm{d} w}$ is written as “$\mathrm{d} w$”
- the derivatives are calculated by the “backpropagation” method, basically just get the derivatives start from the output value according to the chain rule
- the process of minimizing the cost function $J$ to minimum, is to maximize the likelihood of $\hat{y}$ under the assumption of IID (independent identical distribution)

More details about backpropagation:

basically just get the derivatives start from the output value based on the chain rule
$\large \mathrm{d} z = \frac{1}{m}(a - y) $
$\large \mathrm{d}w = \frac{1}{m}x\mathrm{d} z$
$\large \mathrm{d}b = \frac{1}{m}\mathrm{d} z$

Vectorizing with Python

Avoid using “for-loop” as much as possible
Make sure that the data are matrix rather than array, or the $\texttt{broadcasting}$ in Python may causes bugs

Week 3

An example of a 4-layer NN:

Layer 0: the input layer (not counted), $x$
Layer 1 to 3: the hidden layers
Layer 4: the output layer, $\hat{y}$

Two-Layer Neural Network

Structure of an $l+1$ layer NN

Input layer: features $X$, or $a^{[0]}$, NOT counted as a layer
- $X \in \mathbb{R}^{n_x \times m}$
Hidden layer. A neuron is composed by:
- A linear function, $Z$, with parameters $W$ and $b$: $Z^{[l]} = W \times X + b$
- A non-linear activation function, $A$, $A^{[l]} = g(Z^{[l]})$
Output layer: $\hat{Y}$, counted as layer $l+1$, or $a^{[l+1]}$

Notations

In the case of $m$ training examples, each of which has $n_x$ features, and the 1st layer has $I$ nodes:

The $j$-th training example, $x^{(j)}$ is a column vector, and $x^{(j)} \in \mathbb{R}^{n_x \times 1}$
- Stacking the $m$ examples, get $X \in \mathbb{R}^{n_x \times m}$
$\large a^{[l](j)}_i$: the i-th node of the j-th trainning example in the l-th layer
- Taking the $i$-th node in the 1st layer for example:
  - $\large a^{[1]}_i=g(z^{[1]}_i)$
  - $\large z^{[1]}_i = w^{[1]T}_i X + b^{[1]}_i$, where $\large w^{[1]}_i = w^{[1](j)}_i \in \mathbb{R}^{n_x \times 1}$, $\large b^{[1]}_i \in \mathbb{R}$ ($\texttt{braodcasting}$!)
- Stacking all the nodes:
  - $\large W^{[1]}=\begin{bmatrix} \dots & w^{[1]T}_1 & \dots \\ \ddots & \vdots & \ddots \\ \dots & w^{[1]T}_I & \dots \end{bmatrix}_{I \times n_x}$
$b^{[1]} \in \mathbb{R}^{I \times 1}$
- $\large Z^{[1]} = W^{[1]}X+b^{[1]}$, $Z^{[1]} \in \mathbb{R}^{I \times m}$
- $\large A^{[1]} = g(Z^{[1]})$
Then the 2nd layer:
- $\large Z^{[2]} = W^{[2]}a^{[1]}+b^{[2]}$
- $\large a^{[2]} = g(Z^{[2]})$

Activation Function

Functions usually used as activation:

Name	Sigmoid function	Hyperbolic tangent (tanh) function
Expression	$\sigma(z) = \frac{1}{1+e^{-z}}$	$\sigma(z) = \frac{1}{1+e^{-z}}$
Plot
Name	Rectify linear unit (ReLU) function	Leaky ReLu function
Expression	$g(z) = \mathrm{max}\{0,z\}$	e.g., $g(z) = \mathrm{max}\{0.01z,z\}$
Plot

Suggestions on choosing the activation function

Sigmoid function is rarely used in the hidden layer;
Sigmoid function may be used for the output layer for binary classification
Tanh function always performs better and trains faster than sigmoid function in the hidden layers
ReLU function is the most commonly used function, it also trains faster
A leaky ReLU function works better than ReLU function, but it is not commonly used, and ReLU is usually good enough
Then just use ReLU : )

Why there have to be a non-linear activation

A combination of linear function is still a linear function
Only for regression problems, the output layer might use a linear activation func

Implement the Gradient Descent

Forward propagation

Just calculate from the input layer through the hidden layer to the output layer using MATRIX !!!

Backpropagation

A good tutorial here

Use the chain rule
Keep the dimensions matched

Random Initialization

To avoid symmetric (doing exactly the same calculation for all neurons)

Always start small. Most of the activation function has flat slope at large $|x|$, leading to slow learning

Week 4

Forward Propagation wrap-up

Forward Propagation

for one training set:
- $\large z^{[l]} = w^{[l]}\times a^{[l-1]}+b^{[l]}$
- $\large a^{[l]} = g^{[l]}(z^{[l]})$
for the whole training set:
- $\large Z^{[l]} = W^{[l]}\times A^{[l-1]}+b^{[l]}$
$\large A^{[l]} = g^{[l]}(Z^{[l]})$
- where $\large Z^{[1]}=\begin{bmatrix} \vdots & \vdots & \vdots \\ z^{[l](1)} & \vdots & z^{[l](m)} \\ \vdots & \vdots & \vdots \end{bmatrix}_{I \times n_x}$
$\large A^{[0]} = X$, and $\large A^{[L]} = \hat{Y}$

Backpropagation

$\large \mathrm{d}Z^{[L]}=A^{[L]}-Y$

$\large \mathrm{d}W^{[L]} = \frac{1}{m} \mathrm{d} z^{[L]} A^{[L-1]T}$

$\large \mathrm{d}b^{[L]} = \frac{1}{m}\mathrm{np.sum}(\mathrm{d}Z^{[L]}, \mathrm{axis} = 1, \mathrm{keepdims} = True)$

$\large \mathrm{d}Z^{[L-1]} = \mathrm{d}W^{[L]T} \mathrm{d}Z^{[L]} *\operatorname{g}’^{[L-1]}(Z^{[L-1]})$

…

$\large \mathrm{d}Z^{[1]} = \mathrm{d}W^{[2]} \mathrm{d}Z^{[2]} \operatorname{g}’^{[1]}(Z^{[1]})$

$\large \mathrm{d}W^{[1]} = \frac{1}{m} \mathrm{d} z^{[1]} A^{[0]T}$

$\large \mathrm{d}b^{[1]} = \frac{1}{m}\mathrm{np.sum}(\mathrm{d}Z^{[1]}, \mathrm{axis} = 1, \mathrm{keepdims} = True)$

”/*” denote element-wise multiplication

During the forward & backward computations, many quantities are use repeatedly. Thus it is more efficient to “catch” these quantities for latter uses, such as $w^{[l]}$, $b^{[l]}$, $a^{[l]}$, $\mathrm{d}w^{[l]}$, $\mathrm{d}b^{[l]}$, and $\mathrm{d}a^{[l]}$

Get the Matrix Dimension Right

A great tool for debugging!

$w^{[l]}$, $\mathrm{d}{w}^{[l]}$: $(n^{[l]}, n^{[l-1]})$
$b^{[l]}$, $\mathrm{d}{b}^{[l]}$: $(n^{[l]}, 1)$
$z^{[l]}$, $a^{[l]}$, $dz^{[l]}$, $da^{[l]}$: $(n^{[l]}, 1)$
$Z^{[l]}$, $A^{[l]}$, $dZ^{[l]}$, $dAa^{[l]}$: $(n^{[l]},m)$
because of $\mathtt{braodcasting}$, $B^{[l]}$ is the same as $b^{[l]}$

Why Use a Deep NN?

Can handle simple to complicated structures
A function that can be compute with a “small” L-layre deep NN, may require exponentially more hidden units in a shallow NN

Practice

Week 2. Logistic Regression with a Neural Network Mindset

1. Prepare the data

Flatten, reshape, and standardize the data.

For RGB image data, just divide 225.

2. Main steps of building a NN

Define the model structure (such as number of input features)
Initialize the model’s parameters
Loop:
1. Calculate current loss (forward propagation)
2. Calculate current gradient (backward propagation)
3. Update parameters (gradient descent)

The above steps are often built separately and integrated into one function, the $\mathtt{model()}$

The functions for the model may including:

activation function
initializing function: initialize the weightings and bias with zeros
propagate function: forward and backward propagation, with cost
optimize function
prediction function: use the optimized results to predict

And put the above functions in the model:

Initialization
Gradient descent
Retrieve parameters w and b
Predict the training and test set

Further analysis

choose the learning rate. can plot the learning curve for reference

「Note」Neural Networks and Deep Learning

Coursera Deep Learning Specialization -- Course 1

Week 2

Logistic regression

General idea

Vectorizing with Python

Week 3

Two-Layer Neural Network

Structure of an $l+1$ layer NN

Notations

Activation Function

Functions usually used as activation:

Suggestions on choosing the activation function

Why there have to be a non-linear activation

Implement the Gradient Descent

Forward propagation

Backpropagation

Random Initialization

Week 4

Forward Propagation wrap-up

Forward Propagation

Backpropagation

Get the Matrix Dimension Right

Why Use a Deep NN?

Practice

Week 2. Logistic Regression with a Neural Network Mindset

1. Prepare the data

2. Main steps of building a NN

Further analysis

CATALOG

FEATURED TAGS

FRIENDS