Hi,
Just a quick reminder: this is a quite personalized lecture note and is not meant for recording all the details.
Still, hope it can be of help.
Comments and corrections are welcomed : )
Week 2
Logistic regression
A binary classification method in supervised learning
General idea
- input: $\large x \in \mathbb{R}^{n_x \times m}$
- hidden layer: logistic regression neurons with parameters $\large w \in \mathbb{R}^{n_x}$ and $\large b \in \mathbb{R}$
- $\large z=w^Tx+b$
- binary classification: $\large \hat{y} = a = \sigma(z)$, $\sigma$ is the sigmoid function, $\large 1\over{1+e^{-z}}$
- output: prediction $\hat{y}=$ 0 or 1
Now, use gradient descent to train the parameters $w$ and $b$ :
- define loss function: $\large L(\hat{y}, y) = -(y\log{\hat{y}}+(1-y)\log{1-\hat{y}})$
- applies for one case
- $y$ can only be 0 or 1
- define cost function: $\large J(w, b) = {1\over{m}}\sum\limits_{i=1}^{m}L(a^{(i)}, y^{(i)})$
- applies for all ($m$) the cases
- repeatedly update the parameters with:
- $\large w:= w-\alpha \frac{\mathrm{d}J(w)}{\mathrm{d} w}$
- $\large b:= b-\alpha \frac{\mathrm{d}J(b)}{\mathrm{d} b}$
- where $\alpha$ is the learning rate
- when coding, a variable’s (e.g. $w$) derivative of the final product (e.g. $J$ in this case), $\large \frac{\mathrm{d}J(w)}{\mathrm{d} w}$ is written as “$\mathrm{d} w$”
- the derivatives are calculated by the “backpropagation” method, basically just get the derivatives start from the output value according to the chain rule
- the process of minimizing the cost function $J$ to minimum, is to maximize the likelihood of $\hat{y}$ under the assumption of IID (independent identical distribution)
More details about backpropagation:
- basically just get the derivatives start from the output value based on the chain rule
- $\large \mathrm{d} z = \frac{1}{m}(a - y) $
- $\large \mathrm{d}w = \frac{1}{m}x\mathrm{d} z$
- $\large \mathrm{d}b = \frac{1}{m}\mathrm{d} z$
Vectorizing with Python
- Avoid using “for-loop” as much as possible
- Make sure that the data are matrix rather than array, or the $\texttt{broadcasting}$ in Python may causes bugs
Week 3
An example of a 4-layer NN:
- Layer 0: the input layer (not counted), $x$
- Layer 1 to 3: the hidden layers
- Layer 4: the output layer, $\hat{y}$
Two-Layer Neural Network
Structure of an $l+1$ layer NN
- Input layer: features $X$, or $a^{[0]}$, NOT counted as a layer
- $X \in \mathbb{R}^{n_x \times m}$
- Hidden layer. A neuron is composed by:
- A linear function, $Z$, with parameters $W$ and $b$: $Z^{[l]} = W \times X + b$
- A non-linear activation function, $A$, $A^{[l]} = g(Z^{[l]})$
- Output layer: $\hat{Y}$, counted as layer $l+1$, or $a^{[l+1]}$
Notations
In the case of $m$ training examples, each of which has $n_x$ features, and the 1st layer has $I$ nodes:
-
The $j$-th training example, $x^{(j)}$ is a column vector, and $x^{(j)} \in \mathbb{R}^{n_x \times 1}$
- Stacking the $m$ examples, get $X \in \mathbb{R}^{n_x \times m}$
-
$\large a^{[l](j)}_i$: the i-th node of the j-th trainning example in the l-th layer
- Taking the $i$-th node in the 1st layer for example:
- $\large a^{[1]}_i=g(z^{[1]}_i)$
- $\large z^{[1]}_i = w^{[1]T}_i X + b^{[1]}_i$, where $\large w^{[1]}_i = w^{[1](j)}_i \in \mathbb{R}^{n_x \times 1}$, $\large b^{[1]}_i \in \mathbb{R}$ ($\texttt{braodcasting}$!)
-
Stacking all the nodes:
- $\large W^{[1]}=\begin{bmatrix} \dots & w^{[1]T}_1 & \dots \\ \ddots & \vdots & \ddots \\ \dots & w^{[1]T}_I & \dots \end{bmatrix}_{I \times n_x}$
- Taking the $i$-th node in the 1st layer for example:
- $b^{[1]} \in \mathbb{R}^{I \times 1}$
- $\large Z^{[1]} = W^{[1]}X+b^{[1]}$, $Z^{[1]} \in \mathbb{R}^{I \times m}$
- $\large A^{[1]} = g(Z^{[1]})$
- Then the 2nd layer:
- $\large Z^{[2]} = W^{[2]}a^{[1]}+b^{[2]}$
- $\large a^{[2]} = g(Z^{[2]})$
Activation Function
Functions usually used as activation:
Name | Sigmoid function | Hyperbolic tangent (tanh) function |
---|---|---|
Expression | $\sigma(z) = \frac{1}{1+e^{-z}}$ | $\sigma(z) = \frac{1}{1+e^{-z}}$ |
Plot | ![]() |
![]() |
Name | Rectify linear unit (ReLU) function | Leaky ReLu function |
Expression | $g(z) = \mathrm{max}\{0,z\}$ | e.g., $g(z) = \mathrm{max}\{0.01z,z\}$ |
Plot | ![]() |
![]() |
Suggestions on choosing the activation function
- Sigmoid function is rarely used in the hidden layer;
- Sigmoid function may be used for the output layer for binary classification
- Tanh function always performs better and trains faster than sigmoid function in the hidden layers
- ReLU function is the most commonly used function, it also trains faster
- A leaky ReLU function works better than ReLU function, but it is not commonly used, and ReLU is usually good enough
- Then just use ReLU : )
Why there have to be a non-linear activation
- A combination of linear function is still a linear function
- Only for regression problems, the output layer might use a linear activation func
Implement the Gradient Descent
Forward propagation
Just calculate from the input layer through the hidden layer to the output layer using MATRIX !!!
Backpropagation
A good tutorial here
-
Use the chain rule
-
Keep the dimensions matched
Random Initialization
To avoid symmetric (doing exactly the same calculation for all neurons)
Always start small. Most of the activation function has flat slope at large $|x|$, leading to slow learning
Week 4
Forward Propagation wrap-up
Forward Propagation
-
for one training set:
-
$\large z^{[l]} = w^{[l]}\times a^{[l-1]}+b^{[l]}$
-
$\large a^{[l]} = g^{[l]}(z^{[l]})$
-
-
for the whole training set:
- $\large Z^{[l]} = W^{[l]}\times A^{[l-1]}+b^{[l]}$
- $\large A^{[l]} = g^{[l]}(Z^{[l]})$
- where $\large Z^{[1]}=\begin{bmatrix} \vdots & \vdots & \vdots \\ z^{[l](1)} & \vdots & z^{[l](m)} \\ \vdots & \vdots & \vdots \end{bmatrix}_{I \times n_x}$
- $\large A^{[0]} = X$, and $\large A^{[L]} = \hat{Y}$
Backpropagation
$\large \mathrm{d}Z^{[L]}=A^{[L]}-Y$
$\large \mathrm{d}W^{[L]} = \frac{1}{m} \mathrm{d} z^{[L]} A^{[L-1]T}$
$\large \mathrm{d}b^{[L]} = \frac{1}{m}\mathrm{np.sum}(\mathrm{d}Z^{[L]}, \mathrm{axis} = 1, \mathrm{keepdims} = True)$
$\large \mathrm{d}Z^{[L-1]} = \mathrm{d}W^{[L]T} \mathrm{d}Z^{[L]} *\operatorname{g}’^{[L-1]}(Z^{[L-1]})$
…
$\large \mathrm{d}Z^{[1]} = \mathrm{d}W^{[2]} \mathrm{d}Z^{[2]} \operatorname{g}’^{[1]}(Z^{[1]})$
$\large \mathrm{d}W^{[1]} = \frac{1}{m} \mathrm{d} z^{[1]} A^{[0]T}$
$\large \mathrm{d}b^{[1]} = \frac{1}{m}\mathrm{np.sum}(\mathrm{d}Z^{[1]}, \mathrm{axis} = 1, \mathrm{keepdims} = True)$
”/*” denote element-wise multiplication
During the forward & backward computations, many quantities are use repeatedly. Thus it is more efficient to “catch” these quantities for latter uses, such as $w^{[l]}$, $b^{[l]}$, $a^{[l]}$, $\mathrm{d}w^{[l]}$, $\mathrm{d}b^{[l]}$, and $\mathrm{d}a^{[l]}$
Get the Matrix Dimension Right
A great tool for debugging!
- $w^{[l]}$, $\mathrm{d}{w}^{[l]}$: $(n^{[l]}, n^{[l-1]})$
- $b^{[l]}$, $\mathrm{d}{b}^{[l]}$: $(n^{[l]}, 1)$
- $z^{[l]}$, $a^{[l]}$, $dz^{[l]}$, $da^{[l]}$: $(n^{[l]}, 1)$
- $Z^{[l]}$, $A^{[l]}$, $dZ^{[l]}$, $dAa^{[l]}$: $(n^{[l]},m)$
- because of $\mathtt{braodcasting}$, $B^{[l]}$ is the same as $b^{[l]}$
Why Use a Deep NN?
- Can handle simple to complicated structures
- A function that can be compute with a “small” L-layre deep NN, may require exponentially more hidden units in a shallow NN
Practice
Week 2. Logistic Regression with a Neural Network Mindset
1. Prepare the data
Flatten, reshape, and standardize the data.
For RGB image data, just divide 225.
2. Main steps of building a NN
- Define the model structure (such as number of input features)
- Initialize the model’s parameters
- Loop:
- Calculate current loss (forward propagation)
- Calculate current gradient (backward propagation)
- Update parameters (gradient descent)
The above steps are often built separately and integrated into one function, the $\mathtt{model()}$
The functions for the model may including:
- activation function
- initializing function: initialize the weightings and bias with zeros
- propagate function: forward and backward propagation, with cost
- optimize function
- prediction function: use the optimized results to predict
And put the above functions in the model:
- Initialization
- Gradient descent
- Retrieve parameters w and b
- Predict the training and test set
Further analysis
- choose the learning rate. can plot the learning curve for reference