「Note」Structuring Machine Learning Projects

Week 1. Strategies to Hit the Goal

1. Orghogonalization

Definition: Orthogonalization is a system design property that ensures that modification of an instruction or an algorithm component does not create or propagate side effects to other system components. Orthogonalization makes it easier to independently verify the algorithms, thus reducing the time required for testing and development (Ref).

2. Chain of assumptions in mL

Fit training set well on cost function (Human–level performance)
Fit dev set well on cost function
Fit test set well on cost function
Performs well in real world

3. Single (real) number evaluation metric

1) Common evaluation metrics

Basic idea: find ONE value that can evaluate the performance of the model(s). It can be a defined number or the result from the combination of some values.

The metric helps to analyze and tuning the model. It should follow the Goal.

Some common parameters:

Positive: predicted to be True; Negative: predicted to be False.

True: the prediction is right, according to the reality; False: the prediction is wrong.

Precision (Positive predictive value): the fraction of true positives among all the predicted positives.

$\large \mathrm{Precision} = \frac{\mathrm{True\ Positives}}{\mathrm{Number\ of\ Predicted\ Positives}} = \frac{\mathrm{True\ Positives}}{\mathrm{True\ Positives + False\ Positives}}$
Recall (Sensitivity): the fraction of true positives among all the positives.

$\large \mathrm{Recall} = \frac{\mathrm{True\ Positives}}{\mathrm{Number\ of\ Actual\ Positives}} = \frac{\mathrm{True\ Positives}}{\mathrm{True\ Positives + False\ Negatives}}$
F1 score: a tradeoff between precision and recall:

$\large \mathrm{F1\ socre} = 2 \times \frac{PR}{P+R}$
Accuracy: the fraction of correctly predicted cases among the entire sample.

$\large \mathrm{Accuracy} = \frac{\mathrm{True\ Positives + True\ Negatives}}{\mathrm{Total\ Population}}$

2) Satisficing and Optimizing Metrics

Satisficing metric: the requests that have to be met (but may not acquire to do better beyond that)
Optimizing metric: the aspects that you want to do as good as possible

4. Train, Dev, Test set

1) Distributions

Dev and test set should always have same distribution.
Dev and test set should reflect the data that you expected to get/apply to in the future and consider to do well on.

2) Sizes

Rule of thumb of selecting the sizes of dev and test set:

Test set: big enough to give high confidence in the overall performance of the system.
Dev set: big enough to evaluate different ideas.
Test set: big enough to evaluate the final cost. A small test set is likely to have high fluctuations.

With the total data size get bigger, usually the test set size is much larger than the other two; the dev set and test set usually have the same size.

5. When to change the metrics and dev/test sets?

Always focus on the goal/application.

the original metric cannot correctly rank ordering preferences between algorithms, e.g. a classifier has higher accuracy but may show pornographic $\rightarrow$ Can adjust by add weights to certain properties/cost function.
the algorithm works well on the metric and dev/test sets, but not on the application, i.e. cannot reflect the data distribution in the application $\rightarrow$ change them.

⚠️Note:

even if you cannot define the prefect evaluation metric and dev/test set, just set sth. up quickly, implement a dirty algorithm, and quickly iterate. Optimize latter.
But DON’T go too long without a evaluate metric and dev set. It will slow down the efficiency of improving the algorithm.

The orthogonal mindset of the above contents: firstly define the data sets and metrics, then optimize.

6. Human–Level Performance

1) Definitions

Human–level performance: usually the best performance (a group of professional) human can achieve, especially when used as a proxy towards the Bayes performance. But it also depends on the goal.
Bayes (optimal) error: the best error for any algorithm to surpass.

2) Why HL?

Good enough for many tasks
Usually not much space of improvement between HL error and Bayes error.
many tools to improve the performance as long as worse than HL performance, e.g.
- get labeled data from humans
- gain insight from manual error analysis
- better analysis of bias/variance

7. Avoidable bias

$\large \mathrm{Bayes\ error} \leftrightsquigarrow \underbrace{ \mathrm{HL\ error} \longleftrightarrow}_\text{Avoidable bias} \mathrm{\ Training\ error}\underbrace{\ \longleftrightarrow \mathrm{Dev\ error}}_\text{Variance}$

Compare avoidable bias and variance, usually, dealing with the larger one is more efficient.

⚠️Note: Even though it is better to prioritize larger sources of error, all else being equal, that is not the only thing to consider. Another important consideration is how difficult and costly it would be.

Surpassing HL performance

When surpass HL performance, it is hard to tell whether the training set is overfitted or it is just closer to the Bayes error.

Algorithms that can surpass HL performance usually:

learnt from structured data
are not natural perception problems
learnt from huge amount of data
some speech recognition and image recognition, and radiology tasks

Summary

Two fundamental assumptions of supervised learning

One can fit the training set pretty well. (avoidable bias)
The training set performance generalizes pretty well to the dev/test set. (variance)

To reduce avoidable bias:

Train bigger model
Train longer/better optimization algorithms
NN architecture/hyper–parameters search

To reduce variance:

More data
Regularization
NN architecture/hyper–parameters search

Week 2. Error Analysis, Mismatched Data sets, and multiple tasks

1. Error analysis procedures

Manually iterate and inspect some (e.g. hundreds of) misclassified dev set examples.

Create a table to track and evaluate multiple ideas in parallel, e.g., for an cat image classifier

Image ID	Dog	Great Cats	Blurry	Comments
1	$\surd$			Pitbull
2			$\surd$
3		$\surd$	$\surd$	Rainy
$\dots$
% of total	8%	43%	61%

Deal with the most significant problem

2. Cleaning up wrongly labeled data

Training set: so long as the error is near random, it is OK to just leave it there. DL algorithms are quite robust to random errors in the training set, but are less robust to systematic errors.

Dev/test set: add an “Incorrectly labeled” column in the above table during error analysis. Then see if it make a significant difference to the results. If not, no need to bother; if needed:

apply the same process to the dev and test sets to make sure they continue to come from the same distribution
consider examining examples the algorithm got right as well as the ones it got wrong, e.g. right prediction just by luck, maybe wrong if correct the label

⚠️Note: now the train and dev/test set may come from slightly different distributions.

3. Build the first system quickly, then iterate

Quickly set up a dev/test set and metric.
Build initial system quickly.
Use bias/variance analysis & error analysis to prioritize next steps.

4. Mismatched training and dev/test sets

1) How to divide the data

Especially when there are a large amount of data with distribution A when the target data have distribution B:

~~Shuffle all the data then dived.~~
Training set: distribution A (+B)

Dev/test set: distribution B only

The algorithm should learn/ be tuned to meet the goal, then it doesn’t make sense to put data with distribution A in the dev/test set. However, for the training set, data of distribution A can help to learn.

2) Bias & variance analysis

Construct a “training–dev set”: same distribution as training set, but not used for training (can be randomly picked from the training set).

$\large \underbrace{\mathrm{HL\ performance} \longleftrightarrow}_{Avoidable\ bias} \underbrace{ \mathrm{Training\ error} \longleftrightarrow }_\text{Variance} \mathrm{TrainingDev\ error} \underbrace{\longleftrightarrow \mathrm{Dev\ error}}_\text{Data mismatch} \underbrace{\longleftrightarrow \mathrm{Test\ Error}}_{\mathrm{Overfitting\ to\ the\ dev\ set}}$

Sometimes, the dev/test error can be better than the training/training–dev set error. This may be results from that the dev/test set distribution is easier for the model.

A general formulation for analysis:

Construct a table with different errors vs. different data distribution. e.g. a speech recognization application of rearview mirror

	General speech data	Rearview mirror speech data
HL performance	(HL error) 4%	(HL error) 6%
Error on examples trained on	(Training error) 7%	(Training error, 7%)$^\dagger$
Error on examples NOT trained on	(Training–dev error) 10%	(Dev/test error) 6%

$\dagger$: if part of the data with distribution B is added to the training set

3) Addressing data mismatch

Carry out manual error analysis to try to understand differences between training and dev/test sets
Make training data more similar; or collect more data similar to dev/test sets
- artificial data synthesis
  
  e.g. adding car noises to speeches to simulate the driving environment. However, if the audio recording to car noises is much shorter than the speeches and has to be used repeatedly, the algorithm may overfit the noises.
  
  Drawback: can only synthesize a subset of all the situations, thus lead to overfitting.

5. Transfer learning

Definition: Train an NN for task A, and then transfer it to a similar task B, e.g. from a cat image classifier to radiology applications.

Method: Train the parameters as usual, then change the output layer, or a few of the last layers, and retrain the NN with the new dataset $(X, Y)$. If the dataset is large enough, can retrain the entire NN, then the previous training is called “pre–training”, and the later procedure is called “fine–tuning”.

Basic ideas: some low level structures of the two datasets are the same, thus pre–training can help with training the parameters and accelerate the training.

When to use:

Task A and B have the same input (e.g. both are images)
have a lot more data for task A than task B
Low level features from A could be helpful for learning B

6. Multi–task learning

Definition: learn multiple tasks at the same time; one example can have multiple labels.

$\large Y=\begin{bmatrix} | & | & \dots & | \\ y^{(1)} & y^{(2)} & \dots & y^{(m)}\\ | & | & \dots & | \end{bmatrix}_{N_{Tasks} \times m}$

Loss function $\hat{y}^{(i)}$; then cost function:

$\large \mathrm{J} = \frac{1}{m} \sum \limits^m_{i=1} \sum \limits^4_{j=1} \mathrm{L}(\hat{y}_j^{(i)}, y_j^{(i)})$