Week 1. Strategies to Hit the Goal
1. Orghogonalization
Definition: Orthogonalization is a system design property that ensures that modification of an instruction or an algorithm component does not create or propagate side effects to other system components. Orthogonalization makes it easier to independently verify the algorithms, thus reducing the time required for testing and development (Ref).
2. Chain of assumptions in mL
-
Fit training set well on cost function (Human–level performance)
-
Fit dev set well on cost function
-
Fit test set well on cost function
-
Performs well in real world
3. Single (real) number evaluation metric
1) Common evaluation metrics
Basic idea: find ONE value that can evaluate the performance of the model(s). It can be a defined number or the result from the combination of some values.
The metric helps to analyze and tuning the model. It should follow the Goal.
Some common parameters:
Positive: predicted to be True; Negative: predicted to be False.
True: the prediction is right, according to the reality; False: the prediction is wrong.
-
Precision (Positive predictive value): the fraction of true positives among all the predicted positives.
$\large \mathrm{Precision} = \frac{\mathrm{True\ Positives}}{\mathrm{Number\ of\ Predicted\ Positives}} = \frac{\mathrm{True\ Positives}}{\mathrm{True\ Positives + False\ Positives}}$
-
Recall (Sensitivity): the fraction of true positives among all the positives.
$\large \mathrm{Recall} = \frac{\mathrm{True\ Positives}}{\mathrm{Number\ of\ Actual\ Positives}} = \frac{\mathrm{True\ Positives}}{\mathrm{True\ Positives + False\ Negatives}}$
-
F1 score: a tradeoff between precision and recall:
$\large \mathrm{F1\ socre} = 2 \times \frac{PR}{P+R}$
-
Accuracy: the fraction of correctly predicted cases among the entire sample.
$\large \mathrm{Accuracy} = \frac{\mathrm{True\ Positives + True\ Negatives}}{\mathrm{Total\ Population}}$
2) Satisficing and Optimizing Metrics
-
Satisficing metric: the requests that have to be met (but may not acquire to do better beyond that)
-
Optimizing metric: the aspects that you want to do as good as possible
4. Train, Dev, Test set
1) Distributions
- Dev and test set should always have same distribution.
- Dev and test set should reflect the data that you expected to get/apply to in the future and consider to do well on.
2) Sizes
Rule of thumb of selecting the sizes of dev and test set:
- Test set: big enough to give high confidence in the overall performance of the system.
- Dev set: big enough to evaluate different ideas.
- Test set: big enough to evaluate the final cost. A small test set is likely to have high fluctuations.
With the total data size get bigger, usually the test set size is much larger than the other two; the dev set and test set usually have the same size.
5. When to change the metrics and dev/test sets?
Always focus on the goal/application.
- the original metric cannot correctly rank ordering preferences between algorithms, e.g. a classifier has higher accuracy but may show pornographic $\rightarrow$ Can adjust by add weights to certain properties/cost function.
- the algorithm works well on the metric and dev/test sets, but not on the application, i.e. cannot reflect the data distribution in the application $\rightarrow$ change them.
⚠️Note:
- even if you cannot define the prefect evaluation metric and dev/test set, just set sth. up quickly, implement a dirty algorithm, and quickly iterate. Optimize latter.
- But DON’T go too long without a evaluate metric and dev set. It will slow down the efficiency of improving the algorithm.
The orthogonal mindset of the above contents: firstly define the data sets and metrics, then optimize.
6. Human–Level Performance
1) Definitions
- Human–level performance: usually the best performance (a group of professional) human can achieve, especially when used as a proxy towards the Bayes performance. But it also depends on the goal.
- Bayes (optimal) error: the best error for any algorithm to surpass.
2) Why HL?
- Good enough for many tasks
- Usually not much space of improvement between HL error and Bayes error.
- many tools to improve the performance as long as worse than HL performance, e.g.
- get labeled data from humans
- gain insight from manual error analysis
- better analysis of bias/variance
7. Avoidable bias
$\large \mathrm{Bayes\ error} \leftrightsquigarrow \underbrace{ \mathrm{HL\ error} \longleftrightarrow}_\text{Avoidable bias} \mathrm{\ Training\ error}\underbrace{\ \longleftrightarrow \mathrm{Dev\ error}}_\text{Variance}$
Compare avoidable bias and variance, usually, dealing with the larger one is more efficient.
⚠️Note: Even though it is better to prioritize larger sources of error, all else being equal, that is not the only thing to consider. Another important consideration is how difficult and costly it would be.
Surpassing HL performance
When surpass HL performance, it is hard to tell whether the training set is overfitted or it is just closer to the Bayes error.
Algorithms that can surpass HL performance usually:
- learnt from structured data
- are not natural perception problems
- learnt from huge amount of data
- some speech recognition and image recognition, and radiology tasks
Summary
Two fundamental assumptions of supervised learning
- One can fit the training set pretty well. (avoidable bias)
- The training set performance generalizes pretty well to the dev/test set. (variance)
To reduce avoidable bias:
- Train bigger model
- Train longer/better optimization algorithms
- NN architecture/hyper–parameters search
To reduce variance:
- More data
- Regularization
- NN architecture/hyper–parameters search
Week 2. Error Analysis, Mismatched Data sets, and multiple tasks
1. Error analysis procedures
-
Manually iterate and inspect some (e.g. hundreds of) misclassified dev set examples.
-
Create a table to track and evaluate multiple ideas in parallel, e.g., for an cat image classifier
Image ID Dog Great Cats Blurry Comments 1 $\surd$ Pitbull 2 $\surd$ 3 $\surd$ $\surd$ Rainy $\dots$ % of total 8% 43% 61% -
Deal with the most significant problem
2. Cleaning up wrongly labeled data
Training set: so long as the error is near random, it is OK to just leave it there. DL algorithms are quite robust to random errors in the training set, but are less robust to systematic errors.
Dev/test set: add an “Incorrectly labeled” column in the above table during error analysis. Then see if it make a significant difference to the results. If not, no need to bother; if needed:
-
apply the same process to the dev and test sets to make sure they continue to come from the same distribution
-
consider examining examples the algorithm got right as well as the ones it got wrong, e.g. right prediction just by luck, maybe wrong if correct the label
⚠️Note: now the train and dev/test set may come from slightly different distributions.
3. Build the first system quickly, then iterate
- Quickly set up a dev/test set and metric.
- Build initial system quickly.
- Use bias/variance analysis & error analysis to prioritize next steps.
4. Mismatched training and dev/test sets
1) How to divide the data
Especially when there are a large amount of data with distribution A when the target data have distribution B:
-
Shuffle all the data then dived. -
Training set: distribution A (+B)
Dev/test set: distribution B only
The algorithm should learn/ be tuned to meet the goal, then it doesn’t make sense to put data with distribution A in the dev/test set. However, for the training set, data of distribution A can help to learn.
2) Bias & variance analysis
Construct a “training–dev set”: same distribution as training set, but not used for training (can be randomly picked from the training set).
$\large \underbrace{\mathrm{HL\ performance} \longleftrightarrow}_{Avoidable\ bias} \underbrace{ \mathrm{Training\ error} \longleftrightarrow }_\text{Variance} \mathrm{TrainingDev\ error} \underbrace{\longleftrightarrow \mathrm{Dev\ error}}_\text{Data mismatch} \underbrace{\longleftrightarrow \mathrm{Test\ Error}}_{\mathrm{Overfitting\ to\ the\ dev\ set}}$
Sometimes, the dev/test error can be better than the training/training–dev set error. This may be results from that the dev/test set distribution is easier for the model.
A general formulation for analysis:
Construct a table with different errors vs. different data distribution. e.g. a speech recognization application of rearview mirror
General speech data | Rearview mirror speech data | |
---|---|---|
HL performance | (HL error) 4% | (HL error) 6% |
Error on examples trained on | (Training error) 7% | (Training error, 7%)$^\dagger$ |
Error on examples NOT trained on | (Training–dev error) 10% | (Dev/test error) 6% |
$\dagger$: if part of the data with distribution B is added to the training set
3) Addressing data mismatch
-
Carry out manual error analysis to try to understand differences between training and dev/test sets
-
Make training data more similar; or collect more data similar to dev/test sets
-
artificial data synthesis
e.g. adding car noises to speeches to simulate the driving environment. However, if the audio recording to car noises is much shorter than the speeches and has to be used repeatedly, the algorithm may overfit the noises.
Drawback: can only synthesize a subset of all the situations, thus lead to overfitting.
-
5. Transfer learning
Definition: Train an NN for task A, and then transfer it to a similar task B, e.g. from a cat image classifier to radiology applications.
Method: Train the parameters as usual, then change the output layer, or a few of the last layers, and retrain the NN with the new dataset $(X, Y)$. If the dataset is large enough, can retrain the entire NN, then the previous training is called “pre–training”, and the later procedure is called “fine–tuning”.
Basic ideas: some low level structures of the two datasets are the same, thus pre–training can help with training the parameters and accelerate the training.
When to use:
- Task A and B have the same input (e.g. both are images)
- have a lot more data for task A than task B
- Low level features from A could be helpful for learning B
6. Multi–task learning
Definition: learn multiple tasks at the same time; one example can have multiple labels.
$\large Y=\begin{bmatrix} | & | & \dots & | \\ y^{(1)} & y^{(2)} & \dots & y^{(m)}\\ | & | & \dots & | \end{bmatrix}_{N_{Tasks} \times m}$
Loss function $\hat{y}^{(i)}$; then cost function:
$\large \mathrm{J} = \frac{1}{m} \sum \limits^m_{i=1} \sum \limits^4_{j=1} \mathrm{L}(\hat{y}_j^{(i)}, y_j^{(i)})$
When to use:
- Training on a set of tasks that could benefit from having shared lower–level features.
- Usually: amount of data you have for each task is quite similar
- Can train a big enough NN to do well on all the tasks (if the NN is not big enough, multi–task learning hurts performance compared with separate tasks)
Used much less than transfer learning.
7. End–to–end deep learning
Definition: use a single NN to replace multiple stages of processing for a task.
- Pros:
- Let the data speak
- Less hand–designing of components needed
- Cons:
- May need large amount of data
- Excludes potentially useful hand–designed components
Applying the ETE learning:
Key question: is there enough data to learn a function of the complexity needed to map x to y?
ETE learning requires huge amount of data. With a small data set, it is better to learn with separate procedures.