Error Analysis
General
Fehlklassifizierte Tupel betrachten, prüfen
- Welcher Klasse eigentlich zugehörig
- Welche Features geholfen hätte, diese Klasse zu erkennen
Schwierige Beispiele identifizieren.
Skewed classes
Problem:
99% Accuracy, aber nur 0,5 % der Fälle true. Immer false, würde bessere Accuracy bringen.
Andere Evaluationsmetrik:
Precision/Recall
y = 1, wenn true
Predicted | 1 | True positive | False positive class | 0 | False negative | True negative
Precision: Anteil der tatsächlichen true positives von allen als true vorhergesagten.
$\frac{TP}{TP + FP}$
Recall: Von allen Patienten, welchen Anteil wurde korrekt als true erkannt?
$\frac{TP}{TP + FN}$
Für y = 0 ⇒ Recall würde 0 sein.
Precision and recall Tradeoff
Angenommen, $h_\theta(x) >= 0.7$ anstelle von 0.5 und h_\theta(x) < 0.7 Dann hohe Präzision, niedrigerer Recall
Und umgekehrt wenn z.B. $h_\theta(x) >= 0.3$
$F_1$ Score
Zum Vergleich von Precision/Recall.
Durchschnitt: $(P+R)/2$ nicht gut, da es möglich ist immer 1 oder 0 zu tippen.
$F_1$ Score : $2 * \frac{P*R}{P+R}$
Bias / Variance
- High Bias (underfit): High train and validation error (similar level, e.g. error of train: 15% | val: 16%)
- High Variance (overfit): Low train, high validation error (e.g. error of train: 1% | val: 11%)
- High Bias and High Variance: High train error, significant higher validation error (e.g. error of train: 15% | val: 30%)
Plot: Error / Degree of Polynom (with Training and cross validation error)
Regularisierung
- Hohes $\lambda$: Underfit
- Niedriges $\lambda$: Overfit
Strategie: Increase regularization parameter stepwise (x2), and check what leads to lowest CV error. Then check for test set.
Learning Curve
Plot: Error/m (training set size)
- Trainingsetfehler nimmt mit höherer Zahl an Trainingsbeispielen zu.
- Testsetfehler nimmt mit höherer Zahl an Trainingsbeispielen ab.
High bias:
- Trainingsetfehler nimmt mit höherer Zahl an Trainingsbeispielen zu, sehr nah an Testsetfehler.
- Testsetfehler nimmt mit höherer Zahl an Trainingsbeispielen ab, bleibt schneller auf einem Niveau.
- Generell höheres Fehlerniveau
Wenn von High bias betroffen, dann helfen mehr Trainingsdaten i.d.R nicht.
High variance:
- Trainingsetfehler nimmt mit höherer Zahl an Trainingsbeispielen zu, bleibt aber eher gering.
- Testsetfehler nimmt mit höherer Zahl an Trainingsbeispielen ab.
- Lücke zwischen Training und Crossvalidation error.
Wenn von High bias betroffen, dann helfen mehr Trainingsdaten i.d.R.
Basic recipe for ML
- High Bias:
- Additional features
- Additional polynomial features
- Decrease Lambda (regularization parameter)
- High Variance:
- More data
- Smaller number of features
- Increase Lambda (regularization parameter)
Basic recipe for training NNs
Recommended order:
- High bias (look at train set performance):
- Bigger network (more hidden layers / units)
- Train longer
- Advanced optimization algorithms
- Better NN architecture
- High variance (look at dev set performance)
- More data (won't help for high bias problems)
- Regularization
- Better NN architecture
Bigger network almost always improves bias and more data improves variance (not necessarily a tradeoff between the two).
Working on most promising problems
Best case performance if no false positives?
E.g. 100 mislabeled dev set examples, how many are dog images (when training a cat classifier). When 50% could be worth to work on problem (if error is currently at 10% ⇒ 5%).
Evaluate multiple ideas in parallel - Fix false positives - Fix false negatives - Improve performance on blurry images
Create spread sheet: Image / Problem
Result: Calc percentage of problem category (potential improvement “ceiling”)
General rule: Build your first system quickly, then iterate (dev/test setup, build system, bias/variance & error analyis)
Misslabeled data
DL algos: If % or errors is low and errors are random, they are robust
Add another col “incorrectly labeled” in error analysis spread sheet.
Principles when fixing labels:
- Apply same process to dev and test set (same distribution)
- Also see what examples algo got right (not only wrong)
- Train and dev/test data may come from different distribution (no problem if slightly different)
Missmatched train and dev/test set
- 200.000 high qual pics
- 10.000 low qual blurry pics
- Option 1: Combine images, random shuffle in train/dev/test set
- Advantage: Same distribution
- Disadvantage: Lot of images come from high qual pics (most time is spend on optimizing for high qual pics)
- Option 2:
- Train set: 205.000 with high and low qual; Dev & Test: 2500 low quality
- Advantage: Optimizing right data
- Disadvantage: Train distr. is different than dev and test set
Problems with different train and dev/test set dist
Not always good idea to use different dist in train and dev
- Human error ~ 0
- Train 1%
- Dev 10%
Training-dev set: same distribution as training set, but not used for training
- Train 1%
- Train-dev: 9%
- Dev: 10%
Still high gap between train and train-dev ⇒ variance problem
If Train and Train-dev would be closer ⇒ data-mismatch problem.
Summary:
- Human level 4%
- Avoidable bias
- Train 7%
- Variance
- Train-dev: 10%
- Data mismatch
- Dev: 12%
- Degree of overfitting to dev set (if to high ⇒ bigger dev set)
- Test: 12%
Data mismatch problems
- Error analysis to understand difference between training and dev/test set
- Make training more similar / collect more data similar to dev/test set (e.g. simulate audio environment)
- Artificial data synthesis
- Problems: Possible that sampling from too few data (for human it might appear ok)