====== Error Analysis ====== ===== General ===== Fehlklassifizierte Tupel betrachten, prüfen * Welcher Klasse eigentlich zugehörig * Welche Features geholfen hätte, diese Klasse zu erkennen Schwierige Beispiele identifizieren. ====== Skewed classes ====== Problem: 99% Accuracy, aber nur 0,5 % der Fälle true. Immer false, würde bessere Accuracy bringen. Andere Evaluationsmetrik: Precision/Recall y = 1, wenn true Predicted | 1 | True positive | False positive class | 0 | False negative | True negative Precision: Anteil der tatsächlichen true positives von allen als true vorhergesagten. $\frac{TP}{TP + FP}$ Recall: Von allen Patienten, welchen Anteil wurde korrekt als true erkannt? $\frac{TP}{TP + FN}$ Für y = 0 => Recall würde 0 sein. ===== Precision and recall Tradeoff ===== Angenommen, $h_\theta(x) >= 0.7$ anstelle von 0.5 und h_\theta(x) < 0.7 Dann hohe Präzision, niedrigerer Recall Und umgekehrt wenn z.B. $h_\theta(x) >= 0.3$ ==== $F_1$ Score ==== Zum Vergleich von Precision/Recall. Durchschnitt: $(P+R)/2$ nicht gut, da es möglich ist immer 1 oder 0 zu tippen. $F_1$ Score : $2 * \frac{P*R}{P+R}$ ====== Bias / Variance ====== * High Bias (underfit): High train and validation error (similar level, e.g. error of train: 15% | val: 16%) * High Variance (overfit): Low train, high validation error (e.g. error of train: 1% | val: 11%) * High Bias and High Variance: High train error, significant higher validation error (e.g. error of train: 15% | val: 30%) Plot: Error / Degree of Polynom (with Training and cross validation error) ==== Regularisierung ==== * Hohes $\lambda$: Underfit * Niedriges $\lambda$: Overfit Strategie: Increase regularization parameter stepwise (x2), and check what leads to lowest CV error. Then check for test set. ===== Learning Curve ===== Plot: Error/m (training set size) * Trainingsetfehler nimmt mit höherer Zahl an Trainingsbeispielen zu. * Testsetfehler nimmt mit höherer Zahl an Trainingsbeispielen ab. **High bias:** * Trainingsetfehler nimmt mit höherer Zahl an Trainingsbeispielen zu, sehr nah an Testsetfehler. * Testsetfehler nimmt mit höherer Zahl an Trainingsbeispielen ab, bleibt schneller auf einem Niveau. * Generell höheres Fehlerniveau Wenn von High bias betroffen, dann helfen mehr Trainingsdaten i.d.R nicht. **High variance:** * Trainingsetfehler nimmt mit höherer Zahl an Trainingsbeispielen zu, bleibt aber eher gering. * Testsetfehler nimmt mit höherer Zahl an Trainingsbeispielen ab. * Lücke zwischen Training und Crossvalidation error. Wenn von High bias betroffen, dann helfen mehr Trainingsdaten i.d.R. ===== Basic recipe for ML ===== - High Bias: * Additional features * Additional polynomial features * Decrease Lambda (regularization parameter) - High Variance: * More data * Smaller number of features * Increase Lambda (regularization parameter) ===== Basic recipe for training NNs ===== Recommended **order**: - High **bias** (look at train set performance): * Bigger network (more hidden layers / units) * Train longer * Advanced optimization algorithms * Better NN //architecture// - High **variance** (look at dev set performance) * More data (won't help for high bias problems) * Regularization * Better NN //architecture// Bigger network almost always improves bias and more data improves variance (//not necessarily a tradeoff// between the two). ====== Working on most promising problems ====== Best case performance if no false positives? E.g. 100 mislabeled dev set examples, how many are dog images (when training a cat classifier). When 50% could be worth to work on problem (if error is currently at 10% => 5%). Evaluate multiple ideas in parallel - Fix false positives - Fix false negatives - Improve performance on blurry images Create spread sheet: Image / Problem Result: Calc percentage of problem category (potential improvement "ceiling") General rule: Build your first system quickly, then iterate (dev/test setup, build system, bias/variance & error analyis) ====== Misslabeled data ====== DL algos: If % or errors is //low// and errors are //random//, they are robust Add another col "incorrectly labeled" in error analysis spread sheet. Principles when fixing labels: * Apply same process to dev and test set (same distribution) * Also see what examples algo got right (not only wrong) * Train and dev/test data may come from different distribution (no problem if slightly different) ====== Missmatched train and dev/test set ====== * 200.000 high qual pics * 10.000 low qual blurry pics * Option 1: Combine images, random shuffle in train/dev/test set * Advantage: Same distribution * Disadvantage: Lot of images come from high qual pics (most time is spend on optimizing for high qual pics) * Option 2: * Train set: 205.000 with high and low qual; Dev & Test: 2500 low quality * Advantage: Optimizing right data * Disadvantage: Train distr. is different than dev and test set ====== Problems with different train and dev/test set dist ====== Not always good idea to use different dist in train and dev * Human error ~ 0 * Train 1% * Dev 10% Training-dev set: same distribution as training set, but not used for training * Train 1% * Train-dev: 9% * Dev: 10% Still high gap between train and train-dev => variance problem If Train and Train-dev would be closer => data-mismatch problem. Summary: * Human level 4% * Avoidable bias * Train 7% * Variance * Train-dev: 10% * Data mismatch * Dev: 12% * Degree of overfitting to dev set (if to high => bigger dev set) * Test: 12% ====== Data mismatch problems ====== * Error analysis to understand difference between training and dev/test set * Make training more similar / collect more data similar to dev/test set (e.g. simulate audio environment) * Artificial data synthesis * Problems: Possible that sampling from too few data (for human it might appear ok)