====== Evaluation metrics and train/dev/test set ====== ===== Using a single metric evaluation metric ===== Precision (% of examples recognized as class 1, were class 1) Recall (% of actual class1 were correctly identified) * Classifier A: Precision: 95%, Recall: 90% * Classifier B: Precision: 98%, Recall: 85% Problem: Not sure which classifiers are better (due to tradeoff) Solution: New Measure which combines both (F1 Score): Harmonic mean $2/((1/p)+(1/r))$, or in general average Use **Dev set** + **single number evaluation** metric to speed-up iterative improvement ===== Metric tradeoffs ===== Maximize accuracy, subject to runningTime <= 100ms N metrics: 1 optimizing, N-1 satisficing (reaching some threshold) ===== Train/Dev/Test set ===== Dev set / holdout set: Try ideas on dev set Goal: Train and esp. dev and test set should come from **same distribution** Solution: Random shuffle (or stratified sample) ==== Sizes ==== * For 100 - 10.000 samples: 70 Train 30 Test, or 60% Train 20% Dev 20 % Test * For 1.000.000 (NNs): 98% Train, 1% Dev, 1% Test ===== Change dev/test set and metric ===== Change metric, if rank ordering isn't "right" One solution: Use weights for certain errors Two steps: - Place the target (eval metric) - How to shoot at target (how to optimize metric) E.g. high quality images in dev/test set, user upload low quality images. => change metric and/or dev/test set ====== Human level performance ====== Bayes optimal error (best optimal error) Human level error could be used as an estimate for Bayes error (e.g. in Computer Vision) * H: 1%, Train: 8%, Dev: 10% => bias reduction * H: 7,5%, Train: 8, Dev: 10% => variance reduction (more data, regularization) What's human-level error? Best performance possible as a human / usefullness Measure of error between Human Error, Train Error and Dev error * Avoidable bias: Human level <> Training Error * Train bigger model * Train longer/better opti algos * NN architecture/hyperparam search * Variance: Training Error <> Dev Error * More data * Regularization * NN architecture/hyperparam search