Precision (% of examples recognized as class 1, were class 1) Recall (% of actual class1 were correctly identified)
Problem: Not sure which classifiers are better (due to tradeoff) Solution: New Measure which combines both (F1 Score): Harmonic mean $2/((1/p)+(1/r))$, or in general average
Use Dev set + single number evaluation metric to speed-up iterative improvement
Maximize accuracy, subject to runningTime ⇐ 100ms
N metrics: 1 optimizing, N-1 satisficing (reaching some threshold)
Dev set / holdout set: Try ideas on dev set
Goal: Train and esp. dev and test set should come from same distribution
Solution: Random shuffle (or stratified sample)
Change metric, if rank ordering isn't “right”
One solution: Use weights for certain errors
Two steps:
E.g. high quality images in dev/test set, user upload low quality images. ⇒ change metric and/or dev/test set
Bayes optimal error (best optimal error)
Human level error could be used as an estimate for Bayes error (e.g. in Computer Vision)
What's human-level error? Best performance possible as a human / usefullness
Measure of error between Human Error, Train Error and Dev error