This is an old revision of the document!
Using a single metric evaluation metric
Precision (% of examples recognized as class 1, were class 1) Recall (% of actual class1 were correctly identified)
- Classifier A: Precision: 95%, Recall: 90%
- Classifier B: Precision: 98%, Recall: 85%
Problem: Not sure which classifiers are better (due to tradeoff) Solution: New Measure which combines both (F1 Score): Harmonic mean $2/((1/p)+(1/r))$
Use Dev set + single number evaluation metric to speed-up iterative improvement