data_mining:strategy

# Evaluation metrics and train/dev/test set

Precision (% of examples recognized as class 1, were class 1) Recall (% of actual class1 were correctly identified)

• Classifier A: Precision: 95%, Recall: 90%
• Classifier B: Precision: 98%, Recall: 85%

Problem: Not sure which classifiers are better (due to tradeoff) Solution: New Measure which combines both (F1 Score): Harmonic mean $2/((1/p)+(1/r))$, or in general average

Use Dev set + single number evaluation metric to speed-up iterative improvement

Maximize accuracy, subject to runningTime ⇐ 100ms

N metrics: 1 optimizing, N-1 satisficing (reaching some threshold)

Dev set / holdout set: Try ideas on dev set

Goal: Train and esp. dev and test set should come from same distribution

Solution: Random shuffle (or stratified sample)

• For 100 - 10.000 samples: 70 Train 30 Test, or 60% Train 20% Dev 20 % Test
• For 1.000.000 (NNs): 98% Train, 1% Dev, 1% Test

Change metric, if rank ordering isn't “right”

One solution: Use weights for certain errors

Two steps:

1. Place the target (eval metric)
2. How to shoot at target (how to optimize metric)

E.g. high quality images in dev/test set, user upload low quality images. ⇒ change metric and/or dev/test set

# Human level performance

Bayes optimal error (best optimal error)

Human level error could be used as an estimate for Bayes error (e.g. in Computer Vision)

• H: 1%, Train: 8%, Dev: 10% ⇒ bias reduction
• H: 7,5%, Train: 8, Dev: 10% ⇒ variance reduction (more data, regularization)

What's human-level error? Best performance possible as a human / usefullness

Measure of error between Human Error, Train Error and Dev error

• Avoidable bias: Human level <> Training Error
• Train bigger model
• Train longer/better opti algos
• NN architecture/hyperparam search
• Variance: Training Error <> Dev Error
• More data
• Regularization
• NN architecture/hyperparam search
• data_mining/strategy.txt
• Last modified: 2018/05/21 18:50
• by phreazer