data_mining:error_analysis

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
data_mining:error_analysis [2018/05/21 19:55] – [Misslabeled data] phreazerdata_mining:error_analysis [2018/05/21 22:11] – [Problems with different train and dev/test set dist] phreazer
Line 154: Line 154:
     * Train set: 205.000 with high and low qual; Dev & Test: 2500 low quality     * Train set: 205.000 with high and low qual; Dev & Test: 2500 low quality
     * Advantage: Optimizing right data     * Advantage: Optimizing right data
-    * Disadvantage: Train distr. is different than dev and test set +    * Disadvantage: Train distr. is different than dev and test set 
 + 
 +====== Problems with different train and dev/test set dist ====== 
 + 
 +Not always good idea to use different dist in train and dev 
 + 
 +  * Human error ~ 0 
 +  * Train 1% 
 +  * Dev 10% 
 + 
 +Training-dev set: same distribution as training set, but not used for training 
 + 
 +  * Train 1% 
 +  * Train-dev: 9% 
 +  * Dev: 10% 
 + 
 +Still high gap between train and train-dev => variance problem 
 + 
 +If Train and Train-dev would be closer => data-mismatch problem. 
 + 
 +Summary: 
 +  * Human level 4% 
 +    * Avoidable bias 
 +  * Train 7% 
 +    * Variance 
 +  * Train-dev: 10% 
 +    * Data mismatch 
 +  * Dev: 12% 
 +    * Degree of overfitting to dev set (if to high => bigger dev set) 
 +  * Test: 12%
  • data_mining/error_analysis.txt
  • Last modified: 2018/05/21 22:24
  • by phreazer