Differences

This shows you the differences between two versions of the page.

--- data_mining:error_analysis [2018/05/21 19:45] – [Working on most promising problems] phreazer
+++ data_mining:error_analysis [2018/05/21 19:55] – [Misslabeled data] phreazer
@@ Line 139: / Line 139: @@
 Principles when fixing labels:
-- Apply same process to dev and test set (same distribution)
+  * Apply same process to dev and test set (same distribution)
-- Also see what examples algo got right (not only wrong)
+  * Also see what examples algo got right (not only wrong)
-- Train and dev/test data may come from different distribution (no problem if slightly different)
+  * Train and dev/test data may come from different distribution (no problem if slightly different)
+====== Missmatched train and dev/test set ======
+  * 200.000 high qual pics
+  * 10.000 low qual blurry pics
+  * Option 1: Combine images, random shuffle in train/dev/test set
+    * Advantage: Same distribution
+    * Disadvantage: Lot of images come from high qual pics (most time is spend on optimizing for high qual pics)
+  * Option 2:
+    * Train set: 205.000 with high and low qual; Dev & Test: 2500 low quality
+    * Advantage: Optimizing right data
+    * Disadvantage: Train distr. is different than dev and test set