Differences

This shows you the differences between two versions of the page.

--- data_mining:error_analysis [2018/05/21 19:55] – [Misslabeled data] phreazer
+++ data_mining:error_analysis [2018/05/21 22:11] – [Problems with different train and dev/test set dist] phreazer
@@ Line 154: / Line 154: @@
     * Train set: 205.000 with high and low qual; Dev & Test: 2500 low quality
     * Advantage: Optimizing right data
     * Disadvantage: Train distr. is different than dev and test set
+====== Problems with different train and dev/test set dist ======
+Not always good idea to use different dist in train and dev
+  * Human error ~ 0
+  * Train 1%
+  * Dev 10%
+Training-dev set: same distribution as training set, but not used for training
+  * Train 1%
+  * Train-dev: 9%
+  * Dev: 10%
+Still high gap between train and train-dev => variance problem
+If Train and Train-dev would be closer => data-mismatch problem.
+Summary:
+  * Human level 4%
+    * Avoidable bias
+  * Train 7%
+    * Variance
+  * Train-dev: 10%
+    * Data mismatch
+  * Dev: 12%
+    * Degree of overfitting to dev set (if to high => bigger dev set)
+  * Test: 12%