Differences

This shows you the differences between two versions of the page.

--- data_mining:regression [2014/07/13 02:54] – [Mean normalization] phreazer
+++ data_mining:regression [2019/02/10 17:14] (current) – [Gradient descent] phreazer
@@ Line 17: / Line 17: @@
 ==== Cost function ====
+$\displaystyle\min_{\theta_0,\theta_1} \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})^2$
-$\text{minimize}_{\theta_0,\theta_1} \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})^2$
 Vereinfachtes Problem:
-$\text{minimize}_{\theta_0,\theta_1} \frac{1}{2*m} \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})^2$
+$\displaystyle\min_{\theta_0,\theta_1} \frac{1}{2*m} \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})^2$
 $h_\theta(x^{(i)}) = \theta_0 +\theta_1x^{(i)}$
-Cost function (Squared error cost function):
+Cost function (Squared error cost function) $J$:
 $J(\theta_0,\theta_1) = \frac{1}{2*m} \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})^2$
-Goal: $\text{minimize}_{\theta_0,\theta_1} J(\theta_0,\theta_1)$
+Goal: $\displaystyle\min_{\theta_0,\theta_1} J(\theta_0,\theta_1)$
 === Functions (example with only $\theta_1$): ===
@@ Line 54: / Line 53: @@
 Wiederholen bis zur Konvergenz:
-$\theta_j := \theta_j - alpha \frac{\partial}{\partial\theta_j} J(\theta_0, \theta_1)$
+$\theta_j := \theta_j - \alpha \frac{\partial}{\partial\theta_j} J(\theta_0, \theta_1)$
 Gleichzeitiges Update (!):
 $$
-tmp0 := \theta_0 - alpha \frac{\partial}{\partial\theta_0} J(\theta_0, \theta_1)\\
+tmp0 := \theta_0 - \alpha \frac{\partial}{\partial\theta_0} J(\theta_0, \theta_1)\\
-tmp1 := \theta_1 - alpha \frac{\partial}{\partial\theta_1} J(\theta_0, \theta_1)\\
+tmp1 := \theta_1 - \alpha \frac{\partial}{\partial\theta_1} J(\theta_0, \theta_1)\\
 \theta_0 := tmp0\\
 \theta_1 := tmp1
@@ Line 105: / Line 104: @@
 $\theta_j := \theta_j - alpha \frac{\partial}{\partial\theta_j} J(\theta)$
-====== Gradient Descent Improvements ======
-===== Feature Scaling =====
+==== Normalengleichungen ====
+  * Feature-/Designmatrix X (Dim: m x (n+1))
+  * Vector y (Dim: m)
+$\theta = (X^TX)^{-1}X^Ty$
+  * Feature scaling nicht notwendig.
+Was wenn $X^TX$ singulär (nicht invertierbar)?
+(pinv in Octave)
+**Gründe für Singularität:**
+  * Redundante Features (lineare Abhängigkeit)
+  * Zu viele Features (z.B. $m <= n$)
+    * Lösung: Features weglassen oder regularisieren
+**Wann was benutzten?**
+  * m training tupel, n features
+  * GD funktioniert bei großem n (> 1000) gut, Normalengleichung muss (n x n) Matrix invertieren, liegt ungefähr in $O(n^3)$.
+===== Gradient Descent Improvements =====
+==== Feature Scaling ====
   * Features auf ähnliches Skalenniveau bringen führt zu schnellerer Konvergenz
      * Bspw. wenn Contour Plots länglich werden ($x_1 \in [0,2000]$, $x_2 \in [0,5]$).
@@ Line 112: / Line 135: @@
      * Rule of thumb: -3 to 3 (nicht zu groß, nicht zu klein)
-===== Mean normalization =====
+==== Mean normalization ====
 $x_i - \mu_i$
@@ Line 118: / Line 141: @@
 ==== Learning rate $\alpha$ ====
+  * $J(\theta)$ sollte nach jeder Iteration kleiner werden(Plot J/#Iterations).
+    * Alternativ: Konvergenz erklären, wenn Änderungen kleiner $\epsilon$.
+  * Falls $J(\theta)$ ansteigt, überschreitet GD vermutlich Minimum, d.h. kleineres $\alpha$ verwenden.
+  * Wenn $\alpha$ zu klein: Langsame Konvergenz
+  * Wenn $\alpha$ zu groß: $J(\theta)$ sinkt nicht bei jeder Iteration und konvergiert evtl. nicht.
+  * Schema: 0,001 -> 0,003 -> 0,01 -> 0,03 -> 0,1 -> ...
+===== Polynomial regression =====
+$\theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3$
+Durch entsprechende Features möglich:
+$x_3 = (size)^3$
+Feature scaling wird dann wichtig (da groß und verschieden)
+Oder Wurzelfkt.:
+$\theta_0 + \theta_1 x + \theta_2 \sqrt{x}$