Differences

This shows you the differences between two versions of the page.

--- data_mining:logistic_regression [2014/07/20 02:16] – [Gradient Descent] phreazer
+++ data_mining:logistic_regression [2018/05/10 17:46] – [Regularization] phreazer
@@ Line 41: / Line 41: @@
 Neben Gradient Descent:
-- Conjugate gradient
+  * Conjugate gradient
-- BFGS
+  * BFGS
-- L-BFGS
+  * L-BFGS
 Vorteile: Alpha muss nicht gewählt werden, oft schneller.
 Nachteil: Komplex.
+Octave: Kostenfunktion definieren, die Kosten und Gradienten berechnet und an entsprechende Methode übergeben.
+===== Multiclass classification =====
+One vs all / one vs rest Classification
+Z.B. aus 3-Klassenproblem 3 binäre Probleme erzeugen. $h_\theta(x)^{(i)} = P(y=i|x;\theta); i=1,2,3$
+Dann wähle Klasse i, die $\max_i h_\theta^{(i)}(x)$
+===== Adressing Overfitting =====
+  - Feature reduction
+    * Manual selection
+    * Model selection algo
+  - Regularization
+    * Alle Features behalten, aber Größe/Werte der Parameter $\theta_j$ verändern.
+       * Funktioniert gut, wenn es viele Features gibt, die ein wenig zur Vorhersage von y beitragen.
+==== Regularization ====
+$min \dots + 1000 \theta_3^2 + 1000 \theta_4^2$
+Kleine Paremeter führen zu "einfacherer" Hypothesis.
+$min \dots + \lambda \sum_{i=1}^n \theta_j^2$
+=== L2 Regularization ===
+For large $\lambda$, $W^{[l]} => 0$
+$J(W^{[l]},b^{[l]})= \frac{1}{m} \sum_{i=1}^m J(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \sum_{l=1}^L || W^{[l]} ||^2$
+This results in a **simpler** network / each hidden unit has **smaller effect**.
+Another effect, wehn $W$ is small, $z$ has a smaller range, resulting activation e.g. for tanh is more linear.
+=== Dropout ===
+=== Gradient descent (Linear Regression) ===
+$\dots + \lambda / m \theta_j$
+$\theta_j := \theta_j (1-\alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum^m_{i=1} (h_\theta(x^{(i)} - y^{(i)}) x_j^{(i)}$
+=== Normalengleichung (Linear Regression) ===
+$$(x^T X + \lambda
+\begin{bmatrix}
+& \dots & \dots & 0 \\
+\vdots & 1 & 0 & \vdots \\
+\vdots & 0 & \ddots & 0 \\
+& \dots & 0 & 1
+\end{bmatrix})^{-1} X^T y$$
+=== Gradient descent (Logistic Regression) ===
+Unterscheide $\theta_0$ und $\theta_j$!
+Für $\theta_j$: $\dots + \frac{\lambda}{m} \theta_j$