Differences

This shows you the differences between two versions of the page.

--- data_mining:neural_network:short_overview [2018/04/21 23:50] – [Cost function] phreazer
+++ data_mining:neural_network:short_overview [2018/05/10 17:32] (current) – [Cost function] phreazer
@@ Line 73: / Line 73: @@
 ===== Cost function =====
-$s_l$: Anzahl der Einheiten (ohne Bias Unit) pro Layer l
+Notation
-$L$: Gesamtzahl an Layer
+  * $s_l$: Number of units without bias unit per layer $l$
+  * $L$: Total number of layers in network
+  * For binary classification: $S_L = 1; K=1$
+  * For K-nary classification: $S_L = K;$
-Für binäre Klassifikation: $S_L = 1; K=1$
+Generalization of cost function of logistic regression.
-Für m Klassifikation: $S_L = K;$
-Verallgemeinerung der Kostenfunktion für logistische Regression.
+Cost function of logistic regression:
-**Kostenfunktion eines Neuronalen Netzes:**
+$J(\theta) = - \frac{1}{m} \sum_{i=1}^m -y^{(i)} * log(h_\theta(x^{(i)})) - (1-y^{(i)}) * log(1 - h_\theta(x^{(i)}))$
-$J(\theta) = - \frac{1}{m} [\sum_{i=1}^m \sum_{k=1}^K y_k^{(i)} log(h_\theta(x^{(i)}))_k + (1-y_k^{(i)}) log(1-(h_\theta(x^{(i)}))_k)] + \frac{\lambda}{2m} \sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_l+1} (\theta_{ji}^{(l)})^2$
+$+ \frac{\lambda}{2m} \sum_{j=1}^{n} (\theta_{j})^2$
-Erklärung:
+**Cost function of a neural net:**
-Summe über k: Anzahl Outputs.
-Summe über alle $\theta_{ji}^{(l)}$ ohne Bias Units.
-==== Backpropagation Algorithmus ====
+$J(\theta) = - \frac{1}{m} [\sum_{i=1}^m \sum_{k=1}^K y_k^{(i)} log(h_\theta(x^{(i)}))_k + (1-y_k^{(i)}) log(1-(h_\theta(x^{(i)}))_k)]$
-Wir wollen $\min_\theta J(\theta)$ und benötigen $J(\theta)$ und zugehörige partielle Ableitungen.
+$+ \frac{\lambda}{2m} \sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_l+1} (\theta_{ji}^{(l)})^2$
+Explanation:
+  * Sum over k: number of outputs
+  * Sum over all $\theta_{ji}^{(l)}$ without bias units.
+  * Frobenius norm for regularization, also called //weight decay//
+===== Backpropagation Algorithm =====
+Goal is $\min_\theta J(\theta)$, needed parts:
+  * $J(\theta)$
+  * Partial derivatives
 Forward propagation:
@@ Line 98: / Line 109: @@
 a^{(1)} = x \\
 z^{(2)} = \theta^{(1)} a^{(1)} \\
-a^{(2)} = g(z^{(2)}) \text{ füge } a_0^{(2)} \text{hinzu} \\
+a^{(2)} = g(z^{(2)}) \text{ add } a_0^{(2)} \\
 \dots
 $$
-$\delta_j^{(l)}$: Fehler von Knoten j in Layer l.
+==== Calculation of predictions errors ====
-Für jede Outputeinheit (Layer L=4)
+$\delta_j^{(l)}$: Error of unit $j$ in layer $l$.
+For each output unit (layer L=4)
 $\delta_j^{(4)} = a_j^{(4)} - y_j$
-Vektorisiert: $\delta^{(4)} = a^{(4)} - y$
+Vectorized: $\delta^{(4)} = a^{(4)} - y$
+$.*$ is element-wise multiplication
 $$\delta_j^{(3)} = (\theta^{(3)})^T\delta^{(4)}.*g'(z^{(3)}) \\
@@ Line 118: / Line 133: @@
 Algorithmus
-$$\Delta_{ij}^{(l)} = 0 \text{für alle i,j,l} \\
+$$\text{Set } \Delta_{ij}^{(l)} = 0 \text{ for all i,j,l} \\
 \text{For i=1 to m:} \\
-\text{Set} a^{(1)} = x^{(i)}$$
+\text{Set } a^{(1)} = x^{(i)}$$
 Forward propagation to compute $a^{(l)}$ für $l=2,3,\dots,L$