Table of Contents

Neural Network

Motivation

Non-linear Classification:

Problem when using logistic regression: Polynomial or quadratic form leads to too many features.

Basic structure

Inputs:

Activation functions:

Layers:

Neuron activation unit:

Weight matrix between layers:

Example activation of unit 1 in layer 2:

$a_1^{(2)} = g(\theta_{10}^{(1)} x_0 + \theta_{11}^{(1)} x_1 + \theta_{12}^{(1)} x_2 + \theta_{13}^{(1)} x_3)$ usw.

Hypothesis (output of activation $a_1^{(3)}$ in a NN with 3 layers):

If the network has $s_j$ units in layer j, then $\theta^{(j)}$ is of dimension $s_{j+1} \times (s_j+1)$

Vectorized version of forward propagation

$$z^{(2)} = \theta^{(1)} a^{(1)}\\ a^{(2)} = g(z^{(2)})\\ a_0^{(2)} = 1 \dots $$

Functions

Examples how NNs can learn AND, OR, XOR, XNOR functions.

$y = x_1 AND x_2$

With parameter: $h_\theta(x) = g (-30 + 20 x_1 + 20 x_2)$ $g(z) < 0 \approx 0$

$$y = x_1 \text{NOT} x_2 \\g(10-20x_1)\\ y = NOT(x_1) AND NOT(x_2) \\g(10-20x_1-20x_2)$$

$$y = x_1 XNOR x_2$$ from $X_1 AND x_2$, $NOT(x_1) AND NOT(x_2)$ and $x_1 OR x_2$

Multi-Class classification

4 output units for each class one

$$h_\theta(x) \approx 1,0,0,0$$ $$h_\theta(x) \approx 0,1,0,0$$ $$h_\theta(x) \approx 0,0,1,0$$ $$h_\theta(x) \approx 0,0,0,1$$

$y^{(i)}$ is now a 4 dim vector

Cost function

Notation

Generalization of cost function of logistic regression.

Cost function of logistic regression:

$J(\theta) = - \frac{1}{m} \sum_{i=1}^m -y^{(i)} * log(h_\theta(x^{(i)})) - (1-y^{(i)}) * log(1 - h_\theta(x^{(i)}))$

$+ \frac{\lambda}{2m} \sum_{j=1}^{n} (\theta_{j})^2$

Cost function of a neural net:

$J(\theta) = - \frac{1}{m} [\sum_{i=1}^m \sum_{k=1}^K y_k^{(i)} log(h_\theta(x^{(i)}))_k + (1-y_k^{(i)}) log(1-(h_\theta(x^{(i)}))_k)]$

$+ \frac{\lambda}{2m} \sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_l+1} (\theta_{ji}^{(l)})^2$

Explanation:

Backpropagation Algorithm

Goal is $\min_\theta J(\theta)$, needed parts:

Forward propagation:

$$ a^{(1)} = x \\ z^{(2)} = \theta^{(1)} a^{(1)} \\ a^{(2)} = g(z^{(2)}) \text{ add } a_0^{(2)} \\ \dots $$

Calculation of predictions errors

$\delta_j^{(l)}$: Error of unit $j$ in layer $l$.

For each output unit (layer L=4)

$\delta_j^{(4)} = a_j^{(4)} - y_j$

Vectorized: $\delta^{(4)} = a^{(4)} - y$

$.*$ is element-wise multiplication

$$\delta_j^{(3)} = (\theta^{(3)})^T\delta^{(4)}.*g'(z^{(3)}) \\ g'(z^{(3)}) = a^{(3)}.*(1-a^{(3)})$$

$$\delta_j^{(2)} = (\theta^{(2)})^T\delta^{(3)}.*g'(z^{(2)}) \\ g'(z^{(2)}) = a^{(2)}.*(1-a^{(2)})$$

Algorithmus

$$\text{Set } \Delta_{ij}^{(l)} = 0 \text{ for all i,j,l} \\ \text{For i=1 to m:} \\ \text{Set } a^{(1)} = x^{(i)}$$

Forward propagation to compute $a^{(l)}$ für $l=2,3,\dots,L$ $y^{(i)}$ verwenden, um $\delta^{L} = a^{(L)}-y^{(i)}$ zu berechnen (bis $\delta^{(2)}$) $\Delta_{ij}^{(l)} := \Delta_{ij}^{(l)} + a_j^{(l)} \delta_i^{l+1}$

Vektorisiert: $\Delta^{(l)} := \Delta^{(l)} + \delta^{l+1} (a^{(l)})^T$

Ableitungen $D$…

Gradient checking

Fehler bei Backprop nur schwer nachzuvollziehen, daher Approximation der Gradienten.

$\frac{J(\theta + \epsilon) - J(\theta - \epsilon)}{2 \epsilon}$

Random Initialization

Setzen von initialTheta.

Wenn alle initialTheta = 0, dann sind alle Gewichte gleich und $a_1^{(2)} = a_2^{(2)}$ (Problem der gleichen Gewichte).

Initialisiere $\theta^{(l)}_{ij}$ mit Zufallzahl in $[-\epsilon, \epsilon]$

Architekturentscheidungen