Neural Network
Motivation
Non-linear Classification:
Problem when using logistic regression: Polynomial or quadratic form leads to too many features.
Basic structure
Inputs:
- $x_0 = 1$: Bias unit
- $x_1, \dots, x_n$: Input unit
Activation functions:
- Sigmoid activation function $g(z) = \frac{1}{1+e^{-z}}$
Layers:
- Input Layer → Hidden Layer → Output Layer
Neuron activation unit:
- $a_i^{(j)}$ Activation of unit i in layer j
Weight matrix between layers:
- $\theta^{(j)}$ Weight matrix controls mapping from layer j to layer j+1
Example activation of unit 1 in layer 2:
$a_1^{(2)} = g(\theta_{10}^{(1)} x_0 + \theta_{11}^{(1)} x_1 + \theta_{12}^{(1)} x_2 + \theta_{13}^{(1)} x_3)$ usw.
Hypothesis (output of activation $a_1^{(3)}$ in a NN with 3 layers):
- $h_\theta(x)=a_1^{(3)} = g(\theta_{10}^{(2)} a_0^{(2)} + \theta_{11}^{(2)} a_1^{(2)} + \theta_{12}^{(2)} a_2^{(2)} + \theta_{13}^{(2)} a_3^{(2)})$
If the network has $s_j$ units in layer j, then $\theta^{(j)}$ is of dimension $s_{j+1} \times (s_j+1)$
Vectorized version of forward propagation
$$z^{(2)} = \theta^{(1)} a^{(1)}\\ a^{(2)} = g(z^{(2)})\\ a_0^{(2)} = 1 \dots $$
Functions
Examples how NNs can learn AND, OR, XOR, XNOR functions.
$y = x_1 AND x_2$
With parameter: $h_\theta(x) = g (-30 + 20 x_1 + 20 x_2)$ $g(z) < 0 \approx 0$
$$y = x_1 \text{NOT} x_2 \\g(10-20x_1)\\ y = NOT(x_1) AND NOT(x_2) \\g(10-20x_1-20x_2)$$
$$y = x_1 XNOR x_2$$ from $X_1 AND x_2$, $NOT(x_1) AND NOT(x_2)$ and $x_1 OR x_2$
Multi-Class classification
4 output units for each class one
$$h_\theta(x) \approx 1,0,0,0$$ $$h_\theta(x) \approx 0,1,0,0$$ $$h_\theta(x) \approx 0,0,1,0$$ $$h_\theta(x) \approx 0,0,0,1$$
$y^{(i)}$ is now a 4 dim vector
Cost function
Notation
- $s_l$: Number of units without bias unit per layer $l$
- $L$: Total number of layers in network
- For binary classification: $S_L = 1; K=1$
- For K-nary classification: $S_L = K;$
Generalization of cost function of logistic regression.
Cost function of logistic regression:
$J(\theta) = - \frac{1}{m} \sum_{i=1}^m -y^{(i)} * log(h_\theta(x^{(i)})) - (1-y^{(i)}) * log(1 - h_\theta(x^{(i)}))$
$+ \frac{\lambda}{2m} \sum_{j=1}^{n} (\theta_{j})^2$
Cost function of a neural net:
$J(\theta) = - \frac{1}{m} [\sum_{i=1}^m \sum_{k=1}^K y_k^{(i)} log(h_\theta(x^{(i)}))_k + (1-y_k^{(i)}) log(1-(h_\theta(x^{(i)}))_k)]$
$+ \frac{\lambda}{2m} \sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_l+1} (\theta_{ji}^{(l)})^2$
Explanation:
- Sum over k: number of outputs
- Sum over all $\theta_{ji}^{(l)}$ without bias units.
- Frobenius norm for regularization, also called weight decay
Backpropagation Algorithm
Goal is $\min_\theta J(\theta)$, needed parts:
- $J(\theta)$
- Partial derivatives
Forward propagation:
$$ a^{(1)} = x \\ z^{(2)} = \theta^{(1)} a^{(1)} \\ a^{(2)} = g(z^{(2)}) \text{ add } a_0^{(2)} \\ \dots $$
Calculation of predictions errors
$\delta_j^{(l)}$: Error of unit $j$ in layer $l$.
For each output unit (layer L=4)
$\delta_j^{(4)} = a_j^{(4)} - y_j$
Vectorized: $\delta^{(4)} = a^{(4)} - y$
$.*$ is element-wise multiplication
$$\delta_j^{(3)} = (\theta^{(3)})^T\delta^{(4)}.*g'(z^{(3)}) \\ g'(z^{(3)}) = a^{(3)}.*(1-a^{(3)})$$
$$\delta_j^{(2)} = (\theta^{(2)})^T\delta^{(3)}.*g'(z^{(2)}) \\ g'(z^{(2)}) = a^{(2)}.*(1-a^{(2)})$$
Algorithmus
$$\text{Set } \Delta_{ij}^{(l)} = 0 \text{ for all i,j,l} \\ \text{For i=1 to m:} \\ \text{Set } a^{(1)} = x^{(i)}$$
Forward propagation to compute $a^{(l)}$ für $l=2,3,\dots,L$ $y^{(i)}$ verwenden, um $\delta^{L} = a^{(L)}-y^{(i)}$ zu berechnen (bis $\delta^{(2)}$) $\Delta_{ij}^{(l)} := \Delta_{ij}^{(l)} + a_j^{(l)} \delta_i^{l+1}$
Vektorisiert: $\Delta^{(l)} := \Delta^{(l)} + \delta^{l+1} (a^{(l)})^T$
Ableitungen $D$…
Gradient checking
Fehler bei Backprop nur schwer nachzuvollziehen, daher Approximation der Gradienten.
$\frac{J(\theta + \epsilon) - J(\theta - \epsilon)}{2 \epsilon}$
Random Initialization
Setzen von initialTheta.
Wenn alle initialTheta = 0, dann sind alle Gewichte gleich und $a_1^{(2)} = a_2^{(2)}$ (Problem der gleichen Gewichte).
Initialisiere $\theta^{(l)}_{ij}$ mit Zufallzahl in $[-\epsilon, \epsilon]$
Architekturentscheidungen
- 1 Hidden Layer
- Weniger Paramter, anfälliger für Underfitting, aber leichter zu berechnen
- >1 Hidden Layer: Jeder hidden Layer hat gleiche Anzahl an Hidden units
- Mehr Parameter, anfälliger für Overfitting (aber mit Regularisierung zu steuern), aber aufwändiger
- Anzahl hidden layers: Cross-validation error mit verschiedener Anzahl von Hidden Layers testen.