data_mining:neural_network:short_overview

# Neural Network

Non-linear Classification:

Problem when using logistic regression: Polynomial or quadratic form leads to too many features.

Inputs:

• $x_0 = 1$: Bias unit
• $x_1, \dots, x_n$: Input unit

Activation functions:

• Sigmoid activation function $g(z) = \frac{1}{1+e^{-z}}$

Layers:

• Input Layer → Hidden Layer → Output Layer

Neuron activation unit:

• $a_i^{(j)}$ Activation of unit i in layer j

Weight matrix between layers:

• $\theta^{(j)}$ Weight matrix controls mapping from layer j to layer j+1

Example activation of unit 1 in layer 2:

$a_1^{(2)} = g(\theta_{10}^{(1)} x_0 + \theta_{11}^{(1)} x_1 + \theta_{12}^{(1)} x_2 + \theta_{13}^{(1)} x_3)$ usw.

Hypothesis (output of activation $a_1^{(3)}$ in a NN with 3 layers):

• $h_\theta(x)=a_1^{(3)} = g(\theta_{10}^{(2)} a_0^{(2)} + \theta_{11}^{(2)} a_1^{(2)} + \theta_{12}^{(2)} a_2^{(2)} + \theta_{13}^{(2)} a_3^{(2)})$

If the network has $s_j$ units in layer j, then $\theta^{(j)}$ is of dimension $s_{j+1} \times (s_j+1)$

$$z^{(2)} = \theta^{(1)} a^{(1)}\\ a^{(2)} = g(z^{(2)})\\ a_0^{(2)} = 1 \dots$$

Examples how NNs can learn AND, OR, XOR, XNOR functions.

$y = x_1 AND x_2$

With parameter: $h_\theta(x) = g (-30 + 20 x_1 + 20 x_2)$ $g(z) < 0 \approx 0$

$$y = x_1 \text{NOT} x_2 \\g(10-20x_1)\\ y = NOT(x_1) AND NOT(x_2) \\g(10-20x_1-20x_2)$$

$$y = x_1 XNOR x_2$$ from $X_1 AND x_2$, $NOT(x_1) AND NOT(x_2)$ and $x_1 OR x_2$

4 output units for each class one

$$h_\theta(x) \approx 1,0,0,0$$ $$h_\theta(x) \approx 0,1,0,0$$ $$h_\theta(x) \approx 0,0,1,0$$ $$h_\theta(x) \approx 0,0,0,1$$

$y^{(i)}$ is now a 4 dim vector

Notation

• $s_l$: Number of units without bias unit per layer $l$
• $L$: Total number of layers in network
• For binary classification: $S_L = 1; K=1$
• For K-nary classification: $S_L = K;$

Generalization of cost function of logistic regression.

Cost function of logistic regression:

$J(\theta) = - \frac{1}{m} \sum_{i=1}^m -y^{(i)} * log(h_\theta(x^{(i)})) - (1-y^{(i)}) * log(1 - h_\theta(x^{(i)}))$

$+ \frac{\lambda}{2m} \sum_{j=1}^{n} (\theta_{j})^2$

Cost function of a neural net:

$J(\theta) = - \frac{1}{m} [\sum_{i=1}^m \sum_{k=1}^K y_k^{(i)} log(h_\theta(x^{(i)}))_k + (1-y_k^{(i)}) log(1-(h_\theta(x^{(i)}))_k)]$

$+ \frac{\lambda}{2m} \sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_l+1} (\theta_{ji}^{(l)})^2$

Explanation:

• Sum over k: number of outputs
• Sum over all $\theta_{ji}^{(l)}$ without bias units.
• Frobenius norm for regularization, also called weight decay

Goal is $\min_\theta J(\theta)$, needed parts:

• $J(\theta)$
• Partial derivatives

Forward propagation:

$$a^{(1)} = x \\ z^{(2)} = \theta^{(1)} a^{(1)} \\ a^{(2)} = g(z^{(2)}) \text{ add } a_0^{(2)} \\ \dots$$

$\delta_j^{(l)}$: Error of unit $j$ in layer $l$.

For each output unit (layer L=4)

$\delta_j^{(4)} = a_j^{(4)} - y_j$

Vectorized: $\delta^{(4)} = a^{(4)} - y$

$.*$ is element-wise multiplication

$$\delta_j^{(3)} = (\theta^{(3)})^T\delta^{(4)}.*g'(z^{(3)}) \\ g'(z^{(3)}) = a^{(3)}.*(1-a^{(3)})$$

$$\delta_j^{(2)} = (\theta^{(2)})^T\delta^{(3)}.*g'(z^{(2)}) \\ g'(z^{(2)}) = a^{(2)}.*(1-a^{(2)})$$

Algorithmus

$$\text{Set } \Delta_{ij}^{(l)} = 0 \text{ for all i,j,l} \\ \text{For i=1 to m:} \\ \text{Set } a^{(1)} = x^{(i)}$$

Forward propagation to compute $a^{(l)}$ für $l=2,3,\dots,L$ $y^{(i)}$ verwenden, um $\delta^{L} = a^{(L)}-y^{(i)}$ zu berechnen (bis $\delta^{(2)}$) $\Delta_{ij}^{(l)} := \Delta_{ij}^{(l)} + a_j^{(l)} \delta_i^{l+1}$

Vektorisiert: $\Delta^{(l)} := \Delta^{(l)} + \delta^{l+1} (a^{(l)})^T$

Ableitungen $D$…

Fehler bei Backprop nur schwer nachzuvollziehen, daher Approximation der Gradienten.

$\frac{J(\theta + \epsilon) - J(\theta - \epsilon)}{2 \epsilon}$

Setzen von initialTheta.

Wenn alle initialTheta = 0, dann sind alle Gewichte gleich und $a_1^{(2)} = a_2^{(2)}$ (Problem der gleichen Gewichte).

Initialisiere $\theta^{(l)}_{ij}$ mit Zufallzahl in $[-\epsilon, \epsilon]$

• 1 Hidden Layer
• Weniger Paramter, anfälliger für Underfitting, aber leichter zu berechnen
• >1 Hidden Layer: Jeder hidden Layer hat gleiche Anzahl an Hidden units
• Mehr Parameter, anfälliger für Overfitting (aber mit Regularisierung zu steuern), aber aufwändiger
• Anzahl hidden layers: Cross-validation error mit verschiedener Anzahl von Hidden Layers testen.
• data_mining/neural_network/short_overview.txt