data_mining:neural_network:short_overview

Neural Network

Non-linear Classification:

Problem when using logistic regression: Polynomial or quadratic form leads to too many features.

Inputs:

  • $x_0 = 1$: Bias unit
  • $x_1, \dots, x_n$: Input unit

Activation functions:

  • Sigmoid activation function $g(z) = \frac{1}{1+e^{-z}}$

Layers:

  • Input Layer → Hidden Layer → Output Layer

Neuron activation unit:

  • $a_i^{(j)}$ Activation of unit i in layer j

Weight matrix between layers:

  • $\theta^{(j)}$ Weight matrix controls mapping from layer j to layer j+1

Example activation of unit 1 in layer 2:

$a_1^{(2)} = g(\theta_{10}^{(1)} x_0 + \theta_{11}^{(1)} x_1 + \theta_{12}^{(1)} x_2 + \theta_{13}^{(1)} x_3)$ usw.

Hypothesis (output of activation $a_1^{(3)}$ in a NN with 3 layers):

  • $h_\theta(x)=a_1^{(3)} = g(\theta_{10}^{(2)} a_0^{(2)} + \theta_{11}^{(2)} a_1^{(2)} + \theta_{12}^{(2)} a_2^{(2)} + \theta_{13}^{(2)} a_3^{(2)})$

If the network has $s_j$ units in layer j, then $\theta^{(j)}$ is of dimension $s_{j+1} \times (s_j+1)$

$$z^{(2)} = \theta^{(1)} a^{(1)}\\ a^{(2)} = g(z^{(2)})\\ a_0^{(2)} = 1 \dots $$

Examples how NNs can learn AND, OR, XOR, XNOR functions.

$y = x_1 AND x_2$

With parameter: $h_\theta(x) = g (-30 + 20 x_1 + 20 x_2)$ $g(z) < 0 \approx 0$

$$y = x_1 \text{NOT} x_2 \\g(10-20x_1)\\ y = NOT(x_1) AND NOT(x_2) \\g(10-20x_1-20x_2)$$

$$y = x_1 XNOR x_2$$ from $X_1 AND x_2$, $NOT(x_1) AND NOT(x_2)$ and $x_1 OR x_2$

4 output units for each class one

$$h_\theta(x) \approx 1,0,0,0$$ $$h_\theta(x) \approx 0,1,0,0$$ $$h_\theta(x) \approx 0,0,1,0$$ $$h_\theta(x) \approx 0,0,0,1$$

$y^{(i)}$ is now a 4 dim vector

Notation

  • $s_l$: Number of units without bias unit per layer $l$
  • $L$: Total number of layers in network
  • For binary classification: $S_L = 1; K=1$
  • For K-nary classification: $S_L = K;$

Generalization of cost function of logistic regression.

Cost function of logistic regression:

$J(\theta) = - \frac{1}{m} \sum_{i=1}^m -y^{(i)} * log(h_\theta(x^{(i)})) - (1-y^{(i)}) * log(1 - h_\theta(x^{(i)}))$

$+ \frac{\lambda}{2m} \sum_{j=1}^{n} (\theta_{j})^2$

Cost function of a neural net:

$J(\theta) = - \frac{1}{m} [\sum_{i=1}^m \sum_{k=1}^K y_k^{(i)} log(h_\theta(x^{(i)}))_k + (1-y_k^{(i)}) log(1-(h_\theta(x^{(i)}))_k)]$

$+ \frac{\lambda}{2m} \sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_l+1} (\theta_{ji}^{(l)})^2$

Explanation:

  • Sum over k: number of outputs
  • Sum over all $\theta_{ji}^{(l)}$ without bias units.
  • Frobenius norm for regularization, also called weight decay

Goal is $\min_\theta J(\theta)$, needed parts:

  • $J(\theta)$
  • Partial derivatives

Forward propagation:

$$ a^{(1)} = x \\ z^{(2)} = \theta^{(1)} a^{(1)} \\ a^{(2)} = g(z^{(2)}) \text{ add } a_0^{(2)} \\ \dots $$

$\delta_j^{(l)}$: Error of unit $j$ in layer $l$.

For each output unit (layer L=4)

$\delta_j^{(4)} = a_j^{(4)} - y_j$

Vectorized: $\delta^{(4)} = a^{(4)} - y$

$.*$ is element-wise multiplication

$$\delta_j^{(3)} = (\theta^{(3)})^T\delta^{(4)}.*g'(z^{(3)}) \\ g'(z^{(3)}) = a^{(3)}.*(1-a^{(3)})$$

$$\delta_j^{(2)} = (\theta^{(2)})^T\delta^{(3)}.*g'(z^{(2)}) \\ g'(z^{(2)}) = a^{(2)}.*(1-a^{(2)})$$

Algorithmus

$$\text{Set } \Delta_{ij}^{(l)} = 0 \text{ for all i,j,l} \\ \text{For i=1 to m:} \\ \text{Set } a^{(1)} = x^{(i)}$$

Forward propagation to compute $a^{(l)}$ für $l=2,3,\dots,L$ $y^{(i)}$ verwenden, um $\delta^{L} = a^{(L)}-y^{(i)}$ zu berechnen (bis $\delta^{(2)}$) $\Delta_{ij}^{(l)} := \Delta_{ij}^{(l)} + a_j^{(l)} \delta_i^{l+1}$

Vektorisiert: $\Delta^{(l)} := \Delta^{(l)} + \delta^{l+1} (a^{(l)})^T$

Ableitungen $D$…

Fehler bei Backprop nur schwer nachzuvollziehen, daher Approximation der Gradienten.

$\frac{J(\theta + \epsilon) - J(\theta - \epsilon)}{2 \epsilon}$

Setzen von initialTheta.

Wenn alle initialTheta = 0, dann sind alle Gewichte gleich und $a_1^{(2)} = a_2^{(2)}$ (Problem der gleichen Gewichte).

Initialisiere $\theta^{(l)}_{ij}$ mit Zufallzahl in $[-\epsilon, \epsilon]$

  • 1 Hidden Layer
    • Weniger Paramter, anfälliger für Underfitting, aber leichter zu berechnen
  • >1 Hidden Layer: Jeder hidden Layer hat gleiche Anzahl an Hidden units
    • Mehr Parameter, anfälliger für Overfitting (aber mit Regularisierung zu steuern), aber aufwändiger
    • Anzahl hidden layers: Cross-validation error mit verschiedener Anzahl von Hidden Layers testen.
  • data_mining/neural_network/short_overview.txt
  • Last modified: 2018/05/10 17:32
  • by phreazer