====== Neural Network ====== ===== Motivation ===== Non-linear Classification: Problem when using logistic regression: Polynomial or quadratic form leads to too many features. ===== Basic structure ===== Inputs: * $x_0 = 1$: Bias unit * $x_1, \dots, x_n$: Input unit Activation functions: * Sigmoid activation function $g(z) = \frac{1}{1+e^{-z}}$ Layers: * Input Layer -> Hidden Layer -> Output Layer Neuron activation unit: * $a_i^{(j)}$ Activation of unit i in layer j Weight matrix between layers: * $\theta^{(j)}$ Weight matrix controls mapping from layer j to layer j+1 Example activation of unit 1 in layer 2: $a_1^{(2)} = g(\theta_{10}^{(1)} x_0 + \theta_{11}^{(1)} x_1 + \theta_{12}^{(1)} x_2 + \theta_{13}^{(1)} x_3)$ usw. Hypothesis (output of activation $a_1^{(3)}$ in a NN with 3 layers): * $h_\theta(x)=a_1^{(3)} = g(\theta_{10}^{(2)} a_0^{(2)} + \theta_{11}^{(2)} a_1^{(2)} + \theta_{12}^{(2)} a_2^{(2)} + \theta_{13}^{(2)} a_3^{(2)})$ If the network has $s_j$ units in layer j, then $\theta^{(j)}$ is of dimension $s_{j+1} \times (s_j+1)$ ===== Vectorized version of forward propagation ===== $$z^{(2)} = \theta^{(1)} a^{(1)}\\ a^{(2)} = g(z^{(2)})\\ a_0^{(2)} = 1 \dots $$ ===== Functions ===== Examples how NNs can learn AND, OR, XOR, XNOR functions. $y = x_1 AND x_2$ With parameter: $h_\theta(x) = g (-30 + 20 x_1 + 20 x_2)$ $g(z) < 0 \approx 0$ $$y = x_1 \text{NOT} x_2 \\g(10-20x_1)\\ y = NOT(x_1) AND NOT(x_2) \\g(10-20x_1-20x_2)$$ $$y = x_1 XNOR x_2$$ from $X_1 AND x_2$, $NOT(x_1) AND NOT(x_2)$ and $x_1 OR x_2$ ===== Multi-Class classification ===== 4 output units for each class one $$h_\theta(x) \approx 1,0,0,0$$ $$h_\theta(x) \approx 0,1,0,0$$ $$h_\theta(x) \approx 0,0,1,0$$ $$h_\theta(x) \approx 0,0,0,1$$ $y^{(i)}$ is now a 4 dim vector ===== Cost function ===== Notation * $s_l$: Number of units without bias unit per layer $l$ * $L$: Total number of layers in network * For binary classification: $S_L = 1; K=1$ * For K-nary classification: $S_L = K;$ Generalization of cost function of logistic regression. Cost function of logistic regression: $J(\theta) = - \frac{1}{m} \sum_{i=1}^m -y^{(i)} * log(h_\theta(x^{(i)})) - (1-y^{(i)}) * log(1 - h_\theta(x^{(i)}))$ $+ \frac{\lambda}{2m} \sum_{j=1}^{n} (\theta_{j})^2$ **Cost function of a neural net:** $J(\theta) = - \frac{1}{m} [\sum_{i=1}^m \sum_{k=1}^K y_k^{(i)} log(h_\theta(x^{(i)}))_k + (1-y_k^{(i)}) log(1-(h_\theta(x^{(i)}))_k)]$ $+ \frac{\lambda}{2m} \sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_l+1} (\theta_{ji}^{(l)})^2$ Explanation: * Sum over k: number of outputs * Sum over all $\theta_{ji}^{(l)}$ without bias units. * Frobenius norm for regularization, also called //weight decay// ===== Backpropagation Algorithm ===== Goal is $\min_\theta J(\theta)$, needed parts: * $J(\theta)$ * Partial derivatives Forward propagation: $$ a^{(1)} = x \\ z^{(2)} = \theta^{(1)} a^{(1)} \\ a^{(2)} = g(z^{(2)}) \text{ add } a_0^{(2)} \\ \dots $$ ==== Calculation of predictions errors ==== $\delta_j^{(l)}$: Error of unit $j$ in layer $l$. For each output unit (layer L=4) $\delta_j^{(4)} = a_j^{(4)} - y_j$ Vectorized: $\delta^{(4)} = a^{(4)} - y$ $.*$ is element-wise multiplication $$\delta_j^{(3)} = (\theta^{(3)})^T\delta^{(4)}.*g'(z^{(3)}) \\ g'(z^{(3)}) = a^{(3)}.*(1-a^{(3)})$$ $$\delta_j^{(2)} = (\theta^{(2)})^T\delta^{(3)}.*g'(z^{(2)}) \\ g'(z^{(2)}) = a^{(2)}.*(1-a^{(2)})$$ Algorithmus $$\text{Set } \Delta_{ij}^{(l)} = 0 \text{ for all i,j,l} \\ \text{For i=1 to m:} \\ \text{Set } a^{(1)} = x^{(i)}$$ Forward propagation to compute $a^{(l)}$ für $l=2,3,\dots,L$ $y^{(i)}$ verwenden, um $\delta^{L} = a^{(L)}-y^{(i)}$ zu berechnen (bis $\delta^{(2)}$) $\Delta_{ij}^{(l)} := \Delta_{ij}^{(l)} + a_j^{(l)} \delta_i^{l+1}$ Vektorisiert: $\Delta^{(l)} := \Delta^{(l)} + \delta^{l+1} (a^{(l)})^T$ Ableitungen $D$... ==== Gradient checking ==== Fehler bei Backprop nur schwer nachzuvollziehen, daher Approximation der Gradienten. $\frac{J(\theta + \epsilon) - J(\theta - \epsilon)}{2 \epsilon}$ ==== Random Initialization ==== Setzen von initialTheta. Wenn alle initialTheta = 0, dann sind alle Gewichte gleich und $a_1^{(2)} = a_2^{(2)}$ (Problem der gleichen Gewichte). Initialisiere $\theta^{(l)}_{ij}$ mit Zufallzahl in $[-\epsilon, \epsilon]$ ==== Architekturentscheidungen ==== * 1 Hidden Layer * Weniger Paramter, anfälliger für Underfitting, aber leichter zu berechnen * >1 Hidden Layer: Jeder hidden Layer hat gleiche Anzahl an Hidden units * Mehr Parameter, anfälliger für Overfitting (aber mit Regularisierung zu steuern), aber aufwändiger * Anzahl hidden layers: Cross-validation error mit verschiedener Anzahl von Hidden Layers testen.