====== Backpropagation ====== We can compute //how fast the error changes//. Using error derivates w.r.t. hidden acitvities. Then convert error derivates to weights. - Convert discrepancy between output and target into an error derivative. (Differenciate E by) - Compute error derivatives in each hidden layer from error derivatives in the layer above. - Then use **error derivates** w.r.t. activities to get error derivatives w.r.t. incoming **weights**. $\frac{\partial E}{\partial y_j}$ ===== Sample===== Input values: $x_1 = 2, x_2 = 5, x_3 = 3$ Weights: $w_1 = 150, w_2 = 50, w_3 = 100$ Target: $t = 850$. Delta-Rule for learning: $\delta w_i = \epsilon x_i(t-y)$ With $\epsilon = 1/35$: $w*_1 = 70, w*_2 = 100, w*_3 = 80$ ===== Deriving the delta rule ===== Error $E$ is squared residuals summed up over all training cases. $E = \frac{1}{2} \sum\limits_{n \in \text{training}} (t^n-y^n)^2$ Differentiate $E$ by weights, to get error derivatives for weigths: $\frac{\partial{E}}{\partial{w_i}} = \frac{1}{2} \sum\limits_{n} \frac{\partial y^n}{\partial w_i} \frac{dE^n}{dy^n} = - \sum\limits_{n} x_i^n(t^n-y^n)$ Batch delta rule: Changes weights in proportion to the error derivatives summed over all training cases: $\delta w_i = - \epsilon \frac{\partial{E}}{\partial{w_i}}$ ===== More deriving ===== \begin{align} z_1 &= w_1 x_1 + w_2 x_2\\ z_2 &= w_3 x_2 + w_4 x_3\\ h_1 &= \sigma(z_1)\\ h_2 &= \sigma(z_2)\\ y &= u_1 h_1 + u_2 h_2\\ E &= \frac{1}{2} (t-y)^2\\ \sigma(x) &= \frac{1}{1+e^-x} \end{align} Weight, so that $w_2 = w_3$ What is $\frac{\partial E}{\partial w_{tied}} = \frac{\partial E}{\partial w_{2}} + \frac{\partial E}{\partial w_{3}}$ What is $\frac{\partial E}{\partial w_{2}}$? Repeated application of chain rule. $\frac{\partial E}{\partial w_{2}} = \frac{\partial E}{\partial y} \frac{\partial y}{\partial h_1} \frac{\partial h_1}{\partial z_1} \frac{\partial z_1}{\partial w_2}$ $\frac{\partial E}{\partial y} = -(t-y)$ $\frac{\partial y}{\partial h_1} = u_1$ $\frac{\partial h_1}{\partial z_1} = h_1(1-h_1)$ $\frac{\partial z_1}{\partial w_2} = x_2$ $\frac{\partial E}{\partial w_{2}} = -(t-y) u_1 h_1(1-h_1) x_2$ ===== Additional issues ===== ==== Optimization issues ==== * How often update the weights? * Online: After each training case. * Full batch: After a full sweep through the training data (bigger step, minimizes overall training error). * Mini-batch: Small sample of training cases (little bit of zig-zag). * How much to update? * Fixed learning rate. Adapt global learning rate. Adapt learning rate on each connection? Don't use steep descent at all. ==== Overfitting (How well does the network generalize?) ==== See [[data_mining:neural_network:overfitting|Overfitting & Parameter tuning]] ==== History of backpropagation ==== Popular explanation: Could not make use of multiple hidden layers. Not work well in RNN or deep auto-encoder. SVMs worked better, less expertise, fancier theory. Real reasons: Computers too slow; labeled data sets too small; deep networks too small. Continuum - Statistics vs AI: * low-dimensional data (< 100 dim) * lots of noise vs. noise is not the problem * Not much structure in data vs. huge amount of structure, but too complicated. * Main problem: Seperate structure from noise vs. find way of representing the complicated structure. * SVM View: Just clever reincarnation of perceptrons: Expand input to layer of non-linear non-adaptive features. Only one layer of adaptive weights; Very efficient way of fitting weights that constrols overfitting. * SVM View2: Each input vector define a non-adaptive feature. Clever way of simultaneously doing feature selection and finding weights on the remaining features. * Can't learn multiple layers with SVMs