Backpropagation

This is an old revision of the document!

We can compute how fast the error changes.

Using error derivates w.r.t. hidden acitvities. Then convert error derivates to weights.

Convert discrepancy between output and target into an error derivative. (Differenciate E by)
Compute error derivatives in each hidden layer from error derivatives in the layer above.
Then use error derivates w.r.t. activities to get error derivatives w.r.t. incoming weights.

$\frac{\partial E}{\partial y_j}$

Input values: $x_1 = 2, x_2 = 5, x_3 = 3$

Weights: $w_1 = 150, w_2 = 50, w_3 = 100$

Target: $t = 850$.

Delta-Rule for learning: $\delta w_i = \epsilon x_i(t-y)$ With $\epsilon = 1/35$:

$w*_1 = 70, w*_2 = 100, w*_3 = 80$

Error $E$ is squared residuals summed up over all training cases. $E = \frac{1}{2} \sum\limits_{n \in \text{training}} (t^n-y^n)^2$

Differentiate $E$ by weights, to get error derivatives for weigths: $\frac{\partial{E}}{\partial{w_i}} = \frac{1}{2} \sum\limits_{n} \frac{\partial y^n}{\partial w_i} \frac{dE^n}{dy^n} = - \sum\limits_{n} x_i^n(t^n-y^n)$

Batch delta rule: Changes weights in proportion to the error derivatives summed over all training cases: $\delta w_i = - \epsilon \frac{\partial{E}}{\partial{w_i}}$

\begin{align} z_1 &= w_1 x_1 + w_2 x_2\\ z_2 &= w_3 x_2 + w_4 x_3\\ h_1 &= \sigma(z_1)\\ h_2 &= \sigma(z_2)\\ y &= u_1 h_1 + u_2 h_2\\ E &= \frac{1}{2} (t-y)^2\\ \sigma(x) &= \frac{1}{1+e^-x} \end{align}

Weight, so that $w_2 = w_3$

What is $\frac{\partial E}{\partial w_{tied}} = \frac{\partial E}{\partial w_{2}} + \frac{\partial E}{\partial w_{3}}$

What is $\frac{\partial E}{\partial w_{2}}$?

Repeated application of chain rule.

$\frac{\partial E}{\partial w_{2}} = \frac{\partial E}{\partial y} \frac{\partial y}{\partial h_1} \frac{\partial h_1}{\partial z_1} \frac{\partial z_1}{\partial w_2}$

$\frac{\partial E}{\partial y} = -(t-y)$

$\frac{\partial y}{\partial h_1} = u_1$

$\frac{\partial h_1}{\partial z_1} = h_1(1-h_1)$

$\frac{\partial z_1}{\partial w_2} = x_2$

$\frac{\partial E}{\partial w_{2}} = -(t-y) u_1 h_1(1-h_1) x_2$

How often update the weights?
- Online: After each training case.
- Full batch: After a full sweep through the training data (bigger step, minimizes overall training error).
- Mini-batch: Small sample of training cases (little bit of zig-zag).
How much to update?
- Fixed learning rate. Adapt global learning rate. Adapt learning rate on each connection? Don't use steep descent at all.

Target values unreliable?
Sampling errors (accidental regularities of particular training cases)

Regularization Methods:

Weight decay (small weights, simpler model)
Weight-sharing (same weights)
Early-stopping (Fake testset, when performance gets worse, stop training)
Model-Averaging
Bayes fitting (like model averaging)
Dropout (Randomly ommit hidden units)
Generative pre-training)

Weight decay

$L_2$ regularization:

E.g. for Logistic regression, add to cost function $J$: $\dots + \frac{\lambda}{2m} ||w||^2_2 = \dots + \sum_{j=i}^{n_x} w_j^2 = \dots + w^T w$

$L_1$ regularization:

$\frac{\lambda}{2m} ||w||_1$

History of backpropagation

Popular explanation: Could not make use of multiple hidden layers. Not work well in RNN or deep auto-encoder. SVMs worked better, less expertise, fancier theory.

Real reasons: Computers too slow; labeled data sets too small; deep networks too small.

Continuum - Statistics vs AI:

low-dimensional data (< 100 dim)
lots of noise vs. noise is not the problem
Not much structure in data vs. huge amount of structure, but too complicated.
Main problem: Seperate structure from noise vs. find way of representing the complicated structure.

SVM View: Just clever reincarnation of perceptrons: Expand input to layer of non-linear non-adaptive features. Only one layer of adaptive weights; Very efficient way of fitting weights that constrols overfitting.
SVM View2: Each input vector define a non-adaptive feature. Clever way of simultaneously doing feature selection and finding weights on the remaining features.
Can't learn multiple layers with SVMs

Backpropagation

Sample

Deriving the delta rule

More deriving

Additional issues

Optimization issues

Overfitting (How well does the network generalize?)

Weight decay

History of backpropagation

AE Wiki