This is an old revision of the document!
Backpropagation
We can compute how fast the error changes.
Using error derivates w.r.t. hidden acitvities. Then convert error derivates to weights.
- Convert discrepancy between output and target into an error derivative. (Differenciate E by)
- Compute error derivatives in each hidden layer from error derivatives in the layer above.
- Then use error derivates w.r.t. activities to get error derivatives w.r.t. incoming weights.
$\frac{\partial E}{\partial y_j}$
Sample
Input values: $x_1 = 2, x_2 = 5, x_3 = 3$
Weights: $w_1 = 150, w_2 = 50, w_3 = 100$
Target: $t = 850$.
Delta-Rule for learning: $\delta w_i = \epsilon x_i(t-y)$ With $\epsilon = 1/35$:
$w*_1 = 70, w*_2 = 100, w*_3 = 80$
Deriving the delta rule
Error $E$ is squared residuals summed up over all training cases. $E = \frac{1}{2} \sum\limits_{n \in \text{training}} (t^n-y^n)^2$
Differentiate $E$ by weights, to get error derivatives for weigths: $\frac{\partial{E}}{\partial{w_i}} = \frac{1}{2} \sum\limits_{n} \frac{\partial y^n}{\partial w_i} \frac{dE^n}{dy^n} = - \sum\limits_{n} x_i^n(t^n-y^n)$
Batch delta rule: Changes weights in proportion to the error derivatives summed over all training cases: $\delta w_i = - \epsilon \frac{\partial{E}}{\partial{w_i}}$
More deriving
\begin{align} z_1 &= w_1 x_1 + w_2 x_2\\ z_2 &= w_3 x_2 + w_4 x_3\\ h_1 &= \sigma(z_1)\\ h_2 &= \sigma(z_2)\\ y &= u_1 h_1 + u_2 h_2\\ E &= \frac{1}{2} (t-y)^2\\ \sigma(x) &= \frac{1}{1+e^-x} \end{align}
Weight, so that $w_2 = w_3$
What is $\frac{\partial E}{\partial w_{tied}} = \frac{\partial E}{\partial w_{2}} + \frac{\partial E}{\partial w_{3}}$
What is $\frac{\partial E}{\partial w_{2}}$?
Repeated application of chain rule.
$\frac{\partial E}{\partial w_{2}} = \frac{\partial E}{\partial y} \frac{\partial y}{\partial h_1} \frac{\partial h_1}{\partial z_1} \frac{\partial z_1}{\partial w_2}$
$\frac{\partial E}{\partial y} = -(t-y)$
$\frac{\partial y}{\partial h_1} = u_1$
$\frac{\partial h_1}{\partial z_1} = h_1(1-h_1)$
$\frac{\partial z_1}{\partial w_2} = x_2$
$\frac{\partial E}{\partial w_{2}} = -(t-y) u_1 h_1(1-h_1) x_2$
Additional issues
Optimization issues
- How often update the weights?
- Online: After each training case.
- Full batch: After a full sweep through the training data (bigger step, minimizes overall training error).
- Mini-batch: Small sample of training cases (little bit of zig-zag).
- How much to update?
- Fixed learning rate. Adapt global learning rate. Adapt learning rate on each connection? Don't use steep descent at all.
Overfitting (How well does the network generalize?)
- Target values unreliable?
- Sampling errors (accidental regularities of particular training cases)
Regularization Methods:
- Weight decay (small weights, simpler model)
- Weight-sharing (same weights)
- Early-stopping (Fake testset, when performance gets worse, stop training)
- Model-Averaging
- Bayes fitting (like model averaging)
- Dropout (Randomly ommit hidden units)
- Generative pre-training)
Weight decay
$L_2$ regularization:
E.g. for Logistic regression, add to cost function $J$: $\dots + \frac{\lambda}{2m} ||w||^2_2 = \dots + \sum_{j=i}^{n_x} w_j^2 = \dots + w^T w$
$L_1$ regularization:
$\frac{\lambda}{2m} ||w||_1$
History of backpropagation
Popular explanation: Could not make use of multiple hidden layers. Not work well in RNN or deep auto-encoder. SVMs worked better, less expertise, fancier theory.
Real reasons: Computers too slow; labeled data sets too small; deep networks too small.
Continuum - Statistics vs AI:
- low-dimensional data (< 100 dim)
- lots of noise vs. noise is not the problem
- Not much structure in data vs. huge amount of structure, but too complicated.
- Main problem: Seperate structure from noise vs. find way of representing the complicated structure.
- SVM View: Just clever reincarnation of perceptrons: Expand input to layer of non-linear non-adaptive features. Only one layer of adaptive weights; Very efficient way of fitting weights that constrols overfitting.
- SVM View2: Each input vector define a non-adaptive feature. Clever way of simultaneously doing feature selection and finding weights on the remaining features.
- Can't learn multiple layers with SVMs