# Backpropagation

We can compute *how fast the error changes*.

Using error derivates w.r.t. hidden acitvities. Then convert error derivates to weights.

- Convert discrepancy between output and target into an error derivative. (Differenciate E by)
- Compute error derivatives in each hidden layer from error derivatives in the layer above.
- Then use
**error derivates**w.r.t. activities to get error derivatives w.r.t. incoming**weights**.

$\frac{\partial E}{\partial y_j}$

## Sample

Input values: $x_1 = 2, x_2 = 5, x_3 = 3$

Weights: $w_1 = 150, w_2 = 50, w_3 = 100$

Target: $t = 850$.

Delta-Rule for learning: $\delta w_i = \epsilon x_i(t-y)$ With $\epsilon = 1/35$:

$w*_1 = 70, w*_2 = 100, w*_3 = 80$

## Deriving the delta rule

Error $E$ is squared residuals summed up over all training cases. $E = \frac{1}{2} \sum\limits_{n \in \text{training}} (t^n-y^n)^2$

Differentiate $E$ by weights, to get error derivatives for weigths: $\frac{\partial{E}}{\partial{w_i}} = \frac{1}{2} \sum\limits_{n} \frac{\partial y^n}{\partial w_i} \frac{dE^n}{dy^n} = - \sum\limits_{n} x_i^n(t^n-y^n)$

Batch delta rule: Changes weights in proportion to the error derivatives summed over all training cases: $\delta w_i = - \epsilon \frac{\partial{E}}{\partial{w_i}}$

## More deriving

\begin{align} z_1 &= w_1 x_1 + w_2 x_2\\ z_2 &= w_3 x_2 + w_4 x_3\\ h_1 &= \sigma(z_1)\\ h_2 &= \sigma(z_2)\\ y &= u_1 h_1 + u_2 h_2\\ E &= \frac{1}{2} (t-y)^2\\ \sigma(x) &= \frac{1}{1+e^-x} \end{align}

Weight, so that $w_2 = w_3$

What is $\frac{\partial E}{\partial w_{tied}} = \frac{\partial E}{\partial w_{2}} + \frac{\partial E}{\partial w_{3}}$

What is $\frac{\partial E}{\partial w_{2}}$?

Repeated application of chain rule.

$\frac{\partial E}{\partial w_{2}} = \frac{\partial E}{\partial y} \frac{\partial y}{\partial h_1} \frac{\partial h_1}{\partial z_1} \frac{\partial z_1}{\partial w_2}$

$\frac{\partial E}{\partial y} = -(t-y)$

$\frac{\partial y}{\partial h_1} = u_1$

$\frac{\partial h_1}{\partial z_1} = h_1(1-h_1)$

$\frac{\partial z_1}{\partial w_2} = x_2$

$\frac{\partial E}{\partial w_{2}} = -(t-y) u_1 h_1(1-h_1) x_2$

## Additional issues

### Optimization issues

- How often update the weights?
- Online: After each training case.
- Full batch: After a full sweep through the training data (bigger step, minimizes overall training error).
- Mini-batch: Small sample of training cases (little bit of zig-zag).

- How much to update?
- Fixed learning rate. Adapt global learning rate. Adapt learning rate on each connection? Don't use steep descent at all.

### Overfitting (How well does the network generalize?)

### History of backpropagation

Popular explanation: Could not make use of multiple hidden layers. Not work well in RNN or deep auto-encoder. SVMs worked better, less expertise, fancier theory.

Real reasons: Computers too slow; labeled data sets too small; deep networks too small.

Continuum - Statistics vs AI:

- low-dimensional data (< 100 dim)
- lots of noise vs. noise is not the problem
- Not much structure in data vs. huge amount of structure, but too complicated.
- Main problem: Seperate structure from noise vs. find way of representing the complicated structure.

- SVM View: Just clever reincarnation of perceptrons: Expand input to layer of non-linear non-adaptive features. Only one layer of adaptive weights; Very efficient way of fitting weights that constrols overfitting.
- SVM View2: Each input vector define a non-adaptive feature. Clever way of simultaneously doing feature selection and finding weights on the remaining features.
- Can't learn multiple layers with SVMs