Table of Contents

Backpropagation

We can compute how fast the error changes.

Using error derivates w.r.t. hidden acitvities. Then convert error derivates to weights.

  1. Convert discrepancy between output and target into an error derivative. (Differenciate E by)
  2. Compute error derivatives in each hidden layer from error derivatives in the layer above.
  3. Then use error derivates w.r.t. activities to get error derivatives w.r.t. incoming weights.

$\frac{\partial E}{\partial y_j}$

Sample

Input values: $x_1 = 2, x_2 = 5, x_3 = 3$

Weights: $w_1 = 150, w_2 = 50, w_3 = 100$

Target: $t = 850$.

Delta-Rule for learning: $\delta w_i = \epsilon x_i(t-y)$ With $\epsilon = 1/35$:

$w*_1 = 70, w*_2 = 100, w*_3 = 80$

Deriving the delta rule

Error $E$ is squared residuals summed up over all training cases. $E = \frac{1}{2} \sum\limits_{n \in \text{training}} (t^n-y^n)^2$

Differentiate $E$ by weights, to get error derivatives for weigths: $\frac{\partial{E}}{\partial{w_i}} = \frac{1}{2} \sum\limits_{n} \frac{\partial y^n}{\partial w_i} \frac{dE^n}{dy^n} = - \sum\limits_{n} x_i^n(t^n-y^n)$

Batch delta rule: Changes weights in proportion to the error derivatives summed over all training cases: $\delta w_i = - \epsilon \frac{\partial{E}}{\partial{w_i}}$

More deriving

\begin{align} z_1 &= w_1 x_1 + w_2 x_2\\ z_2 &= w_3 x_2 + w_4 x_3\\ h_1 &= \sigma(z_1)\\ h_2 &= \sigma(z_2)\\ y &= u_1 h_1 + u_2 h_2\\ E &= \frac{1}{2} (t-y)^2\\ \sigma(x) &= \frac{1}{1+e^-x} \end{align}

Weight, so that $w_2 = w_3$

What is $\frac{\partial E}{\partial w_{tied}} = \frac{\partial E}{\partial w_{2}} + \frac{\partial E}{\partial w_{3}}$

What is $\frac{\partial E}{\partial w_{2}}$?

Repeated application of chain rule.

$\frac{\partial E}{\partial w_{2}} = \frac{\partial E}{\partial y} \frac{\partial y}{\partial h_1} \frac{\partial h_1}{\partial z_1} \frac{\partial z_1}{\partial w_2}$

$\frac{\partial E}{\partial y} = -(t-y)$

$\frac{\partial y}{\partial h_1} = u_1$

$\frac{\partial h_1}{\partial z_1} = h_1(1-h_1)$

$\frac{\partial z_1}{\partial w_2} = x_2$

$\frac{\partial E}{\partial w_{2}} = -(t-y) u_1 h_1(1-h_1) x_2$

Additional issues

Optimization issues

Overfitting (How well does the network generalize?)

See Overfitting & Parameter tuning

History of backpropagation

Popular explanation: Could not make use of multiple hidden layers. Not work well in RNN or deep auto-encoder. SVMs worked better, less expertise, fancier theory.

Real reasons: Computers too slow; labeled data sets too small; deep networks too small.

Continuum - Statistics vs AI: