We can compute how fast the error changes.
Using error derivates w.r.t. hidden acitvities. Then convert error derivates to weights.
$\frac{\partial E}{\partial y_j}$
Input values: $x_1 = 2, x_2 = 5, x_3 = 3$
Weights: $w_1 = 150, w_2 = 50, w_3 = 100$
Target: $t = 850$.
Delta-Rule for learning: $\delta w_i = \epsilon x_i(t-y)$ With $\epsilon = 1/35$:
$w*_1 = 70, w*_2 = 100, w*_3 = 80$
Error $E$ is squared residuals summed up over all training cases. $E = \frac{1}{2} \sum\limits_{n \in \text{training}} (t^n-y^n)^2$
Differentiate $E$ by weights, to get error derivatives for weigths: $\frac{\partial{E}}{\partial{w_i}} = \frac{1}{2} \sum\limits_{n} \frac{\partial y^n}{\partial w_i} \frac{dE^n}{dy^n} = - \sum\limits_{n} x_i^n(t^n-y^n)$
Batch delta rule: Changes weights in proportion to the error derivatives summed over all training cases: $\delta w_i = - \epsilon \frac{\partial{E}}{\partial{w_i}}$
\begin{align} z_1 &= w_1 x_1 + w_2 x_2\\ z_2 &= w_3 x_2 + w_4 x_3\\ h_1 &= \sigma(z_1)\\ h_2 &= \sigma(z_2)\\ y &= u_1 h_1 + u_2 h_2\\ E &= \frac{1}{2} (t-y)^2\\ \sigma(x) &= \frac{1}{1+e^-x} \end{align}
Weight, so that $w_2 = w_3$
What is $\frac{\partial E}{\partial w_{tied}} = \frac{\partial E}{\partial w_{2}} + \frac{\partial E}{\partial w_{3}}$
What is $\frac{\partial E}{\partial w_{2}}$?
Repeated application of chain rule.
$\frac{\partial E}{\partial w_{2}} = \frac{\partial E}{\partial y} \frac{\partial y}{\partial h_1} \frac{\partial h_1}{\partial z_1} \frac{\partial z_1}{\partial w_2}$
$\frac{\partial E}{\partial y} = -(t-y)$
$\frac{\partial y}{\partial h_1} = u_1$
$\frac{\partial h_1}{\partial z_1} = h_1(1-h_1)$
$\frac{\partial z_1}{\partial w_2} = x_2$
$\frac{\partial E}{\partial w_{2}} = -(t-y) u_1 h_1(1-h_1) x_2$
Popular explanation: Could not make use of multiple hidden layers. Not work well in RNN or deep auto-encoder. SVMs worked better, less expertise, fancier theory.
Real reasons: Computers too slow; labeled data sets too small; deep networks too small.
Continuum - Statistics vs AI: