====== Backpropagation ======

We can compute //how fast the error changes//.

Using error derivates w.r.t. hidden acitvities. Then convert error derivates to weights.

  - Convert discrepancy between output and target into an error derivative. (Differenciate E by)
  - Compute error derivatives in each hidden layer from error derivatives in the layer above.
  - Then use **error derivates** w.r.t. activities to get error derivatives w.r.t. incoming **weights**.

$\frac{\partial E}{\partial y_j}$

===== Sample=====

Input values: $x_1 = 2, x_2 = 5, x_3 = 3$

Weights: $w_1 = 150, w_2 = 50, w_3 = 100$

Target: $t = 850$.

Delta-Rule for learning: $\delta w_i = \epsilon x_i(t-y)$
With $\epsilon = 1/35$:

$w*_1 = 70, w*_2 = 100, w*_3 = 80$

===== Deriving the delta rule =====

Error $E$ is squared residuals summed up over all training cases. $E = \frac{1}{2} \sum\limits_{n \in \text{training}} (t^n-y^n)^2$

Differentiate $E$ by weights, to get error derivatives for weigths: $\frac{\partial{E}}{\partial{w_i}} = \frac{1}{2} \sum\limits_{n} \frac{\partial y^n}{\partial w_i} \frac{dE^n}{dy^n} = - \sum\limits_{n} x_i^n(t^n-y^n)$

Batch delta rule:
Changes weights in proportion to the error derivatives summed over all training cases: $\delta w_i = - \epsilon \frac{\partial{E}}{\partial{w_i}}$

===== More deriving =====
\begin{align}
z_1 &= w_1 x_1 + w_2 x_2\\
z_2 &= w_3 x_2 + w_4 x_3\\
h_1 &= \sigma(z_1)\\
h_2 &= \sigma(z_2)\\
y &= u_1 h_1 + u_2 h_2\\
E &= \frac{1}{2} (t-y)^2\\
\sigma(x) &= \frac{1}{1+e^-x}
\end{align}

Weight, so that $w_2 = w_3$

What is $\frac{\partial E}{\partial w_{tied}} = \frac{\partial E}{\partial w_{2}} + \frac{\partial E}{\partial w_{3}}$

What is $\frac{\partial E}{\partial w_{2}}$?

Repeated application of chain rule.

$\frac{\partial E}{\partial w_{2}} = \frac{\partial E}{\partial y} \frac{\partial y}{\partial h_1} \frac{\partial h_1}{\partial z_1} \frac{\partial z_1}{\partial w_2}$

$\frac{\partial E}{\partial y} = -(t-y)$

$\frac{\partial y}{\partial h_1} = u_1$

$\frac{\partial h_1}{\partial z_1} = h_1(1-h_1)$

$\frac{\partial z_1}{\partial w_2} = x_2$

$\frac{\partial E}{\partial w_{2}} = -(t-y)  u_1  h_1(1-h_1) x_2$

===== Additional issues =====

==== Optimization issues ====

  * How often update the weights? 
    * Online: After each training case. 
    * Full batch: After a full sweep through the training data (bigger step, minimizes overall training error). 
    * Mini-batch: Small sample of training cases (little bit of zig-zag). 
  * How much to update? 
    * Fixed learning rate. Adapt global learning rate. Adapt learning rate on each connection? Don't use steep descent at all.

==== Overfitting (How well does the network generalize?) ====
 See [[data_mining:neural_network:overfitting|Overfitting &amp; Parameter tuning]]
==== History of backpropagation ====

Popular explanation: Could not make use of multiple hidden layers. Not work well in RNN or deep auto-encoder. SVMs worked better, less expertise, fancier theory.

Real reasons: Computers too slow; labeled data sets too small; deep networks too small.

Continuum - Statistics vs AI:
  * low-dimensional data (< 100 dim)
  * lots of noise vs. noise is not the problem
  * Not much structure in data vs. huge amount of structure, but too complicated.
  * Main problem: Seperate structure from noise vs. find way of representing the complicated structure.

  * SVM View: Just clever reincarnation of perceptrons: Expand input to layer of non-linear non-adaptive features. Only one layer of adaptive weights; Very efficient way of fitting weights that constrols overfitting.
  * SVM View2: Each input vector define a non-adaptive feature. Clever way of simultaneously doing feature selection and finding weights on the remaining features.
  * Can't learn multiple layers with SVMs