Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
data_mining:neural_network:backpropagation [2017/08/19 19:48] – [Overfitting (How well does the network generalize?)] phreazer | data_mining:neural_network:backpropagation [2018/05/12 12:14] (current) – [Backpropagation] phreazer | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== Backpropagation ====== | ====== Backpropagation ====== | ||
- | We can compute how fast the error changes. | + | We can compute |
Using error derivates w.r.t. hidden acitvities. Then convert error derivates to weights. | Using error derivates w.r.t. hidden acitvities. Then convert error derivates to weights. | ||
Line 76: | Line 76: | ||
==== Overfitting (How well does the network generalize? | ==== Overfitting (How well does the network generalize? | ||
- | + | See [[data_mining:neural_network:overfitting|Overfitting & Parameter tuning]] | |
- | * Target values unreliable? | + | |
- | * Sampling errors (accidental regularities of particular training cases) | + | |
- | + | ||
- | Regularization Methods: | + | |
- | * Weight decay (small weights, simpler model) | + | |
- | * Weight-sharing (same weights) | + | |
- | * Early-stopping (Fake testset, when performance gets worse, stop training) | + | |
- | * Model-Averaging | + | |
- | * Bayes fitting (like model averaging) | + | |
- | * Dropout (Randomly ommit hidden units) | + | |
- | * Generative pre-training) | + | |
- | + | ||
- | === $L_1$ and $L_2$ regularization === | + | |
- | + | ||
- | == Example for logistic regression: == | + | |
- | + | ||
- | $L_2$ regularization: | + | |
- | + | ||
- | E.g. for Logistic regression, add to cost function $J$: $\dots + \frac{\lambda}{2m} | + | |
- | + | ||
- | $L_1$ regularization: | + | |
- | + | ||
- | $\frac{\lambda}{2m} ||w||_1$ | + | |
- | + | ||
- | $w$ will be sparse. | + | |
- | + | ||
- | Use hold-out test set to set hyperparameter. | + | |
- | + | ||
- | == Neural Network == | + | |
- | + | ||
- | Cost function | + | |
- | + | ||
- | $J(\dots) = 1/m \sum_{i=1}^n L(\hat{y}^{i}, | + | |
- | + | ||
- | Frobenius Norm: $||W^{[l]}||_F^2$ | + | |
- | + | ||
- | For gradient descent: | + | |
- | + | ||
- | $dW^{[l]} = \dots + \frac{\lambda}{m} W^{[l]}$ | + | |
- | + | ||
- | Called **weight decay** (additional multiplication with weights) | + | |
- | + | ||
- | Large $\lambda$: Every layer ~ linear; z small range of values (in case of tanh activation fct) | + | |
==== History of backpropagation ==== | ==== History of backpropagation ==== | ||