Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
data_mining:regression [2014/07/13 03:03] – [Learning rate $\alpha$] phreazer | data_mining:regression [2019/02/10 17:14] (current) – [Gradient descent] phreazer | ||
---|---|---|---|
Line 17: | Line 17: | ||
==== Cost function ==== | ==== Cost function ==== | ||
- | + | $\displaystyle\min_{\theta_0, | |
- | $\text{minimize}_{\theta_0, | + | |
Vereinfachtes Problem: | Vereinfachtes Problem: | ||
- | $\text{minimize}_{\theta_0, | + | $\displaystyle\min_{\theta_0, |
$h_\theta(x^{(i)}) = \theta_0 +\theta_1x^{(i)}$ | $h_\theta(x^{(i)}) = \theta_0 +\theta_1x^{(i)}$ | ||
- | Cost function (Squared error cost function): | + | Cost function (Squared error cost function) |
$J(\theta_0, | $J(\theta_0, | ||
- | Goal: $\text{minimize}_{\theta_0, | + | Goal: $\displaystyle\min_{\theta_0, |
=== Functions (example with only $\theta_1$): | === Functions (example with only $\theta_1$): | ||
Line 54: | Line 53: | ||
Wiederholen bis zur Konvergenz: | Wiederholen bis zur Konvergenz: | ||
- | $\theta_j := \theta_j - alpha \frac{\partial}{\partial\theta_j} J(\theta_0, \theta_1)$ | + | $\theta_j := \theta_j - \alpha \frac{\partial}{\partial\theta_j} J(\theta_0, \theta_1)$ |
Gleichzeitiges Update (!): | Gleichzeitiges Update (!): | ||
$$ | $$ | ||
- | tmp0 := \theta_0 - alpha \frac{\partial}{\partial\theta_0} J(\theta_0, \theta_1)\\ | + | tmp0 := \theta_0 - \alpha \frac{\partial}{\partial\theta_0} J(\theta_0, \theta_1)\\ |
- | tmp1 := \theta_1 - alpha \frac{\partial}{\partial\theta_1} J(\theta_0, \theta_1)\\ | + | tmp1 := \theta_1 - \alpha \frac{\partial}{\partial\theta_1} J(\theta_0, \theta_1)\\ |
\theta_0 := tmp0\\ | \theta_0 := tmp0\\ | ||
\theta_1 := tmp1 | \theta_1 := tmp1 | ||
Line 104: | Line 103: | ||
$\theta_j := \theta_j - alpha \frac{\partial}{\partial\theta_j} J(\theta)$ | $\theta_j := \theta_j - alpha \frac{\partial}{\partial\theta_j} J(\theta)$ | ||
+ | |||
+ | |||
+ | ==== Normalengleichungen ==== | ||
+ | |||
+ | * Feature-/ | ||
+ | * Vector y (Dim: m) | ||
+ | |||
+ | $\theta = (X^TX)^{-1}X^Ty$ | ||
+ | |||
+ | * Feature scaling nicht notwendig. | ||
+ | |||
+ | Was wenn $X^TX$ singulär (nicht invertierbar)? | ||
+ | |||
+ | (pinv in Octave) | ||
+ | |||
+ | **Gründe für Singularität: | ||
+ | * Redundante Features (lineare Abhängigkeit) | ||
+ | * Zu viele Features (z.B. $m <= n$) | ||
+ | * Lösung: Features weglassen oder regularisieren | ||
+ | |||
+ | **Wann was benutzten? | ||
+ | |||
+ | * m training tupel, n features | ||
+ | * GD funktioniert bei großem n (> 1000) gut, Normalengleichung muss (n x n) Matrix invertieren, | ||
===== Gradient Descent Improvements ===== | ===== Gradient Descent Improvements ===== | ||
Line 125: | Line 148: | ||
* Wenn $\alpha$ zu groß: $J(\theta)$ sinkt nicht bei jeder Iteration und konvergiert evtl. nicht. | * Wenn $\alpha$ zu groß: $J(\theta)$ sinkt nicht bei jeder Iteration und konvergiert evtl. nicht. | ||
* Schema: 0,001 -> 0,003 -> 0,01 -> 0,03 -> 0,1 -> ... | * Schema: 0,001 -> 0,003 -> 0,01 -> 0,03 -> 0,1 -> ... | ||
+ | |||
+ | |||
+ | ===== Polynomial regression ===== | ||
+ | $\theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3$ | ||
+ | |||
+ | Durch entsprechende Features möglich: | ||
+ | $x_3 = (size)^3$ | ||
+ | |||
+ | Feature scaling wird dann wichtig (da groß und verschieden) | ||
+ | |||
+ | Oder Wurzelfkt.: | ||
+ | |||
+ | $\theta_0 + \theta_1 x + \theta_2 \sqrt{x}$ |