====== Logistic regression ======

===== Hypothesis =====

Wertebereich sollte hier liegen: $0 \leq h_\theta(x) \leq 1$

$h_\theta(x) = g(\theta^Tx)$

Logistic/Sigmoid function $g(z) = \frac{1}{1+e^{-z}}$

$h_\theta(x)$: Geschätze Wahrscheinlichkeit, dass y = 1 is für Input x.

$h_\theta(x) = P(y=1 | x; \theta)$

$1 = P(y=1 | x; \theta) + P(y=0 | x; \theta)$

===== Cost function =====

How to choose parameters \theta?

$J(\theta) = \frac{1}{m} \sum_{i=1}^m Cost(h_\theta(x),y)$

$Cost(h_\theta(x),y) = 1/2 (h_\theta(x)-y)^2$ nicht konvex bei logistischer Regression!

$Cost(h_\theta(x),y) = \begin{cases} -log(h_\theta(x))&\text{if y = 1}\\-log(1 - h_\theta(x))&\text{if y = 0}\end{cases}$

Für y=1: Wenn y = 1, dann sind Kosten 0, sonst werden sie größer, je weiter entfernt.
Für y=0: analog.

Vereinfacht:

$Cost(h_\theta(x),y) = -y * log(h_\theta(x)) - (1-y) * log(1 - h_\theta(x))$

Kann durch Maximum Likelihood Estimation hergeleitet werden.

===== Gradient Descent =====

Ähnlicher Algorithmus wie bei linearer Regression, aber andere Hypothesis.

===== Optimierung =====

Neben Gradient Descent: 
  * Conjugate gradient
  * BFGS
  * L-BFGS

Vorteile: Alpha muss nicht gewählt werden, oft schneller.
Nachteil: Komplex.

Octave: Kostenfunktion definieren, die Kosten und Gradienten berechnet und an entsprechende Methode übergeben.

===== Multiclass classification =====

One vs all / one vs rest Classification

Z.B. aus 3-Klassenproblem 3 binäre Probleme erzeugen. $h_\theta(x)^{(i)} = P(y=i|x;\theta); i=1,2,3$

Dann wähle Klasse i, die $\max_i h_\theta^{(i)}(x)$

===== Adressing Overfitting =====
  - Feature reduction
    * Manual selection
    * Model selection algo
  - Regularization
    * Alle Features behalten, aber Größe/Werte der Parameter $\theta_j$ verändern.
       * Funktioniert gut, wenn es viele Features gibt, die ein wenig zur Vorhersage von y beitragen.

==== Regularization ====

$min \dots + 1000 \theta_3^2 + 1000 \theta_4^2$

Kleine Paremeter führen zu "einfacherer" Hypothesis.

$min \dots + \lambda \sum_{i=1}^n \theta_j^2$

=== Gradient descent (Linear Regression) ===

$\dots + \lambda / m \theta_j$

$\theta_j := \theta_j (1-\alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum^m_{i=1} (h_\theta(x^{(i)} - y^{(i)}) x_j^{(i)}$


=== Normalengleichung (Linear Regression) ===
$$(x^T X + \lambda 
\begin{bmatrix}
0 & \dots & \dots & 0 \\
\vdots & 1 & 0 & \vdots \\
\vdots & 0 & \ddots & 0 \\
0 & \dots & 0 & 1
\end{bmatrix})^{-1} X^T y$$

=== Gradient descent (Logistic Regression) ===

Unterscheide $\theta_0$ und $\theta_j$!

Für $\theta_j$: $\dots + \frac{\lambda}{m} \theta_j$