data_mining:neural_network:loss_functions

# Loss functions

Difference between two probability distributions

$-(y log(\hat{y}) + (1-y) log(1-\hat{y})$

or

$-1/N \sum_{i=1}^N (y_i log(\hat{y_i}) + (1-y_i) log(1-\hat{y_i})$

Recap entropy:

In bin classification entropy for distribution q(y) with 50:50 classes is: $H(q) = log(2)$

For other distrubition (and in general) with C classes, entropy of distribution is $H(q) = - \sum_{c=1}^C q(y_c) * log(q(y_c))$

Small value of $q(y_c)$ leads to large negative log (multiple times of q): math.log(0.01) = -4.6

Large value of $q(y_c)$ leads to small negative log: math.log(0.99) = -0.01

More classes, higher entropy

Recap cross-entropy:

$H_p(q) = - \sum_{c=1}^C q(y_c) * log(p(y_c))$ where p is other distribution

When $p == q$, then $H_p(q) == H(q)$, so cross-entropy >= entropy.

Recap KL-Divergence:

Is difference between cross-entropy and entropy.

$D_{KL}(q||p) = H_p(q) - H(q) = \sum_{c=1}^C q(y_c) (log(q(y_c) - log(p(y_c)))$

For an algo we need to find closes $p$ for $q$ by minimizing cross-entropy.

In training we have $N$ samples. For one particular example the distribution is known to which class it belongs. The loss function should minimize the average cross-entropy.

Outcome: Scalar [0,1] using sigmoid

$-\sum_{i}^C(y_i log(\hat{y_i})$ $C$ is number of classes

Outcome: Vector [0,1] using softmax

$-\sum_{i}^C(y_i log(\hat{y_i}) + (1-y_i) log(1-\hat{y_i})$

Ouctome: Vector [0,1] using sigmoid

• data_mining/neural_network/loss_functions.txt