====== Convolutional neural network ======

Author: Yann Lecun

  * Multiple copies of same neuron, same activation function, weights and biases
  * Only connected to some local neurons, instead of full connection
  * Automatically do Feature Engineering 
  * Kernel: E.g. element-wise multiplication and summation
  * Architecture: 
    * Input -> Convolutional layer -> RELU -> Pooling (dim red) -> Fully connected layer
  * Drawback: Large data set needed

Applications:
  * Photo tagging

===== Convolution operation =====

Example with zero padding:

Given
  x[i] = [6,2]
  h[i] = [1,2,5,4]

With zero padding, and inverted filter x (otherwise operation would be cross-correlation).

  [2  6]
   |  |
   V  V
   0 [1 2 5 4]
  = 2 * 0 + 6 * 1 = 6
Second step:

    [2  6]  
     |  |  
     V  V  
  0 [1  2  5  4]  
  = 2 * 1 + 6 * 2 = 14 (the arrows represent the connection between the kernel and the input)
Third step:

       [2  6]  
        |  |  
        V  V  
  0 [1  2  5  4]  
  = 2 * 2 + 6 * 5 = 34
Fourth step:

          [2  6]
           |  |
           V  V
  0 [1  2  5  4]  
  = 2 * 5 + 6 * 4 = 34
Fifth step:

             [2  6]
              |  |
              V  V
  0 [1  2  5  4] 0  
  = 2 * 4 + 6 * 0 = 8
The result of the convolution for this case, listing all the steps, would then be: Y = [6 14 34 34 8]

==== Result size ====

$n \times n$ image, $f \times f$ kernel, $n-f+1 \times n-f+1$ result (Valid padding = no padding)

With padding:

For padding $p$: $n+2p-f+1 \times n+2p-f+1$

Same padding $p=(f-1)/2$


=== With strides ===

$(n+2p-f)/s + 1 \times (n+2p-f)/s + 1 $

=== With Volumes: ===

$ 6 \times 6 \times 3$ * $3 \times 3 \times 3$ = $4 \times 4$


$n-f+1 \times n-f+1 \times n_c'$ Number of filters $n_c'$

Use many filters, to detect multiple features
==== In Python with scipy ====
  # full method
  np.convolve(x,h,"full")
  array([ 6, 14, 34, 34,  8])
  # same method
  np.convolve(x,h,"same")  #no zero padding at end
  array([ 6, 14, 34, 34])
  # valid method
  np.convolve(x,h,"valid")  #no zero padding
  array([14, 34, 34])
  
==== In tensor flow ====

  * 3x3 filter (4D tensor = [3,3,1,1] = [width, height, channels, number of filters])
  * 10x10 image (4D tensor = [1,10,10,1] = [batch size, width, height, number of channels]
  * The output size for zero padding 'SAME' mode will be same as input = 10x10
  * The output size without zero padding 'VALID' mode: input size - kernel dimension +1 = 10 -3 + 1 = 8 = 8x8

<code python>
import tensorflow as tf

#Building graph

input = tf.Variable(tf.random_normal([1,10,10,1]))
filter = tf.Variable(tf.random_normal([3,3,1,1]))
op = tf.nn.conv2d(input, filter, strides=[1, 1, 1, 1], padding='VALID')
op2 = tf.nn.conv2d(input, filter, strides=[1, 1, 1, 1], padding='SAME')

#Initialization and session
init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)

    print("Input \n")
    print('{0} \n'.format(input.eval()))
    print("Filter/Kernel \n")
    print('{0} \n'.format(filter.eval()))
    print("Result/Feature Map with valid positions \n")
    result = sess.run(op)
    print(result)
    print('\n')
    print("Result/Feature Map with padding \n")
    result2 = sess.run(op2)
    print(result2)
</code>

===== Max Pooling =====
Fixed hyper-parameters:
  * Filter size f
  * Stride s

Typical values: $f=2, s=2$

Usually no padding is used.

Channel is the same (depth)
===== Average Pooling =====
In deep networks 7x7x1000 => 1x1x1000

===== Winning competitions =====

  * Ensembling of outputs
  * Multi-crop at test-time: Run classifier on multiple versions of test images (cropped, mirrored, ...) and average results