====== Perceptron ====== * Popularized by Frank Rosenblatt (1960s) * Used for tasks with very big vectors of features Decision Unit: Binary trheshold neuron. Bias can be learned like weights, it's weight with value 1. Perceptron convergence * If output correct => no weight changes * If output unit incorrectly outputs 0 => add input vector to weight vector. * If output unit incorrectly outputs 1 => substract input vector from the weight vector. This generates set of weights that gets the right answer for all training cases, if such a set exists. => Deciding the features is the important distinction ====== Geometrical Interpretation ====== ===== Weight-Space view ===== * 1 dimension for each weight * Point represents a setting of all weights * Leaving the threshold out, each **training case** can be represented as a **hyperplane** through the **origin**. Inputs represent planes (or Constraints) * For a particular training case: Weights must lie on one side of this hyper-plane to get the answer correct. Plane goes through **origin**, is perpendicular to the input vector (with correct answer = 1 (or 0)). Good weight vector needs to be on the same side of the hyperplane. Scalar product of wight vector and input vector positiv (angle < 90°). **Cone of feasable solutions** Need to find a point on right side of all the planes (training cases): Might not exist. If there are weight vectors that get the right side for all cases, they lie in a hyper-cone, with apex in origin. Average of two good wight vectors is a good weight vector => Convex problem. ==== Why learning works ==== Proof that squared distance between feasable and current weight vector get's smaller. Hopeful claim: Every time perceptron makes mistakes, the current weight vector gets closer to all feasble wight vectors. Not true. "generously feasable" weight vector, that lie within a feasible region by a **margin** at least as great as the **length of the input vector** that defines **each constraint plane**. Every time perceptron makes mistake, squared distance to all of the **generously feasible** weight vectors is always decreased by at least the **squared length** of the **update vector**. After finite number of mistakes, the weight vector must lie in the **feasible** region, if the region exists. ==== What perceptrons can't do ==== Separate feature unit for each of the many binary vectors => any possible discrimination. Exponential many features are necessary. This type of table look-up won't generalize. (guess due to the binary encoding). Binary threshold output unit can't tell if two single bit features are the same. => contradicting constraints. === Data-Space === - Each input vector is point in space - Weight vector defines plane - Weight plane is perpendicular to the weight vector and misses the origin by a distance equal to the threshold. Pos and neg cases can not be seperated by a plane => not linearly separable. === Discriminate patterns under translation with wrap-around === Binary threshold neuron can't discriminate between different patterns that have same number of on pixels (if pattern can translate with wrap-around). Proof: Pattern A on all possible translations (4 on pixels). Total Input will be 4x sum of all the weights. Pattern B on all possible translations (4 on pixels) Total Input will be 4x sum of all the weights. But every single case of pattern A must provide more import than all single cases of pattern B. Case: 3 Patterns with 2 Classes. First class contains patterns with 4 pixels on. Second class contains pattern with either 1 or 3 pixels on. Weight from each pixel = 1. Bias = -3.5 Any example with 3 pixels on: activation = -0.5 Any example with 1 pixel on: activation = -2.5 Any example with 4 pixel on: acitvation = 0.5 => correctly classified. Whole point of pattern recognition is to recognize patterns despite transformations like translation. Minksy and Paper ("Group Invariance Theorem"): Part of perceptron that **learns** cannot learn to do this if the transformation form a **group**: - Translation with **wrap-around** form a group. To deal with such transf. tricky part of recognition must be solved by hand-coded **feature detectors** (not learning procedure). Conclusion (after 20 yrs): Neuronal Network has to learn the feature detectors (not only weights). * More layers of linear units do not help (still linear). * Fixed output non-linearities are not enough. => Need multiple layers of **adaptive**, **non-linear** hidden units. * Efficient way of adapting all the weights. * Learning weights going into hidden units is equivalent to learning features. * No one is telling us what the feature vectors should be.