Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
data_mining:neural_network:perceptron [2017/02/02 22:04] – [Weight space] phreazer | data_mining:neural_network:perceptron [2017/02/03 22:21] (current) – [What perceptrons can't do] phreazer | ||
---|---|---|---|
Line 17: | Line 17: | ||
====== Geometrical Interpretation ====== | ====== Geometrical Interpretation ====== | ||
- | ===== Weight | + | ===== Weight-Space view ===== |
- | * 1 dimensions | + | * 1 dimension |
- | * Point represents setting of all weights | + | * Point represents |
- | * Leaving the threshold out, each **training case** can be represented as a **hyperplane** through the **origin**. | + | * Leaving the threshold out, each **training case** can be represented as a **hyperplane** through the **origin**. |
* For a particular training case: Weights must lie on one side of this hyper-plane to get the answer correct. | * For a particular training case: Weights must lie on one side of this hyper-plane to get the answer correct. | ||
- | Plane is perpendicular to the input vector. Good weight vector needs to be on the same side of the hyperplane. | + | |
- | Scalar product of wight vector and input vector positiv (angle <= 90°). | + | |
+ | Plane goes through **origin**, | ||
+ | Scalar product of wight vector and input vector positiv (angle < 90°). | ||
+ | |||
+ | **Cone of feasable solutions** | ||
+ | |||
+ | Need to find a point on right side of all the planes (training cases): Might not exist. | ||
+ | If there are weight vectors that get the right side for all cases, they lie in a hyper-cone, with apex in origin. | ||
+ | Average of two good wight vectors is a good weight vector => Convex problem. | ||
+ | |||
+ | |||
+ | ==== Why learning works ==== | ||
+ | |||
+ | Proof that squared distance between feasable and current weight vector get's smaller. Hopeful claim: Every time perceptron makes mistakes, the current weight vector gets closer to all feasble wight vectors. Not true. | ||
+ | |||
+ | " | ||
+ | |||
+ | Every time perceptron makes mistake, squared distance to all of the **generously feasible** weight vectors is always decreased by at least the **squared length** of the **update vector**. | ||
+ | |||
+ | After finite number of mistakes, the weight vector must lie in the **feasible** region, if the region exists. | ||
+ | |||
+ | ==== What perceptrons can't do ==== | ||
+ | |||
+ | Separate feature unit for each of the many binary vectors => any possible discrimination. Exponential many features are necessary. | ||
+ | |||
+ | This type of table look-up won't generalize. (guess due to the binary encoding). | ||
+ | |||
+ | Binary threshold output unit can't tell if two single bit features are the same. | ||
+ | => contradicting constraints. | ||
+ | |||
+ | === Data-Space === | ||
+ | - Each input vector is point in space | ||
+ | - Weight vector defines plane | ||
+ | - Weight plane is perpendicular to the weight vector and misses the origin by a distance equal to the threshold. | ||
+ | |||
+ | Pos and neg cases can not be seperated by a plane => not linearly separable. | ||
+ | |||
+ | === Discriminate patterns under translation with wrap-around === | ||
+ | |||
+ | Binary threshold neuron can't discriminate between different patterns that have same number of on pixels (if pattern can translate with wrap-around). | ||
+ | |||
+ | Proof: | ||
+ | |||
+ | Pattern A on all possible translations (4 on pixels). | ||
+ | Total Input will be 4x sum of all the weights. | ||
+ | |||
+ | Pattern B on all possible translations (4 on pixels) | ||
+ | Total Input will be 4x sum of all the weights. | ||
+ | |||
+ | But every single case of pattern A must provide more import than all single cases of pattern B. | ||
+ | |||
+ | Case: | ||
+ | |||
+ | 3 Patterns with 2 Classes. | ||
+ | First class contains patterns with 4 pixels on. | ||
+ | Second class contains pattern with either 1 or 3 pixels on. | ||
+ | |||
+ | Weight from each pixel = 1. | ||
+ | Bias = -3.5 | ||
+ | |||
+ | Any example with 3 pixels on: activation = -0.5 | ||
+ | |||
+ | Any example with 1 pixel on: activation = -2.5 | ||
+ | |||
+ | Any example with 4 pixel on: acitvation = 0.5 | ||
+ | |||
+ | => correctly classified. | ||
+ | |||
+ | Whole point of pattern recognition is to recognize patterns despite transformations like translation. Minksy and Paper (" | ||
+ | - Translation with **wrap-around** form a group. | ||
+ | To deal with such transf. tricky part of recognition must be solved by hand-coded **feature detectors** (not learning procedure). | ||
+ | |||
+ | Conclusion (after 20 yrs): Neuronal Network has to learn the feature detectors (not only weights). | ||
+ | * More layers of linear units do not help (still linear). | ||
+ | * Fixed output non-linearities are not enough. | ||
+ | |||
+ | => Need multiple layers of **adaptive**, | ||
+ | |||
+ | * Efficient way of adapting all the weights. | ||
+ | * Learning weights going into hidden units is equivalent to learning features. | ||
+ | * No one is telling us what the feature vectors should be. | ||