data_mining:neural_network:perceptron

Perceptron

  • Popularized by Frank Rosenblatt (1960s)
  • Used for tasks with very big vectors of features

Decision Unit: Binary trheshold neuron.

Bias can be learned like weights, it's weight with value 1.

Perceptron convergence

  • If output correct ⇒ no weight changes
  • If output unit incorrectly outputs 0 ⇒ add input vector to weight vector.
  • If output unit incorrectly outputs 1 ⇒ substract input vector from the weight vector.

This generates set of weights that gets the right answer for all training cases, if such a set exists. ⇒ Deciding the features is the important distinction

Geometrical Interpretation

  • 1 dimension for each weight
  • Point represents a setting of all weights
  • Leaving the threshold out, each training case can be represented as a hyperplane through the origin. Inputs represent planes (or Constraints)
    • For a particular training case: Weights must lie on one side of this hyper-plane to get the answer correct.

Plane goes through origin, is perpendicular to the input vector (with correct answer = 1 (or 0)). Good weight vector needs to be on the same side of the hyperplane. Scalar product of wight vector and input vector positiv (angle < 90°).

Cone of feasable solutions

Need to find a point on right side of all the planes (training cases): Might not exist. If there are weight vectors that get the right side for all cases, they lie in a hyper-cone, with apex in origin. Average of two good wight vectors is a good weight vector ⇒ Convex problem.

Proof that squared distance between feasable and current weight vector get's smaller. Hopeful claim: Every time perceptron makes mistakes, the current weight vector gets closer to all feasble wight vectors. Not true.

“generously feasable” weight vector, that lie within a feasible region by a margin at least as great as the length of the input vector that defines each constraint plane.

Every time perceptron makes mistake, squared distance to all of the generously feasible weight vectors is always decreased by at least the squared length of the update vector.

After finite number of mistakes, the weight vector must lie in the feasible region, if the region exists.

Separate feature unit for each of the many binary vectors ⇒ any possible discrimination. Exponential many features are necessary.

This type of table look-up won't generalize. (guess due to the binary encoding).

Binary threshold output unit can't tell if two single bit features are the same. ⇒ contradicting constraints.

Data-Space

- Each input vector is point in space - Weight vector defines plane - Weight plane is perpendicular to the weight vector and misses the origin by a distance equal to the threshold.

Pos and neg cases can not be seperated by a plane ⇒ not linearly separable.

Discriminate patterns under translation with wrap-around

Binary threshold neuron can't discriminate between different patterns that have same number of on pixels (if pattern can translate with wrap-around).

Proof:

Pattern A on all possible translations (4 on pixels). Total Input will be 4x sum of all the weights.

Pattern B on all possible translations (4 on pixels) Total Input will be 4x sum of all the weights.

But every single case of pattern A must provide more import than all single cases of pattern B.

Case:

3 Patterns with 2 Classes. First class contains patterns with 4 pixels on. Second class contains pattern with either 1 or 3 pixels on.

Weight from each pixel = 1. Bias = -3.5

Any example with 3 pixels on: activation = -0.5

Any example with 1 pixel on: activation = -2.5

Any example with 4 pixel on: acitvation = 0.5

⇒ correctly classified.

Whole point of pattern recognition is to recognize patterns despite transformations like translation. Minksy and Paper (“Group Invariance Theorem”): Part of perceptron that learns cannot learn to do this if the transformation form a group: - Translation with wrap-around form a group. To deal with such transf. tricky part of recognition must be solved by hand-coded feature detectors (not learning procedure).

Conclusion (after 20 yrs): Neuronal Network has to learn the feature detectors (not only weights).

  • More layers of linear units do not help (still linear).
  • Fixed output non-linearities are not enough.

⇒ Need multiple layers of adaptive, non-linear hidden units.

  • Efficient way of adapting all the weights.
  • Learning weights going into hidden units is equivalent to learning features.
  • No one is telling us what the feature vectors should be.
  • data_mining/neural_network/perceptron.txt
  • Last modified: 2017/02/03 22:21
  • by phreazer