FCN is used to share it with Fast R-CNN object detection network.
Slide network over convolutional feature map (obtained by last convolutional layer).
n x n window as Input (e.h. n=3) Mapped to lower-dimensional feature (e.g. 256d)
Fed into fully connected sibling layers:
At each sliding window: Predict max k region proposals
Output of RPN
Proposals are placed relative to k reference boxes = anchors
Anchor is centered at sliding window, associated with scale = 3 and aspect = 3 ratio ⇒ k = 9 anchors at each sliding position (WHk anchors in total)
Pyramid of anchors
Features used for regression are of same spatial size (3 x 3) on feature maps. k bounding box are learned, for each scale and aspect ratio, they don't share weights.
For training: Anchors which overlap ground truth object > 0.5 IoU ⇒ foreground
Non-Maximum Suppression (NMS):
Anchors overlap ⇒ proposals overlap. NMS Sorts proposal by score, discards those which have an IoU > threshold with proposal with higher score.
Could already stop here for binary object class detection.
RPN proposes RoI of different sizes ⇒ different sized CNN feature maps.
Region of Interest Pooling simplifies the problem, by reducing feature maps into same size.
Splits input feature map into a fixed number of roughly equal regions, then applies max-pooling on every region. Output is always fixed.
Now, those feature can be used for classification
Two tasks:
Fast R-CNN and RPN would have different convolutional layer weights, if trained independently.
Alternate training:
First train RPN, then use proposals to train Fast R-CNN. Fast-RCNN is then used to initialize RPN