====== Faster RCNN ====== * First: Use a pretrained CNN to create a feature map. * Region Proposal Network: Fully convolutional network (FCN) that proposes regions * Set of rectangular object proposals and objectness score * ROI pooling of proposals * Fast R-CNN detector * Classify content of bounding box * Adjust bounding box coordinates (better fit for object) FCN is used to share it with Fast R-CNN object detection network. ===== Region Proposal Network ===== * Input: Image of any size * Output: Rectangular object proposal with objectiveness score ==== Architecture ==== * Fully convolutional network * Sharing convolutional layers with Fast R-CNN Slide network over convolutional feature map (obtained by last convolutional layer). n x n window as Input (e.h. n=3) Mapped to lower-dimensional feature (e.g. 256d) Fed into fully connected sibling layers: - box-regression layer (reg) 1 x 1 - box-classification layer (cls) 1 x 1 ==== Anchors ==== At each sliding window: Predict max k region proposals Output of RPN * reg layer: 4k outputs (coordinates) * cls layer: 2k outputs (prob for foreground, prob for background) Proposals are placed relative to k reference boxes = anchors Anchor is centered at sliding window, associated with scale = 3 and aspect = 3 ratio => k = 9 anchors at each sliding position (WHk anchors in total) === Translation-invariant anchors === Pyramid of anchors * Classifies and regresses bounding boxes with reference to anchor boxes of multiple **scales** and **aspect ratios** * Only needs images and feature maps of a single size Features used for regression are of same spatial size (3 x 3) on feature maps. k bounding box are learned, for each scale and aspect ratio, they don't share weights. ==== Training ==== For training: Anchors which overlap ground truth object > 0.5 IoU => foreground * Each mini-batch arises from single image with positive and negative anchors * More negative samples present * Randomly sample 256 anchors with pos/negative ratio of 1:1 * New layers are initialized with 0-mean gaussian distribution, $\sigma=0.01$ * Shared convolutional layer are initialized by pretrained weights of ImageNet classiciation. ===== Postprocessing ===== Non-Maximum Suppression (NMS): Anchors overlap => proposals overlap. NMS Sorts proposal by score, discards those which have an IoU > threshold with proposal with higher score. Could already stop here for binary object class detection. ===== ROI Pooling ===== RPN proposes RoI of different sizes => different sized CNN feature maps. Region of Interest Pooling simplifies the problem, by reducing feature maps into same size. Splits input feature map into a fixed number of roughly equal regions, then applies max-pooling on every region. Output is always fixed. Now, those feature can be used for classification ===== Region-based Convolutional NN ===== Two tasks: * Classify proposaly into m classes (plus background class, to remove bad proposals) * Adjust bounding boxes for proposal according to predicted class ===== Sharing Features ===== Fast R-CNN and RPN would have different convolutional layer weights, if trained independently. Alternate training: First train RPN, then use proposals to train Fast R-CNN. Fast-RCNN is then used to initialize RPN