Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued within the previous couple of years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. During this paper, we propose an easy and scalable detection algorithm that improves mean average precision (mAP) by quite 30% relative to the previous best result on VOC 2012—achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals so as to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a big performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also compare R-CNN to OverFeat, a recently proposed sliding-window detector supported an identical CNN architecture. We discover that R-CNN outperforms OverFeat by an outsized margin on the 200-class ILSVRC2013 detection dataset. ASCII text file for the entire system is out there at this http URL.
Rich feature hierarchies for accurate object detection and semantic segmentation
Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued within the previous couple of years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. During this paper, we propose an easy and scalable detection algorithm that improves mean average precision (mAP) by quite […]