Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued within the previous couple of years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. During this paper, we propose an easy and scalable detection algorithm that improves mean average precision (mAP) by quite 30% relative to the previous best result on VOC 2012—achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals so as to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a big performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also compare R-CNN to OverFeat, a recently proposed sliding-window detector supported an identical CNN architecture. We discover that R-CNN outperforms OverFeat by an outsized margin on the 200-class ILSVRC2013 detection dataset. ASCII text file for the entire system is out there at this http URL.