PASCAL VOC Challenge performance evaluation server

Detection Results: VOC2010 ^BETA

Competition "comp4" (train on own data)

This leaderboard shows only those submissions that have been marked as public, and so the displayed rankings should not be considered as definitive.

The highest scoring entry in each column is shown in bold.
Clicking on the blue arrow symbol () at the top of a column will order the submissions from high to low wrt performance on that column.

Average Precision (AP %)

		mean	aero plane	bicycle	bird	boat	bottle	bus	car	cat	chair	cow	dining table	dog	horse	motor bike	person	potted plant	sheep	sofa	train	tv/ monitor	submission date
	Fast R-CNN + YOLO ^[?]	70.8	82.7	77.7	74.3	59.1	47.1	78.0	73.1	89.2	49.6	74.3	55.9	87.4	79.8	82.2	75.3	43.1	71.4	67.8	81.9	65.6	05-Jun-2015
	Fast R-CNN VGG16 extra data ^[?]	68.8	82.0	77.8	71.6	55.3	42.4	77.3	71.7	89.3	44.5	72.1	53.7	87.7	80.0	82.5	72.7	36.6	68.7	65.4	81.1	62.7	18-Apr-2015
	segDeepM ^[?]	67.2	82.3	75.2	67.1	50.7	49.8	71.1	69.6	88.2	42.5	71.2	50.0	85.7	76.6	81.8	69.3	41.5	71.9	62.2	73.2	64.6	29-Jan-2015
	BabyLearning ^[?]	63.8	77.7	73.8	62.3	48.8	45.4	67.3	67.0	80.3	41.3	70.8	49.7	79.5	74.7	78.6	64.5	36.0	69.9	55.7	70.4	61.7	12-Nov-2014
	R-CNN (bbox reg) ^[?]	62.9	79.3	72.4	63.1	44.0	44.4	64.6	66.3	84.9	38.8	67.3	48.4	82.3	75.0	76.7	65.7	35.8	66.2	54.8	69.1	58.8	27-Oct-2014
	R-CNN ^[?]	59.8	76.5	70.4	58.0	40.2	39.6	61.8	63.7	81.0	36.2	64.5	45.7	80.5	71.9	74.3	60.6	31.5	64.7	52.5	64.6	57.2	27-Oct-2014
	YOLO ^[?]	58.8	78.0	67.3	59.4	42.0	25.7	68.6	56.7	81.7	37.4	62.8	48.0	77.8	72.9	72.2	63.9	29.9	53.4	53.4	74.8	50.8	06-Nov-2015
	Feature Edit ^[?]	56.4	74.8	69.2	55.7	41.9	36.1	64.7	62.3	69.5	31.3	53.3	43.7	69.9	64.0	71.8	60.5	32.7	63.0	44.1	63.6	56.6	04-Sep-2014
	R-CNN (bbox reg) ^[?]	53.7	71.8	65.8	53.0	36.8	35.9	59.7	60.0	69.9	27.9	50.6	41.4	70.0	62.0	69.0	58.1	29.5	59.4	39.3	61.2	52.4	13-Mar-2014
	R-CNN ^[?]	50.2	67.1	64.1	46.7	32.0	30.5	56.4	57.2	65.9	27.0	47.3	40.9	66.6	57.8	65.9	53.6	26.7	56.5	38.1	52.8	50.2	30-Jan-2014
	poselets ^[?]	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	59.3	-	-	-	-	-	06-Jun-2014
	Head-Detect-Segment ^[?]	-	-	-	-	-	-	-	-	41.7	-	-	-	-	-	-	-	-	-	-	-	-	30-Aug-2010
	BERKELEY POSELETS ^[?]	-	33.2	51.9	8.5	8.2	34.8	39.0	48.8	22.2	-	20.6	-	18.5	48.2	44.1	48.5	9.1	28.0	13.0	22.5	33.0	29-Aug-2010
	UCI_LSVM-MDPM-10X ^[?]	-	-	48.1	-	-	-	54.7	-	-	-	25.1	6.0	-	46.7	41.1	-	-	31.2	17.7	-	32.3	30-Aug-2010

Abbreviations

Title	Method	Affiliation	Contributors	Description	Date
Multiclass poselets	BERKELEY POSELETS	UC Berkeley / Adobe	Subhransu Maji, Thomas Brox, Jitendra Malik	Poselets based on Bourdev et al ECCV 2010, extended for multiple categories.	2010-08-29 21:20:30
Computational Baby Learning	BabyLearning	National University of Singapore	Xiaodan Liang, Si Liu, Yunchao Wei, Luoqi Liu, Liang Lin, Shuicheng Yan	This entry is an implementation of the framework described in "Computational Baby Learning" (http://arxiv.org/abs/1411.2861). We build a computational model to interpret and mimic the baby learning process, based on prior knowledge modelling, exemplar learning, and learning with video contexts. Training data: (1) We used only two positive instances along with ~20,000 unlabelled videos to train the detector for each object category. (2) We used data from ILSVRC 2012 to pre-train the Network in Network [1] and fine-tuned the network with our newly mined instances. [1] Min Lin, Qiang Chen, Shuicheng Yan. Network In Network. In ICLR 2014.	2014-11-12 03:35:57
Fast R-CNN with YOLO Rescoring	Fast R-CNN + YOLO	University of Washington	Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi	We use the YOLO detection method to rescore the bounding boxes from Fast R-CNN. This helps mitigate false background detections and improve overall performance. For more information and example code see: http://pjreddie.com/darknet/yolo/	2015-06-05 04:05:10
Fast R-CNN VGG16 extra data	Fast R-CNN VGG16 extra data	Microsoft Research	Ross Girshick	Fast R-CNN is a new algorithm for training R-CNNs. The training process is a single fine-tuning run that jointly trains for softmax classification and bounding-box regression. Training took ~22 hours on a single GPU and testing takes ~330ms / image. A tech report describing the method is forthcoming. Open source code will be release. This entry was trained on VOC 2012 train+val union with VOC 2007 train+val+test.	2015-04-18 19:42:04
Feature Edit with CNN features	Feature Edit	The University of FUDAN	Zhiqiang Shen, Xiangyang Xue et al.	We edit 5th CNN features with the network defined by Krizhevsky(2012), then add the new features to original feature set. Two stages are contained to find out the variables to inhibit. Step one is to find out the largest variance of subset within a class and step two is to find out ones with smallest inter-class variance.	2014-09-04 03:46:10
Cat-Cut	Head-Detect-Segment	CVIT-IIIT,Hyderabad, VGG University of Oxford	Omkar M Parkhi, Andrea Vedaldi, C.V.Jawahar, Andrew Zisserman	Detector is trained to detect cat heads. The detections returned are used to initialize seeds for GrabCut which segments the cat. Bounding box is then inferred from these segmentations.	2010-08-30 21:56:06
Region-based CNN	R-CNN	UC Berkeley	Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik	This entry is an implementation of the system described in "Rich feature hierarchies for accurate object detection and semantic segmentation" (http://arxiv.org/abs/1311.2524 version 5). Code is available at http://www.cs.berkeley.edu/~rbg/. Training data: (1) We used ILSVRC 2012 to pre-train the ConvNet (using caffe) (2) We fine-tuned the resulting ConvNet using 2012 trainval (3) We trained object detector SVMs using 2012 trainval. The same detection SVMs were used for the 2012 and 2010 results. For this submission, we used the 16-layer ConvNet from Simonyan & Zisserman instead of Krizhevsky et al.'s ConvNet.	2014-10-27 15:56:23
Regions with Convolutional Neural Network Features	R-CNN	UC Berkeley	Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik	This entry is an implementation of the system described in "Rich feature hierarchies for accurate object detection and semantic segmentation" (http://arxiv.org/abs/1311.2524). We made two small changes relative to the arXiv tech report that are responsible for improved performance: (1) we added a small amount of context around each region proposal (16px at the warped size) and (2) we used a higher learning rate while fine-tuning (starting at 0.001). Aside from non-maximum suppression no additional post-processing (e.g., detector or image classification context) was applied. Code will be made available soon at http://www.cs.berkeley.edu/~rbg/. Training data: (1) We used ILSVRC 2012 to pre-train the ConvNet (using caffe) (2) We fine-tuned the resulting ConvNet using 2012 train (3) We trained object detector SVMs using 2012 train+val The same detection SVMs were used for the 2012 and 2010 results.	2014-01-30 02:13:56
Regions with Convolutional Neural Network Features	R-CNN (bbox reg)	UC Berkeley	Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik	This entry is an implementation of the system described in "Rich feature hierarchies for accurate object detection and semantic segmentation" (http://arxiv.org/abs/1311.2524). We made two small changes relative to the arXiv tech report that are responsible for improved performance: (1) we added a small amount of context around each region proposal (16px at the warped size) and (2) we used a higher learning rate while fine-tuning (starting at 0.001). Aside from non-maximum suppression no additional post-processing (e.g., detector or image classification context) was applied. Code will be made available soon at http://www.cs.berkeley.edu/~rbg/. Training data: (1) We used ILSVRC 2012 to pre-train the ConvNet (using caffe) (2) We fine-tuned the resulting ConvNet using 2012 train (3) We trained object detector SVMs using 2012 train+val The same detection SVMs were used for the 2012 and 2010 results. This submission includes a simple regression from pool5 features to bounding box coordinates.	2014-03-13 18:50:34
Region-based CNN	R-CNN (bbox reg)	UC Berkeley	Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik	This entry is an implementation of the system described in "Rich feature hierarchies for accurate object detection and semantic segmentation" (http://arxiv.org/abs/1311.2524 version 5). Code is available at http://www.cs.berkeley.edu/~rbg/. Training data: (1) We used ILSVRC 2012 to pre-train the ConvNet (using caffe) (2) We fine-tuned the resulting ConvNet using 2012 trainval (3) We trained object detector SVMs using 2012 trainval. The same detection SVMs were used for the 2012 and 2010 results. For this submission, we used the 16-layer ConvNet from Simonyan & Zisserman instead of Krizhevsky et al.'s ConvNet.	2014-10-27 15:48:35
10x train set for LSVM, mixtures, deformable parts	UCI_LSVM-MDPM-10X	University of California, Irvine	Xiangxin Zhu, Carl Vondrick, Deva Ramanan, Charless Fowlkes	We downloaded additional images from Flickr that match the distribution of the testing set. We used Amazon's Mechanical Turk to annotate these training sets that are 10 times larger the standard trainval set. We used our larger training set to train models with the detector from Felzenswalb et. all.	2010-08-30 04:33:45
You Only Look Once: Unified, Real-Time Detection	YOLO	University of Washington	Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi	We train a convolutional neural network to perform end-to-end object detection. Our network processes the full image and outputs multiple bounding boxes and class probabilities. At test time we process images in real-time at 45fps. For more information and example code see: http://pjreddie.com/darknet/yolo/	2015-11-06 07:24:21
Deep poselets	poselets	Facebook	Fei Yang, Rob Fergus	Poselets trained with CNN. Ran original poselets on a large set of images, collected weakly labelled training data, trained a convolutional neural net and applied it to the test data. This method allows for training deep poselets without the need of lots of manual keypoint annotations. Poselets trained with CNN. Ran original poselets on a large set of images, collected weakly labelled training data, trained a convolutional neural net and applied it to the test data. THis field seems to be broken. Really you want that long of a description??	2014-06-06 16:42:35
CNN with Segmentation and Context Cues	segDeepM	University of Toronto	Yukun Zhu, Ruslan Salakhutdinov, Raquel Urtasun, Sanja Fidler	Our method exploits object segmentation in order to improve the accuracy of object detection. We frame object detection problem as inference in a Markov Random Field, in which each detection hypothesis scores object appearance as well as contextual information using Convolutional Neural Networks, and allows the hypothesis to choose and score a segment out of a large pool of accurate object segmentation proposals. This implementation adopts 16 layer CNN instead of Krizhevsky et al.'s network for appearance model and CPMC for generating segmentation proposals.	2015-01-29 02:09:13