Detection Results: VOC2010 BETA

Competition "comp4" (train on own data)

This leaderboard shows only those submissions that have been marked as public, and so the displayed rankings should not be considered as definitive.Entries equivalent to a selected submission are determined by bootstrapping the performance measure, and assessing if the differences between the selected submission and the others are not statistically significant (see sec 3.5 in VOC 2014 paper).

Average Precision (AP %)

  mean

aero
plane
bicycle

bird

boat

bottle

bus

car

cat

chair

cow

dining
table
dog

horse

motor
bike
person

potted
plant
sheep

sofa

train

tv/
monitor
submission
date
Fast R-CNN + YOLO [?] 70.882.777.774.359.147.178.073.189.249.674.355.987.479.882.275.343.171.467.881.965.605-Jun-2015
Fast R-CNN VGG16 extra data [?] 68.882.077.871.655.342.477.371.789.344.572.153.787.780.082.572.736.668.765.481.162.718-Apr-2015
segDeepM [?] 67.282.375.267.150.749.871.169.688.242.571.250.085.776.681.869.341.571.962.273.264.629-Jan-2015
BabyLearning [?] 63.877.773.862.348.845.467.367.080.341.370.849.779.574.778.664.536.069.955.770.461.712-Nov-2014
R-CNN (bbox reg) [?] 62.979.372.463.144.044.464.666.384.938.867.348.482.375.076.765.735.866.254.869.158.827-Oct-2014
R-CNN [?] 59.876.570.458.040.239.661.863.781.036.264.545.780.571.974.360.631.564.752.564.657.227-Oct-2014
YOLO [?] 58.878.067.359.442.025.768.656.781.737.462.848.077.872.972.263.929.953.453.474.850.806-Nov-2015
Feature Edit [?] 56.474.869.255.741.936.164.762.369.531.353.343.769.964.071.860.532.763.044.163.656.604-Sep-2014
R-CNN (bbox reg) [?] 53.771.865.853.036.835.959.760.069.927.950.641.470.062.069.058.129.559.439.361.252.413-Mar-2014
R-CNN [?] 50.267.164.146.732.030.556.457.265.927.047.340.966.657.865.953.626.756.538.152.850.230-Jan-2014
poselets [?] ---------------59.3-----06-Jun-2014
Head-Detect-Segment [?] --------41.7------------30-Aug-2010
BERKELEY POSELETS [?] -33.251.98.58.234.839.048.822.2-20.6-18.548.244.148.59.128.013.022.533.029-Aug-2010
** UCI_LSVM-MDPM-10X ** [?] --48.1---54.7---25.16.0-46.741.1--31.217.7-32.330-Aug-2010

Abbreviations

TitleMethodAffiliationContributorsDescriptionDate
Multiclass poseletsBERKELEY POSELETSUC Berkeley / AdobeSubhransu Maji, Thomas Brox, Jitendra MalikPoselets based on Bourdev et al ECCV 2010, extended for multiple categories.2010-08-29 21:20:30
Computational Baby LearningBabyLearningNational University of Singapore Xiaodan Liang, Si Liu, Yunchao Wei, Luoqi Liu, Liang Lin, Shuicheng YanThis entry is an implementation of the framework described in "Computational Baby Learning" (http://arxiv.org/abs/1411.2861). We build a computational model to interpret and mimic the baby learning process, based on prior knowledge modelling, exemplar learning, and learning with video contexts. Training data: (1) We used only two positive instances along with ~20,000 unlabelled videos to train the detector for each object category. (2) We used data from ILSVRC 2012 to pre-train the Network in Network [1] and fine-tuned the network with our newly mined instances. [1] Min Lin, Qiang Chen, Shuicheng Yan. Network In Network. In ICLR 2014.2014-11-12 03:35:57
Fast R-CNN with YOLO RescoringFast R-CNN + YOLOUniversity of WashingtonJoseph Redmon, Santosh Divvala, Ross Girshick, Ali FarhadiWe use the YOLO detection method to rescore the bounding boxes from Fast R-CNN. This helps mitigate false background detections and improve overall performance. For more information and example code see: http://pjreddie.com/darknet/yolo/2015-06-05 04:05:10
Fast R-CNN VGG16 extra dataFast R-CNN VGG16 extra dataMicrosoft ResearchRoss GirshickFast R-CNN is a new algorithm for training R-CNNs. The training process is a single fine-tuning run that jointly trains for softmax classification and bounding-box regression. Training took ~22 hours on a single GPU and testing takes ~330ms / image. A tech report describing the method is forthcoming. Open source code will be release. This entry was trained on VOC 2012 train+val union with VOC 2007 train+val+test.2015-04-18 19:42:04
Feature Edit with CNN featuresFeature EditThe University of FUDANZhiqiang Shen, Xiangyang Xue et al.We edit 5th CNN features with the network defined by Krizhevsky(2012), then add the new features to original feature set. Two stages are contained to find out the variables to inhibit. Step one is to find out the largest variance of subset within a class and step two is to find out ones with smallest inter-class variance.2014-09-04 03:46:10
Cat-CutHead-Detect-SegmentCVIT-IIIT,Hyderabad, VGG University of OxfordOmkar M Parkhi, Andrea Vedaldi, C.V.Jawahar, Andrew ZissermanDetector is trained to detect cat heads. The detections returned are used to initialize seeds for GrabCut which segments the cat. Bounding box is then inferred from these segmentations.2010-08-30 21:56:06
Region-based CNNR-CNNUC BerkeleyRoss Girshick, Jeff Donahue, Trevor Darrell, Jitendra MalikThis entry is an implementation of the system described in "Rich feature hierarchies for accurate object detection and semantic segmentation" (http://arxiv.org/abs/1311.2524 version 5). Code is available at http://www.cs.berkeley.edu/~rbg/. Training data: (1) We used ILSVRC 2012 to pre-train the ConvNet (using caffe) (2) We fine-tuned the resulting ConvNet using 2012 trainval (3) We trained object detector SVMs using 2012 trainval. The same detection SVMs were used for the 2012 and 2010 results. For this submission, we used the 16-layer ConvNet from Simonyan & Zisserman instead of Krizhevsky et al.'s ConvNet.2014-10-27 15:56:23
Regions with Convolutional Neural Network FeaturesR-CNNUC BerkeleyRoss Girshick, Jeff Donahue, Trevor Darrell, Jitendra MalikThis entry is an implementation of the system described in "Rich feature hierarchies for accurate object detection and semantic segmentation" (http://arxiv.org/abs/1311.2524). We made two small changes relative to the arXiv tech report that are responsible for improved performance: (1) we added a small amount of context around each region proposal (16px at the warped size) and (2) we used a higher learning rate while fine-tuning (starting at 0.001). Aside from non-maximum suppression no additional post-processing (e.g., detector or image classification context) was applied. Code will be made available soon at http://www.cs.berkeley.edu/~rbg/. Training data: (1) We used ILSVRC 2012 to pre-train the ConvNet (using caffe) (2) We fine-tuned the resulting ConvNet using 2012 train (3) We trained object detector SVMs using 2012 train+val The same detection SVMs were used for the 2012 and 2010 results. 2014-01-30 02:13:56
Regions with Convolutional Neural Network FeaturesR-CNN (bbox reg)UC BerkeleyRoss Girshick, Jeff Donahue, Trevor Darrell, Jitendra MalikThis entry is an implementation of the system described in "Rich feature hierarchies for accurate object detection and semantic segmentation" (http://arxiv.org/abs/1311.2524). We made two small changes relative to the arXiv tech report that are responsible for improved performance: (1) we added a small amount of context around each region proposal (16px at the warped size) and (2) we used a higher learning rate while fine-tuning (starting at 0.001). Aside from non-maximum suppression no additional post-processing (e.g., detector or image classification context) was applied. Code will be made available soon at http://www.cs.berkeley.edu/~rbg/. Training data: (1) We used ILSVRC 2012 to pre-train the ConvNet (using caffe) (2) We fine-tuned the resulting ConvNet using 2012 train (3) We trained object detector SVMs using 2012 train+val The same detection SVMs were used for the 2012 and 2010 results. This submission includes a simple regression from pool5 features to bounding box coordinates.2014-03-13 18:50:34
Region-based CNNR-CNN (bbox reg)UC BerkeleyRoss Girshick, Jeff Donahue, Trevor Darrell, Jitendra MalikThis entry is an implementation of the system described in "Rich feature hierarchies for accurate object detection and semantic segmentation" (http://arxiv.org/abs/1311.2524 version 5). Code is available at http://www.cs.berkeley.edu/~rbg/. Training data: (1) We used ILSVRC 2012 to pre-train the ConvNet (using caffe) (2) We fine-tuned the resulting ConvNet using 2012 trainval (3) We trained object detector SVMs using 2012 trainval. The same detection SVMs were used for the 2012 and 2010 results. For this submission, we used the 16-layer ConvNet from Simonyan & Zisserman instead of Krizhevsky et al.'s ConvNet.2014-10-27 15:48:35
10x train set for LSVM, mixtures, deformable partsUCI_LSVM-MDPM-10XUniversity of California, IrvineXiangxin Zhu, Carl Vondrick, Deva Ramanan, Charless FowlkesWe downloaded additional images from Flickr that match the distribution of the testing set. We used Amazon's Mechanical Turk to annotate these training sets that are 10 times larger the standard trainval set. We used our larger training set to train models with the detector from Felzenswalb et. all.2010-08-30 04:33:45
You Only Look Once: Unified, Real-Time DetectionYOLOUniversity of WashingtonJoseph Redmon, Santosh Divvala, Ross Girshick, Ali FarhadiWe train a convolutional neural network to perform end-to-end object detection. Our network processes the full image and outputs multiple bounding boxes and class probabilities. At test time we process images in real-time at 45fps. For more information and example code see: http://pjreddie.com/darknet/yolo/2015-11-06 07:24:21
Deep poseletsposeletsFacebookFei Yang, Rob FergusPoselets trained with CNN. Ran original poselets on a large set of images, collected weakly labelled training data, trained a convolutional neural net and applied it to the test data. This method allows for training deep poselets without the need of lots of manual keypoint annotations. Poselets trained with CNN. Ran original poselets on a large set of images, collected weakly labelled training data, trained a convolutional neural net and applied it to the test data. THis field seems to be broken. Really you want that long of a description??2014-06-06 16:42:35
CNN with Segmentation and Context CuessegDeepMUniversity of TorontoYukun Zhu, Ruslan Salakhutdinov, Raquel Urtasun, Sanja Fidler Our method exploits object segmentation in order to improve the accuracy of object detection. We frame object detection problem as inference in a Markov Random Field, in which each detection hypothesis scores object appearance as well as contextual information using Convolutional Neural Networks, and allows the hypothesis to choose and score a segment out of a large pool of accurate object segmentation proposals. This implementation adopts 16 layer CNN instead of Krizhevsky et al.'s network for appearance model and CPMC for generating segmentation proposals.2015-01-29 02:09:13