Detection Results: VOC2012 BETA

Competition "comp4" (train on own data)

This leaderboard shows only those submissions that have been marked as public, and so the displayed rankings should not be considered as definitive.Entries equivalent to a selected submission are determined by bootstrapping the performance measure, and assessing if the differences between the selected submission and the others are not statistically significant (see sec 3.5 in VOC 2014 paper).

Average Precision (AP %)

  mean

aero
plane
bicycle

bird

boat

bottle

bus

car

cat

chair

cow

dining
table
dog

horse

motor
bike
person

potted
plant
sheep

sofa

train

tv/
monitor
submission
date
R-FCN, ResNet Ensemble(VOC+COCO) [?] 88.494.892.990.682.481.889.991.797.176.093.471.996.694.393.992.875.791.980.893.686.409-Oct-2016
HIK_FRCN [?] 87.995.093.291.380.377.790.689.997.872.893.770.797.295.494.091.872.792.881.194.186.219-Sep-2016
** Deformable R-FCN, ResNet-101 (VOC+COCO) ** [?] 87.194.091.788.579.478.089.790.896.974.293.171.395.994.893.292.571.791.878.393.283.323-Mar-2017
FasterRcnn-ResNeXt101(COCO+07++12, single model) [?] 86.893.993.488.380.272.689.489.396.873.091.572.395.494.593.891.770.790.681.292.683.904-May-2017
R-FCN, ResNet (VOC+COCO) [?] 85.092.389.986.774.775.286.789.095.870.290.466.595.093.292.191.171.089.776.092.083.409-Oct-2016
PVANet+ [?] 84.293.589.884.175.669.788.287.993.470.087.775.392.990.590.990.267.386.480.392.078.826-Oct-2016
BlitzNet512 [?] 83.893.189.484.775.565.086.687.494.569.988.871.792.591.691.188.961.290.479.291.883.019-Jul-2017
Faster RCNN, ResNet (VOC+COCO) [?] 83.892.188.484.875.971.486.387.894.266.889.469.293.991.990.989.667.988.276.890.380.010-Dec-2015
PVANet+ (compressed) [?] 83.792.888.983.474.768.788.287.893.569.587.374.393.189.589.990.266.886.479.891.978.218-Nov-2016
ICT_360_ISD [?] 82.690.789.487.075.870.186.086.596.265.386.862.194.690.690.589.763.587.372.790.777.118-Nov-2016
SSD512 VGG16 07++12+COCO [?] 82.291.488.682.671.463.187.488.193.966.986.666.392.091.790.888.560.987.075.490.280.410-Oct-2016
BlitzNet300 [?] 80.291.086.580.070.154.784.484.192.565.183.569.291.288.188.585.755.885.479.389.878.219-Jul-2017
OHEM+FRCN, VGG16, VOC+COCO [?] 80.190.187.479.965.866.386.185.092.962.483.469.590.688.988.983.659.082.074.788.277.318-Apr-2016
DSSD513_ResNet101_07++12 [?] 80.092.186.680.368.758.284.385.094.663.385.965.693.088.587.886.457.485.273.487.876.815-Feb-2017
SSD300 VGG16 07++12+COCO [?] 79.391.086.078.165.055.484.984.093.462.183.667.391.388.988.685.654.783.877.388.376.503-Oct-2016
DSOD300+ [?] 79.390.587.477.567.457.784.783.692.664.881.366.490.187.888.187.357.980.375.688.176.716-Mar-2017
BlitzNet [?] 79.090.085.380.467.253.682.983.693.862.684.065.991.686.687.784.656.884.774.088.075.817-Mar-2017
Res101+hyper+FasterRCNN(COCO+0712trainval) [?] 78.988.985.379.968.463.884.183.991.062.083.264.388.887.685.987.160.880.770.588.073.010-Feb-2017
SSD512 VGG16 07++12 [?] 78.590.085.377.764.358.585.184.392.661.383.465.189.988.588.285.554.482.470.787.175.613-Oct-2016
YOLOv2 (VOC + COCO) [?] 78.288.887.077.864.951.885.279.393.164.481.470.291.388.187.281.057.778.171.088.576.812-Mar-2017
HFM_VGG16 [?] 77.588.885.176.864.861.485.084.190.059.982.661.988.585.285.686.956.779.567.585.473.421-Mar-2016
Res101+FasterRCNN(COCO+0712trainval) [?] 77.386.983.776.565.959.581.982.690.960.181.064.288.084.986.285.258.779.572.686.471.305-Feb-2017
RUN300 3WAY, VGG16, 07++12 [?] 77.089.384.275.163.651.083.880.691.659.582.064.290.086.486.282.952.282.073.487.774.618-Jul-2017
FasterRCNN [?] 76.884.485.581.465.460.384.983.893.462.085.755.590.888.481.485.750.582.765.289.060.023-Jul-2017
IFRN_07+12 [?] 76.687.883.979.064.558.982.282.091.456.582.362.490.485.686.486.455.180.562.785.469.207-Jun-2016
ION [?] 76.487.584.776.863.858.382.679.090.957.882.064.788.986.584.782.351.478.269.285.273.523-Nov-2015
** DSOD300 ** [?] 76.389.485.372.962.749.583.680.692.160.877.965.688.985.586.884.651.177.772.386.072.217-Mar-2017
PLN [?] 76.088.384.777.465.955.882.079.491.958.277.358.889.585.385.382.955.879.664.686.569.927-Mar-2017
MNC baseline [?] 75.986.481.176.464.357.881.180.392.055.282.661.089.986.484.685.453.179.866.184.769.915-Dec-2015
Faster RCNN baseline (VOC+COCO) [?] 75.987.483.676.862.959.681.982.091.354.982.659.089.085.584.784.152.278.965.585.470.224-Nov-2015
SSD300 VGG16 07++12 [?] 75.888.182.974.461.947.682.778.891.558.180.064.189.485.785.582.650.279.873.686.672.118-Oct-2016
RFCN_DCN [?] 75.785.783.076.963.657.879.479.592.958.279.660.990.385.385.183.555.779.664.584.668.127-Jun-2017
MCC_FRCN, ResNet101, 07++12 [?] 75.486.083.578.362.259.580.479.191.255.980.156.390.286.684.182.853.078.265.585.469.921-Nov-2016
YOLOv2 [?] 75.486.685.076.861.155.581.278.291.856.879.661.789.786.085.084.251.279.462.984.971.023-Feb-2017
BlitzNet [?] 75.487.582.274.661.646.081.578.491.458.280.364.989.183.685.881.550.679.974.884.971.217-Mar-2017
LocNet [?] 74.886.383.076.160.854.679.979.090.654.381.662.089.085.785.582.849.776.667.583.267.406-Nov-2015
DDT augmentation based on web images [?] 74.486.581.976.263.455.480.880.189.751.678.656.288.884.885.582.650.678.164.185.668.126-Jul-2017
MR_CNN_S_CNN_MORE_DATA [?] 73.985.582.976.657.862.779.477.286.655.079.162.287.083.484.778.945.373.465.880.374.006-Jun-2015
HyperNet_VGG [?] 71.484.278.573.655.653.778.779.887.749.674.952.186.081.783.381.848.673.559.479.965.712-Oct-2015
HyperNet_SP [?] 71.384.178.373.355.553.678.679.687.549.574.952.185.681.683.281.648.473.259.379.765.628-Oct-2015
Fast R-CNN + YOLO [?] 70.783.478.573.555.843.479.173.189.449.475.557.087.580.981.074.741.871.568.582.167.206-Nov-2015
MR_CNN_S_CNN [?] 70.785.079.671.555.357.776.073.984.650.574.361.785.579.981.776.441.069.061.277.772.109-May-2015
RPN [?] 70.484.979.874.353.949.877.575.988.545.677.155.386.981.780.979.640.172.660.981.261.501-Jun-2015
FasterRCNN [?] 70.482.178.672.654.352.177.376.687.449.876.150.586.580.182.080.846.770.658.880.565.323-Jul-2017
DEEP_ENSEMBLE_COCO [?] 70.184.079.471.651.951.174.172.188.648.373.457.886.180.080.770.446.669.668.875.971.403-May-2015
OHEM+FRCN, VGG16 [?] 69.881.578.969.652.346.577.472.188.248.873.858.386.979.781.475.043.069.564.878.568.918-Apr-2016
Networks on Convolutional Feature Maps [?] 68.882.879.071.652.353.774.169.084.946.974.353.185.081.379.572.238.972.459.576.768.117-Apr-2015
Fast R-CNN VGG16 extra data [?] 68.482.378.470.852.338.777.871.689.344.273.055.087.580.580.872.035.168.365.780.464.217-Apr-2015
UMICH_FGS_STRUCT [?] 66.482.976.164.144.649.470.371.284.642.768.655.882.777.179.968.741.469.060.072.066.220-Jun-2015
segDeepM [?] 66.481.175.665.747.746.172.169.186.843.071.053.084.976.378.868.840.070.061.871.464.104-Mar-2016
NUS_NIN_c2000 [?] 63.880.273.861.943.743.070.367.680.741.969.751.778.275.276.965.138.668.358.068.763.330-Oct-2014
BabyLearning [?] 63.278.074.261.345.742.768.266.880.240.670.049.879.074.577.964.035.367.955.768.762.612-Nov-2014
R-CNN (bbox reg) [?] 62.479.672.761.941.241.965.966.484.638.567.246.782.074.876.065.235.665.454.267.460.326-Oct-2014
NUS_NIN [?] 62.477.973.162.639.543.369.166.478.939.168.150.077.271.376.164.738.466.956.266.962.730-Oct-2014
R-CNN [?] 59.276.870.956.637.536.962.963.681.135.764.343.980.471.674.060.030.863.452.063.558.725-Oct-2014
YOLO [?] 57.977.067.257.738.322.768.355.981.436.260.848.577.272.371.363.528.952.254.873.950.806-Nov-2015
Feature Edit [?] 56.374.669.154.439.133.165.262.769.730.856.044.670.064.471.160.233.361.346.461.757.806-Sep-2014
R-CNN (bbox reg) [?] 53.371.865.852.034.132.659.660.069.827.652.041.769.661.368.357.829.657.840.959.354.113-Mar-2014
SDS [?] 50.769.758.448.528.328.861.357.570.824.150.735.964.959.165.857.126.058.838.658.950.721-Jul-2014
R-CNN [?] 49.668.163.846.129.427.956.657.065.926.548.739.566.257.365.453.226.254.538.150.651.630-Jan-2014
Poselets2 [?] ---------------58.7-----06-Jun-2014
Geometric shape [?] --3.8------------------19-Jun-2016

Abbreviations

TitleMethodAffiliationContributorsDescriptionDate
Computational Baby LearningBabyLearningNational University of Singapore Xiaodan Liang, Si Liu, Yunchao Wei, Luoqi Liu, Liang Lin, Shuicheng YanThis entry is an implementation of the framework described in "Computational Baby Learning" (http://arxiv.org/abs/1411.2861). We build a computational model to interpret and mimic the baby learning process, based on prior knowledge modelling, exemplar learning, and learning with video contexts. Training data: (1) We used only two positive instances along with ~20,000 unlabelled videos to train the detector for each object category. (2) We used data from ILSVRC 2012 to pre-train the Network in Network [1] and fine-tuned the network with our newly mined instances. [1] Min Lin, Qiang Chen, Shuicheng Yan. Network In Network. In ICLR 2014.2014-11-12 03:50:50
Fully conv net for segmentation and detectionBlitzNetInriaNikita Dvornik Konstantin Shmelkov Julien Mairal Cordelia SchmidCNN for joint segmentation and detection (based on SSD). Input resolution 512. Trained on VOC07 trainval + VOC12 trainval.2017-03-17 18:22:43
Fully conv net for segmentation and detectionBlitzNetInriaNikita Dvornik Konstantin Shmelkov Julien Mairal Cordelia SchmidCNN for joint segmentation and detection (based on SSD). Input resolution 300. Trained on VOC07 trainval + VOC12 trainval. 2017-03-17 18:24:29
FCNBlitzNet300INRIANikita Dvornik, Konstantin Shmelkov, Julien Mairal, Cordelia SchmidCNN for joint segmentation and detection (based on SSD). Input resolution 300. Operates with speed 24 FPS. Trained on VOC07 trainval + VOC12 trainval, pretrained on COCO.2017-07-19 13:57:45
FCNBlitzNet512INRIANikita Dvornik, Konstantin Shmelkov, Julien Mairal, Cordelia SchmidCNN for joint segmentation and detection (based on SSD). Input resolution 512. Operates with speed 19 FPS. Trained on VOC07 trainval + VOC12 trainval, pretrained on COCO.2017-07-19 13:38:53
DDT augmentationDDT augmentation based on web imagesNanjing University, The University Of AdelaideXiu-Shen Wei, Chen-Lin Zhang, Jianxin Wu, Chunhua Shen, Zhi-Hua ZhouThis entry is based on Faster RCNN and our web-based object detection dataset (i.e., WebVOC [R1]) as an external dataset. Specifically, for WebVOC, we first collect web images from the Internet by Google using the categories of PASCAL VOC. In total, we collect 12,776 noisy web images, which has a similar scale as the original PASCAL VOC dataset. Then, we employ our Deep Descriptor Transforming (DDT) method [R1] to remove the noisy images, and moreover automatically annotate object bounding boxes. 10,081 images with their automatically generated boxes are remaining as valid images. For training detection models, we firstly fine-tune VGG-16 on WebVOC. Then, the WebVOC fine-tuned model is used for the VOC task. The training data of VOC is VOC 2007 trainval, test and VOC 2012 trainval. [R1] Xiu-Shen Wei,燙hen-Lin Zhang,燡ianxin Wu,燙hunhua Shen,燴hi-Hua Zhou. Unsupervised Object Discovery and Co-Localization by Deep Descriptor Transforming, arXiv:1707.06397, 20172017-07-26 10:55:14
An Ensemble of CNNs with COCO AugmentationDEEP_ENSEMBLE_COCOAustralian National University (ANU)Jian(Edison) Guo Stephen GouldWe follow mainly through the RCNN pipeline with the following innovations. 1) We trained an ensemble of CNNs for feature extraction. Our ensemble consists of GoogleNet and VGG-16 networks trained on different subsets of PASCAL VOC 2007/2012 and COCO. 2) We trained an ensemble of one-vs-all SVMs and bounding box regressors corresponding to each model of the CNN ensemble. 3) We averaged the SVM scores across the ensemble and sent the averaged SVM scores through the post-processing pipeline to obtain the indices of the selective search boxes retained after post-processing. 4) With the box indices, we ran box regression for each of the boxes for each of the models in the ensemble and then averaged the boxes across the ensemble to obtain the final results. (please see http://arxiv.org/abs/1506.07224)2015-05-03 15:40:02
Learning DSOD from ScratchDSOD300Intel Labs ChinaZhiqiang Shen, Jianguo Li, Zhuang Liu, Yurong Chen, Yu-Gang Jiang, Xiangyang Xue, Thomas HuangWe train DSOD for object detection. The training data is VOC 2007 trainval, test and VOC 2012 trainval without ImageNet pre-trained models. The input image size is 300x300.2017-03-17 00:42:36
Learning DSOD from ScratchDSOD300+Intel Labs ChinaZhiqiang Shen, Jianguo Li, Zhuang Liu, Yurong Chen, Yu-Gang Jiang, Xiangyang Xue, Thomas HuangWe train DSOD for object detection. The training data is VOC 2007 trainval, test, VOC 2012 trainval and MS COCO without ImageNet pre-trained models. The input image size is 300x300.2017-03-16 23:06:59
DSSD513 ResNet-101 07++12DSSD513_ResNet101_07++12UNC Chapel Hill, AmazonCheng-Yang Fu*, Wei Liu*, Ananth Ranga, Ambrish Tyagi, Alexander C. Berg (* equal contribution)We first train SSD513 model using ResNet-101 on VOC07 trainval + test and VOC12 trainval for the 20 PASCAL classes. Then we use that SSD513 as the pre-trained model to train the DSSD513 on same training data. We only test a single model on a single scale image (513x513), and don't have any post-processing steps. Details can be found at : https://arxiv.org/abs/1701.066592017-02-15 18:02:47
Deformable R-FCN, ResNet-101 (VOC+COCO)Deformable R-FCN, ResNet-101 (VOC+COCO)Microsoft Research AsiaJifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, Yichen Wei This entry is based on Deformable Convlutional Networks [a], R-FCN [b] and ResNet-101 [c]. The model is pre-trained on the 1000-class ImageNet classification training set, fine-tuned on the MS COCO trainval set, and then fine-tuned on the VOC 2007 trainval+test and VOC 2012 trainval sets. OHEM and multi-scale training are applied on our model. Multi-scale testing and horizontal flipping are applied during inference. [a] "Deformable Convolutional Networks", Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, Yichen Wei (https://arxiv.org/abs/1703.06211) [b] "R-FCN: Object Detection via Region-based Fully Convolutional Networks", Jifeng Dai, Yi Li, Kaiming He, Jian Sun (http://arxiv.org/abs/1605.06409). [c] "Deep Residual Learning for Image Recognition", Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (https://arxiv.org/abs/1512.03385)2017-03-23 03:46:36
Fast R-CNN with YOLO RescoringFast R-CNN + YOLOUniversity of WashingtonJoseph Redmon, Santosh Divvala, Ross Girshick, Ali FarhadiWe use the YOLO detection method to rescore the bounding boxes from Fast R-CNN. This helps mitigate false background detections and improve overall performance. For more information and example code see: http://pjreddie.com/darknet/yolo/2015-11-06 08:03:59
Fast R-CNN VGG16 extra dataFast R-CNN VGG16 extra dataMicrosoft ResearchRoss GirshickFast R-CNN is a new algorithm for training R-CNNs. The training process is a single fine-tuning run that jointly trains for softmax classification and bounding-box regression. Training took ~22 hours on a single GPU and testing takes ~330ms / image. A tech report describing the method is forthcoming. Open source code will be release. This entry was trained on VOC 2012 train+val union with VOC 2007 train+val+test.2015-04-17 17:32:25
Faster RCNN baseline (VOC+COCO)Faster RCNN baseline (VOC+COCO)Microsoft ResearchShaoqing Ren, Kaiming He, Ross Girshick, Jian SunThis entry is an baseline implementation of the system described in " Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks" (arXiv 2015). We use an ImageNet-pre-trained model (VGG-16), and fine-tune it on COCO trainval detection task. Then the COCO fine-tuned model is used for VOC task. The training data of VOC is VOC 2007 trainval, test and VOC 2012 trainval. The entire system takes <200ms per image, including proposal and detection.2015-11-24 03:56:56
Faster RCNN, ResNet (VOC+COCO)Faster RCNN, ResNet (VOC+COCO)Microsoft ResearchShaoqing Ren, Xiangyu Zhang, Kaiming He, Jian SunThis entry is based on an improved Faster R-CNN system [a] and an extremely deep Residual Net [b] with a depth of over 100 layers. The model is pre-trained on the 1000-class ImageNet classification training set, fine-tuned on the MS COCO trainval set, and fine-tuned on the VOC 2007 trainval+test and VOC 2012 trainval sets. [a] "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks", Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. NIPS 2015. [b] "Deep Residual Learning for Image Recognition", Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Tech Report 2015. 2015-12-10 14:47:49
FasterRCNN FasterRCNN FasterRCNN FasterRCNN FasterRCNN 2017-07-23 13:38:24
FasterRCNN FasterRCNN FasterRCNN FasterRCNN FasterRCNN 2017-07-23 13:42:44
FasterRcnn-ResNeXt101(COCO+07++12, single model)FasterRcnn-ResNeXt101(COCO+07++12, single model)Beijing University of Posts and Telecommunications, (BUPT-PRIV)Lu Yang; Qing Song; Zhihui Wang; Min YangOur network based on ResNeXt101-32x4d and faster rcnn, multi-scale training / multi-scale testing / image flipping are applied on this submittion. We first train our network on COCO and VOC0712trainval sets, then finetune on VOC07trainvaltest and VOC12trainval sets.2017-05-04 10:57:08
Diamond Frame Bicycle RecognitionGeometric shapeNational Cheng Kung UniversityChung-Ping Young, Yen-Bor Lin, Kuan-Yu ChenBicycle of diamond frame detector for side-view image is proposed based on the observation that a bicycle consists of two wheels in the form of ellipse shapes and a frame in the form of two triangles. Through the design of geometric constraints on the relationship between the triangles and ellipses, the computation is fast comparing to the feature-based classifiers. Besides, the training process is unnecessary and only single image is required for our algorithm. The experimental results are also given in this paper to show the practicability and the performance of the proposed bicycle model and bicycle detection algorithm.2016-06-19 10:06:33
Hierarchical Feature ModelHFM_VGG16Inha UniversityByungjae Lee, Enkhbayar Erdenee, Sungyul Kim, Phill Kyu RheeWe are motivated from the observations that many object detectors are degraded in performance due to ambiguities in inter-class and variations in intra-class appearances; deep features extracted from visual objects show strong hierarchical clustering property. We partition the deep features into unsupervised super-categories in the inter-class level, augmented categories in the object level to discover deep-feature-driven knowledge. We build Hierarchical Feature Model (HFM) using the Latent Topic Model (LTM) algorithm, ensemble one-versus-all SVMs at each node, and constitute hierarchical classification ensemble (HCE). In detection phase, object categorization and localization are processed based on the hypothesis of HCE with hierarchical mechanism.2016-03-21 10:59:33
Faster R-CNN with cascade RPN and global contextHIK_FRCNHikvision Research InstituteQiaoyong Zhong, Chao Li, Yingying Zhang, Di Xie, Shiliang PuOur work on object detection is based on Faster R-CNN. We design and validate the following improvements: * Better network. We find that the identity-mapping variant of ResNet-101 is superior for object detection over the original version. * Better RPN proposals. A novel cascade RPN is proposed to refine proposals' scores and location. A constrained neg/pos anchor ratio further increases proposal recall dramatically. * Pretraining matters. We find that a pretrained global context branch increases mAP by over 3 points. * Training strategies. To attack the imbalance problem, we design a balanced sampling strategy over different classes. Other training strategies, like multi-scale training and online hard example mining are also applied. * Testing strategies. During inference, multi-scale testing, horizontal flipping and weighted box voting are applied. Based on an ImageNet DET pretrained model, we first finetune on COCO+VOC dataset, then finetune on VOC dataset only.2016-09-19 05:50:00
HyperNet_SPHyperNet_SPIntel Labs China Tao Kong, Anbang Yao, Yurong Chen, Fuchun SunWe train hyperNet for object detection. An ImageNet-pre-trained model (VGG-16) is used for training HyperNet, both for proposal and detection. The training data is VOC 2007 trainval, test and VOC 2012 trainval. The proposal num is 100 for each image. This is a speed up version of the basic HyperNet. We move the 3󫢬 convolutional layer to the front of ROI pooling layer. This slight change has two advantages: (a) The channel number of Hyper Feature maps has been significantly reduced (from 126 to 4). (b) The sliding window classifier is more simple (from Conv-FC to FC). Both two characteristics can speed up proposal generation process. The speed is 5 fps using VGG16.2015-10-28 07:36:14
HyperNet_VGG16HyperNet_VGGIntel Labs China Tao Kong, Anbang Yao, Yurong Chen, Fuchun SunWe train hyperNet for object detection. An ImageNet-pre-trained model (VGG-16) is used for training HyperNet, both for proposal and detection. The training data is VOC 2007 trainval, test and VOC 2012 trainval. The proposal num is 100 for each image.2015-10-12 02:52:03
Implicit+Sink+DilationICT_360_ISDInstitute of Computing Technology, Chinese Academy of ScienceYu Li, Min Lin, Sheng Tang, Shuicheng YanWe update the method before 2016-11-18 03:34:32
Improved Feature RCNNIFRN_07+12Tsinghua MIGHaofeng Zou, Guiguang Dingadd improved global & local feature in RCNN and use a iterative detection method2016-06-07 07:47:00
Inside-Outside NetIONCornell UniversitySean Bell, Larry Zitnick, Kavita Bala, Ross GirshickOur "Inside-Outside Net" (ION) detector will be described soon in an arXiv submission. The method is based on Fast R-CNN with VGG16 and was trained on VOC 2012 train+val union VOC 2007 train+val (not VOC 2007 test), as well as the segmentations from SDS (Simultaneous Detection and Segmentation) on the training set images. We use the selective search boxes published with Fast R-CNN. Runtime: ~1.15s/image on a Titan X GPU (excluding proposal generation).2015-11-23 04:37:20
Improving Localization Accuracy for Object DetectiLocNetENPCSpyros Gidaris, Nikos KomodakisWe propose a novel object localization methodology with the purpose of boosting the localization accuracy of state-of-the-art object detection systems. Our model, given a search region, aims at returning the bounding box of an object of interest inside this region. To accomplish its goal, it relies on assigning conditional probabilities to each row and column of this region, where these probabilities provide useful information regarding the location of the boundaries of the object inside the search region and allow the accurate inference of the object bounding box under a simple probabilistic framework. For implementing our localization model, we make use of a convolutional neural network architecture that is properly adapted for this task, called LocNet. We show experimentally that LocNet achieves a very significant improvement on the mAP for high IoU thresholds on PASCAL VOC2007 test set and that it can be very easily coupled with recent state-of-the-art object detection systems, helping them to boost their performance. 2015-11-06 22:59:43
FRCN with multi-level feature and global contextMCC_FRCN, ResNet101, 07++12Harbin Institute of Technology Shenzhen Graduate SchoolWang Yuan, You LeiOur work is based on Faster R-CNN and ResNet101. (1) The low-level features are down-sampled using the convolution layer (stride 2), adjusted to the same size as the high-level features, and then merged for proposal and detection. (2) The context features are extracted from the entire image抯 feature maps using ROI pooling layer, and then merged with the region抯 features maps. (3) Weighted box voting are applied. The model is pre-trained on the 1000-class ImageNet classification training set, fine-tuned on the VOC 2007 trainval+test and VOC 2012 trainval sets only.2016-11-21 03:34:12
Multi-task Network CascadesMNC baselineMicrosoft Research AsiaJifeng Dai, Kaiming He, Jian SunOur Multi-task Network Cascades (MNCs) is described in arxiv paper "Multi-task Network Cascades for Instance-aware Semantic Segmentation" (http://arxiv.org/abs/1512.04412). The entry is based on MNCs and VGG-16 net. The training data is VOC 2007 trainval, test, and VOC 2012 trainval, augmented with the segmentation annotations from SBD ("Semantic contours from inverse detectors"). The overall runtime is 0.36sec/image on a K40 GPU.2015-12-15 14:06:18
Multi-Region & Semantic Segmentation-Aware CNNMR_CNN_S_CNNUniversite Paris Est, Ecole des Ponts ParisTechSpyros Gidaris, Nikos KomodakisThis entry is an implementation of the system described in "Object detection via a multi-region & semantic segmentation-aware CNN model" (http://arxiv.org/abs/1505.01749). The training data used for this entry are: 1) ImageNet for pre-training (of the 16-layers VGG-Net), 2) VOC2012 train set for fine-tuning of the deep models, and 3) VOC2012 train+val for training the detection SVMs. Abstract of "Object detection via a multi-region & semantic segmentation-aware CNN model": "We propose an object detection system that relies on a multi-region deep convolutional neural network (CNN) that also encodes semantic segmentation-aware features. The resulting CNN-based representation aims at capturing a diverse set of discriminative appearance factors and exhibits localization sensitivity that is essential for accurate object localization. We exploit the above properties of our recognition module by integrating it on an iterative localization mechanism that alternates between scoring a box proposal and refining its location with a deep CNN regression model."2015-05-09 23:15:56
Multi-Region & Semantic Segmentation-Aware CNNMR_CNN_S_CNN_MORE_DATAUniversite Paris Est, Ecole des Ponts ParisTechSpyros Gidaris, Nikos KomodakisThis entry is an implementation of the system described in "Object detection via a multi-region & semantic segmentation-aware CNN model" (http://arxiv.org/abs/1505.01749). The training data used for this entry are: 1) ImageNet for pre-training (of the 16-layers VGG-Net), 2) VOC2007 train+val and VOC2012 train+val sets for fine-tuning the deep models and training the detection SVMs. Abstract of "Object detection via a multi-region & semantic segmentation-aware CNN model": "We propose an object detection system that relies on a multi-region deep convolutional neural network (CNN) that also encodes semantic segmentation-aware features. The resulting CNN-based representation aims at capturing a diverse set of discriminative appearance factors and exhibits localization sensitivity that is essential for accurate object localization. We exploit the above properties of our recognition module by integrating it on an iterative localization mechanism that alternates between scoring a box proposal and refining its location with a deep CNN regression model."2015-06-06 15:49:11
The NIN extension of RCNNNUS_NINNUSJian Dong, Qiang Chen, Min Lin, Shuicheng YanThe entry is based on Ross Girshick's RCNN framework. We employ a single Network in Network [1] as the feature extractor to improve the model discriminative capability. We follow Girshick's RCNN protocal for training: (1) We used data from ILSVRC 2012 to pre-train the ConvNet (using caffe) (2) We fine-tuned the resulting ConvNet using 2012 trainval (3) We trained object detector SVMs using 2012 trainval. This entry is used as the baseline for the journal version of [2]. [1] Min Lin, Qiang Chen, Shuicheng Yan. Network In Network. In ICLR 2014. [2] Jian Dong, Qiang Chen, Min Lin, Shuicheng Yan, Alan Yuille: Towards Unified Object Detection and Semantic Segmentation. 2014-10-30 15:47:28
The NIN extension of RCNNNUS_NIN_c2000NUSJian Dong, Qiang Chen, Min Lin, Shuicheng YanThe entry is based on Ross Girshick's RCNN framework. We employ a single Network in Network [1] as the feature extractor to improve the model discriminative capability. We follow Girshick's RCNN protocal for training: (1) We used data from ILSVRC 2012 + 1000 extra categories of ImageNet to pre-train the ConvNet (using caffe) (2) We fine-tuned the resulting ConvNet using 2012 trainval (3) We trained object detector SVMs using 2012 trainval. This entry is used as the baseline for the journal version of [2]. [1] Min Lin, Qiang Chen, Shuicheng Yan. Network In Network. In ICLR 2014. [2] Jian Dong, Qiang Chen, Min Lin, Shuicheng Yan, Alan Yuille: Towards Unified Object Detection and Semantic Segmentation. 2014-10-30 15:45:29
NoCNetworks on Convolutional Feature MapsMicrosoft ResearchShaoqing Ren, Kaiming He, Ross Girshick, Xiangyu Zhang, Jian SunThis entry is an implementation of the system described in 揙bject Detection Networks on Convolutional Feature Maps (http://arxiv.org/abs/1504.06066). We train a 揘etwork on Convolutional feature maps (NoC) for fast and accurate object detection. Training data for this entry include: (i) ImageNet data for pre-training (VGG-16); (ii) VOC 2007 trainval and 2012 trainval for training the NoC on pooled region features. Selective Search and EdgeBoxes are used for proposal.2015-04-17 17:21:10
Online Hard Example Mining for Fast R-CNN (VGG16)OHEM+FRCN, VGG16Carnegie Mellon University, Facebook AI ResearchAbhinav Shrivastava, Abhinav Gupta, Ross GirshickWe propose an online hard example mining (OHEM) algorithm to train region-based ConvNet detectors. This entry uses OHEM to train the Fast R-CNN (FRCN) object detection system. We use an ImageNet pre-trained VGG16 model and fine-tune it on VOC 2012 trainval dataset. For more details, please refer to 'Training Region-based Object Detectors with Online Hard Example Mining', CVPR 2016 (http://arxiv.org/abs/1604.03540).2016-04-18 05:16:35
Online Hard Example Mining for Fast R-CNN (VGG16)OHEM+FRCN, VGG16, VOC+COCOCarnegie Mellon University, Facebook AI ResearchAbhinav Shrivastava, Abhinav Gupta, Ross GirshickWe propose an online hard example mining (OHEM) algorithm to train region-based ConvNet detectors. This entry uses OHEM to train the Fast R-CNN (FRCN) object detection system. We use an ImageNet pre-trained VGG16 model, use OHEM to fine-tune on COCO trainval set and further fine-tune on VOC 2012 trainval, VOC 2007 trainval and VOC 2007 test dataset. For more details, please refer to 'Training Region-based Object Detectors with Online Hard Example Mining', CVPR 2016 (http://arxiv.org/abs/1604.03540).2016-04-18 05:18:28
PLNPLNXXXXKaibing Chen, Xinggang Wang, Zilong HuangPoint Linking Network, trained only on pascal voc 07++12 dataset.2017-03-27 07:53:57
Faster R-CNN with PVANet (VOC+COCO)PVANet+Intel Imaging and Camera TechnologySanghoon Hong, Byungseok Roh, Kye-Hyeon Kim, Yeongjae Cheon, Minje ParkBased on Faster R-CNN with a network designed from scratch. The network is designed for efficiency and it takes less than 50 ms including proposal generation and detection (tested with 200 proposals on Titan X). The network is pre-trained with the ImageNet classification training set and fine-tuned with VOC2007/2012/MSCOCO trainval sets and VOC2007 test set. Only single-scale images are used while testing. Please refer to 揚VANet: Lightweight Deep Neural Networks for Real-time Object Detection (https://arxiv.org/abs/1611.08588) and https://github.com/sanghoon/pva-faster-rcnn for more details.2016-10-26 09:25:07
Faster R-CNN with PVANet (VOC+COCO)PVANet+ (compressed)Intel Imaging and Camera TechnologySanghoon Hong, Byungseok Roh, Kye-Hyeon Kim, Yeongjae Cheon, Minje ParkBased on Faster R-CNN with a network designed from scratch. The network is designed for efficiency and it takes only 32 ms (30 fps) including proposal generation and detection (tested with 200 proposals on Titan X). The network is pre-trained with the ImageNet classification training set and fine-tuned with VOC2007/2012/MSCOCO trainval sets and VOC2007 test set. Only single-scale images are used while testing. Please refer to 揚VANet: Lightweight Deep Neural Networks for Real-time Object Detection (https://arxiv.org/abs/1611.08588) and https://github.com/sanghoon/pva-faster-rcnn for more details.2016-11-18 07:05:29
Region-based CNNR-CNNUC BerkeleyRoss Girshick, Jeff Donahue, Trevor Darrell, Jitendra MalikThis entry is an implementation of the system described in "Rich feature hierarchies for accurate object detection and semantic segmentation" (http://arxiv.org/abs/1311.2524 version 5). Code is available at http://www.cs.berkeley.edu/~rbg/. Training data: (1) We used ILSVRC 2012 to pre-train the ConvNet (using caffe) (2) We fine-tuned the resulting ConvNet using 2012 trainval (3) We trained object detector SVMs using 2012 trainval. The same detection SVMs were used for the 2012 and 2010 results. For this submission, we used the 16-layer ConvNet from Simonyan & Zisserman instead of Krizhevsky et al.'s ConvNet.2014-10-25 21:09:52
Regions with Convolutional Neural Network FeaturesR-CNNUC BerkeleyRoss Girshick, Jeff Donahue, Trevor Darrell, Jitendra MalikThis entry is an implementation of the system described in "Rich feature hierarchies for accurate object detection and semantic segmentation" (http://arxiv.org/abs/1311.2524). We made two small changes relative to the arXiv tech report that are responsible for improved performance: (1) we added a small amount of context around each region proposal (16px at the warped size) and (2) we used a higher learning rate while fine-tuning (starting at 0.001). Aside from non-maximum suppression no additional post-processing (e.g., detector or image classification context) was applied. Code will be made available soon at http://www.cs.berkeley.edu/~rbg/. Training data: (1) We used ILSVRC 2012 to pre-train the ConvNet (using caffe) (2) We fine-tuned the resulting ConvNet using 2012 train (3) We trained object detector SVMs using 2012 train+val The same detection SVMs were used for the 2012 and 2010 results. 2014-01-30 01:46:58
Region-based CNNR-CNN (bbox reg)UC BerkeleyRoss Girshick, Jeff Donahue, Trevor Darrell, Jitendra MalikThis entry is an implementation of the system described in "Rich feature hierarchies for accurate object detection and semantic segmentation" (http://arxiv.org/abs/1311.2524 version 5). Code is available at http://www.cs.berkeley.edu/~rbg/. Training data: (1) We used ILSVRC 2012 to pre-train the ConvNet (using caffe) (2) We fine-tuned the resulting ConvNet using 2012 trainval (3) We trained object detector SVMs using 2012 trainval. The same detection SVMs were used for the 2012 and 2010 results. For this submission, we used the 16-layer ConvNet from Simonyan & Zisserman instead of Krizhevsky et al.'s ConvNet.2014-10-26 03:29:27
Regions with Convolutional Neural Network FeaturesR-CNN (bbox reg)UC BerkeleyRoss Girshick, Jeff Donahue, Trevor Darrell, Jitendra MalikThis entry is an implementation of the system described in "Rich feature hierarchies for accurate object detection and semantic segmentation" (http://arxiv.org/abs/1311.2524). We made two small changes relative to the arXiv tech report that are responsible for improved performance: (1) we added a small amount of context around each region proposal (16px at the warped size) and (2) we used a higher learning rate while fine-tuning (starting at 0.001). Aside from non-maximum suppression no additional post-processing (e.g., detector or image classification context) was applied. Code will be made available soon at http://www.cs.berkeley.edu/~rbg/. Training data: (1) We used ILSVRC 2012 to pre-train the ConvNet (using caffe) (2) We fine-tuned the resulting ConvNet using 2012 train (3) We trained object detector SVMs using 2012 train+val The same detection SVMs were used for the 2012 and 2010 results. This submission includes a simple regression from pool5 features to bounding box coordinates.2014-03-13 18:08:18
R-FCN, ResNet (VOC+COCO)R-FCN, ResNet (VOC+COCO)Microsoft ResearchHaozhi Qi*, Yi Li*, Jifeng Dai* (* equal contribution)This entry is based on R-FCN [a] and ResNet-101. The model is pre-trained on the 1000-class ImageNet classification training set, fine-tuned on the MS COCO trainval set, and then fine-tuned on the VOC 2007 trainval+test and VOC 2012 trainval sets. OHEM and multi-scale training are applied on our model. Multi-scale testing and horizontal flipping are applied during inference. [a] 揜-FCN: Object Detection via Region-based Fully Convolutional Networks, Jifeng Dai, Yi Li, Kaiming He, Jian Sun (http://arxiv.org/abs/1605.06409).2016-10-09 08:33:08
R-FCN, ResNet Ensemble(VOC+COCO)R-FCN, ResNet Ensemble(VOC+COCO)Microsoft ResearchHaozhi Qi*, Yi Li*, Jifeng Dai* (* equal contribution)This entry is based on R-FCN [a] and ResNet models. We utilize an ensemble of R-FCN models pre-trained on 1000-class ImageNet classification training set, fine-tuned on the MS COCO trainval set, and then fine-tuned on the VOC 2007 trainval+test and VOC 2012 trainval sets. OHEM and multi-scale training are applied on our model. Multi-scale testing and horizontal flipping are applied during inference. [a] 揜-FCN: Object Detection via Region-based Fully Convolutional Networks, Jifeng Dai, Yi Li, Kaiming He, Jian Sun (http://arxiv.org/abs/1605.06409).2016-10-09 08:45:02
RFCN_DCNRFCN_DCNXXXtesterRFCN_DCN2017-06-27 12:55:51
Region Proposal NetworkRPNMicrosoft ResearchShaoqing Ren, Kaiming He, Ross Girshick, Jian Sun This entry is an implementation of the system described in " Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks" (arXiv 2015). An ImageNet-pre-trained model (VGG-16) is used for training a Region Proposal Network (RPN) and Fast R-CNN detector. The training data is VOC 2007 trainval, test and VOC 2012 trainval. The entire system takes <200ms per image, including proposal and detection.2015-06-01 10:29:23
RUN300 3WAY, VGG16, 07++12 RUN300 3WAY, VGG16, 07++12 Seoul National UniversityKyoungmin Lee, Jaeseok Choi, Jisoo Jeong, Nojun KwakWe focused on solving a structural contradiction and enhancing the contextual information of the multi-scale feature maps. We propose a network, based on SSD, using ResBlock and deconvolution layers to enrich the representation power of feature maps. In addition, a unified prediction module is applied to generalize output result. It takes 15.6ms for Titan X Pascal GPU, which indicates that it maintains the advantage of fast computation of a single stage detector.(https://arxiv.org/abs/1707.05031)2017-07-18 02:02:47
Res101+FasterRCNNRes101+FasterRCNN(COCO+0712trainval)MeituKang YangI use ResNet-101 + FasterRCNN train on COCO, fine tuning on voc_2007_tranval+voc_2012_trainval, test on voc_2012_test2017-02-05 03:16:39
Res101+hyper+FasterRCNN(COCO+0712trainval)Res101+hyper+FasterRCNN(COCO+0712trainval)MeituKang YangI use Res101+hyper+FasterRCNN(COCO+0712trainval)2017-02-10 03:03:50
SDSSDSUC BerkeleyBharath Hariharan Pablo Arbelaez Ross Girshick Jitendra MalikWe aim to detect all instances of a category in an image and, for each instance, mark the pixels that belong to it. We call this task Simultaneous Detection and Segmentation (SDS). Unlike classical bounding box detection, SDS requires a segmentation and not just a box. Unlike classical semantic segmentation, we require individual object instances. We build on recent work that uses convolutional neural networks to classify category-independent region proposals (R-CNN [1]), introducing a novel architecture tailored for SDS. We then use category-specific, top-down figure-ground predictions to refine our bottom-up proposals. We show a 7 point boost (16% relative) over our baselines on SDS, a 4 point boost (8% relative) over state-of-the-art on semantic segmentation, and state-of-the-art performance in object detection. Finally, we provide diagnostic tools that unpack performance and provide directions for future work.2014-07-21 22:46:22
SSD300SSD300 VGG16 07++12Google, UNC Chapel Hill, ZooxWei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. BergWe train SSD model using VGG16 on 300 x 300 input image. The training data is VOC07 trainval + test and VOC12 trainval. The inference speed is 59 FPS on Titan X with batch size 8 or 46 FPS with batch size 1. We only test a single model on a single scale image (300x300), and don't have any post-processing steps. Check out our code and more details at: https://github.com/weiliu89/caffe/tree/ssd2016-10-18 17:53:04
SSD300ftSSD300 VGG16 07++12+COCOGoogle, UNC Chapel Hill, ZooxWei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. BergWe first train SSD300 model using VGG16 on MS COCO trainval35k, then fine-tune it on VOC07 trainval + test and VOC12 trainval for the 20 PASCAL classes.2016-10-03 07:08:37
SSD512SSD512 VGG16 07++12Google, UNC Chapel Hill, ZooxWei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. BergWe train SSD model using VGG16 on 512 x 512 input image. The training data is VOC07 trainval + test and VOC12 trainval. The inference speed is 22 FPS on Titan X with batch size 8 or 19 FPS with batch size 1. We only test a single model on a single scale image (512x512), and don't have any post-processing steps. Check out our code and more details at: https://github.com/weiliu89/caffe/tree/ssd2016-10-13 17:28:35
SSD512ftSSD512 VGG16 07++12+COCOGoogle, UNC Chapel Hill, ZooxWei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. BergWe first train SSD512 model using VGG16 on MS COCO trainval35k, then fine-tune it on VOC07 trainval + test and VOC12 trainval for the 20 PASCAL classes. We only test a single model on a single scale image (512x512), and don't have any post-processing steps.2016-10-10 19:35:42
Fine-grained search using R-CNN with StructObjUMICH_FGS_STRUCTUniversity of Michigan & Zhejiang UniversityYuting Zhang, Kihyuk Sohn, Ruben Villegas, Gang Pan, Honglak LeeWe performed the Bayesian optimization based fine-grained search (FGS) using the R-CNN detector trained with structured objective: (1) We used the 16-layer network pretrained by VGG group. (2) We finetuned the network with softmax classifier using VOC2012 detection trainval set. (3) Structured SVMs are trained using VOC2012 trainval as object detector. (4) FGS is applied based on the R-CNN initial solutions. (5) Bounding box regression is adopted. Please refer to this paper for details: Yuting Zhang, Kihyuk Sohn, Ruben Villegas, Gang Pan, Honglak Lee, 揑mproving Object Detection with Deep Convolutional Networks via Bayesian Optimization and Structured Prediction, CVPR 2015.2015-06-20 21:39:43
You Only Look Once: Unified, Real-Time DetectionYOLOUniversity of WashingtonJoseph Redmon, Santosh Divvala, Ross Girshick, Ali FarhadiWe train a convolutional neural network to perform end-to-end object detection. Our network processes the full image and outputs multiple bounding boxes and class probabilities. At test time we process images in real-time at 45fps. For more information and example code see: http://pjreddie.com/darknet/yolo/2015-11-06 07:36:38
YOLOv2YOLOv2University of WashingtonJoe Redmon, Ali FarhadiWe use a variety of tricks to increase the performance of YOLO including dimension cluster priors and multi-scale training. Details at https://pjreddie.com/yolo/2017-02-23 16:37:58
YOLOv2 (VOC + COCO)YOLOv2 (VOC + COCO)University of WashingtonJoseph Redmon, Ali FarhadiWe use a variety of tricks to increase the performance of YOLO including dimension cluster priors and multi-scale training. Details at https://pjreddie.com/yolo/2017-03-12 04:11:29
CNN with Segmentation and Context Cues segDeepMUniversity of TorontoYukun Zhu, Ruslan Salakhutdinov, Raquel Urtasun, Sanja Fidler segDeepM on PASCAL2012, w/ bounding box regression 2016-03-04 19:28:43
Feature Edit with CNN featuresFeature EditThe University of FUDANZhiqiang Shen, Xiangyang Xue et al.We edit 5th CNN features with the network defined by Krizhevsky(2012), then add the new features to original feature set. Two stages are contained to find out the variables to inhibit. Step one is to find out the largest variance of subset within a class and step two is to find out ones with smallest inter-class variance. This edit operation is to handle the separation of different properties. A linear-SVM is boosted to classify the proposal regions and a bounding-box regression is also employed to reduce the localization errors.2014-09-06 15:58:29
Deep poseletsPoselets2FacebookFei Yang, Rob FergusPoselets trained with CNN. Ran original poselets on a large set of images, collected weakly labelled training data, trained a convolutional neural net and applied it to the test data. This method allows for training deep poselets without the need of lots of manual keypoint annotations. Poselets trained with CNN. Ran original poselets on a large set of images, collected weakly labelled training data, trained a convolutional neural net and applied it to the test data. THis field seems to be broken. Really you want that long of a description??2014-06-06 14:02:45