Detection Results: VOC2012 BETA

Competition "comp3" (train on VOC2012 data)

This leaderboard shows only those submissions that have been marked as public, and so the displayed rankings should not be considered as definitive.

Average Precision (AP %)

















NAS Yolo [?] 86.592.992.788.478.078.190.889.794.574.392.871.993.294.592.992.367.092.177.792.484.909-May-2020
BOE_IOT_AIBD_method_improved [?] 83.890.490.082.877.476.889.585.993.373.086.768.492.792.590.690.369.184.173.390.378.927-Nov-2019
Improved yolo-v3 [?] 83.791.889.386.373.971.
Stronger-yolo [?] 83.391.989.182.575.272.987.387.891.071.385.
FCASA-detection [?] 82.490.987.283.872.372.086.387.790.269.885.171.289.790.089.390.661.185.375.189.580.105-Aug-2019
DOLO [?] 81.391.787.383.
ASSD513 [?] 81.392.189.282.571.560.485.584.893.963.788.667.492.690.289.086.560.488.273.488.677.018-Aug-2018
COS-DET [?] 81.391.787.282.171.668.686.985.393.163.886.866.092.090.488.488.861.286.873.888.173.726-Apr-2019
FastX-RCNN [?] 81.189.786.484.170.973.184.685.594.364.785.362.293.490.288.889.962.183.871.288.773.906-Jul-2018
DFL-Net [?] 80.291.488.580.667.358.
RockDetector-1 [?] 79.988.685.681.769.364.082.080.994.264.184.365.993.788.886.687.261.683.572.988.175.408-Nov-2019
refine_denseSSD [?] 77.589.885.877.064.456.783.781.892.160.983.863.289.685.988.185.354.782.364.688.272.414-May-2018
FPNSSD [?] 77.090.378.881.767.153.479.580.593.859.985.861.892.581.784.180.856.184.869.287.471.229-Mar-2018
TCnet [?] 76.686.683.178.565.661.180.880.391.756.380.161.890.586.184.083.456.679.770.084.571.902-May-2018
TCnet [?] 76.586.882.778.565.360.279.680.091.056.980.961.390.286.884.283.155.480.370.084.771.729-Mar-2018
ASSD321 [?] 76.489.684.376.764.549.381.777.092.257.881.364.091.686.585.882.
ATLSSD [?] 74.887.682.772.062.257.583.183.886.956.276.360.684.480.484.985.950.181.165.584.970.126-Mar-2018
DSD [?] 74.587.982.074.861.951.582.181.189.855.878.558.386.882.382.783.449.279.569.185.069.219-Jul-2018
dsa_1050 [?] 73.987.482.072.960.751.880.776.890.154.078.760.089.183.583.381.449.775.764.285.270.518-Nov-2017
DSOD v2 [?] 72.986.882.569.057.447.181.277.888.754.875.560.485.282.085.482.445.075.368.284.369.224-Jun-2018
MA-SSD [?] 72.987.081.471.259.449.081.374.488.255.578.261.285.982.782.780.346.576.666.883.766.201-Aug-2018
GRP-DSOD320 [?] 72.587.
ssd [?] 72.286.980.168.957.247.481.
DSOD (single model) [?] 70.886.480.265.555.742.480.375.386.651.172.360.583.980.583.680.442.772.467.383.166.221-Jan-2018
Attention-SSD-vgg [?]
SSD [?] 64.078.972.361.842.827.973.169.484.942.568.452.280.976.577.268.231.667.066.677.360.910-Jun-2017
DCONV_SSD_FCN [?] 62.877.970.662.946.528.669.763.183.642.166.652.379.672.877.267.733.
THU_ML_class [?] 62.478.071.064.547.445.370.170.682.037.965.444.277.469.674.475.537.962.045.573.856.303-Jun-2017
yolo [?] 62.179.872.155.344.943.171.572.375.142.161.345.873.470.976.279.335.267.449.171.556.128-Sep-2019
yolo [?] 59.476.068.151.340.039.169.866.774.039.856.247.870.570.475.175.731.961.652.468.054.428-Sep-2019
YOLOv2 [?] 48.869.561.637.628.218.863.253.265.627.544.435.961.457.966.963.816.852.839.565.446.201-Dec-2016
DENSE_BOX [?] 45.964.764.128.826.730.760.654.947.429.341.834.642.659.364.262.524.353.727.150.950.707-Jul-2015
NoC [?] 42.262.860.426.722.325.756.955.252.121.538.334.243.951.258.840.720.442.037.452.641.626-Apr-2015
HybridCodingApe [?] 40.961.852.024.624.820.257.144.553.617.433.038.342.848.859.435.722.840.339.551.149.523-Sep-2012
Data Decomposition and Distinctive Context [?] 40.955.058.122.518.833.957.654.542.620.240.329.337.154.658.351.614.744.832.151.741.013-Oct-2011
segDPM [?] 40.759.154.328.224.434.553.448.151.318.137.829.940.448.952.946.416.139.535.450.844.924-Feb-2014
NYU-UCLA_Hierarchy [?] 40.656.355.923.420.327.256.648.153.823.332.933.439.253.056.943.614.337.939.452.643.713-Oct-2011
Fisher with FLAIR [?] 40.661.752.027.924.018.956.545.353.415.534.636.342.348.457.936.624.340.638.049.849.017-Jun-2014
DenseYolo [?] 39.460.248.726.
DPM-MKL [?] 39.159.654.521.921.632.152.549.340.819.135.228.937.250.949.946.115.639.335.648.942.823-Sep-2012
DPM-MK [?] 38.356.053.319.217.325.853.145.444.520.
NEC_STANFORD_OCP [?] 36.765.146.825.024.616.051.044.951.513.026.631.040.239.751.532.812.635.733.548.044.823-Sep-2012
Detector-Merging [?] 36.547.250.218.321.425.253.346.346.317.527.830.335.041.652.
MISSOURI_HOGLBP_MDPM_CONTEXT [?] 36.451.453.718.315.631.656.547.138.619.532.
NUS_Context_SVM [?] 36.251.452.920.115.826.953.045.637.615.336.025.132.650.455.836.812.337.630.548.141.005-Oct-2011
SelectiveSearchMonkey [?] 35.556.943.416.615.818.052.338.349.012.229.732.836.745.754.430.416.237.234.745.944.213-Oct-2011
CVC_DET [?] 34.145.449.815.716.026.354.644.835.116.831.323.626.045.649.642.214.530.528.545.740.023-Sep-2012
UOCTTI_LSVM_MDPM [?] 33.653.253.913.113.530.555.551.231.714.529.
TREE--MAX-POOLING [?] 32.943.851.713.712.727.351.543.732.918.327.318.523.145.248.642.911.632.427.547.039.313-Oct-2011
LCC-TREE-CODING [?] 32.441.151.713.711.927.352.141.732.917.627.318.523.145.248.641.911.632.427.544.238.313-Oct-2011
SVM-HOG [?] 31.547.551.714.212.627.351.844.225.317.830.218.116.946.950.943.09.531.223.644.322.122-Sep-2012
Configurable And-Or Tree Model [?] 29.550.
lSVM-Viewpoint [?] 20.942.543.75.44.818.128.636.624.212.620.64.517.515.
UOCTTI_WL-SSVM_GRAMMAR [?] ---------------49.2-----12-Oct-2011
CMIC-Synthetic-DPM [?] -40.447.8-11.423.748.940.923.511.925.5-10.942.038.740.77.530.4-38.434.813-Oct-2011
CMIC-GS-DPM [?] ----13.326.4-41.5---12.2--41.6-8.331.4---13-Oct-2011
Geometric shape [?] --3.8------------------19-Jun-2016
Struct_Det_CRF [?] -


ASSD321ASSD321rutgersjingru yi, pengxiang wuinput resolution: 321x3212018-08-20 02:34:00
ASSD513ASSD513RutgersJingru Yi, pengxiang wuinput resolution: 513x5132018-08-18 12:26:28
ATLSSDATLSSDATL(Alibaba Turing Labs)Xuan JinSSD-based method trained on VOC20122018-03-26 07:48:08
softmax with Attention on vgg for detectionAttention-SSD-vggCSUSTJiaWe select the box which boxes >0.5. we added the attention on the SSD model2018-05-20 11:12:40
BOE_IOT_AIBD_method_improvedBOE_IOT_AIBD_method_improvedBOE_IOT_AIBDXu JingtaoBOE_IOT_AIBD_method_improved2019-11-27 03:29:33
Single-stage detector trained by step-SGDR.COS-DETZUIYOU IncTabsun, Ma Baoyuan, Li Yong, Li XiaosongI designed a new step-SGDR method which is the most important innovation and it boosts the mAP almost 0.6 compared with step-decay strategy. An important point is how to judge the overfit point. As for the backbone I used the darknet-53 while some common methods like distort/random crop/random flip/mix-up for the data augmentation. Also multi-scale testing and horizontal flip test really help. Some common methods like softNMS do not make sense in my experiments. On a single 1080Ti the model runs at almost 15fps.2019-04-26 12:04:21
Color_HOG based detector with BOW classifierCVC_DETComputer Vision Center BarcelonaFahad Khan, Camp Davesa, Joost van de Weijer, Rao Muhammad Anwer, Albert Gordo, Pep Gonfaus, Ramon Baldrich, Antonio LopezWe use our Color-HOG based part detector [1]. The detection results are combined with our CVC_CLS submission. References: 1. Fahad shahbaz khan, Rao Muhammad Anwer, Joost van de Weijer, Andrew D. Bagdanov, Maria Vanrell, Antonio M. Lopez. Color Attributes for Object Detection. In CVPR 2012. 2012-09-23 18:53:20
Dynamic And-Or Tree Learning For Object DetectionConfigurable And-Or Tree ModelSun Yat-Sen UniversityXiaolong Wang, Liang Lin, Lichao Huang, Xinhui Zhang, Zechao YangWe propose a novel hierarchical model for object detection, namely "And-Or tree", which is a configurable by introducing the “switch” variables (i.e. the or-nodes) accounting for intra-class object variance. This model comprises three layers: a batch of leaf-nodes in bottom for localizing object parts; the or-nodes for activating several leaf-nodes to specify a composition of parts; a root-node verifying object holistic distortion. For model training , a novel discriminative learning algorithm is proposed to explicitly determine the structural configuration (e.g., the production of leaf-nodes associated with the or-nodes) along with the optimization of multi-layer parameters. The response of model integrates the bottom-up testings via the leaf-nodes and or-nodes with the global verification via the root-node. In the implementation, we apply the histograms of gradients(HOG) as the image feature. Object detection is achieved by scanning the sub-windows over different scales and locations of the image. The final decisions are further rescored by a context model encoding the inter-object spatial interactions.2012-09-23 16:02:13
dssd style archDCONV_SSD_FCNshanghai universityli junhao( object detection and semantic segmentation in one forward pass2018-03-17 02:58:20
DenseBoxCNNDENSE_BOXBaidu IDLLichao HuangI train a VGG16-liked convolutional neural network to perform end-to-end object detection. This network can processes the full image and outputs multiple bounding boxes and class confidence score simultaneously. The training data used in this entry is VOC2012 trianval only. 2015-07-07 05:39:05
A Distinguishable Features Learning Network for {wansh, jpq} One-Stage Anchor-Based Object Detection via Distinguishable Feature Learning2020-06-22 08:29:46
YOLO V3 with dynamic constraint for objectnessDOLOTencent MIG YYB & USTC BDAA LABChen Joya, Bin Luo, XueZheng Peng, Tong XuWe present DOLO, which is based on a state-of-the-art object detection method YOLO V3. We have improved it by our dynamic constraint strategy. Furthermore, we use a simple SNIP (Scale Normalization for Image Pyramids) strategy in our training. While inference, our square weaken method are adopted for multi-scale and flip testing.2018-09-21 10:34:36
The DPM-MKL baselineDPM-MKLOxfordRoss Girshick, Andrea Vedaldi, Karen SimonyanThis method is similar to last year DPM-MKL entry. We updated several aspects of the implementation (e.g. th type of features). 2012-09-23 23:05:18
DSDDSDCainiaoDuliang HaiwaDSD2018-07-19 14:43:34
DSODDSOD (single model)IntelZhiqiang Shen, Zhuang Liu, Jianguo Li, Yu-Gang Jiang, Yurong chen, Xiangyang Xue.The training data is VOC 2012 trainval set without ImageNet pre-trained models or any other additional dataset. The input image size is 300x300. More details can be referred to our paper: "DSOD: Learning Deeply Supervised Object Detectors from Scratch".2018-01-21 06:13:56
DSOD v2DSOD v2UIUCZhiqiang Shen, Zhuang Liu, Jianguo Li, Yu-Gang Jiang, Yurong Chen and Xiangyang XueTraining from scratch without pre-trained models. The input size is 300x300.2018-06-24 05:56:41
Yolo with dense grid and high level featuresDenseYoloUniversity Politehnica BucharestPaul UrziceanuN\A2017-05-15 10:54:16
Detector_WeightingDetector-MergingUniversity of AmsterdamSezer Karaoglu, Fahad Shahbaz Khan, Koen van de Sande, Jan van Gemert, Rao Muhammad Anwer, , Jasper Uijlings, Camp Davesa, Joost van de Weijer, Theo Gevers, Cees Snoek We use a bounding box merging scheme that exploits the results from different independent detectors. Each detector results in a ranked list of BB, which is not directly comparable with other detectors. We merge the detectors with a weighting scheme based on hold-out performance. For input, we use the standard Felzenszwalb gray HOG detector [1] ; the color-HOG detector of CVC [2] which introduces color information within the part based detection framework; and a slightly improved version of the SelectiveSearch detector [3] by the UvA submitted to VOC 2011. [1] P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan Object Detection with Discriminatively Trained Part Based Models. In TPAMI, Vol. 32, No. 9, Sep. 2010 [2] Fahad shahbaz khan, Rao Muhammad Anwer, Joost van de Weijer, Andrew D. Bagdanov, Maria Vanrell, Antonio M. Lopez. Color Attributes for Object Detection. In CVPR 2012. [3] Segmentation As Selective Search for Object Recognition Koen E. A. van de Sande, Jasper R. R. Uijlings, Theo Gevers, Arnold W. M. Smeulders. In ICCV, 2011 2012-09-23 22:51:19
Full convolution attention selectivFCASA-detectionDHAIxiangming.zhou kai.fu guodong.wu We propose a novel architecture of object detection. We use full convolution networks as the multistep rpn networks. This kind of architecture proposes rois base on the previous step. So it avoids the unbalanced between positive and negative samples. Meanwhile,this kind of architecture can improve the recall of detection,because the rois are filtered by multistep rpn networks,the remaining rois are more reliable.And we also use soft-nms for scoring our objects,and GIOU loss for the location loss.Our architecture can apply to any single-stage detector. By using the same backbone networks we trained yolov3,ssd with and without our architecture , it shows that by using our architecture will boost mAp almost 5% on PASCAL VOC data set.. 2019-08-05 10:43:51
Detection Network Based on Function MaintenanceFMFPDUniversity of Chinese Academy of SciencesChengqi XuThis module maintains the high-level strong semantic information more effectively, so that the lower level feature maps also have strong semantic features and the presentation ability of small object is also greatly enhanced. At the same time, the accuracy of detection is improved by using the two-stage features of the network to describe the objects.2020-05-19 14:33:57
FPNSSDFPNSSDsogou.comKuang LiuFPNSSD trained on VOC122018-03-29 10:38:04
Faster RCNN with ResNextFastX-RCNNYi+AI LabHang Zhang, Boyuan Sun, Zhaonan Wang, Hao Zhao, ZiXuan Guan, Wei MiaoFaster RCNN + RoIAlign + ResNeXt152 + SoftNMS + Multi-Scale Training + Multi-Scale Testing;2018-07-06 04:04:00
Fisher with FLAIRFisher with FLAIRUniversity of AmsterdamKoen van de Sande, Cees Snoek, Arnold SmeuldersRun for our CVPR2014 paper "Fisher and VLAD with FLAIR", see 11:47:29
Gated Recurrent Feature PyramidsGRP-DSOD320UIUCZhiqiang Shen, Honghui Shi, Rogerio Feris, Liangliang Cao, Shuicheng Yan, Ding Liu, Xinchao Wang, Xiangyang Xue, Thomas S. HuangWe train GRP-DSOD for object detection. The training data is VOC 2012 trainval set without ImageNet pre-trained models or any other additional dataset. The input image size is 320x320. More details can be referred to our paper: "Learning Object Detection from Scratch with Gated Recurrent Feature Pyramids".2017-11-19 22:13:59
Diamond Frame Bicycle RecognitionGeometric shapeNational Cheng Kung UniversityChung-Ping Young, Yen-Bor Lin, Kuan-Yu ChenBicycle of diamond frame detector for side-view image is proposed based on the observation that a bicycle consists of two wheels in the form of ellipse shapes and a frame in the form of two triangles. Through the design of geometric constraints on the relationship between the triangles and ellipses, the computation is fast comparing to the feature-based classifiers. Besides, the training process is unnecessary and only single image is required for our algorithm. The experimental results are also given in this paper to show the practicability and the performance of the proposed bicycle model and bicycle detection algorithm.2016-06-19 10:06:33
Hybrid Coding for Selective SearchHybridCodingApeksande@uva.nlKoen E. A. van de Sande Jasper R. R. Uijlings Cees G. M. Snoek Arnold W. M. SmeuldersWe have improved significantly over last years method from [1] with a hybrid bag-of-words using average and difference coding, a first in object detection. Briefly, the method of [1], instead of exhaustive search, which was dominant in the Pascal VOC 2010 and 2011 detection challenge, uses segmentation as a sampling strategy for selective search (cf. the ICCV paper). We use a small set of data-driven, class-independent, high quality object locations (coverage of 96-99% of all objects in the VOC2007 test set). Because we have only a limited number of locations to evaluate, this enables the use of more computationally expensive features, such as bag-of-words using average and difference coding strategies. While difference coding is an order of magnitude more expensive than average, we are still able to efficiently train a detection system for it due to several optimizations in the descriptor coding and the kernel classification runtime. As low-level features, we use new complementary color descriptors. Finally, the detection system is fused with classification scores found using most telling example selection from [2]. [1] "Segmentation as Selective Search for Object Recognition"; Koen E. A. van de Sande, Jasper R. R. Uijlings, Theo Gevers, Arnold W. M. Smeulders; 13th International Conference on Computer Vision, 2011. [2] "The Most Telling Window for Image Classification"; Jasper R. R. Uijlings, Koen E. A. van de Sande, Arnold W. M. Smeulders, Theo Gevers, Nicu Sebe, Cees G. M. Snoek; PASCAL VOC Challenge Workshop 2011 at ICCV, 2011.2012-09-23 21:01:35
Improved yolo-v3Improved yolo-v3horizonxianfeng tanImproved yolo-v32019-11-15 10:30:19
HOGLBP with Mixture DPM and ContextMISSOURI_HOGLBP_MDPM_CONTEXTThe University of Missouri-ColumbiaGuang Chen, Miao Sun, Xutao Lv, Yan Li, Tony X. HanHOG-LBP features [1] are incorporated in the deformable part model [2]. Deformable model is further improved by using the learned multiple anchor positions so that the possible locations for each part are modeled as a mixture of Gaussian distribution. For part and root filters, PCA is adopted to denoise and accelerate the detection speed. We proposed a permutation matrix method to add the model symmetry constraints during the feature selection, which effectively takes advantage of the symmetry property existing in most of the object categories and avoids the overfitting. Contextual information including image class label estimation, segmentation estimation, color histogram of ROI, and objects location priors, and correlations between the object detectors are used to leverage the final detection results to a very large extent: there are lots of contextual information and correlational information among objects that can be used to boost the detection performance. For example, trains and buses are objects bearing some visual similarities. But none of the large objects can coexist in the same location. So detection scores are correlated and we use the inference on Bayesian networks to further improve the detection results. [1] Xiaoyu Wang, Tony X. Han and Shuicheng Yan, “An HOG-LBP Human Detector with Partial Occlusion Handling,” IEEE International Conference on Computer ICCV 2009), Kyoto, 2009. [2] Girshick, R. B. and Felzenszwalb, P. F. and McAllester, D. : Discriminatively Trained Deformable Part Models, Release 52012-09-23 21:27:16
Using NAS Enhance YoloNAS Yolo PA-Occam-PlatformJian Yang, Zhenhou Hong, Xiaoyang Qu, Jianzong Wang, Jing XiaoNAS-YoLo is an objection detection model that introduces automatic data augmentation and neural architecture search(NAS) into a state-of-the-art YoLo model. The automatic data augmentation uses a reinforcement learning-based controller to find the best augmentation policies for the target data-set. The neural architecture search algorithm is developed from a one-shot NAS method with a parallel divide-and-conquer based evolutionary algorithm. Besides, an SMBO-based auto-tuning algorithm is used to yield better hyper-parameter combinations for the NAS-YoLo. 2020-05-09 08:00:13
Object-centric poolingNEC_STANFORD_OCPNEC Laboratories America and Stanford UniversityOlga Russakovsky Xiaoyu Wang Shenghuo Zhu Li Fei-Fei Yuanqing Lin Object-centric pooling (OCP) is a method which represents a bounding box by pooling the coded low-level descriptors on the foreground and background separately and then concatenating them (Russakovsky et al. ECCV 2012). This method exploits powerful classification features that have been developed in the past years. In this system, we used DHOG and LBP as low-level descriptors. We developed a discriminative LCC coding scheme in addition to traditional LCC coding. We make use of candidate bounding boxes (van de Sande et al. ICCV 2011).2012-09-23 22:47:43
Networks on Convolutional Feature MapsNoCMicrosoft ResearchShaoqing Ren, Kaiming He, Ross Girshick, Xiangyu Zhang, Jian SunThis entry is an implementation of the system described in “Object Detection Networks on Convolutional Feature Maps” ( This model is trained on HoG feature only. Training data for this entry is voc 2012 trainval set. Selective Search is used for proposal.2015-04-26 09:47:29
Weakly supervised detection using inception-v2PITT_WSOD_INC2University of PittsburghKeren Ye, Mingda Zhang, Wei Li, Danfeng Qin, Adriana Kovashka, Jesse BerentWeakly supervised detection using inception-v22019-03-14 05:19:35
RockDetector-1RockDetector-1RocKontrolChen Li, Hui WanRockDetector-1-based method trained on VOC20122019-11-08 14:56:51
SSDSSDTHUSSDSSD2017-06-10 04:47:11
SVM classifier using HOG?V2?SVM-HOGOrange Labs Beijing, France TelecomZhao FengOur object detection system is based on the Discriminatively Trained Deformable Part Models, Release 5. It is our first attempt for VOC challange. We do not make much modifications to the baseline system provided in The submitted results are obtained by applying post-processings of both bounding box prediction and contextual rescoring. 2012-09-22 20:06:39
Stronger-yoloStronger-yolocentral south universityZhihong XiaoImprove yolov3 with focal loss?KL loss?mix up?anchor-free and so on.2019-06-12 07:08:06
resnet101+softmaxTCnetTsinghua UniversityYulin LiuThis is a model based on mask rcnn2018-03-29 12:02:15
TCnetTCnetTsinghua UniversityLiu YulinTCnet2018-05-02 08:02:45
faster rcnnTHU_ML_classTsinghua Universitytrainingfaster rcnn2017-06-03 10:55:37
YOLOv2YOLOv2University of WashingtonJoe Redmon, Ali FarhadiYOLOv2 runs a single detection network once on an image to detect objects. It predicts bounding boxes and objectness as well as class probabilities across a convolutional feature map. For more information see: 21:15:21
dsa_tesdsa_1050Nanjing UniversityADadd cs2017-11-18 11:34:21
refine_denseSSDrefine_denseSSDBUPTYongqiang Yaorefine_denseSSD2018-05-14 02:23:40
ssdssdssdssdssd2018-08-01 09:31:10
yolo-allyoloshouhfqyolo3-6082019-09-28 05:08:50
yolo-allyoloshouhfq0219yolo32019-09-28 04:14:52
Synthetic Trainining for deformable parts modelCMIC-GS-DPMCairo Microsoft Innovation CenterDr. Motaz El-Saban , Osama Khalil, Mostafa Izz, Mohamed FathiWe introduce dataset augmentation using synthetic examples as a method for introducing novel variations not present in the original set. We make use of deformable parts-based model (Felzenszwalb et al 2010). We augment the training set with examples obtained by applying global scaling of the dataset examples. Global scaling includes no, up and down scaling with varying performance across different object classes. Technique selection is based upon performance on the validation set. The augmented dataset is then used to train parts-based detectors using HOG features (Dalal & Triggs 2006) and latent SVM. The resulting class models are applied on test images in a “sliding-window” fashion.2011-10-13 22:01:23
Synthetic Trainining for deformable parts modelCMIC-Synthetic-DPMCairo Microsoft Innovation CenterDr. Motaz El-Saban , Osama Khalil, Mostafa Izz, Mohamed FathiWe introduce dataset augmentation using synthetic examples as a method for introducing novel variations not present in the original set. We make use of deformable parts-based model (Felzenszwalb et al 2010). We augment the training set with examples obtained by relocating objects (having segmentation masks) to new backgrounds. New backgrounds used for relocation are selected using a set of techniques (no relocation, same image, “different” image or image with co-occurring objects). Performance of those techniques varies across classes according to the object class properties. For every class, we select the technique that achieves the highest AP on the validation set. The augmented dataset is then used to train parts-based detectors using HOG features (Dalal & Triggs 2006) and latent SVM. The resulting class models are applied on test images in a “sliding-window” fashion.2011-10-13 21:54:09
DPM with basic rescoringDPM-MKOxford VGGAndrea Vedaldi and Andrew ZissermanThis method uses a Deformable Part Model (our own implementation) to generate an initial (and very good) list of 100 candidate bounding boxes per image. These are then rescored by a multiple features model combining DPM scores with dense SP-BOW, geometry, and context. The SP-BOW model are dense SIFT features (vl_phow in VLFeat) quantized into 1200 visual words, 6x6 spatial layout, cell-by-cell l2 normalization after raising the entries to the 1/4 power (1/4-homogeneous Hellinger's kernel). The geometric model is a second order polynomial kernel on the bounding box coordinates. The context model is a second order polynomial kernels mixing the candidate DPM score with twenty scores obtained as the maximum response of the DPMs for the 20 classes in that image (like Felzenszwalb). A second context model is also added, using 20 scores from a state-of-the-art Fisher kernel image classifier (also on dense SIFT features), as described in Chatfileld et al. 2010. The SVM scores are passed through a sigmoid for standardization in the 0-1 interval; the sigmoid model is fitted to the truing data. The model is trained by means of a large scale linear SVM using the one-slack bundle formulation (aka SVM^perf). The solver hence uses retraining implicitly, and we make sure it reaches full convergence.2011-10-13 10:20:29
NLPR-DetectionData Decomposition and Distinctive ContextInstitute of Automation, Chinese Academy of SciencesJunge Zhang, Yinan Yu, Yongzhen Huang, Chong Wang, Weiqiang Ren, Jinchen Wu, Kaiqi Huang and Tieniu TanPart based model has achieved great success in recent years. To our understanding, the original deformable part based model has several limits: 1) the computational complexity is very large, especially when it is extended to enhanced models via multiple features, more mixtures or flexible part models. 2) The original part based model is not “deformable” enough. To tackle these problems, 1) we propose a data decomposition based feature representation scheme for part based model in an unsupervised manner. The submitted method takes about 1~2 seconds per image from PASCAL VOC datasets on average while keeping high performance. We learn the basis from samples without any label information. The specific label independent rule followed in the submitted methods can be adapted into other variants of part based model such as hierarchical model or flexible mixture models. 2) We found that, each part corresponds to multiple possible locations, which is not reflected in the original part-based model. Accordingly, we propose that the locations of parts should obey the multiple Gaussian distribution. Thus, for each part we learn its optimal locations by clustering which are used to update the original anchors of the part-based model. The proposed method above can more effectively describe the deformation (pose and location variety) of objects’ parts. 3) We rescored the initial results by our distinctive context model including global and local and intra-class context information. Besides, segmentation provides strong indication for object’s presence, therefore, the proposed segmentation aware semantic attribute is applied in the final reasoning which indeed shows promising performance. 2011-10-13 16:20:59
SVM classifier with LCC and tree codingLCC-TREE-CODINGUniversity of MissouriXiaoyu Wang Miao Sun Xutao Lv Shuai Tang Guang Chen Yan Li Tony X. HanA two layers cascade structure for object detection. The first layer employs deformable model to select possible candidates for the second layer. The later layer takes location and global context augmented with LBP feature to improve the accuracy. A bag of words model enhanced with spatial pyramid and local coordilate coding is used to model the global context information. A hierachical tree structure coding is used to take care of the intra-class variation for each detection window. Linear SVM is used for classification.2011-10-13 17:13:43
Context-SVM based submission for 3 tasksNUS_Context_SVMNational University of SingaporeZheng Song, Qiang Chen, Shuicheng YanClassification uses the BoW framework. Dense-SIFT, HOG^2, LBP and color moment features are extracted. We then use VQ and fisher vector for feature coding and SPM and Generalized Pyramid Matching(GPM) to generate image representations. Context-aware features are also extracted based on [1]. The classification models are learnt via kernel SVM. Then final classification scores are refined with kernel mapping[2]. Detection and segmentation results use the baseline of [3] using HOG and LBP feature. And then based on [1], we further learn context model and refine the detection results. The final segmentation result uses the learnt average masks for each detection component learnt using segmentation training set to substitute the rectangle detection boxes. [1] Zheng Song*, Qiang Chen*, Zhongyang Huang, Yang Hua, and Shuicheng Yan. Contextualizing Object Detection and Classification. [2] [3] 2011-10-05 09:01:23
Latent Hierarchical LearningNYU-UCLA_HierarchyNYU and UCLAYuanhao Chen, Li Wan, Long Zhu, Rob Fergus, Alan YuilleBased on two recent publications: "Latent Hierarchical Structural Learning for Object Detection". Long Zhu, Yuanhao Chen, Alan Yuille, William Freeman. CVPR 2010. "Active Mask Hierarchies for Object Detection". Yuanhao Chen, Long Zhu, Alan Yuille. ECCV 2010 We present a latent hierarchical structural learning method for object detection. An object is represented by a mixture of hierarchical tree models where the nodes represent object parts. The nodes can move spatially to allow both local and global shape deformations. The image features are histograms of words (HOWs) and oriented gradients (HOGs) which enable rich appearance representation of both structured (eg, cat face) and textured (eg,cat body) image regions. Learning the hierarchical model is a latent SVM problem which can be solved by the incremental concave-convex procedure (iCCCP). Object detection is performed by scanning sub-windows using dynamic programming. The detections are rescored by a context model which encodes the correlations of 20 object classes by using both object detection and image classification. 2011-10-13 22:21:11
Selective Search Detection SystemSelectiveSearchMonkeyUniversity of Amsterdam and University of TrentoJasper R. R. Uijlings Koen E. A. van de Sande Arnold W. M. Smeulders Theo Gevers Nicu Sebe Cees SnoekBased on "Segmentation as Selective Search for Object Recognition"; Koen E. A. van de Sande, Jasper R. R. Uijlings, Theo Gevers, Arnold W. M. Smeulders; 13th International Conference on Computer Vision, 2011. Instead of exhaustive search, which was dominant in the Pascal VOC 2010 detection challenge, we use segmentation as a sampling strategy for selective search (cf. our ICCV paper). Like segmentation, we use the image structure to guide our sampling process. However, unlike segmentation, we propose to generate many approximate locations over few and precise object delineations, as the goal is to cover all object locations. Our sampling is diversified to deal with as many image conditions as possible. Specifically, we use a variety of hierarchical region grouping strategies by varying colour spaces and grouping criteria. This results in a small set of data-driven, class-indepent, high quality object locations (coverage of 96-99% of all objects in the VOC2007 test set). Because we have only a limited number of locations to evaluate, this enables the use of the more computationally expensive bag-of-words framework for classification. Our bag-of-words implementation uses densely sampled SIFT and ColorSIFT descriptors.2011-10-13 20:45:25
Structured Detection and Segmentation CRFStruct_Det_CRFOxford Brookes UniversityJonathan Warrell, Vibhav Vineet, Paul Sturgess, Philip TorrWe form a hierarchical CRF which jointly models a pool of candidate detections and the multiclass pixel segmentation of an image. Attractive and repulsive pairwise terms are allowed between detection nodes (cf Desai et al, ICCV 2009), which are integrated into a Pn-Potts based hierarchical segmentation energy (cf Ladicky et al, ECCV 2010). A cutting-plane algorithm is used to train the model, using approximate MAP inference. We form a joint loss which combines segmentation and detection components (i.e. paying a penalty both for each pixel incorrectly labelled, and each false detection node which is active in a solution), and use different weightings of this loss to train the model to perform detection and segmentation. The segmentation results thus make use of the bounding box annotations. The candidate detections are generated using the Felzenschwalb et al. CVPR 2008/2010 detector, and as features for segmentation we use textons, SIFT, LBPs and the detection response surfaces themselves.2011-10-13 03:27:02
SVM classifier with tree max-poolingTREE--MAX-POOLINGUniversity of MissouriXiaoyu Wang, Miao Sun, Xutao Lv, Shuai Tang, Guang Chen, Yan Li ,Tony X. HanA two layers cascade structure for object detection. The first layer employs deformable model to select possible candidates for the second layer. The later layer takes location and global context augmented with LBP feature to improve the accuracy. A bag of words model enhanced with spatial pyramid and local coordilate coding is used to model the global context information. A hierachical tree structure coding is used to take care of the intra-class variation for each detection window. Max-pooling is used for tree node assignment. Linear SVM is used for classification.2011-10-13 20:50:30
LSVM trained mixtures of deformable part modelsUOCTTI_LSVM_MDPMUniversity of ChicagoRoss Girshick (University of Chicago), Pedro Felzenszwalb (Brown), David McAllester (TTI-Chicago)Based on [1] and [2] "Object Detection with Discriminatively Trained Part Based Models"; Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester and Deva Ramanan; IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 32, No. 9, September 2010. This entry is a minor modification of our publicly available "voc-release4" object detection system [1]. The system uses latent SVM to train mixtures of deformable part models using HOG features [2]. Final detections are refined using a context rescoring mechanism [2]. We extended [1] to detect smaller objects by adding an extra high-resolution octave to the HOG feature pyramid. The HOG features in this extra octave are computed using 2x2 pixel cells. Additional bias parameters are learned to help calibrate scores from detections in the extra octave with the scores of detections above this octave. This entry is the same as UOCTTI_LSVM_MDPM from the 2010 competition. Detection results are reported for all 20 object classes to provide a baseline for the 2011 competition.2011-10-12 16:09:55
Person grammar model trained with WL-SSVMUOCTTI_WL-SSVM_GRAMMARUniversity of ChicagoRoss Girshick (University of Chicago), Pedro Felzenszwalb (Brown), David McAllester (TTI-Chicago)This entry is described in [1] "Object Detection with Grammar Models"; Ross B. Girshick, Pedro F. Felzenszwalb, David McAllester. Neural Information Processing Systems 2011 (to appear). We define a grammar model for detecting people and train the model’s parameters from bounding box annotations using a formalism that we call weak-label structural SVM (WL-SSVM). The person grammar uses a set of productions that represent varying degrees of visibility/occlusion. Object parts, such as the head and shoulder, are shared across all interpretations of object visibility. Each part is represented by a deformable mixture model that includes deformable subparts. An "occluder" part (itself a deformable mixture of parts) is used to capture the nontrivial appearance of the stuff that typically occludes people from below. We further refine detections using the context rescoring mechanism from the UOCTTI_LSVM_MDPM entry, using the results of that entry for the 19 non-person classes. 2011-10-12 16:13:33
Using viewpoint cues to improve object recognitionlSVM-ViewpointCornellJoshua Schwartz Noah Snavely Daniel HuttenlocherOur system is based on the Latent SVM framework of [1], including their context rescoring method. We train 6 component models with 8 parts. However, unlike [1], components are trained using a clustering based on an unsupervised estimation of 3D object viewpoint. In this sense, our approach is similar to the unsupervised approach in [2], which also seeks to estimate viewpoint, but our clustering is based on explicit reasoning about 3D geometry. Additionally, we add features based on estimated 3D scene geometry for context rescoring. Of note is the fact that a detection with our method gives rise to an explicit estimation of object viewpoint within a scene, rather than just a bounding box. [1] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object Detection with Discriminatively Trained Part Based Models. PAMI 2010 [2] C. Gu and X. Ren. Discriminative Mixture-of-Templates for Viewpoint Classification. ECCV 20102011-10-13 02:33:13
DPM that uses region segmentation featuressegDPMUofT, TTI-C, UCLASanja Fidler, Roozbeh Mottaghi, Allan Yuille, Raquel UrtasunDPM-style model that exploits bottom-up segmentation. We use CPMC to extract regions and CPMC-o2p to classify them. The output of the CPMC-o2p is then used as segmentation in our model. We propose a new model that blends between DPM (HOG appearance model) and segmentation. The model encourages each detection to fit tightly around a region. If there is no region, the detector will just go with the typical HOG score. In addition, we use context re-scoring based on object presence classifiers provided by NUS. Project page: 20:22:19