VOC2012 RESULTS

Key to abbreviations

Classification Results: VOC2012 data

Competition "comp1" (train on VOC2012 data)

Average Precision (AP %)

 aero
plane
bicyclebirdboatbottlebuscarcatchaircowdining
table
doghorsemotor
bike
personpotted
plant
sheepsofatraintv/
monitor
CVC_BOW_FK_COLORHOG89.370.969.873.951.384.879.672.963.859.464.164.775.579.291.442.763.261.986.773.8
CVC_UVA_UNITN_ BOW_FK_COLORDET_SP92.074.273.077.554.385.281.976.465.263.268.568.978.281.091.655.969.465.486.777.4
IMPERIAL_COMPLEX_LOGNORMAL73.233.431.044.717.057.734.445.941.218.130.234.323.139.357.311.923.125.351.236.2
ITI_FK_BS_GRAYSIFT89.162.360.068.133.479.866.970.357.451.055.059.368.674.583.125.657.253.883.464.9
ITI_FK_FUSED_SIFT90.465.465.872.337.780.670.572.460.355.161.463.672.477.486.837.761.157.285.968.7
NUSPSL_CTX_GPM_SCM97.384.280.885.360.889.986.889.375.477.875.183.087.590.195.057.879.273.494.580.7
UP_FEATURE_ENSEMBLE--------------88.7-----

Precision/Recall Curves

Classification Results: VOC2012 data

Competition "comp2" (train on own data)

Average Precision (AP %)

 aero
plane
bicyclebirdboatbottlebuscarcatchaircowdining
table
doghorsemotor
bike
personpotted
plant
sheepsofatraintv/
monitor
ITI_FK_FLICKR_GRAYSIFT_ENTROPY88.163.061.968.634.979.667.470.557.552.055.360.168.774.383.226.457.653.483.064.0

Precision/Recall Curves

Detection Results: VOC2012 data

Competition "comp3" (train on VOC2012 data)

Average Precision (AP %)

 aero
plane
bicyclebirdboatbottlebuscarcatchaircowdining
table
doghorsemotor
bike
personpotted
plant
sheepsofatraintv/
monitor
CVC_BOW_COLOR_HOG45.449.815.716.026.354.644.835.116.831.323.626.045.649.642.214.530.528.545.740.0
MISSOURI_HOGLBP_MDPM_CONTEXT51.453.718.315.631.656.547.138.619.532.022.125.050.351.944.911.937.730.650.939.3
NEC_STANFORD_OCP65.146.825.024.616.051.044.951.513.026.631.040.239.751.532.812.635.733.548.044.8
OLB_FT_DPM_R547.551.714.212.627.351.844.225.317.830.218.116.946.950.943.09.531.223.644.322.1
SYSU_DYNAMIC_AND_OR_TREE50.247.07.93.824.847.242.831.217.524.210.021.343.546.437.57.926.421.543.136.7
UOC_OXFORD_DPM_MKL59.654.521.921.632.152.549.340.819.135.228.937.250.949.946.115.639.335.648.942.8
UVA_DETECTOR_MERGING47.250.218.321.425.253.346.346.317.527.830.335.041.652.143.218.035.231.145.444.4
UVA_HYBRID_CODING_APE61.852.024.624.820.257.144.553.617.433.038.342.848.859.435.722.840.339.551.149.5

Precision/Recall Curves

Detection Results: VOC2012 data

Competition "comp4" (train on own data)

Average Precision (AP %)

 aero
plane
bicyclebirdboatbottlebuscarcatchaircowdining
table
doghorsemotor
bike
personpotted
plant
sheepsofatraintv/
monitor

Precision/Recall Curves

Segmentation Results: VOC2012 data

Competition "comp5" (train on VOC2012 data)

Average Precision (AP %)

 [mean]back
ground
aero
plane
bicyclebirdboatbottlebuscarcatchaircowdining
table
doghorsemotor
bike
personpotted
plant
sheepsofatraintv/
monitor
BONNGC_O2P_CPMC_CSI45.485.059.327.943.939.841.452.261.556.413.644.526.142.851.757.951.329.845.728.849.943.3
BONN_CMBR_O2P_CPMC_LIN44.883.960.027.346.440.041.757.659.050.410.041.622.343.051.756.850.133.743.729.547.544.7
BONN_O2PCPMC_FGT_SEGM47.085.165.429.351.333.444.259.860.352.513.653.632.640.357.657.349.033.553.529.247.637.6
NUS_DET_SPR_GC_SP47.382.852.931.039.844.558.960.852.549.022.638.127.547.452.446.851.935.755.340.854.247.8
UVA_OPT_NBNN_CRF11.363.210.52.33.03.01.030.214.915.00.26.12.35.112.115.323.40.58.93.510.75.3

Segmentation Results: VOC2012 data

Competition "comp6" (train on own data)

Average Precision (AP %)

 [mean]back
ground
aero
plane
bicyclebirdboatbottlebuscarcatchaircowdining
table
doghorsemotor
bike
personpotted
plant
sheepsofatraintv/
monitor
BONNGC_O2P_CPMC_CSI46.885.063.626.845.641.747.154.358.655.114.549.030.946.152.658.253.432.044.534.645.343.1
BONN_CMBR_O2P_CPMC_LIN46.784.763.923.844.640.345.559.658.757.111.745.934.943.054.958.051.534.644.129.950.544.5
BONN_O2PCPMC_FGT_SEGM47.585.263.427.356.137.747.257.959.355.011.550.830.545.058.457.448.634.653.332.447.639.2

Person Layout Results: VOC2012 data

Competition "comp7" (train on VOC2012 data)

Average Precision (AP %)

 HeadHandFoot

Precision/Recall Curves

Person Layout Results: VOC2012 data

Competition "comp8" (train on own data)

Average Precision (AP %)

 HeadHandFoot
CANON_HEAD_TORSO80.9--

Precision/Recall Curves

Action Classification Results: VOC2012 data

Competition "comp9" (train on VOC2012 data)

Average Precision (AP %)

 jumpingphoningplaying
instrument
readingriding
bike
riding
horse
runningtaking
photo
using
computer
walking
STANFORD_RF_MULTFEAT_SVM75.744.866.644.493.294.287.638.470.675.6
SZU_DPM_RF_SVM73.845.062.841.493.093.487.835.064.773.5

Precision/Recall Curves

Action Classification Results: VOC2012 data

Competition "comp10" (train on own data)

Average Precision (AP %)

 jumpingphoningplaying
instrument
readingriding
bike
riding
horse
runningtaking
photo
using
computer
walking
HU_BU_MIL_MULTI_CUE59.439.656.534.475.780.274.327.655.256.6
OXFORD_ALIGNED_BODYPARTS77.050.465.339.594.195.987.742.768.674.5

Precision/Recall Curves

Results: VOC2012 data

Competition "comp11" (train on VOC2012 data)

Average Precision (AP %)

 jumpingphoningplaying
instrument
readingriding
bike
riding
horse
runningtaking
photo
using
computer
walking

Precision/Recall Curves

Results: VOC2012 data

Competition "comp12" (train on own data)

Average Precision (AP %)

 jumpingphoningplaying
instrument
readingriding
bike
riding
horse
runningtaking
photo
using
computer
walking

Precision/Recall Curves

Key to Abbreviations

AbbreviationTitleMethodAffiliationContributorsDescriptiorn
BONNGC_O2P_CPMC_CSIO2P Regressor + Composite Statistical InferenceBONNGC_O2P_CPMC_CSI(1) University of Bonn, (2) Georgia Institute of Technology, (3) University of CoimbraJoao Carreira (1,3) Fuxin Li (2) Guy Lebanon (2) Cristian Sminchisescu (1)We utilize a novel probabilistic inference procedure (unpublished yet), Composite Statisitcal Inference (CSI), on semantic segmentation using predictions on overlapping figure-ground hypotheses. Regressor predictions on segment overlaps to the ground truth object are modelled as generated by the true overlap with the ground truth segment plus noise. A model of ground truth overlap is defined by parametrizing on the unknown percentage of each superpixel that belongs to the unknown ground truth. A joint optimization on all the superpixels and all the categories is then performed in order to maximize the likelihood of the SVR predictions. The optimization has a tight convex relaxation so solutions can be expected to be close to the global optimum. A fast and optimal search algorithm is then applied to retrieve each object. CSI takes the intuition from the SVRSEGM inference algorithm that multiple predictions on similar segments can be combined to better consolidate the segment mask. But it fully develops the idea by constructing a probabilistic framework and performing composite MLE jointly on all segments and categories. Therefore it is able to consolidate better object boundaries and handle hard cases when objects interact closely and heavily occlude each other. For each image, we use 150 overlapping figure-ground hypotheses generated by the CPMC algorithm (Carreira and Sminchisescu, PAMI 2012), and linear SVR predictions on them with the novel second order O2P features (Carreira, Caseiro, Batista, Sminchisescu, ECCV2012; see VOC12 entry BONN_CMBR_O2P_CPMC_LIN) as the input to the inference algorithm.
BONN_CMBR_O2P_CPMC_LINLinear SVR with second-order pooling.BONN_CMBR_O2P_CPMC_LIN(1) University of Bonn, (2) University of CoimbraJoao Carreira (2,1) Rui Caseiro (2) Jorge Batista (2) Cristian Sminchisescu (1)We present a novel effective local feature aggregation method that we use in conjunction with an existing figure-ground segmentation sampling mechanism. This submission is described in detail in [1]. We sample multiple figure-ground segmentation candidates per image using the Constrained Parametric Min-Cuts (CPMC) algorithm. SIFT, masked SIFT and LBP features are extracted on the whole image, then pooled over each object segmentation candidate to generate global region descriptors. We employ a novel second-order pooling procedure, O2P, with two non-linearities: a tangent space mapping and power normalization. The global region descriptors are passed through linear regressors for each category, then labeled segments in each image having scores above some threshold are pasted onto the image in the order of these scores. Learning is performed using an epsilon-insensitive loss function on overlap with ground truth, similar to [2], but within a linear formulation (using LIBLINEAR). comp6: learning uses all images in the segmentation+detection trainval sets, and external ground truth annotations provided by courtesy of the Berkeley vision group. comp5: one model is trained for each category using the available ground truth segmentations from the 2012 trainval set. Then, on each image having no associated ground truth segmentations, the learned models are used together with bounding box constraints, low-level cues and region competition to generate predicted object segmentations inside all bounding boxes. Afterwards, learning proceeds similarly to the fully annotated case. 1. “Semantic Segmentation with Second-Order Pooling”, Carreira, Caseiro, Batista, Sminchisescu. ECCV 2012. 2. "Object Recognition by Ranking Figure-Ground Hypotheses", Li, Carreira, Sminchisescu. CVPR 2010.
BONN_O2PCPMC_FGT_SEGMBONN_O2PCPMC_FGT_SEGMBONN_O2PCPMC_FGT_SEGM(1) Universitfy of Bonn, (2) University of Coimbra, (3) Georgia Institute of Technology, (4) Vienna University of TechnologyJoao Carreira(1,2), Adrian Ion(4), Fuxin Li(3), Cristian Sminchisescu(1)We present a joint image segmentation and labeling model which, given a bag of figure-ground segment hypotheses extracted at multiple image locations and scales using CPMC (Carreira and Sminchisescu, PAMI 2012), constructs a joint probability distribution over both the compatible image interpretations (tilings or image segmentations) composed from those segments, and over their labeling into categories. The process of drawing samples from the joint distribution can be interpreted as first sampling tilings, modeled as maximal cliques, from a graph connecting spatially non-overlapping segments in the bag (Ion, Carreira, Sminchisescu, ICCV2011), followed by sampling labels for those segments, conditioned on the choice of a particular tiling. We learn the segmentation and labeling parameters jointly, based on Maximum Likelihood with a novel Incremental Saddle Point estimation procedure (Ion, Carreira, Sminchisescu, NIPS2011). As meta-features we combine outputs from linear SVRs using novel second order O2P features to predict the overlap between segments and ground-truth objects of each class (Carreira, Caseiro, Batista, Sminchisescu, ECCV2012; see VOC12 entry BONNCMBR_O2PCPMC_LINEAR), bounding box object detectors, and kernel SVR outputs trained to predict the overlap between segments and ground-truth objects of each class (Carreira, Li, Sminchisescu, IJCV 2012). comp6: the O2P SVR learning uses all images in the segmentation+detection trainval sets, and external ground truth annotations provided by courtesy of the Berkeley vision group.
CANON_HEAD_TORSOHead and Torso Detection with Persistent Use of HeHEAD-TORSO-ESTIMATORCanon Inc.Kan Torii, Atsushi Nogami, Kaname Tomite, Kenji Tsukamoto, Masakazu MatsuguWe build a head detector by integrating two types of detectors based on the parts-based model by Felzenszwalb et al. (PAMI 2010). One is a robust view-based head detector trained on newly annotated images in the VOC 2006-2010 trainval datasets. The other is the whole body detector of Felzenszwalb et al., although it can be configured to learn a wider variety of poses. Both detectors estimate the bounding box of the head instead of the whole body. The two detectors are integrated by merging their detections based on the score and overlap of the estimated bounding boxes. This works as a verification process of one of the detectors by the other. We can show that the total detector is also capable of estimating the inclination of the upper body.
CVC_BOW_COLOR_HOGColor_HOG based detector with BOW classifierCVC_DETComputer Vision Center BarcelonaFahad Khan, Camp Davesa, Joost van de Weijer, Rao Muhammad Anwer, Albert Gordo, Pep Gonfaus, Ramon Baldrich, Antonio LopezWe use our Color-HOG based part detector [1]. The detection results are combined with our CVC_CLS submission. References: 1. Fahad shahbaz khan, Rao Muhammad Anwer, Joost van de Weijer, Andrew D. Bagdanov, Maria Vanrell, Antonio M. Lopez. Color Attributes for Object Detection. In CVPR 2012.
CVC_BOW_FK_COLORHOGBOW with fisher and color-HOG detection CVC_CLSComputer Vision BarcelonaAlbert Gordo, Camp Davesa, Fahad Khan, Pep Gonfaus, Joost van de Weijer, Rao Muhammad Anwer, Ramon Baldrich, Jordi Gonzalez, Ernest ValvenyOur submission is a combination of standard bag-of-words pipeline, fisher vectors and color-HOG based part detection models. For bag-of-words, we use SIFT and ColorNames. To combine multiple cues, we use late fusion and color attention [1]. Fisher representation is based on SIFT features. Finally, we use our color-HOG detector [2] which introduces color information within the part based detection framework [3]. References: 1. Fahad shahbaz khan, Joost van de Weijer, Maria Vanrell. Modulating shape features by color attention for object recognition. IJCV, 98(1):49-64, 2012. 2. Fahad shahbaz khan, Rao Muhammad Anwer, Joost van de Weijer, Andrew D. Bagdanov, Maria Vanrell, Antonio M. Lopez. Color Attributes for Object Detection. In CVPR 2012. 3. P. F. Felzenszwalb, R. B. Girshick, D. A. McAllester, D. Ramanan. Object detection with discriminatively trained part-based models. PAMI, 32(9):1627–1645, 2010.
CVC_UVA_UNITN_ BOW_FK_COLORDET_SPCombination of BOW, Fihser, Color detection, SPCVC_UVA_UNITNComputer Vision Barcelona, University of Amsterdam, University of TrentoFahad Khan, Jan van Gemert, Camp Davesa, Jasper Uijlings , Albert Gordo, Sezer Karaoglu, Koen van de Sande, Pep Gonfaus, Rao Muhammad Anwer, Joost van de Weijer, Cees Snoek, Ramon Baldrich, Nicu Sebe, Theo geversFor bag-of-words, we use SIFT and ColorNames. To combine multiple cues, we use late fusion and color attention [1]. Fisher representation is based on SIFT features. We use our color-HOG detector [2] which introduces color information within the part based detection framework [3]. We extend the Spatial Pyramid pooling with a generic functional pooling scheme. Pooling can be seen as a crude pre-matching technique which may be based on geometry (SPM) but can be any other grouping function [4]. This technique has shown to aid with pooling based on saliency [5]. Here we also include pools based on signal to noise, interest points, and a pyramid over scale. References: 1. Fahad shahbaz khan, Joost van de Weijer, Maria Vanrell. Modulating shape features by color attention for object recognition. IJCV, 98(1):49-64, 2012. 2. Fahad shahbaz khan, Rao Muhammad Anwer, Joost van de Weijer, Andrew D. Bagdanov, Maria Vanrell, Antonio M. Lopez. Color Attributes for Object Detection. In CVPR 2012. 3. P. F. Felzenszwalb, R. B. Girshick, D. A. McAllester, D. Ramanan. Object detection with discriminatively trained part-based models. PAMI, 32(9):1627–1645, 2010. 4. J. C. van Gemert. Exploiting Photographic Style for Category-Level Image Classification by Generalizing the Spatial Pyramid. In ICMR, 2011. 5. S. Karaoglu, J. C. van Gemert and Th. Gevers. Object Reading: Text Recognition for Object Recognition. In ECCV-IFCVCR 2012, Oct 2012.
HU_BU_MIL_MULTI_CUEOn Recognizing Actions in Still Images via MILSVM-HOG, BOW, miles-objectnessHacettepe University, Bilkent UniversityCagdas Bas, Fadime Sener, Nazli Ikizler-CinbisWe propose a multi-cue based approach for recognizing hu- man actions in still images, where candidate object regions are discovered and utilized in a weakly supervised manner. Our approach is weakly su- pervised in the sense that it does not require any explicitly trained object detector or part/attribute annotation. Instead, a multiple instance learn- ing approach is used over sets of object hypotheses in order to represent objects relevant to the actions. Our results show that using multiple ob- ject hypotheses within multiple instance learning is e ective for human action recognition in still images and such an object representation is suitable for using in conjunction with other visual features.
IMPERIAL_COMPLEX_LOGNORMALComplex LogNormal Scale SpaceComplexLogNormal_LogFoveal_PhaseInvarianceImperial College London1) Ioannis Alexiou 2) Anil A. BharathWe design and optimize a scale space based on sparse coding optimization in the frequency domain. This Scale Space is obtained by using Complex filters with lognormal envelopes which span over angular and radial frequencies. Two basic features are harvested from these filters. The oriented magnitudes and the projected Phase of the filter outputs which are used for sampling Keypoints and Grid-points specifically designed for such filter outputs. We design descriptors comprised by pooling functions to properly accumulate such outputs. The descriptors are produced by foveal arranged poolers which sample the basic features using (136 & 544) inner products per sampling point. These poolers are obtained by using lognormal distributions on the spatial domain this time. Two basic descriptors of 136 and 544 dimensions are produced for keypoint and grid based sampling. These are fed to a k-means algorithmic module to generate 4000 visual words. Histograms of these words are used at fixed regions with hard assignment of the words. Another class of histograms is introduced by pairing up words and computing histograms of word pairs, as proposed by Alexiou, Bharath (BMVC2012). The fixed regions compose a spatial pyramid where each is independently learnt by an SVM classifier. A final simple learning approach is used to merge all SVM predictions into a class prediction.
ITI_FK_BS_GRAYSIFTFisher Encoding Baseline using gray SIFT featuresITI_FK_BS_GRAYSIFTITI-CERTH & Surrey UniversityE. Chatzilari, S. Nikolopoulos, Y. Kompatsiaris, J. KittlerBased on the implementation of “K. Chatfield, V. Lempitsky, A. Vedaldi, A. Zisserman, The devil is in the details: an evaluation of recent feature encoding methods, British Machine Vision Conference, 2011” and specifically following the approach described in “F. Perronnin, J. Sanchez, and T. Mensink. 2010. Improving the fisher kernel for large-scale image classification. In Proc. (ECCV'10), Springer-Verlag, Berlin, Heidelberg, 143-156.” for feature encoding based on the Fisher Kernel. We use gray-SIFT descriptors reduced with PCA to 80 dimensions and GMM with 256 components for estimating the probabilistic visual vocabulary. A spatial pyramid tree with 3 levels (1st:1x1, 2nd:2x2 and 3rd:3x1 horizontal) and dense sampling every 3 pixels has been employed to define the key-points. The descriptors are aggregated using Fisher encoding, which produces in a 40960-dimensional vector for each of the 8 regions of the spatial pyramid. These vectors are subsequently concatenated to produce the final 327680-dimensional representation vector for each image. SVM classifiers are trained using the Hellinger kernel which results in square rooting the features and then normalizing the results using the l2 norm.
ITI_FK_FLICKR_GRAYSIFT_ENTROPYMultimodal bootstrapping using MIRFLICKR1mITI_FK_FLICKR_GRAYSIFT_ENTROPYITI-CERTH & Surrey UniversityE. Chatzilari, S. Nikolopoulos, Y. Kompatsiaris, J. Kittler Based on the implementation of “K. Chatfield, V. Lempitsky, A. Vedaldi, A. Zisserman, The devil is in the details: an evaluation of recent feature encoding methods, British Machine Vision Conference, 2011” and specifically following the approach described in “F. Perronnin, J. Sanchez, and T. Mensink. 2010. Improving the fisher kernel for large-scale image classification. In Proc. (ECCV'10), Springer-Verlag, Berlin, Heidelberg, 143-156.” for feature encoding based on the Fisher Kernel. We use gray-SIFT descriptors reduced with PCA to 80 dimensions and GMM with 256 components for estimating the probabilistic visual vocabulary. A spatial pyramid tree with 3 levels (1st:1x1, 2nd:2x2 and 3rd:3x1 horizontal) and dense sampling every 3 pixels has been employed to define the key-points. The descriptors are aggregated using Fisher encoding, which produces in a 40960-dimensional vector for each of the 8 regions of the spatial pyramid. These vectors are subsequently concatenated to produce the final 327680-dimensional representation vector for each image. SVM classifiers are trained using the Hellinger kernel which results in square rooting the features and then normalizing the results using the l2 norm. In addition to the train+validation dataset the set of examples that is used for training the visual recognition models is further enriched by collecting the first 500 images per concept from the MIRFLICKR dataset (1 million images in total). The images are ranked in ascending order based on the geometric mean of the image visual score (distance from the SVM-hyperplane), the complement of the image tag-based similarity (between the image tags and the concept of interest) and the entropy of tag-based similarities among all concepts in the dataset.
ITI_FK_FUSED_SIFTLate fusion of Gray, RGB, HSV and Op SIFT by avgITI_FK_FUSED_GRAY-RGB-HSV-OP-SIFTITI-CERTH & Surrey UniversityE. Chatzilari, S. Nikolopoulos, Y. Kompatsiaris, J. KittlerBased on the implementation of “K. Chatfield, V. Lempitsky, A. Vedaldi, A. Zisserman, The devil is in the details: an evaluation of recent feature encoding methods, British Machine Vision Conference, 2011” and specifically following the approach described in “F. Perronnin, J. Sanchez, and T. Mensink. 2010. Improving the fisher kernel for large-scale image classification. In Proc. (ECCV'10), Springer-Verlag, Berlin, Heidelberg, 143-156.” for feature encoding based on the Fisher Kernel. We use SIFT descriptors reduced with PCA to 80 dimensions and GMM with 256 components for estimating the probabilistic visual vocabulary. A spatial pyramid tree with 3 levels (1st:1x1, 2nd:2x2 and 3rd:3x1 horizontal) and dense sampling every 3 pixels has been employed to define the key-points. The descriptors are aggregated using Fisher encoding, which produces in a 40960-dimensional vector for each of the 8 regions of the spatial pyramid. These vectors are subsequently concatenated to produce the final 327680-dimensional representation vector for each image. SVM classifiers are trained using the Hellinger kernel which results in square rooting the features and then normalizing the results using the l2 norm. Four different feature spaces namely, Gray-SIFT, RGB-SIFT, HSV-SIFT, OP-SIFT have been calculated and the final prediction is generated by averaging the predictions of all four feature spaces (late fusion).
MISSOURI_HOGLBP_MDPM_CONTEXTHOGLBP with Mixture DPM and ContextMISSOURI_HOGLBP_MDPM_CONTEXTThe University of Missouri-ColumbiaGuang Chen, Miao Sun, Xutao Lv, Yan Li, Tony X. HanHOG-LBP features [1] are incorporated in the deformable part model [2]. Deformable model is further improved by using the learned multiple anchor positions so that the possible locations for each part are modeled as a mixture of Gaussian distribution. For part and root filters, PCA is adopted to denoise and accelerate the detection speed. We proposed a permutation matrix method to add the model symmetry constraints during the feature selection, which effectively takes advantage of the symmetry property existing in most of the object categories and avoids the overfitting. Contextual information including image class label estimation, segmentation estimation, color histogram of ROI, and objects location priors, and correlations between the object detectors are used to leverage the final detection results to a very large extent: there are lots of contextual information and correlational information among objects that can be used to boost the detection performance. For example, trains and buses are objects bearing some visual similarities. But none of the large objects can coexist in the same location. So detection scores are correlated and we use the inference on Bayesian networks to further improve the detection results. [1] Xiaoyu Wang, Tony X. Han and Shuicheng Yan, “An HOG-LBP Human Detector with Partial Occlusion Handling,” IEEE International Conference on Computer ICCV 2009), Kyoto, 2009. [2] Girshick, R. B. and Felzenszwalb, P. F. and McAllester, D. : Discriminatively Trained Deformable Part Models, Release 5
NEC_STANFORD_OCPObject-centric poolingNEC_STANFORD_OCPNEC Laboratories America and Stanford UniversityOlga Russakovsky Xiaoyu Wang Shenghuo Zhu Li Fei-Fei Yuanqing Lin Object-centric pooling (OCP) is a method which represents a bounding box by pooling the coded low-level descriptors on the foreground and background separately and then concatenating them (Russakovsky et al. ECCV 2012). This method exploits powerful classification features that have been developed in the past years. In this system, we used DHOG and LBP as low-level descriptors. We developed a discriminative LCC coding scheme in addition to traditional LCC coding. We make use of candidate bounding boxes (van de Sande et al. ICCV 2011).
NUSPSL_CTX_GPM_SCM“Sub-class”-aware Object ClassificationNUSPSL_CTX_GPM_SCMNational University of Singapore; Panasonic Singapore Laboratories; Sun Yat-sen UniversityNUS: Dong Jian, Chen Qiang, Song Zheng, Pan Yan, Xia Wei, Yan Shuicheng; PSL: Hua Yang, Huang Zhongyang, Shen Shengmei; The new solution is motivated by the observation of considerable within-class diversities in the PASCAL VOC dataset. For example, the chair category includes two obvious sub-classes, namely sofa-like chairs and rigid-material chairs. In feature space, these two sub-classes are essentially far away, and it is intuitively beneficial to model them independently. The proposed new solution contributes in the following aspects: 1) inhomogeneous-similarity aware sub-class mining (SCM), 2) sub-class aware object detection and classification, and 3) sub-class aware kernel mapping for late fusion. Also the whole solution is founded on several valuable components from the NUS-PSL team in VOC 2011: 1) Traditional SPM and the novel Generalized Hierarchical Matching (GPM) [2] schemes are performed to generate image representations. 2) Contextualized object detection and classification [1]. Considerable improvement over our solution for PASCAL VOC 2011 has been achieved as shown in our offline train-vs-validation experiments. [1] Zheng Song, Qiang Chen, Zhongyang Huang, Yang Hua, and Shuicheng Yan. Contextualizing Object Detection and Classification, CVPR 2011 [2] Qiang Chen, Zheng Song, Yang Hua, Zhongyang Huang, Shuicheng YAN. Generalized Hierarchical Matching for Image Classification, CVPR 2012.
NUS_DET_SPR_GC_SPDM2: Detection, Mask transfer, MRF pruningNUS_DET_SPR_GC_SPNational University of Singapore(NUS), Panasonic Singapore Laboratories(PSL)(NUS) Wei XIA, Csaba DOMOKOS, Jian DONG, Loong Fah CHEONG, Shuicheng YAN, (PSL) Zhongyang HUANG, Shengmei SHENWe propose a three-step coarse-to-fine framework for general object segmentation. Given a test image, the object bounding boxes are first predicted by object detectors, and then the coarse masks within the corresponding bounding boxes are transferred from the training data based on the optimization framework of coupled global and local sparse representations in [1]. Then based on the coarse masks as well as the original detection information (bounding boxes and confidence maps), we built a super-pixel based MRF model for each bounding box, and then perform foreground-background inference. Both L-a-b color histogram and detection confidence map are used for characterizing the unary terms, while the PB edge contrast is used as smoothness term. Finally, the segmentation results are further refined by post-processing of multi-scale super-pixel segmentation. [1]Wei Xia, Zheng Song, Jiashi Feng, Loong Fah Cheong and Shuicheng Yan. Segmentation over Detection by Coupled Global and Local Sparse Representations, ECCV 2012.
OLB_FT_DPM_R5SVM classifier using HOG?V2?SVM-HOGOrange Labs Beijing, France TelecomZhao FengOur object detection system is based on the Discriminatively Trained Deformable Part Models, Release 5. It is our first attempt for VOC challange. We do not make much modifications to the baseline system provided in http://people.cs.uchicago.edu/~rbg/latent/. The submitted results are obtained by applying post-processings of both bounding box prediction and contextual rescoring.
OXFORD_ALIGNED_BODYPARTSHuman action recognition from aligned body partsOxfordUniversity of OxfordMinh Hoai, Lubor Ladicky, Andrew ZissermanWe propose a method for human action recognition that uses body-part detectors to localize the human and align feature descriptors. We first automatically detect the upper-body, the hands, and the silhouette. We compute appearance features (SIFT + HOG + color) and location features (aspect ratio + relative position) of the detected body parts. We use the silhouette as an indicator of the pose and as a mask for discarding irrelevant features computed outside the body. We further utilize the detection scores obtained from several object detectors trained on publicly available datasets.
STANFORD_RF_MULTFEAT_SVMRandom forest with SVM on multiple featuresRF_MULTFEAT_SVMStanford University; MITAditya Khosla, Rui Zhang, Bangpeng Yao, and Li Fei-FeiWe use a random forest (RF) approach for action classification. Our method is adapted from our CVPR2011 paper (Khosla*, Yao*, Fei-Fei, 2011). We explore two key properties that determine the performance of RF classifiers: discrimination and randomization. (1) Discrimination: We obtain strong decision trees, using discriminative SVM classifiers at each tree node. (2) Randomization: We consider a very dense feature space, where we sample image regions that can have any size and location in the image. Compared to VOC2011, we use multiple features including SIFT, HOG, Color Naming, Object Bank and LBP. We modify some of the existing features to better address our need. Further, we perform tree selection (similar to feature selection) to identify more discriminative regions in a class-specific manner.
SYSU_DYNAMIC_AND_OR_TREEDynamic And-Or Tree Learning For Object DetectionConfigurable And-Or Tree ModelSun Yat-Sen UniversityXiaolong Wang, Liang Lin, Lichao Huang, Xinhui Zhang, Zechao YangWe propose a novel hierarchical model for object detection, namely "And-Or tree", which is a configurable by introducing the “switch” variables (i.e. the or-nodes) accounting for intra-class object variance. This model comprises three layers: a batch of leaf-nodes in bottom for localizing object parts; the or-nodes for activating several leaf-nodes to specify a composition of parts; a root-node verifying object holistic distortion. For model training , a novel discriminative learning algorithm is proposed to explicitly determine the structural configuration (e.g., the production of leaf-nodes associated with the or-nodes) along with the optimization of multi-layer parameters. The response of model integrates the bottom-up testings via the leaf-nodes and or-nodes with the global verification via the root-node. In the implementation, we apply the histograms of gradients(HOG) as the image feature. Object detection is achieved by scanning the sub-windows over different scales and locations of the image. The final decisions are further rescored by a context model encoding the inter-object spatial interactions.
SZU_DPM_RF_SVMPart based models and object detectionSZU_DPM_RF_SVMShenzhen UniversityShiqi Yu, Shengyin Wu, Wensheng ChenBased on "Object Detection with Discriminatively Trained Part Based Models", P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan; IEEE TPAMI, 2010; and "Combining Randomization and Discrimination for Fine-Grained Image Categorization"; B. Yao, A. Khosla, and L. Fei-Fei. CVPR2011. We combine human part based models with object detectors for the proposed action classification method. Based on a simple principle that similar poses are presented when the subjects perform the same action, the deformable part based model [Felzenszwalb et. al. TPAMI 2010] is employed to describe the pose of a human body. In detail, the positions and textures of the parts can be extracted as features for action classification. For detection part of the proposed method, we use the random forest (RF) described in [Yao et. al. CVPR2011]. RF can detect human body parts and the objects interact with the human. At last we fuse the features and scores from the two models, and achieved a stronger classifier.
UOC_OXFORD_DPM_MKLThe DPM-MKL baselineDPM-MKLUniversity of Oxford, University of ChicagoRoss Girshick, Andrea Vedaldi, Karen Simonyan, Pedro Felzenszwalb, Andrew ZissermanThis method is similar to last year DPM-MKL entry. We updated several aspects of the implementation (e.g. the type of features).
UP_FEATURE_ENSEMBLESVM with different descriptorsEnsemble of ensembleuniversity of PadovaLoris Nanniwe proposed a system that incorporates several perturbation approaches and descriptors for a generic computer vision system. Some of the variations of approach we investigate include using different global and bag-of-feature based descriptors, different clusterings for codebook creations, different subspace projections for reducing the dimensionality of the descriptors extracted from each region. The basic classifier used in our ensembles is the Support Vector Machine. The ensemble decisions are combined by sum rule.
UVA_DETECTOR_MERGINGDetector_WeightingDetector-MergingUniversity of AmsterdamSezer Karaoglu, Fahad Shahbaz Khan, Koen van de Sande, Jan van Gemert, Rao Muhammad Anwer, , Jasper Uijlings, Camp Davesa, Joost van de Weijer, Theo Gevers, Cees Snoek We use a bounding box merging scheme that exploits the results from different independent detectors. Each detector results in a ranked list of BB, which is not directly comparable with other detectors. We merge the detectors with a weighting scheme based on hold-out performance. For input, we use the standard Felzenszwalb gray HOG detector [1] ; the color-HOG detector of CVC [2] which introduces color information within the part based detection framework; and a slightly improved version of the SelectiveSearch detector [3] by the UvA submitted to VOC 2011. [1] P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan Object Detection with Discriminatively Trained Part Based Models. In TPAMI, Vol. 32, No. 9, Sep. 2010 [2] Fahad shahbaz khan, Rao Muhammad Anwer, Joost van de Weijer, Andrew D. Bagdanov, Maria Vanrell, Antonio M. Lopez. Color Attributes for Object Detection. In CVPR 2012. [3] Segmentation As Selective Search for Object Recognition Koen E. A. van de Sande, Jasper R. R. Uijlings, Theo Gevers, Arnold W. M. Smeulders. In ICCV, 2011
UVA_HYBRID_CODING_APEHybrid Coding for Selective SearchHybridCodingApeksande@uva.nlKoen E. A. van de Sande Jasper R. R. Uijlings Cees G. M. Snoek Arnold W. M. SmeuldersWe have improved significantly over last years method from [1] with a hybrid bag-of-words using average and difference coding, a first in object detection. Briefly, the method of [1], instead of exhaustive search, which was dominant in the Pascal VOC 2010 and 2011 detection challenge, uses segmentation as a sampling strategy for selective search (cf. the ICCV paper). We use a small set of data-driven, class-independent, high quality object locations (coverage of 96-99% of all objects in the VOC2007 test set). Because we have only a limited number of locations to evaluate, this enables the use of more computationally expensive features, such as bag-of-words using average and difference coding strategies. While difference coding is an order of magnitude more expensive than average, we are still able to efficiently train a detection system for it due to several optimizations in the descriptor coding and the kernel classification runtime. As low-level features, we use new complementary color descriptors. Finally, the detection system is fused with classification scores found using most telling example selection from [2]. [1] "Segmentation as Selective Search for Object Recognition"; Koen E. A. van de Sande, Jasper R. R. Uijlings, Theo Gevers, Arnold W. M. Smeulders; 13th International Conference on Computer Vision, 2011. [2] "The Most Telling Window for Image Classification"; Jasper R. R. Uijlings, Koen E. A. van de Sande, Arnold W. M. Smeulders, Theo Gevers, Nicu Sebe, Cees G. M. Snoek; PASCAL VOC Challenge Workshop 2011 at ICCV, 2011.
UVA_OPT_NBNN_CRFCRF with NBNN features and simple smoothingOptNBNN-CRFUniversity of Amsterdam (UvA)Carsten van Weelden, Maarten van der Velden, Jan van GemertNaive Bayes nearest neighbor (NBNN) [Boiman et al, CVPR 2008] performs well in image classification because it avoids quantization of image features and estimates image-to-class distance. In the context of my MSc thesis we applied the NBNN method to segmentation by estimating image-to-class distances for superpixels, which we use as unary potentials in a simple conditional random field (CRF). To get the NBNN estimates we extract dense SIFT features from the training set and store these in a FLANN index [Muja and Lowe, VISSAPP'09] for efficient nearest neighbor search. To deal with the unbalanced class frequency we learn a linear correction for each class as in [Behmo et al, ECCV 2010]. We segment each test image into 500 SLIC superpixels [Achanta et al, TPAMI 2012] and take each superpixel as a vertex in the CRF. We use the corrected NBNN estimates as unary potentials and Potts potential as pairwise potentials and infer the MAP labeling using alpha-expansion [Boykov et al, TPAMI 2001]. We tune the weighting between the unary and pairwise potential by exhaustive search.