VOC2011 RESULTS

Key to abbreviations

Classification Results: VOC2011 data

Competition "comp1" (train on VOC2011 data)

Average Precision (AP %)

  aero
plane
bicycle bird boat bottle bus car cat chair cow dining
table
dog horse motor
bike
person potted
plant
sheep sofa train tv/
monitor
BPACAD_COMB_LF_AK_WK_NOBOXES 86.558.359.767.433.274.264.065.558.544.853.557.060.770.884.639.455.450.580.763.1
BPACAD_CS_FISH256_1024_SVM_AVGKER_NOBOXES 85.057.057.765.930.775.062.464.456.942.250.955.359.169.184.239.352.346.778.961.8
BUPT_ALL 61.511.912.429.78.730.618.423.621.65.814.818.57.112.347.77.215.09.818.819.2
BUPT_NOPATCH 65.123.817.336.012.640.531.135.427.210.420.831.313.629.554.910.719.119.242.130.8
JDL_K17_AVG_CLS 84.252.054.563.225.371.258.061.150.233.344.349.757.965.179.920.947.443.077.756.7
LIRIS_CLS 88.356.259.368.633.276.662.264.555.342.655.156.261.970.082.537.356.448.379.664.7
LIRIS_CLSDET 90.066.263.370.947.080.973.963.961.152.757.956.969.673.888.446.365.354.281.372.7
MSRAUSTC_HIGH_ORDER_SVM 92.874.869.676.147.383.576.476.959.854.563.567.075.178.890.443.163.160.485.671.1
MSRAUSTC_PATCH 92.774.569.475.445.783.476.576.659.654.563.467.474.878.690.343.063.158.685.271.3
NANJING_DMC_HIK_SVM_SIFT 55.625.531.036.515.841.440.040.630.017.821.134.027.031.057.911.920.722.648.435.7
NLPR_KF_SVM 10.59.110.76.06.57.213.312.211.59.55.616.78.66.638.95.315.05.08.35.4
NLPR_SS_VW_PLS 94.582.679.480.757.887.885.583.966.674.269.475.283.088.193.556.275.564.190.076.6
NLPR_SVM_BOWDET 82.969.445.460.146.080.075.159.954.950.743.349.963.472.288.136.157.137.775.258.5
NLPR_SVM_BOWDET_CONV 83.869.847.860.545.480.574.660.454.051.345.351.564.572.687.735.957.739.875.862.7
NUSPSL_CTX_GPM 95.581.179.482.558.287.784.183.168.572.868.576.483.387.592.856.577.767.091.277.5
NUSPSL_CTX_GPM_SVM 94.378.576.480.057.086.382.181.565.674.766.573.481.985.391.953.273.965.189.576.0
SJT_SIFT_LLC_PCAPOOL_DET_SVM 85.666.551.960.345.476.870.365.156.434.349.652.463.171.586.826.156.947.975.565.6
SJT_SIFT_LLC_PCAPOOL_SVM 83.252.549.359.626.073.558.264.452.136.644.952.157.863.878.119.152.844.172.057.4
UVA_MOSTTELLING 90.174.166.576.057.085.681.274.563.562.764.566.676.581.290.858.769.366.384.777.2

Precision/Recall Curves

Classification Results: VOC2011 data

Competition "comp2" (train on own data)

Average Precision (AP %)

  aero
plane
bicycle bird boat bottle bus car cat chair cow dining
table
dog horse motor
bike
person potted
plant
sheep sofa train tv/
monitor
LIRIS_CLSTEXT 88.366.160.868.546.777.369.263.755.952.656.655.569.673.787.146.365.254.081.272.7

Precision/Recall Curves

Detection Results: VOC2011 data

Competition "comp3" (train on VOC2011 data)

Average Precision (AP %)

  aero
plane
bicycle bird boat bottle bus car cat chair cow dining
table
dog horse motor
bike
person potted
plant
sheep sofa train tv/
monitor
BROOKES_STRUCT_DET_CRF 37.1 42.6 2.0 0.0 16.0 43.8 38.6 17.0 10.3 7.7 2.4 1.5 34.3 41.1 38.4 1.5 14.7 5.3 35.4 27.1
CMIC_GS_DPM - - - 13.3 26.4 - 41.5 - - - 12.2 - - 41.6 - 8.3 31.4 - - -
CMIC_SYNTHDPM 40.4 47.8 - 11.4 23.7 48.9 40.9 23.5 11.9 25.5 - 10.9 42.0 38.6 40.7 7.5 30.4 - 38.4 34.8
CORNELL_ISVM_VIEWPOINT 42.5 43.7 5.4 4.8 18.1 28.6 36.6 24.2 12.6 20.5 4.4 17.5 15.2 38.2 7.9 1.7 23.2 7.1 41.0 25.7
MISSOURI_LCC_TREE_CODING 41.1 51.7 13.7 11.9 27.3 52.1 41.7 32.9 17.6 27.3 18.5 23.1 45.2 48.6 41.9 11.6 32.4 27.5 44.2 38.3
MISSOURI_TREE_MAX_POOLING 43.8 51.7 13.7 12.7 27.3 51.5 43.7 32.9 18.3 27.3 18.5 23.1 45.2 48.6 42.9 11.6 32.4 27.5 47.0 39.3
NLPR_DD_DC 55.0 58.1 22.5 18.8 33.9 57.6 54.5 42.6 20.2 40.3 29.3 37.1 54.6 58.3 51.6 14.7 44.8 32.1 51.7 41.0
NUS_CONTEXT_SVM 51.4 52.9 20.1 15.7 26.9 53.0 45.6 37.6 15.2 36.0 25.1 32.6 50.4 55.8 36.8 12.3 37.6 30.5 48.1 41.0
NYUUCLA_HIERARCHY 56.3 55.9 23.4 20.3 27.2 56.6 48.1 53.8 23.2 32.9 33.3 39.2 53.0 56.9 43.6 14.3 37.9 39.4 52.6 43.7
OXFORD_DPM_MK 56.0 53.3 19.2 17.2 25.8 53.1 45.4 44.5 20.1 32.1 28.1 37.2 52.3 56.6 43.3 12.1 34.3 37.6 51.8 45.2
UOCTTI_LSVM_MDPM 53.2 53.9 13.1 13.5 30.5 55.5 51.2 31.7 14.5 29.0 16.0 22.1 43.1 50.3 46.3 8.8 33.0 22.9 45.8 38.2
UOCTTI_WL-SSVM_GRAMMAR - - - - - - - - - - - - - - 49.2 - - - - -
UVA_SELSEARCH 56.9 43.4 16.6 15.8 18.0 52.3 38.3 48.9 12.2 29.7 32.8 36.7 45.7 54.4 30.4 16.2 37.2 34.7 45.9 44.2

Precision/Recall Curves

Detection Results: VOC2011 data

Competition "comp4" (train on own data)

Average Precision (AP %)

  aero
plane
bicycle bird boat bottle bus car cat chair cow dining
table
dog horse motor
bike
person potted
plant
sheep sofa train tv/
monitor

Precision/Recall Curves

Segmentation Results (VOC2011 data)

Competition "comp5" (train on VOC2011 data)

Accuracy (%)

- Entries in parentheses are synthesized from detection results.

  [mean] back
ground
aero
plane
bicycle bird boat bottle bus car cat chair cow dining
table
dog horse motor
bike
person potted
plant
sheep sofa train tv/
monitor
BONN_FGT_SEGM 41.4 83.4 51.7 23.7 46.0 33.9 49.4 66.2 56.2 41.7 10.4 41.9 29.6 24.4 49.1 50.5 39.6 19.9 44.9 26.1 40.0 41.6
BONN_SVR_SEGM 43.3 84.9 54.3 23.9 39.5 35.3 42.6 65.4 53.5 46.1 15.0 47.4 30.1 33.9 48.8 54.4 46.4 28.8 51.3 26.2 44.9 37.2
BROOKES_STRUCT_DET_CRF 31.3 79.4 36.6 18.6 9.2 11.0 29.8 59.0 50.3 25.5 11.8 29.0 24.8 16.0 29.1 47.9 41.9 16.1 34.0 11.6 43.3 31.7
NUS_CONTEXT_SVM 35.1 77.2 40.5 19.0 28.4 27.8 40.7 56.4 45.0 33.1 7.2 37.4 17.4 26.8 33.7 46.6 40.6 23.3 33.4 23.9 41.2 38.6
NUS_SEG_DET_MASK_CLS_CRF 37.7 79.8 41.5 20.2 30.4 29.1 47.4 61.2 47.7 35.0 8.5 38.3 14.5 28.6 36.5 47.8 42.5 28.5 37.8 26.4 43.5 45.8
(CORNELL_ISVM_VIEWPOINT) 11.8 1.4 7.4 10.5 5.5 1.6 22.9 25.7 27.9 10.9 4.7 16.4 5.2 5.6 10.3 21.4 11.1 4.8 6.7 3.0 21.3 24.2
(MISSOURI_LCC_TREE_CODING) 13.1 0.5 9.2 9.4 8.1 2.2 25.7 32.6 18.6 13.2 4.1 9.5 13.8 9.5 13.5 17.4 26.7 10.0 9.5 14.5 15.9 11.2
(MISSOURI_TREE_MAX_POOLING) 13.1 0.6 10.0 7.8 7.4 2.3 27.1 30.2 38.8 12.3 3.9 8.3 10.7 7.8 11.4 14.4 26.9 6.3 8.6 10.3 16.9 13.2
(NLPR_DD_DC) 19.4 0.8 21.6 2.9 10.1 7.9 38.0 27.2 26.0 7.4 7.3 30.4 17.8 26.3 24.9 41.6 29.2 2.4 27.8 20.7 31.0 6.9
(NYUUCLA_HIERARCHY) 15.3 1.2 11.9 7.6 12.9 6.7 12.4 24.3 28.4 26.2 2.9 21.3 9.3 19.8 18.6 27.7 27.6 6.3 23.1 5.9 18.1 9.1
(OXFORD_DPM_MK) 15.2 0.4 16.3 7.4 8.7 4.7 27.0 29.8 18.9 23.0 3.2 15.3 11.6 13.9 19.6 19.1 23.3 4.3 22.7 7.7 19.5 22.5
(UOCTTI_LSVM_MDPM) 13.1 4.0 9.2 7.8 9.2 6.2 20.4 38.4 24.9 11.2 3.3 12.8 5.9 10.4 15.4 19.5 20.4 5.7 13.4 5.0 15.9 16.3
(UVA_SELSEARCH) 16.2 2.9 13.9 8.2 5.4 7.2 18.8 52.0 29.2 21.9 3.9 17.5 10.7 13.7 12.2 27.7 14.7 7.8 21.3 12.9 17.2 20.5

Segmentation Results (VOC2011 data)

Competition "comp6" (train on own data)

Accuracy (%)

- Entries in parentheses are synthesized from detection results.

  [mean] back
ground
aero
plane
bicycle bird boat bottle bus car cat chair cow dining
table
dog horse motor
bike
person potted
plant
sheep sofa train tv/
monitor
BERKELEY_REGION_CLASSIFY 39.1 83.3 48.9 20.0 32.8 28.2 41.1 53.9 48.3 48.0 6.0 34.9 27.5 35.0 47.2 47.3 48.4 20.6 52.7 25.0 36.6 35.4

Person Layout Results: VOC2011 data

Competition "comp7" (train on VOC2011 data)

Average Precision (AP %)

  Head Hand Foot

Precision/Recall Curves

Person Layout Results: VOC2011 data

Competition "comp8" (train on own data)

Average Precision (AP %)

  Head Hand Foot
OXFORD_RANK_SLACK_RBF 72.926.94.1

Precision/Recall Curves

Action Classification Results: VOC2011 data

Competition "comp9" (train on VOC2011 data)

Average Precision (AP %)

  jumping phoning playing
instrument
reading riding
bike
riding
horse
running taking
photo
using
computer
walking
CAENLEAR_DSAL 62.139.760.533.680.883.680.323.253.450.2
CAENLEAR_HOBJ_DSAL 71.650.777.537.886.589.583.825.158.959.2
MISSOURI_SSLMF 58.836.848.530.681.583.078.521.350.753.8
NUDT_CONTEXT 65.941.557.434.788.890.287.925.754.559.5
NUDT_LL_SEMANTIC 66.341.353.935.288.890.087.625.553.758.2
STANFORD_RF_DENSEFTR_SVM 66.041.060.041.590.092.186.628.862.065.9
WVU_SVM-PHOW 42.529.532.126.748.546.359.213.524.335.6

Precision/Recall Curves

Action Classification Results: VOC2011 data

Competition "comp10" (train on own data)

Average Precision (AP %)

  jumping phoning playing
instrument
reading riding
bike
riding
horse
running taking
photo
using
computer
walking
BERKELEY_ACTION_POSELETS 59.531.345.627.884.488.377.631.047.457.6
STANFORD_COMBINE_ATTR_PART 66.741.160.842.290.592.286.228.863.564.2
STANFORD_MAPSVM_POSELET 27.029.328.323.871.982.467.320.126.046.4

Precision/Recall Curves

Key to Abbreviations

AbbreviationTitleMethodAffiliationContributorsDescriptiorn
BERKELEY_ACTION_POSELETS Poselets trained on action categories.BERKELEY_ACTION_POSELETSUniversity of California, BerkeleySubhransu Maji, Lubomir Bourdev, Jitendra MalikThis is based on our CVPR 2011 paper: "Action recognition using a distributed representation of pose and appearance", Subhransu Maji, Lubomir Bourdev and Jitendra Malik For this submission we train 200 poselets for each action category. In addition we train poselets based on subcategory labels for playinginstrument and ridingbike. Linear SVMs are trained on the "poselet activation vector" along with features from object detectors for four categories: motorbike, bicycle, horse and tvmonitor. Context models re-rank the objects at the image level, as described in the CVPR'11 paper.
BERKELEY_REGION_CLASSIFY Classification of low-level regionsBerkeley_Region_ClassifyUC BerkeleyPablo Arbelaez, Bharath Hariharan, Saurabh Gupta, Chunhui Gu, Lubomir Bourdev and Jitendra MalikWe propose a semantic segmentation approach that represents and classifies generic regions from low-level segmentation. We extract object candidates using ultrametric contour maps (Arbelaez et al., TPAMI 2011) at several image resolutions. We represent each region using mid- and high-level features that capture its appearance (color, shape , texture) and also its compatibility with the activations of a part detector (we use the poselets from Bourdev et al, ECCV 2010.) . A category label is assigned to each region using a hierarchy of IKSVM classifiers (Maji et al, CVPR 2008).
BONN_FGT_SEGM BONN_FGT_SEGMBONN_FGT_SEGM¹University of Bonn, ²Vienna University of Technology, ³Georgia Institute of TechnologyJoao Carreira¹, Adrian Ion², Fuxin Li³, Cristian Sminchisescu¹We present a joint image segmentation and labeling model which, given a bag of figure-ground segment hypotheses extracted at multiple image locations and scales using CPMC (Carreira and Sminchisescu, CVPR2010), constructs a joint probability distribution over both the compatible image interpretations (tilings or image segmentations) composed from those segments, and over their labeling into categories. The process of drawing samples from the joint distribution can be interpreted as first sampling tilings, modeled as maximal cliques, from a graph connecting spatially non-overlapping segments in the bag (Ion, Carreira, Sminchisescu, ICCV11) , followed by sampling labels for those segments, conditioned on the choice of a particular tiling. We learn the segmentation and labeling parameters jointly, based on Maximum Likelihood with a novel Incremental Saddle Point estimation procedure (Ion, Carreira, Sminchisescu, NIPS2011).
BONN_SVR_SEGM SVR on CPMC-generated Figure-ground segmentationsBONN_SVRSEGMUniversity of BonnJoao Carreira, Fuxin Li, Cristian SminchisescuWe present a recognition system based on sequential figure-ground ranking. We extract a bag of figure-ground segments using CPMC (Carreira and Sminchisescu, CVPR 2010). The bag is then filtered down to 100 segments using a class-independent ranker. Using these features we learn one nonlinear Support Vector Regressor (SVR) for each category that predicts the overlap between each segment and an object from that category. A complete image interpretation is obtained by sequentially selecting segments using combination and non-maxima suppression schemes. Details can be found in respectively (F. Li, J.Carreira, C. Sminchisescu, CVPR 2010, IJCV11). Additionally, the system is trained with both object segmentation layouts and weak annotations from bounding boxes.
BPACAD_COMB_LF_AK_WK_NOBOXES Combination of the late fusion, avgker and wekerBPACAD_COMB_LF_AK_WK_NOBOXESData Mining and Web Search Research Group (DMWS), MTA SZTAKI. HungaryBálint Daróczy, László NikházyThis is the average of the confidence output of a late fusion, an aggregated kernel and an averaged kernel (BPACAD_CS_FISH256-1024_SVM_AVGKER) method. We computed RGB Color moments and SIFT descriptors (Löwe 1999) on a dense grid and regions detected by Harris-Laplacian (Mikolajczik et al, 2005). All the three methods are based on non-hierarchical Gaussian Mixture Models (GMM) with 256 Gaussians (two of them also using GMMs with 1024 Gaussians ) and non-sparse Fisher vectors (Perronnin et al, 2007 ). We used our open-source GMM/Fisher toolkit for GPGPUs to compute the Fisher vectors and train the GMMs (http://datamining.sztaki.hu/?q=en/GPU-GMM). All of them are using Fisher vector based pre-computed kernels (basic kernels) for learning linear SVM classifiers (Daróczy et al, ImageCLEF 2011) . The late fusion method is based on a combination of SVM predictions (18 SVM classifiers per class), meanwhile the aggregated and averaged kernels are computed before the classification (only one SVM classifier per class). While the averaged kernel method needs no parameter tuning, for the late fusion and the aggregated kernel method we learned optimal weights per class on the validation set.
BPACAD_CS_FISH256_1024_SVM_AVGKER_NOBOXES SVM on averaged Fisher KernelsBPACAD_CS_FISH256-1024_SVM_AVGKER_NOBOXESData Mining and Web Search Research Group (DMWS), MTA SZTAKI, HungaryBálint Daróczy, László NikházyWe computed RGB Color moments and SIFT descriptors (Löwe 1999) on a dense grid and regions detected by Harris-Laplacian (Mikolajczik et al, 2005). We trained non-hierarchical Gaussian Mixture Models (with 256 Gaussians and 1024 Gaussians) on a subset (1 million) of the extracted low-level features of the training images. We extracted non-sparse Fisher vectors on nine different poolings with GMM with 256 Gaussians (dense grid, Harris-Laplacian, 3x1 and 2x2 spatial pyramids (Lazebnik et al, 2006)) and four with GMM with 1024 Gaussians (dense,3x1). We used our open-source GMM/Fisher toolkit for GPGPUs to compute the Fisher vectors and train the GMMs (http://datamining.sztaki.hu/?q=en/GPU-GMM). We calculated pre-computed kernels (Daróczy et al, ImageCLEF 2011) and averaged them. We trained only one binary SVM classifier per class.
BROOKES_STRUCT_DET_CRF Structured Detection and Segmentation CRFStruct_Det_CRFOxford Brookes UniversityJonathan Warrell, Vibhav Vineet, Paul Sturgess, Philip TorrWe form a hierarchical CRF which jointly models a pool of candidate detections and the multiclass pixel segmentation of an image. Attractive and repulsive pairwise terms are allowed between detection nodes (cf Desai et al, ICCV 2009), which are integrated into a Pn-Potts based hierarchical segmentation energy (cf Ladicky et al, ECCV 2010). A cutting-plane algorithm is used to train the model, using approximate MAP inference. We form a joint loss which combines segmentation and detection components (i.e. paying a penalty both for each pixel incorrectly labelled, and each false detection node which is active in a solution), and use different weightings of this loss to train the model to perform detection and segmentation. The segmentation results thus make use of the bounding box annotations. The candidate detections are generated using the Felzenschwalb et al. CVPR 2008/2010 detector, and as features for segmentation we use textons, SIFT, LBPs and the detection response surfaces themselves.
BUPT_ALL BUPT_MCPR_allcombining methodsBeijing University of Posts and Telecommunications-MCPRLZhicheng Zhao, Tao Liu, Xin Guo, Anni CaiA region-based method is used, in which all features mentioned above are extracted on regions rather than keypoints. A region is a group of pixels with similar appearance, and meanshift method is employed to do this. Finally, we combine the results of two methods with a linear fusion algorithm.without patch.
BUPT_NOPATCH BUPT_MCPR_nopatchnopatch mthodBeijing University of Posts and Telecommunications-MCPRLZhicheng Zhao, Tao Liu, Xin Guo, Anni CaiA bag of words method with SIFT, SURF and HOG features, and dense sampling method for keypoints is also involved.and keypoint detection are used.
CAENLEAR_DSAL Discriminative spatial saliencyDSALUniv Caen/ INRIA LEARGaurav Sharma, Frederic Jurie, Cordelia SchmidWe propose to learn discriminative saliency maps for images which highlight the regions which are more discriminant for the current classification task. We use the saliency maps to weight the visual words for improving discriminative capacity of bag of words features. The approach is motivated by the observation that for many human actions and attributes, local regions are highly discriminative e.g. for running the bent arms and legs are highly discriminant. Along with that we combine features based on SIFT, HOG, Color and texture.
CAENLEAR_HOBJ_DSAL Human obj interaction and discriminative saliencyHOBJ+DSALUniv Caen/ INRIA LEARGaurav Sharma, Alessandro Prest, Frederic Jurie, Vittorio Ferrari, Cordelia SchmidWe use the weakly supervised approach (Prest et al. PAMI2010) for learning human actions modeled as interactions between humans and objects. The human bounding box is taken as reference and the object relevant to the action and its spatial relation with the human is automatically learnt. The method is combined with a method to learn discriminative spatial saliency which highlights the regions which are more discriminant for the current classification task. We use the saliency maps to weight the visual words for improving discriminative capacity of bag of words features. Along with that we combine features based on SIFT, HOG, Color and texture.
CMIC_GS_DPM Synthetic Trainining for deformable parts modelCMIC-GS-DPMCairo Microsoft Innovation CenterDr. Motaz El-Saban , Osama Khalil, Mostafa Izz, Mohamed FathiWe introduce dataset augmentation using synthetic examples as a method for introducing novel variations not present in the original set. We make use of deformable parts-based model (Felzenszwalb et al 2010). We augment the training set with examples obtained by applying global scaling of the dataset examples. Global scaling includes no, up and down scaling with varying performance across different object classes. Technique selection is based upon performance on the validation set. The augmented dataset is then used to train parts-based detectors using HOG features (Dalal & Triggs 2006) and latent SVM. The resulting class models are applied on test images in a “sliding-window” fashion.
CMIC_SYNTHDPM Synthetic Trainining for deformable parts modelCMIC-Synthetic-DPMCairo Microsoft Innovation CenterDr. Motaz El-Saban , Osama Khalil, Mostafa Izz, Mohamed FathiWe introduce dataset augmentation using synthetic examples as a method for introducing novel variations not present in the original set. We make use of deformable parts-based model (Felzenszwalb et al 2010). We augment the training set with examples obtained by relocating objects (having segmentation masks) to new backgrounds. New backgrounds used for relocation are selected using a set of techniques (no relocation, same image, “different” image or image with co-occurring objects). Performance of those techniques varies across classes according to the object class properties. For every class, we select the technique that achieves the highest AP on the validation set. The augmented dataset is then used to train parts-based detectors using HOG features (Dalal & Triggs 2006) and latent SVM. The resulting class models are applied on test images in a “sliding-window” fashion.
CORNELL_ISVM_VIEWPOINT Using viewpoint cues to improve object recognitionlSVM-ViewpointCornellJoshua Schwartz Noah Snavely Daniel HuttenlocherOur system is based on the Latent SVM framework of [1], including their context rescoring method. We train 6 component models with 8 parts. However, unlike [1], components are trained using a clustering based on an unsupervised estimation of 3D object viewpoint. In this sense, our approach is similar to the unsupervised approach in [2], which also seeks to estimate viewpoint, but our clustering is based on explicit reasoning about 3D geometry. Additionally, we add features based on estimated 3D scene geometry for context rescoring. Of note is the fact that a detection with our method gives rise to an explicit estimation of object viewpoint within a scene, rather than just a bounding box. [1] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object Detection with Discriminatively Trained Part Based Models. PAMI 2010 [2] C. Gu and X. Ren. Discriminative Mixture-of-Templates for Viewpoint Classification. ECCV 2010
JDL_K17_AVG_CLS SVM with average kernel using 17 kernelsJDL_K17_AVG_CLSJDL, Institute of Computing Technology, Chinese Academy of SciencesShuhui Wang, Li Shen, Shuqiang Jiang, Qi Tian, Qingming Huangwe calculate six types of commonly used BOW features(including dense and sparse sift, dense color sift,hog, lbp and self similarity) and 3 global features(color moment, gist and block gist), where the visual vocabulary size is typically around 1000. We calculate 3 level spatial pyramid features on those BOW representation respectively. Then 17 base kernels are calculated by using histogram intersection, RBF and chi-square kernels on these features, whose kernel parameters are tuned using the validation data. We calculate an average kernel by using these 17 base kernel. one-against-all SVM classifiers are used to train the final classfiers for each category.
LIRIS_CLS MKL classifier with multiple featuresLIRIS_CLSLIRIS, Ecole Centrale de Lyon, CNRS, UMR5205, FranceChao ZHU, Yuxing TANG, Ningning LIU, Charles-Edmond BICHOT, Emmanuel Dellandrea, Liming CHENIn this submission, we mainly make use of local descriptors and the popular bag-of-visual-words approach for classification. Regions of interest in each image are detected using both Harris-Laplace detector and dense sampling strategy. SIFT and color SIFT descriptors are then computed for each region as baseline. In addition, we also extract DAISY and extended LBP descriptors based on our work [1][2] for computational efficiency and complementary information to SIFT. For each kind of descriptor, 1,000,000 randomly selected descriptors from the train + val set are quantized using k-means algorithm into 4000 visual words. Each image is then represented by the histogram using hard assignment. Spatial pyramid technique is applied for coarse spatial information. The chi-square kernels of different levels in the pyramid are computed, and then fused by linear combination. The final outputs are obtained by using multiple kernel learning algorithm to fuse different descriptors. [1] C. Zhu, C.E. Bichot, L. Chen: 'Visual object recognition using DAISY descriptor', in Proc. of IEEE International Conference on Multimedia and Expo (ICME), to appear, 2011. [2] C. Zhu, C.E. Bichot, L. Chen: 'Multi-scale color local binary patterns for visual object classes recognition', in Proc. of 20th International Conference on Pattern Recognition (ICPR), pp.3065-3068, 2010.
LIRIS_CLSDET Classification combined with detectionLIRIS_CLSDETLIRIS, Ecole Centrale de Lyon, CNRS, UMR5205, FranceChao ZHU, Yuxing TANG, Ningning LIU, Charles-Edmond BICHOT, Emmanuel Dellandrea, Liming CHENIn this submission, we improve the classification performances by combining it with object detection results. For classification, we mainly make use of local descriptors and the popular bag-of-visual-words approach. Regions of interest in each image are detected using both Harris-Laplace detector and dense sampling strategy. SIFT and color SIFT descriptors are then computed for each region as baseline. In addition, we also extract DAISY and extended LBP descriptors based on our work [1][2] for computational efficiency and complementary information to SIFT. For each kind of descriptor, 1,000,000 randomly selected descriptors from the train + val set are quantized using k-means algorithm into 4000 visual words. Each image is then represented by the histogram using hard assignment. Spatial pyramid technique is applied for coarse spatial information. The chi-square kernels of different levels in the pyramid are computed, and then fused by linear combination. The final outputs are obtained by using multiple kernel learning algorithm to fuse different descriptors. For object detection, we apply the HOG feature to train deformable part models, and use the models together with sliding window approach to detect objects. Finally, we combine the outputs of classification and detection by late fusion. [1] C. Zhu, C.E. Bichot, L. Chen: 'Visual object recognition using DAISY descriptor', in Proc. of IEEE International Conference on Multimedia and Expo (ICME), to appear, 2011. [2] C. Zhu, C.E. Bichot, L. Chen: 'Multi-scale color local binary patterns for visual object classes recognition', in Proc. of 20th International Conference on Pattern Recognition (ICPR), pp.3065-3068, 2010.
LIRIS_CLSTEXT Classification with additional text featureLIRIS_CLSTEXTLIRIS, Ecole Centrale de Lyon, CNRS, UMR5205, FranceChao ZHU, Yuxing TANG, Ningning LIU, Charles-Edmond BICHOT, Emmanuel Dellandrea, Liming CHENIn this submission, we try to use additional text information to help with object classification. We propose novel text features [1] based on semantic distance using WordNet. The basic idea is to calculate the semantic distance between the text associated with an image and an emotional dictionary based on path similarity, denoting how similar two word senses are, based on the shortest path that connects the senses in a taxonomy. As there are no tags included in Pascal2011 dataset, we downloaded 1 million Flickr images (including their tags) as the additional textual source. Firstly, for each Pascal image, we find its similar images (top 20) from the database using KNN method based on visual features (LBP and color HSV histogram), and then use these tags to extract the text feature. We use SVM with RBF kernel to train the classifier and predict the outputs. For classification based on visual features, we follow the same method described in our other submission. The outputs of visual feature based method and text feature based method are then linearly combined as final results. [1] N. Liu, Y. Zhang, E. Dellandréa, B. Tellez, L. Chen: ‘Associating text features with visual ones to improve affective image classification’, International Conference Affective Computing (ACII), Memphis, USA, 2011.
MISSOURI_LCC_TREE_CODING SVM classifier with LCC and tree codingLCC-TREE-CODINGUniversity of MissouriXiaoyu Wang Miao Sun Xutao Lv Shuai Tang Guang Chen Yan Li Tony X. HanA two layers cascade structure for object detection. The first layer employs deformable model to select possible candidates for the second layer. The later layer takes location and global context augmented with LBP feature to improve the accuracy. A bag of words model enhanced with spatial pyramid and local coordilate coding is used to model the global context information. A hierachical tree structure coding is used to take care of the intra-class variation for each detection window. Linear SVM is used for classification.
MISSOURI_SSLMF Supervised Learning with Multiple FeaturesSupervised learning with multiple featureUniversity of Missouri - ColumbiaXutao Lv, Xiaoyu Wang, Guang Chen, Shuai Tang, Yan Li, Miao Sun, Tony X. HanMultiple available features are combined and fed into a newly developed supervised learning algorithm. The features includes the feature extracted within the bounding box and the feature from the whole image. The features from the whole images are served as context information. We mainly use two feature descriptors in our submission, dense SIFT and HOG. LCC coding method and spatial pyramid is adopted to generate histogram for each action image, and the histogram is then served as feature vector to train and test with the supervised learning algorithm.
MISSOURI_TREE_MAX_POOLING SVM classifier with tree max-poolingTREE--MAX-POOLINGUniversity of MissouriXiaoyu Wang, Miao Sun, Xutao Lv, Shuai Tang, Guang Chen, Yan Li ,Tony X. HanA two layers cascade structure for object detection. The first layer employs deformable model to select possible candidates for the second layer. The later layer takes location and global context augmented with LBP feature to improve the accuracy. A bag of words model enhanced with spatial pyramid and local coordilate coding is used to model the global context information. A hierachical tree structure coding is used to take care of the intra-class variation for each detection window. Max-pooling is used for tree node assignment. Linear SVM is used for classification.
MSRAUSTC_HIGH_ORDER_SVM SVM with mined high order featuresMSRA_USTC_HIGH_ORDER_SVMMicrosoft Research Asia & University of Science and Technology of ChinaKuiyuan Yang, Lei Zhang, Hong-Jiang ZhangWe introduce a discriminatively-trained parts-based model with different level templates for image classification. The model consists of templates of HOG features (Dalal and Triggs, 2006) at three different levels. The responses of different level templates are combined by a latent-SVM, where the latent variables are the positions of the templates. We develop a novel mining algorithm to define the parts and an iterative training procedure to learn the parts. The model is applied to all 20 PASCAL VOC objects.
MSRAUSTC_PATCH SVM with multi-channel cell-structured patch featuMSRA_USTC_PATCHMicrosoft Research Asia & University of Science and Technology of ChinaKuiyuan Yang, Lei Zhang, Hong-Jiang ZhangWe introduce a discriminatively-trained patch-based model with cell-structured templates for image classification. Dense sampled patches are represented cell-structured templates of HOG, LBP, HSV, SIFT, CSIFT and SSIM. These templates are then fed to Super-Vector Coding (Xi Zhou, 2010) and Fisher Kernel (Florent Perronnin, 2010) to form the image feature. Then linear SVM is trained for each category in one-vs-the-rest manner. The object detector from Pascal VOC 2007 is used to extract object level features and classifiers are trained based on these features, and then fusion with the former one.
NANJING_DMC_HIK_SVM_SIFT HIK based svm classifier with dense SIFT features.DMC-HIK-SVM-SIFTThe University of NanjingYubin Yang, Ye Tang, Lingyan PanWe adopt a bag-of-visual-words method (cf Csurka et al 2004). A single descriptor type, SIFT descriptors (Lowe 2004) are extracted from 16*16 pixel patches which are densely sampled from each image on a grid with stepsize 12 pixels. We partition the original training and validation data into different categories according to its label, then randomly select 200 images per category (2000 images in total) as the training set. We use a novel difference maximize coding approach to quantize these descriptors into 200 “visual words”. Each image is then represented by a histogram of visual words. Spatial pyramid matching (Lazebnik et al, CVPR 2006) are also used in our method. Finally, we train a HIK kernel (Jianxin Wu et al, ICCV 2009) based SVM classifier using the concatenated pyramid feature vector for each image in training set.
NLPR_DD_DC NLPR-DetectionData Decomposition and Distinctive ContextInstitute of Automation, Chinese Academy of SciencesJunge Zhang, Yinan Yu, Yongzhen Huang, Chong Wang, Weiqiang Ren, Jinchen Wu, Kaiqi Huang and Tieniu TanPart based model has achieved great success in recent years. To our understanding, the original deformable part based model has several limits: 1) the computational complexity is very large, especially when it is extended to enhanced models via multiple features, more mixtures or flexible part models. 2) The original part based model is not “deformable” enough. To tackle these problems, 1) we propose a data decomposition based feature representation scheme for part based model in an unsupervised manner. The submitted method takes about 1~2 seconds per image from PASCAL VOC datasets on average while keeping high performance. We learn the basis from samples without any label information. The specific label independent rule followed in the submitted methods can be adapted into other variants of part based model such as hierarchical model or flexible mixture models. 2) We found that, each part corresponds to multiple possible locations, which is not reflected in the original part-based model. Accordingly, we propose that the locations of parts should obey the multiple Gaussian distribution. Thus, for each part we learn its optimal locations by clustering which are used to update the original anchors of the part-based model. The proposed method above can more effectively describe the deformation (pose and location variety) of objects’ parts. 3) We rescored the initial results by our distinctive context model including global and local and intra-class context information. Besides, segmentation provides strong indication for object’s presence, therefore, the proposed segmentation aware semantic attribute is applied in the final reasoning which indeed shows promising performance.
NLPR_KF_SVM SVM classifier with five kernels.NLPR_KF_SVMNational Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences.Ruiguang Hu Weiming HuGrayPHOG,HuePHOG,PLBP(R=1 and R=2). GraySIFT_HL,HueSIFT_HL : LLC coding and MAX pooling. Codebooks : K-means clustering. L1 normalization : GrayPHOG,HuePHOG,PLBP; L2 normalization : GraySIFT_HL, HueSIFT_HL; chi-squared kernel : GrayPHOG,HuePHOG,PLBP; Linear kernel : graySIFT_HL,HueSIFT_HL. kernel fusion : Average strategy . train features : extracted from croped sub_images according to annotation boundingbox, test features : extracted from whole test images.
NLPR_SS_VW_PLS NLPR_CLSSemi-Semantic Visual Words & Partial Least SquaresNational Laboratory of Pattern Recognition, Institute of Automation Chinese Academy of SciencesYinan Yu, Junge Zhang, Yongzhen Huang, Weiqiang Ren, Chong Wang, Jinchen Wu, Kaiqi Huang, Tieniu TanThe framework is based on the classical Bag of Words model. The system consists of: 1) In feature level, we use Semi-Semantic and non-semantic Visual Words Learning, with Fast Feature Extraction (Salient Coding, Super-Vector Coding and Visual Co-occurrence) and multiple features; 2) To learn class model, we employ an alternative multiple-linear-kernel learning for intra-class feature combination, after using Partial Least Squares analysis, which projects the extremely high-dimensional features into a low-dimensional space; 3) The combination of 20 categories scores and detections scores generate a high-level semantic representation of image, we use non-linear-kernel learning to extract inter-class contextual information, which further improve the performance. Besides, all the parameters are decided by cross-validation and prior knowledge on VOC2007 and VOC2010 trainval sets. The motivation and novelty of our algorithm: The traditional codebook describes the distribution of feature space, containing less semantic information of the interesting objects, and a semantic codebook may benefit the performance. We observe that the Deformable-Part-Based Model [Felz TPAMI 2010] describes the object by “object parts”, which can be seen as the semi-semantic visual words. Based on this idea, we propose to use the semi-semantic and non-semantic visual words based Bag of words model for image classification. We analyze the recent image classification algorithms, finding that the feature “distribution”, “reconstruction” and “saliency” is three fundamental issues in coding and image description. However, these methods usually lead to an extremely high-dimensional description, especially with multiple features. In order to learn these features by MKL, we find Partial Least Square is a reliable method for dimensionality reduction. The compression ratio of PLS is over 10000, while the discrimination can be preserved.
NLPR_SVM_BOWDET Svm with multiple feature and detection resultsNLPR_IVA_SVM_BOWDectNLPR,CASIAJing Liu, Jianlong Fu, Bingyuan Liu, Hanqing LuTwo types of features are considered. First, typical BOW features (OpponentSift, CSift and rgSift) with dense and Harris sampling techniques respectively, and the spatial pyramid kernels with these features are calculated for classification. Second, the object detection based on the deformable part model is employed (P. Felzenszwalb et al. in PAMI 2009 ) . Combined with these features, we attempt to learn a hierarchical classifier with SVM according to the hierarchical structure for all the 20-class data.
NLPR_SVM_BOWDET_CONV Svm with multiple feature and detection resultsNLPR_IVA_SVM_BOWDect_ConvolutionNLPR,CASIAJing Liu, Jianlong Fu, Bingyuan Liu, Hanqing LuThree types of features are considered. First, typical BOW features (OpponentSift, CSift and rgSift) with dense and Harris sampling respectively, and spatial pyramid kernels are calculated. Second, an improved image representation via convolutional sparse coding and max pooling operation is employed, which is motivated by M. Zeiler’s work in ICCV 2011. Third the object detection based on the deformable part model. Combined with multiple features, we attempt to learn a hierarchical classifier with SVM according to the hierarchical structure for all the 20-class data.
NUDT_CONTEXT Svm classifier with contextual informationNUDT_ContextNational University of Defense TechnologyLi Zhou, Zongtan Zhou, Dewen HuAction classification using contextual information. We present a new model for action classification context based on the distribution of object and the semantic category of scene within images. The scene classification works by creating multiple resolution images and partitioning them into sub-regions with different scales. The visual descriptors of all sub-regions in the same resolution image are directly concatenated for SVM classifiers. Finally, regarding each resolution image as a feature channel, we combine all the feature channels to reach a final decision. The object recognition works by incorporating a multi-resolution representation into the bag-of-features model.
NUDT_LL_SEMANTIC Svm classifier with low-level and semantic modelinNUDT_Low-level_SemanticNational University of Defense TechnologyLi Zhou, Dewen Hu, Zongtan ZhouAction classification based on combining low-level and semantic modeling strategies
NUSPSL_CTX_GPM classification using context svm and GPMNUSPSL_CTX_GPMNational University of Singapore; Panasonic Singapore LaboratoriesNUS: Chen Qiang, Song Zheng, Yan Shuicheng; PSL: Hua Yang, Huang Zhongyang, Shen Shengmei;The whole solution for object classification is based on BoW framework. On image level, Dense-SIFT, HOG^2, LBP and color moments features are extracted. VQ and Fisher vectors are utilized for feature coding. Traditional SPM and novel spatial-free pyramid matching scheme are then performed to generate image representations. Context-aware features are also extracted based on detection result [1]. The classification models are learnt via kernel SVM. The final classification scores are refined with kernel mapping [2]. The key novelty of the new solution is the pooling strategy (which well handles the spatial mismatching as well as noise feature issues), and considerable improvement has been achieved as shown in other offline experiments. [1] Zheng Song, Qiang Chen, Zhongyang Huang, Yang Hua, and Shuicheng Yan. Contextualizing Object Detection and Classification, CVPR 2011. [2] http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/workshop/nuspsl.pdf
NUSPSL_CTX_GPM_SVM classification using context svm and GPM,NUSPSL_CTX_GPM_SVMNational University of Singapore; Panasonic Singapore LaboratoriesNUS: Chen Qiang, Song Zheng, Yan Shuicheng; PSL: Hua Yang, Huang Zhongyang, Shen Shengmei;The whole solution for object classification is based on BoW framework[1]. On image level, Dense-SIFT, HOG^2, LBP and color moments features are extracted. VQ and Fisher vectors are utilized for feature coding. Traditional SPM and novel spatial-free pyramid matching scheme are then performed to generate image representations. Context-aware features are also extracted based on detection result [2]. The classification models are learnt via kernel SVM. The key novelty of the new solution is the pooling strategy (which well handles the spatial mismatching as well as noise feature issues), and considerable improvement has been achieved as shown in other offline experiments. [1]http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/workshop/nuspsl.pdf [2] Zheng Song, Qiang Chen, Zhongyang Huang, Yang Hua, and Shuicheng Yan. Contextualizing Object Detection and Classification, CVPR 2011.
NUS_CONTEXT_SVM Context-SVM based submission for 3 tasksNUS_Context_SVMNational University of SingaporeZheng Song, Qiang Chen, Shuicheng YanClassification uses the BoW framework. Dense-SIFT, HOG^2, LBP and color moment features are extracted. We then use VQ and fisher vector for feature coding and SPM and Generalized Pyramid Matching(GPM) to generate image representations. Context-aware features are also extracted based on [1]. The classification models are learnt via kernel SVM. Then final classification scores are refined with kernel mapping[2]. Detection and segmentation results use the baseline of [3] using HOG and LBP feature. And then based on [1], we further learn context model and refine the detection results. The final segmentation result uses the learnt average masks for each detection component learnt using segmentation training set to substitute the rectangle detection boxes. [1] Zheng Song*, Qiang Chen*, Zhongyang Huang, Yang Hua, and Shuicheng Yan. Contextualizing Object Detection and Classification. [2] http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/workshop/nuspsl.pdf [3] http://people.cs.uchicago.edu/~pff/latent/
NUS_SEG_DET_MASK_CLS_CRF Segmentation Using CRF with Detection MaskNUS_SEG_DET_MASK_CLS_CRFNational University of SingaporeWei XIA, Zheng SONG, Qiang CHEN, Shuicheng YAN, Loong Fah CHEONGThe solution is based on CRF model and the key contribution is the utilization of various types of binary regularization terms. Object detection also plays a very significant role in guiding semantic object segmentation. In this solution, the CRF model is built to integrate the global classification score and local unary and binary information to perform semantic segmentation. What’s more, the detection masks trained by setting a hard threshold of the detection confidence maps are applied as extra unary and smooth terms in the CRF model. Some of masks with high confidence are also used in the post-processing stage to do some refinement at the mask boundaries.
NYUUCLA_HIERARCHY Latent Hierarchical LearningNYU-UCLA_HierarchyNYU and UCLAYuanhao Chen, Li Wan, Long Zhu, Rob Fergus, Alan YuilleBased on two recent publications: "Latent Hierarchical Structural Learning for Object Detection". Long Zhu, Yuanhao Chen, Alan Yuille, William Freeman. CVPR 2010. "Active Mask Hierarchies for Object Detection". Yuanhao Chen, Long Zhu, Alan Yuille. ECCV 2010 We present a latent hierarchical structural learning method for object detection. An object is represented by a mixture of hierarchical tree models where the nodes represent object parts. The nodes can move spatially to allow both local and global shape deformations. The image features are histograms of words (HOWs) and oriented gradients (HOGs) which enable rich appearance representation of both structured (eg, cat face) and textured (eg,cat body) image regions. Learning the hierarchical model is a latent SVM problem which can be solved by the incremental concave-convex procedure (iCCCP). Object detection is performed by scanning sub-windows using dynamic programming. The detections are rescored by a context model which encodes the correlations of 20 object classes by using both object detection and image classification.
OXFORD_DPM_MK DPM with basic rescoringDPM-MKOxford VGGAndrea Vedaldi and Andrew ZissermanThis method uses a Deformable Part Model (our own implementation) to generate an initial (and very good) list of 100 candidate bounding boxes per image. These are then rescored by a multiple features model combining DPM scores with dense SP-BOW, geometry, and context. The SP-BOW model are dense SIFT features (vl_phow in VLFeat) quantized into 1200 visual words, 6x6 spatial layout, cell-by-cell l2 normalization after raising the entries to the 1/4 power (1/4-homogeneous Hellinger's kernel). The geometric model is a second order polynomial kernel on the bounding box coordinates. The context model is a second order polynomial kernels mixing the candidate DPM score with twenty scores obtained as the maximum response of the DPMs for the 20 classes in that image (like Felzenszwalb). A second context model is also added, using 20 scores from a state-of-the-art Fisher kernel image classifier (also on dense SIFT features), as described in Chatfileld et al. 2010. The SVM scores are passed through a sigmoid for standardization in the 0-1 interval; the sigmoid model is fitted to the truing data. The model is trained by means of a large scale linear SVM using the one-slack bundle formulation (aka SVM^perf). The solver hence uses retraining implicitly, and we make sure it reaches full convergence.
OXFORD_RANK_SLACK_RBF Structured ranking for Layout DetectionSVM-rank-slack-RBFUniversity of OxfordArpit Mittal, Matthew Blaschko, Andrew Zisserman, Manuel J Marin, Phil TorrWe make use of SVM structured ranking algorithm to combine and rank the outputs of different parts detectors. Individual parts are detected using separate detectors, then, the outputs are customized to the local image using the positional and scale cues. Different part detections are finally combined using a ranking function to give a single confidence value for the human layout detection. The ranking is performed such that detections having more true-positive parts (i.e., higher precision) are returned earlier. For detection of human head, we use the parts-based model of Felzenszwalb et al. (PAMI 2010); and hand is localized using the hand detector developed by Mittal et al. (BMVC, 2011). The feet are detected using the foot part of Felzenszwalb et al.'s human detector and also returned as the bounding box around the super-pixels resembling human foot in the lower bracket of the human ROI. We use slack rescaled variant of SVM structured ranking algorithm and RBF kernel map.
SJT_SIFT_LLC_PCAPOOL_DET_SVM SVM using LLC features with detection resultsSIFT-LLC-PCAPOOL-DET-SVMShanghai Jiao Tong UniveristyJun Zhu, Xiaokang Yang, Yukun Zhu, Rui Zhang, Xiaolin ChenWe adopt the framework based on locality-constrained linear coding (LLC) framework (J. Wang et al, CVPR 2010), fused with the detection results given by discriminatively-trained deformable part-based object detectors (P. Felzenszwalb, et al, CVPR 2008). First, the SIFT descriptors (Lowe, 2004) are extracted from image patches densely sampled by every 8 pixels, with three different scales (16x16, 24x24 and 32x32) respectively. Then, the codebook with 1024 bases is constructed by using K-means clustering on 100,000 randomly selected descriptors from the training set. Each 128-dimensional SIFT descriptor is encoded by the approximated LLC (the number of neighbors is set to 5 with the shift-invariant constraint), obtaining a 1024-dimensional code vector. We use max pooling on the patch-level codes from hundends of overlapping regions with various spatial scales and positions, followed by dimension reduction using PCA. After that, the pooled features are concatenated into a vector with l2-normalization to form the image-level representation. In addition, the results of object detectors are also considered. We use Felzenszwalb's deformable part-based models to detect the bounding-boxes for each object class. The detection scores are max-pooled in each cell of spatial pyramid (i.e., 1x1+2x2+3x1) to construct image-level representation with l2-normalization. We obtain final imgae-level representation through weighted concatenation of the two feature vectors from LLC codes and object detectors. Then, a linear SVM classifier is trained to perform classification. The regularization parameters as well as the fusion weight are respectively tuned for each class using the training and validation set. We adopt 'liblinear' software package, which is released by the machine learning group at national Taiwan university, on the implemention of SVM.
SJT_SIFT_LLC_PCAPOOL_SVM Linear SVM using LLC features with PCA poolingSIFT-LLC-PCAPOOL-SVMShanghai Jiao Tong UniveristyJun Zhu, Xiaokang Yang, Yukun Zhu, Rui Zhang, Xiaoling ChenWe adopt the framework based on locality-constrained linear coding (LLC) framework (J. Wang et al, CVPR 2010). First, the SIFT descriptors (Lowe, 2004) are extracted from image patches densely sampled by every 8 pixels, with three different scales (16x16, 24x24 and 32x32) respectively. Then, the codebook with 1024 bases is constructed by using K-means clustering on 100,000 randomly selected descriptors from the training set. After that, each 128-dimensional SIFT descriptor is encoded by the approximated LLC (the number of neighbors is set to 5 with the shift-invariant constraint), obtaining a 1024-dimensional code vector. We use max pooling on the patch-level codes from hundends of overlapping regions with various spatial scales and positions, followed by dimension reduction using PCA. After that, the pooled features are concatenated into a vector with l2-normalization to form the image-level representation. Finally, we train linear SVM classifiers on this feature representation to perform classification. The regularization parameters are respectively tuned for each class using the training and validation set. We adopt 'liblinear' software package, which is released by the machine learning group at national Taiwan university, on the implemention of SVM.
STANFORD_COMBINE_ATTR_PART Combine attribute classifiers and object detectorsCOMBINE_ATTR_PARTStanford UniversityBangpeng Yao, Aditya Khosla, Li Fei-FeiOur approach combines attribute classifiers and part detectors for action classification. The method is adapted from our ICCV2011 paper (Yao et al, 2011). The "attributes" are trained by using our random forest classifier (Yao et al, 2011), which are strong classifiers that consider global properties of action classes. As for "parts", we consider the objects that interact with the humans, such as horses, books, etc. Specifically, we take the object bank (Li et al, 2010) detectors that are trained on the ImageNet dataset for part representation. The confidence scores obtained from attribute classifiers and part detectors are combined to form the final score for each image.
STANFORD_MAPSVM_POSELET MAP-based SVM classifier with poselet featuresMAPSVM-PoseletStanford UniversityTim Tang, Pawan Kumar, Ben Packer, Daphne KollerWe build on the Poselet-based feature vector for action classification (Maji et al., 2010) in four ways: (i) we use a 2-level spatial pyramid (Lazenik et al., CVPR 2006); (ii) we obtain a segmentation of the person bounding box into foreground and background using an efficient GrabCut-like scheme (Rother et al., SIGGRAPH 2004), and use it to divide the feature vector into two parts---one corresponding to the foreground and one corresponding to the background; (iii) we learn a mixture model to deal with the different visual aspects of people performing the same action; and (iv) we optimize mean average precision (Yue et al., SIGIR 2007) instead of the 0/1 loss used in the standard binary SVM. All action classifiers are trained on only the VOC 2011 data, with additional annotations required to compute the Poselets. All hyperparameters are set using 5-fold cross-validation.
STANFORD_RF_DENSEFTR_SVM Random forest with SVM node classifiersRF_DENSEFTR_SVMStanford UniversityBangpeng Yao, Aditya Khosla, Li Fei-FeiWe use a random forest (RF) approach for action classification. Our method is adapted from our CVPR2011 paper (Yao et al, 2011). We explore two key properties that determine the performance of RF classifiers: discrimination and randomization. (1) Discrimination: In order to obtain strong decision trees, instead of randomly generating feature weights as in the conventional RF approaches, we use discriminative SVM classifiers to train the split for each tree node. (2) Randomization: The correlation between different decision trees needs to be small, such that the combination of all the trees can form an effective RF classifier. We consider a very dense feature space, where we sample image regions that can have any size and location in the image. For each sampled region, we use an SPM feature representation. Since each decision tree samples a specific set of image regions, the correlation between the trees can be reduced.
UOCTTI_LSVM_MDPM LSVM trained mixtures of deformable part modelsUOCTTI_LSVM_MDPMUniversity of ChicagoRoss Girshick (University of Chicago), Pedro Felzenszwalb (Brown), David McAllester (TTI-Chicago)Based on [1] http://people.cs.uchicago.edu/~pff/latent-release4 and [2] "Object Detection with Discriminatively Trained Part Based Models"; Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester and Deva Ramanan; IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 32, No. 9, September 2010. This entry is a minor modification of our publicly available "voc-release4" object detection system [1]. The system uses latent SVM to train mixtures of deformable part models using HOG features [2]. Final detections are refined using a context rescoring mechanism [2]. We extended [1] to detect smaller objects by adding an extra high-resolution octave to the HOG feature pyramid. The HOG features in this extra octave are computed using 2x2 pixel cells. Additional bias parameters are learned to help calibrate scores from detections in the extra octave with the scores of detections above this octave. This entry is the same as UOCTTI_LSVM_MDPM from the 2010 competition. Detection results are reported for all 20 object classes to provide a baseline for the 2011 competition.
UOCTTI_WL-SSVM_GRAMMAR Person grammar model trained with WL-SSVMUOCTTI_WL-SSVM_GRAMMARUniversity of ChicagoRoss Girshick (University of Chicago), Pedro Felzenszwalb (Brown), David McAllester (TTI-Chicago)This entry is described in [1] "Object Detection with Grammar Models"; Ross B. Girshick, Pedro F. Felzenszwalb, David McAllester. Neural Information Processing Systems 2011 (to appear). We define a grammar model for detecting people and train the model’s parameters from bounding box annotations using a formalism that we call weak-label structural SVM (WL-SSVM). The person grammar uses a set of productions that represent varying degrees of visibility/occlusion. Object parts, such as the head and shoulder, are shared across all interpretations of object visibility. Each part is represented by a deformable mixture model that includes deformable subparts. An "occluder" part (itself a deformable mixture of parts) is used to capture the nontrivial appearance of the stuff that typically occludes people from below. We further refine detections using the context rescoring mechanism from the UOCTTI_LSVM_MDPM entry, using the results of that entry for the 19 non-person classes.
UVA_MOSTTELLING Most Telling WindowUvA_UNITN_MostTellingMonkeyUniversity of Amsterdam, University of TrentoJasper Uijlings Koen van de Sande Arnold Smeulders Theo Gevers Nicu Sebe Cees SnoekClassification Task Our main component of this entry is the "Most Telling Window" method which uses Segmentation as Selective Search [1] combined with Bag-of-Words. The "Most Telling Window" method is also used in our Detection entry. However, instead of focusing on finding complete objects, training is adjusted such that we can use the most discriminative part of an object for its identification instead of the whole object. The Most Telling Window method is currently under review. While the "Most Telling Window" method yields the greatest contribution, we improve accuracy further by combining it with a normal Bag-of-Words framework based on SIFT and ColourSift and with the detection scores of the part-based model of Felzenszwalb et al. [1] "Segmentation as Selective Search for Object Recognition"; Koen E. A. van de Sande, Jasper R. R. Uijlings, Theo Gevers, Arnold W. M. Smeulders; 13th International Conference on Computer Vision, 2011.
UVA_SELSEARCH Selective Search Detection SystemSelectiveSearchMonkeyUniversity of Amsterdam and University of TrentoJasper R. R. Uijlings Koen E. A. van de Sande Arnold W. M. Smeulders Theo Gevers Nicu Sebe Cees SnoekBased on "Segmentation as Selective Search for Object Recognition"; Koen E. A. van de Sande, Jasper R. R. Uijlings, Theo Gevers, Arnold W. M. Smeulders; 13th International Conference on Computer Vision, 2011. Instead of exhaustive search, which was dominant in the Pascal VOC 2010 detection challenge, we use segmentation as a sampling strategy for selective search (cf. our ICCV paper). Like segmentation, we use the image structure to guide our sampling process. However, unlike segmentation, we propose to generate many approximate locations over few and precise object delineations, as the goal is to cover all object locations. Our sampling is diversified to deal with as many image conditions as possible. Specifically, we use a variety of hierarchical region grouping strategies by varying colour spaces and grouping criteria. This results in a small set of data-driven, class-indepent, high quality object locations (coverage of 96-99% of all objects in the VOC2007 test set). Because we have only a limited number of locations to evaluate, this enables the use of the more computationally expensive bag-of-words framework for classification. Our bag-of-words implementation uses densely sampled SIFT and ColorSIFT descriptors.
WVU_SVM-PHOW Svm classifier with PHOW features.SVM-PHOWWest Virginia UniversityBiyun Lai, Yu Zhu, Qin Wu, Guodong GuoWe develop a method for still-image based action recognition. There are 10 action classes plus the “other” action class provided by PASCAL VOC 2011. We extracted the PHOW features to represent the images, which is a kind of multi-scale dense SIFT implementation. The kernel SVM method is used for training action classifiers. Different kernels are used for the SVM. We also used a learning technique to map the original features into a different space to improve the feature representation. A confidence measure is used to combine the results from different kernels to form the final decision for action classification. The training is performed on the provided training set, and tuned by using the validation set, and then the learned classifiers are applied to the test data.