VOC2011 RESULTS

Classification Results: VOC2011 data

Competition "comp1" (train on VOC2011 data)

Average Precision (AP %)

	aero plane	bicycle	bird	boat	bottle	bus	car	cat	chair	cow	dining table	dog	horse	motor bike	person	potted plant	sheep	sofa	train	tv/ monitor
BPACAD_COMB_LF_AK_WK_NOBOXES	86.5	58.3	59.7	67.4	33.2	74.2	64.0	65.5	58.5	44.8	53.5	57.0	60.7	70.8	84.6	39.4	55.4	50.5	80.7	63.1
BPACAD_CS_FISH256_1024_SVM_AVGKER_NOBOXES	85.0	57.0	57.7	65.9	30.7	75.0	62.4	64.4	56.9	42.2	50.9	55.3	59.1	69.1	84.2	39.3	52.3	46.7	78.9	61.8
BUPT_ALL	61.5	11.9	12.4	29.7	8.7	30.6	18.4	23.6	21.6	5.8	14.8	18.5	7.1	12.3	47.7	7.2	15.0	9.8	18.8	19.2
BUPT_NOPATCH	65.1	23.8	17.3	36.0	12.6	40.5	31.1	35.4	27.2	10.4	20.8	31.3	13.6	29.5	54.9	10.7	19.1	19.2	42.1	30.8
JDL_K17_AVG_CLS	84.2	52.0	54.5	63.2	25.3	71.2	58.0	61.1	50.2	33.3	44.3	49.7	57.9	65.1	79.9	20.9	47.4	43.0	77.7	56.7
LIRIS_CLS	88.3	56.2	59.3	68.6	33.2	76.6	62.2	64.5	55.3	42.6	55.1	56.2	61.9	70.0	82.5	37.3	56.4	48.3	79.6	64.7
LIRIS_CLSDET	90.0	66.2	63.3	70.9	47.0	80.9	73.9	63.9	61.1	52.7	57.9	56.9	69.6	73.8	88.4	46.3	65.3	54.2	81.3	72.7
MSRAUSTC_HIGH_ORDER_SVM	92.8	74.8	69.6	76.1	47.3	83.5	76.4	76.9	59.8	54.5	63.5	67.0	75.1	78.8	90.4	43.1	63.1	60.4	85.6	71.1
MSRAUSTC_PATCH	92.7	74.5	69.4	75.4	45.7	83.4	76.5	76.6	59.6	54.5	63.4	67.4	74.8	78.6	90.3	43.0	63.1	58.6	85.2	71.3
NANJING_DMC_HIK_SVM_SIFT	55.6	25.5	31.0	36.5	15.8	41.4	40.0	40.6	30.0	17.8	21.1	34.0	27.0	31.0	57.9	11.9	20.7	22.6	48.4	35.7
NLPR_KF_SVM	10.5	9.1	10.7	6.0	6.5	7.2	13.3	12.2	11.5	9.5	5.6	16.7	8.6	6.6	38.9	5.3	15.0	5.0	8.3	5.4
NLPR_SS_VW_PLS	94.5	82.6	79.4	80.7	57.8	87.8	85.5	83.9	66.6	74.2	69.4	75.2	83.0	88.1	93.5	56.2	75.5	64.1	90.0	76.6
NLPR_SVM_BOWDET	82.9	69.4	45.4	60.1	46.0	80.0	75.1	59.9	54.9	50.7	43.3	49.9	63.4	72.2	88.1	36.1	57.1	37.7	75.2	58.5
NLPR_SVM_BOWDET_CONV	83.8	69.8	47.8	60.5	45.4	80.5	74.6	60.4	54.0	51.3	45.3	51.5	64.5	72.6	87.7	35.9	57.7	39.8	75.8	62.7
NUSPSL_CTX_GPM	95.5	81.1	79.4	82.5	58.2	87.7	84.1	83.1	68.5	72.8	68.5	76.4	83.3	87.5	92.8	56.5	77.7	67.0	91.2	77.5
NUSPSL_CTX_GPM_SVM	94.3	78.5	76.4	80.0	57.0	86.3	82.1	81.5	65.6	74.7	66.5	73.4	81.9	85.3	91.9	53.2	73.9	65.1	89.5	76.0
SJT_SIFT_LLC_PCAPOOL_DET_SVM	85.6	66.5	51.9	60.3	45.4	76.8	70.3	65.1	56.4	34.3	49.6	52.4	63.1	71.5	86.8	26.1	56.9	47.9	75.5	65.6
SJT_SIFT_LLC_PCAPOOL_SVM	83.2	52.5	49.3	59.6	26.0	73.5	58.2	64.4	52.1	36.6	44.9	52.1	57.8	63.8	78.1	19.1	52.8	44.1	72.0	57.4
UVA_MOSTTELLING	90.1	74.1	66.5	76.0	57.0	85.6	81.2	74.5	63.5	62.7	64.5	66.6	76.5	81.2	90.8	58.7	69.3	66.3	84.7	77.2

Precision/Recall Curves

Classification Results: VOC2011 data

Competition "comp2" (train on own data)

Average Precision (AP %)

	aero plane	bicycle	bird	boat	bottle	bus	car	cat	chair	cow	dining table	dog	horse	motor bike	person	potted plant	sheep	sofa	train	tv/ monitor
LIRIS_CLSTEXT	88.3	66.1	60.8	68.5	46.7	77.3	69.2	63.7	55.9	52.6	56.6	55.5	69.6	73.7	87.1	46.3	65.2	54.0	81.2	72.7

Precision/Recall Curves

Detection Results: VOC2011 data

Competition "comp3" (train on VOC2011 data)

Average Precision (AP %)

	aero plane	bicycle	bird	boat	bottle	bus	car	cat	chair	cow	dining table	dog	horse	motor bike	person	potted plant	sheep	sofa	train	tv/ monitor
BROOKES_STRUCT_DET_CRF	37.1	42.6	2.0	0.0	16.0	43.8	38.6	17.0	10.3	7.7	2.4	1.5	34.3	41.1	38.4	1.5	14.7	5.3	35.4	27.1
CMIC_GS_DPM	-	-	-	13.3	26.4	-	41.5	-	-	-	12.2	-	-	41.6	-	8.3	31.4	-	-	-
CMIC_SYNTHDPM	40.4	47.8	-	11.4	23.7	48.9	40.9	23.5	11.9	25.5	-	10.9	42.0	38.6	40.7	7.5	30.4	-	38.4	34.8
CORNELL_ISVM_VIEWPOINT	42.5	43.7	5.4	4.8	18.1	28.6	36.6	24.2	12.6	20.5	4.4	17.5	15.2	38.2	7.9	1.7	23.2	7.1	41.0	25.7
MISSOURI_LCC_TREE_CODING	41.1	51.7	13.7	11.9	27.3	52.1	41.7	32.9	17.6	27.3	18.5	23.1	45.2	48.6	41.9	11.6	32.4	27.5	44.2	38.3
MISSOURI_TREE_MAX_POOLING	43.8	51.7	13.7	12.7	27.3	51.5	43.7	32.9	18.3	27.3	18.5	23.1	45.2	48.6	42.9	11.6	32.4	27.5	47.0	39.3
NLPR_DD_DC	55.0	58.1	22.5	18.8	33.9	57.6	54.5	42.6	20.2	40.3	29.3	37.1	54.6	58.3	51.6	14.7	44.8	32.1	51.7	41.0
NUS_CONTEXT_SVM	51.4	52.9	20.1	15.7	26.9	53.0	45.6	37.6	15.2	36.0	25.1	32.6	50.4	55.8	36.8	12.3	37.6	30.5	48.1	41.0
NYUUCLA_HIERARCHY	56.3	55.9	23.4	20.3	27.2	56.6	48.1	53.8	23.2	32.9	33.3	39.2	53.0	56.9	43.6	14.3	37.9	39.4	52.6	43.7
OXFORD_DPM_MK	56.0	53.3	19.2	17.2	25.8	53.1	45.4	44.5	20.1	32.1	28.1	37.2	52.3	56.6	43.3	12.1	34.3	37.6	51.8	45.2
UOCTTI_LSVM_MDPM	53.2	53.9	13.1	13.5	30.5	55.5	51.2	31.7	14.5	29.0	16.0	22.1	43.1	50.3	46.3	8.8	33.0	22.9	45.8	38.2
UOCTTI_WL-SSVM_GRAMMAR	-	-	-	-	-	-	-	-	-	-	-	-	-	-	49.2	-	-	-	-	-
UVA_SELSEARCH	56.9	43.4	16.6	15.8	18.0	52.3	38.3	48.9	12.2	29.7	32.8	36.7	45.7	54.4	30.4	16.2	37.2	34.7	45.9	44.2

Precision/Recall Curves

Detection Results: VOC2011 data

Competition "comp4" (train on own data)

Average Precision (AP %)

	aero plane	bicycle	bird	boat	bottle	bus	car	cat	chair	cow	dining table	dog	horse	motor bike	person	potted plant	sheep	sofa	train	tv/ monitor

Precision/Recall Curves

aeroplane
bicycle
bird
boat
bottle
bus
car
cat
chair
cow
diningtable
dog
horse
motorbike
person
pottedplant
sheep
sofa
train
tvmonitor

Segmentation Results (VOC2011 data)

Competition "comp5" (train on VOC2011 data)

Accuracy (%)

- Entries in parentheses are synthesized from detection results.

	[mean]	back ground	aero plane	bicycle	bird	boat	bottle	bus	car	cat	chair	cow	dining table	dog	horse	motor bike	person	potted plant	sheep	sofa	train	tv/ monitor
BONN_FGT_SEGM	41.4	83.4	51.7	23.7	46.0	33.9	49.4	66.2	56.2	41.7	10.4	41.9	29.6	24.4	49.1	50.5	39.6	19.9	44.9	26.1	40.0	41.6
BONN_SVR_SEGM	43.3	84.9	54.3	23.9	39.5	35.3	42.6	65.4	53.5	46.1	15.0	47.4	30.1	33.9	48.8	54.4	46.4	28.8	51.3	26.2	44.9	37.2
BROOKES_STRUCT_DET_CRF	31.3	79.4	36.6	18.6	9.2	11.0	29.8	59.0	50.3	25.5	11.8	29.0	24.8	16.0	29.1	47.9	41.9	16.1	34.0	11.6	43.3	31.7
NUS_CONTEXT_SVM	35.1	77.2	40.5	19.0	28.4	27.8	40.7	56.4	45.0	33.1	7.2	37.4	17.4	26.8	33.7	46.6	40.6	23.3	33.4	23.9	41.2	38.6
NUS_SEG_DET_MASK_CLS_CRF	37.7	79.8	41.5	20.2	30.4	29.1	47.4	61.2	47.7	35.0	8.5	38.3	14.5	28.6	36.5	47.8	42.5	28.5	37.8	26.4	43.5	45.8
(CORNELL_ISVM_VIEWPOINT)	11.8	1.4	7.4	10.5	5.5	1.6	22.9	25.7	27.9	10.9	4.7	16.4	5.2	5.6	10.3	21.4	11.1	4.8	6.7	3.0	21.3	24.2
(MISSOURI_LCC_TREE_CODING)	13.1	0.5	9.2	9.4	8.1	2.2	25.7	32.6	18.6	13.2	4.1	9.5	13.8	9.5	13.5	17.4	26.7	10.0	9.5	14.5	15.9	11.2
(MISSOURI_TREE_MAX_POOLING)	13.1	0.6	10.0	7.8	7.4	2.3	27.1	30.2	38.8	12.3	3.9	8.3	10.7	7.8	11.4	14.4	26.9	6.3	8.6	10.3	16.9	13.2
(NLPR_DD_DC)	19.4	0.8	21.6	2.9	10.1	7.9	38.0	27.2	26.0	7.4	7.3	30.4	17.8	26.3	24.9	41.6	29.2	2.4	27.8	20.7	31.0	6.9
(NYUUCLA_HIERARCHY)	15.3	1.2	11.9	7.6	12.9	6.7	12.4	24.3	28.4	26.2	2.9	21.3	9.3	19.8	18.6	27.7	27.6	6.3	23.1	5.9	18.1	9.1
(OXFORD_DPM_MK)	15.2	0.4	16.3	7.4	8.7	4.7	27.0	29.8	18.9	23.0	3.2	15.3	11.6	13.9	19.6	19.1	23.3	4.3	22.7	7.7	19.5	22.5
(UOCTTI_LSVM_MDPM)	13.1	4.0	9.2	7.8	9.2	6.2	20.4	38.4	24.9	11.2	3.3	12.8	5.9	10.4	15.4	19.5	20.4	5.7	13.4	5.0	15.9	16.3
(UVA_SELSEARCH)	16.2	2.9	13.9	8.2	5.4	7.2	18.8	52.0	29.2	21.9	3.9	17.5	10.7	13.7	12.2	27.7	14.7	7.8	21.3	12.9	17.2	20.5

Segmentation Results (VOC2011 data)

Competition "comp6" (train on own data)

Accuracy (%)

- Entries in parentheses are synthesized from detection results.

	[mean]	back ground	aero plane	bicycle	bird	boat	bottle	bus	car	cat	chair	cow	dining table	dog	horse	motor bike	person	potted plant	sheep	sofa	train	tv/ monitor
BERKELEY_REGION_CLASSIFY	39.1	83.3	48.9	20.0	32.8	28.2	41.1	53.9	48.3	48.0	6.0	34.9	27.5	35.0	47.2	47.3	48.4	20.6	52.7	25.0	36.6	35.4

Person Layout Results: VOC2011 data

Competition "comp7" (train on VOC2011 data)

Average Precision (AP %)

	Head	Hand	Foot

Precision/Recall Curves

head
hand
foot

Person Layout Results: VOC2011 data

Competition "comp8" (train on own data)

Average Precision (AP %)

	Head	Hand	Foot
OXFORD_RANK_SLACK_RBF	72.9	26.9	4.1

Precision/Recall Curves

Action Classification Results: VOC2011 data

Competition "comp9" (train on VOC2011 data)

Average Precision (AP %)

	jumping	phoning	playing instrument	reading	riding bike	riding horse	running	taking photo	using computer	walking
CAENLEAR_DSAL	62.1	39.7	60.5	33.6	80.8	83.6	80.3	23.2	53.4	50.2
CAENLEAR_HOBJ_DSAL	71.6	50.7	77.5	37.8	86.5	89.5	83.8	25.1	58.9	59.2
MISSOURI_SSLMF	58.8	36.8	48.5	30.6	81.5	83.0	78.5	21.3	50.7	53.8
NUDT_CONTEXT	65.9	41.5	57.4	34.7	88.8	90.2	87.9	25.7	54.5	59.5
NUDT_LL_SEMANTIC	66.3	41.3	53.9	35.2	88.8	90.0	87.6	25.5	53.7	58.2
STANFORD_RF_DENSEFTR_SVM	66.0	41.0	60.0	41.5	90.0	92.1	86.6	28.8	62.0	65.9
WVU_SVM-PHOW	42.5	29.5	32.1	26.7	48.5	46.3	59.2	13.5	24.3	35.6

Precision/Recall Curves

Action Classification Results: VOC2011 data

Competition "comp10" (train on own data)

Average Precision (AP %)

	jumping	phoning	playing instrument	reading	riding bike	riding horse	running	taking photo	using computer	walking
BERKELEY_ACTION_POSELETS	59.5	31.3	45.6	27.8	84.4	88.3	77.6	31.0	47.4	57.6
STANFORD_COMBINE_ATTR_PART	66.7	41.1	60.8	42.2	90.5	92.2	86.2	28.8	63.5	64.2
STANFORD_MAPSVM_POSELET	27.0	29.3	28.3	23.8	71.9	82.4	67.3	20.1	26.0	46.4

Precision/Recall Curves

Key to Abbreviations

Abbreviation	Title	Method	Affiliation	Contributors	Descriptiorn
BERKELEY_ACTION_POSELETS	Poselets trained on action categories.	BERKELEY_ACTION_POSELETS	University of California, Berkeley	Subhransu Maji, Lubomir Bourdev, Jitendra Malik	This is based on our CVPR 2011 paper: "Action recognition using a distributed representation of pose and appearance", Subhransu Maji, Lubomir Bourdev and Jitendra Malik For this submission we train 200 poselets for each action category. In addition we train poselets based on subcategory labels for playinginstrument and ridingbike. Linear SVMs are trained on the "poselet activation vector" along with features from object detectors for four categories: motorbike, bicycle, horse and tvmonitor. Context models re-rank the objects at the image level, as described in the CVPR'11 paper.
BERKELEY_REGION_CLASSIFY	Classification of low-level regions	Berkeley_Region_Classify	UC Berkeley	Pablo Arbelaez, Bharath Hariharan, Saurabh Gupta, Chunhui Gu, Lubomir Bourdev and Jitendra Malik	We propose a semantic segmentation approach that represents and classifies generic regions from low-level segmentation. We extract object candidates using ultrametric contour maps (Arbelaez et al., TPAMI 2011) at several image resolutions. We represent each region using mid- and high-level features that capture its appearance (color, shape , texture) and also its compatibility with the activations of a part detector (we use the poselets from Bourdev et al, ECCV 2010.) . A category label is assigned to each region using a hierarchy of IKSVM classifiers (Maji et al, CVPR 2008).
BONN_FGT_SEGM	BONN_FGT_SEGM	BONN_FGT_SEGM	�University of Bonn, �Vienna University of Technology, �Georgia Institute of Technology	Joao Carreira�, Adrian Ion�, Fuxin Li�, Cristian Sminchisescu�	We present a joint image segmentation and labeling model which, given a bag of figure-ground segment hypotheses extracted at multiple image locations and scales using CPMC (Carreira and Sminchisescu, CVPR2010), constructs a joint probability distribution over both the compatible image interpretations (tilings or image segmentations) composed from those segments, and over their labeling into categories. The process of drawing samples from the joint distribution can be interpreted as first sampling tilings, modeled as maximal cliques, from a graph connecting spatially non-overlapping segments in the bag (Ion, Carreira, Sminchisescu, ICCV11) , followed by sampling labels for those segments, conditioned on the choice of a particular tiling. We learn the segmentation and labeling parameters jointly, based on Maximum Likelihood with a novel Incremental Saddle Point estimation procedure (Ion, Carreira, Sminchisescu, NIPS2011).
BONN_SVR_SEGM	SVR on CPMC-generated Figure-ground segmentations	BONN_SVRSEGM	University of Bonn	Joao Carreira, Fuxin Li, Cristian Sminchisescu	We present a recognition system based on sequential figure-ground ranking. We extract a bag of figure-ground segments using CPMC (Carreira and Sminchisescu, CVPR 2010). The bag is then filtered down to 100 segments using a class-independent ranker. Using these features we learn one nonlinear Support Vector Regressor (SVR) for each category that predicts the overlap between each segment and an object from that category. A complete image interpretation is obtained by sequentially selecting segments using combination and non-maxima suppression schemes. Details can be found in respectively (F. Li, J.Carreira, C. Sminchisescu, CVPR 2010, IJCV11). Additionally, the system is trained with both object segmentation layouts and weak annotations from bounding boxes.
BPACAD_COMB_LF_AK_WK_NOBOXES	Combination of the late fusion, avgker and weker	BPACAD_COMB_LF_AK_WK_NOBOXES	Data Mining and Web Search Research Group (DMWS), MTA SZTAKI. Hungary	B�lint Dar�czy, L�szl� Nikh�zy	This is the average of the confidence output of a late fusion, an aggregated kernel and an averaged kernel (BPACAD_CS_FISH256-1024_SVM_AVGKER) method. We computed RGB Color moments and SIFT descriptors (L�we 1999) on a dense grid and regions detected by Harris-Laplacian (Mikolajczik et al, 2005). All the three methods are based on non-hierarchical Gaussian Mixture Models (GMM) with 256 Gaussians (two of them also using GMMs with 1024 Gaussians ) and non-sparse Fisher vectors (Perronnin et al, 2007 ). We used our open-source GMM/Fisher toolkit for GPGPUs to compute the Fisher vectors and train the GMMs (http://datamining.sztaki.hu/?q=en/GPU-GMM). All of them are using Fisher vector based pre-computed kernels (basic kernels) for learning linear SVM classifiers (Dar�czy et al, ImageCLEF 2011) . The late fusion method is based on a combination of SVM predictions (18 SVM classifiers per class), meanwhile the aggregated and averaged kernels are computed before the classification (only one SVM classifier per class). While the averaged kernel method needs no parameter tuning, for the late fusion and the aggregated kernel method we learned optimal weights per class on the validation set.
BPACAD_CS_FISH256_1024_SVM_AVGKER_NOBOXES	SVM on averaged Fisher Kernels	BPACAD_CS_FISH256-1024_SVM_AVGKER_NOBOXES	Data Mining and Web Search Research Group (DMWS), MTA SZTAKI, Hungary	B�lint Dar�czy, L�szl� Nikh�zy	We computed RGB Color moments and SIFT descriptors (L�we 1999) on a dense grid and regions detected by Harris-Laplacian (Mikolajczik et al, 2005). We trained non-hierarchical Gaussian Mixture Models (with 256 Gaussians and 1024 Gaussians) on a subset (1 million) of the extracted low-level features of the training images. We extracted non-sparse Fisher vectors on nine different poolings with GMM with 256 Gaussians (dense grid, Harris-Laplacian, 3x1 and 2x2 spatial pyramids (Lazebnik et al, 2006)) and four with GMM with 1024 Gaussians (dense,3x1). We used our open-source GMM/Fisher toolkit for GPGPUs to compute the Fisher vectors and train the GMMs (http://datamining.sztaki.hu/?q=en/GPU-GMM). We calculated pre-computed kernels (Dar�czy et al, ImageCLEF 2011) and averaged them. We trained only one binary SVM classifier per class.
BROOKES_STRUCT_DET_CRF	Structured Detection and Segmentation CRF	Struct_Det_CRF	Oxford Brookes University	Jonathan Warrell, Vibhav Vineet, Paul Sturgess, Philip Torr	We form a hierarchical CRF which jointly models a pool of candidate detections and the multiclass pixel segmentation of an image. Attractive and repulsive pairwise terms are allowed between detection nodes (cf Desai et al, ICCV 2009), which are integrated into a Pn-Potts based hierarchical segmentation energy (cf Ladicky et al, ECCV 2010). A cutting-plane algorithm is used to train the model, using approximate MAP inference. We form a joint loss which combines segmentation and detection components (i.e. paying a penalty both for each pixel incorrectly labelled, and each false detection node which is active in a solution), and use different weightings of this loss to train the model to perform detection and segmentation. The segmentation results thus make use of the bounding box annotations. The candidate detections are generated using the Felzenschwalb et al. CVPR 2008/2010 detector, and as features for segmentation we use textons, SIFT, LBPs and the detection response surfaces themselves.
BUPT_ALL	BUPT_MCPR_all	combining methods	Beijing University of Posts and Telecommunications-MCPRL	Zhicheng Zhao, Tao Liu, Xin Guo, Anni Cai	A region-based method is used, in which all features mentioned above are extracted on regions rather than keypoints. A region is a group of pixels with similar appearance, and meanshift method is employed to do this. Finally, we combine the results of two methods with a linear fusion algorithm.without patch.
BUPT_NOPATCH	BUPT_MCPR_nopatch	nopatch mthod	Beijing University of Posts and Telecommunications-MCPRL	Zhicheng Zhao, Tao Liu, Xin Guo, Anni Cai	A bag of words method with SIFT, SURF and HOG features, and dense sampling method for keypoints is also involved.and keypoint detection are used.
CAENLEAR_DSAL	Discriminative spatial saliency	DSAL	Univ Caen/ INRIA LEAR	Gaurav Sharma, Frederic Jurie, Cordelia Schmid	We propose to learn discriminative saliency maps for images which highlight the regions which are more discriminant for the current classification task. We use the saliency maps to weight the visual words for improving discriminative capacity of bag of words features. The approach is motivated by the observation that for many human actions and attributes, local regions are highly discriminative e.g. for running the bent arms and legs are highly discriminant. Along with that we combine features based on SIFT, HOG, Color and texture.
CAENLEAR_HOBJ_DSAL	Human obj interaction and discriminative saliency	HOBJ+DSAL	Univ Caen/ INRIA LEAR	Gaurav Sharma, Alessandro Prest, Frederic Jurie, Vittorio Ferrari, Cordelia Schmid	We use the weakly supervised approach (Prest et al. PAMI2010) for learning human actions modeled as interactions between humans and objects. The human bounding box is taken as reference and the object relevant to the action and its spatial relation with the human is automatically learnt. The method is combined with a method to learn discriminative spatial saliency which highlights the regions which are more discriminant for the current classification task. We use the saliency maps to weight the visual words for improving discriminative capacity of bag of words features. Along with that we combine features based on SIFT, HOG, Color and texture.
CMIC_GS_DPM	Synthetic Trainining for deformable parts model	CMIC-GS-DPM	Cairo Microsoft Innovation Center	Dr. Motaz El-Saban , Osama Khalil, Mostafa Izz, Mohamed Fathi	We introduce dataset augmentation using synthetic examples as a method for introducing novel variations not present in the original set. We make use of deformable parts-based model (Felzenszwalb et al 2010). We augment the training set with examples obtained by applying global scaling of the dataset examples. Global scaling includes no, up and down scaling with varying performance across different object classes. Technique selection is based upon performance on the validation set. The augmented dataset is then used to train parts-based detectors using HOG features (Dalal & Triggs 2006) and latent SVM. The resulting class models are applied on test images in a �sliding-window� fashion.
CMIC_SYNTHDPM	Synthetic Trainining for deformable parts model	CMIC-Synthetic-DPM	Cairo Microsoft Innovation Center	Dr. Motaz El-Saban , Osama Khalil, Mostafa Izz, Mohamed Fathi	We introduce dataset augmentation using synthetic examples as a method for introducing novel variations not present in the original set. We make use of deformable parts-based model (Felzenszwalb et al 2010). We augment the training set with examples obtained by relocating objects (having segmentation masks) to new backgrounds. New backgrounds used for relocation are selected using a set of techniques (no relocation, same image, �different� image or image with co-occurring objects). Performance of those techniques varies across classes according to the object class properties. For every class, we select the technique that achieves the highest AP on the validation set. The augmented dataset is then used to train parts-based detectors using HOG features (Dalal & Triggs 2006) and latent SVM. The resulting class models are applied on test images in a �sliding-window� fashion.
CORNELL_ISVM_VIEWPOINT	Using viewpoint cues to improve object recognition	lSVM-Viewpoint	Cornell	Joshua Schwartz Noah Snavely Daniel Huttenlocher	Our system is based on the Latent SVM framework of [1], including their context rescoring method. We train 6 component models with 8 parts. However, unlike [1], components are trained using a clustering based on an unsupervised estimation of 3D object viewpoint. In this sense, our approach is similar to the unsupervised approach in [2], which also seeks to estimate viewpoint, but our clustering is based on explicit reasoning about 3D geometry. Additionally, we add features based on estimated 3D scene geometry for context rescoring. Of note is the fact that a detection with our method gives rise to an explicit estimation of object viewpoint within a scene, rather than just a bounding box. [1] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object Detection with Discriminatively Trained Part Based Models. PAMI 2010 [2] C. Gu and X. Ren. Discriminative Mixture-of-Templates for Viewpoint Classification. ECCV 2010
JDL_K17_AVG_CLS	SVM with average kernel using 17 kernels	JDL_K17_AVG_CLS	JDL, Institute of Computing Technology, Chinese Academy of Sciences	Shuhui Wang, Li Shen, Shuqiang Jiang, Qi Tian, Qingming Huang	we calculate six types of commonly used BOW features(including dense and sparse sift, dense color sift,hog, lbp and self similarity) and 3 global features(color moment, gist and block gist), where the visual vocabulary size is typically around 1000. We calculate 3 level spatial pyramid features on those BOW representation respectively. Then 17 base kernels are calculated by using histogram intersection, RBF and chi-square kernels on these features, whose kernel parameters are tuned using the validation data. We calculate an average kernel by using these 17 base kernel. one-against-all SVM classifiers are used to train the final classfiers for each category.
LIRIS_CLS	MKL classifier with multiple features	LIRIS_CLS	LIRIS, Ecole Centrale de Lyon, CNRS, UMR5205, France	Chao ZHU, Yuxing TANG, Ningning LIU, Charles-Edmond BICHOT, Emmanuel Dellandrea, Liming CHEN	In this submission, we mainly make use of local descriptors and the popular bag-of-visual-words approach for classification. Regions of interest in each image are detected using both Harris-Laplace detector and dense sampling strategy. SIFT and color SIFT descriptors are then computed for each region as baseline. In addition, we also extract DAISY and extended LBP descriptors based on our work [1][2] for computational efficiency and complementary information to SIFT. For each kind of descriptor, 1,000,000 randomly selected descriptors from the train + val set are quantized using k-means algorithm into 4000 visual words. Each image is then represented by the histogram using hard assignment. Spatial pyramid technique is applied for coarse spatial information. The chi-square kernels of different levels in the pyramid are computed, and then fused by linear combination. The final outputs are obtained by using multiple kernel learning algorithm to fuse different descriptors. [1] C. Zhu, C.E. Bichot, L. Chen: 'Visual object recognition using DAISY descriptor', in Proc. of IEEE International Conference on Multimedia and Expo (ICME), to appear, 2011. [2] C. Zhu, C.E. Bichot, L. Chen: 'Multi-scale color local binary patterns for visual object classes recognition', in Proc. of 20th International Conference on Pattern Recognition (ICPR), pp.3065-3068, 2010.
LIRIS_CLSDET	Classification combined with detection	LIRIS_CLSDET	LIRIS, Ecole Centrale de Lyon, CNRS, UMR5205, France	Chao ZHU, Yuxing TANG, Ningning LIU, Charles-Edmond BICHOT, Emmanuel Dellandrea, Liming CHEN	In this submission, we improve the classification performances by combining it with object detection results. For classification, we mainly make use of local descriptors and the popular bag-of-visual-words approach. Regions of interest in each image are detected using both Harris-Laplace detector and dense sampling strategy. SIFT and color SIFT descriptors are then computed for each region as baseline. In addition, we also extract DAISY and extended LBP descriptors based on our work [1][2] for computational efficiency and complementary information to SIFT. For each kind of descriptor, 1,000,000 randomly selected descriptors from the train + val set are quantized using k-means algorithm into 4000 visual words. Each image is then represented by the histogram using hard assignment. Spatial pyramid technique is applied for coarse spatial information. The chi-square kernels of different levels in the pyramid are computed, and then fused by linear combination. The final outputs are obtained by using multiple kernel learning algorithm to fuse different descriptors. For object detection, we apply the HOG feature to train deformable part models, and use the models together with sliding window approach to detect objects. Finally, we combine the outputs of classification and detection by late fusion. [1] C. Zhu, C.E. Bichot, L. Chen: 'Visual object recognition using DAISY descriptor', in Proc. of IEEE International Conference on Multimedia and Expo (ICME), to appear, 2011. [2] C. Zhu, C.E. Bichot, L. Chen: 'Multi-scale color local binary patterns for visual object classes recognition', in Proc. of 20th International Conference on Pattern Recognition (ICPR), pp.3065-3068, 2010.
LIRIS_CLSTEXT	Classification with additional text feature	LIRIS_CLSTEXT	LIRIS, Ecole Centrale de Lyon, CNRS, UMR5205, France	Chao ZHU, Yuxing TANG, Ningning LIU, Charles-Edmond BICHOT, Emmanuel Dellandrea, Liming CHEN	In this submission, we try to use additional text information to help with object classification. We propose novel text features [1] based on semantic distance using WordNet. The basic idea is to calculate the semantic distance between the text associated with an image and an emotional dictionary based on path similarity, denoting how similar two word senses are, based on the shortest path that connects the senses in a taxonomy. As there are no tags included in Pascal2011 dataset, we downloaded 1 million Flickr images (including their tags) as the additional textual source. Firstly, for each Pascal image, we find its similar images (top 20) from the database using KNN method based on visual features (LBP and color HSV histogram), and then use these tags to extract the text feature. We use SVM with RBF kernel to train the classifier and predict the outputs. For classification based on visual features, we follow the same method described in our other submission. The outputs of visual feature based method and text feature based method are then linearly combined as final results. [1] N. Liu, Y. Zhang, E. Dellandr�a, B. Tellez, L. Chen: �Associating text features with visual ones to improve affective image classification�, International Conference Affective Computing (ACII), Memphis, USA, 2011.
MISSOURI_LCC_TREE_CODING	SVM classifier with LCC and tree coding	LCC-TREE-CODING	University of Missouri	Xiaoyu Wang Miao Sun Xutao Lv Shuai Tang Guang Chen Yan Li Tony X. Han	A two layers cascade structure for object detection. The first layer employs deformable model to select possible candidates for the second layer. The later layer takes location and global context augmented with LBP feature to improve the accuracy. A bag of words model enhanced with spatial pyramid and local coordilate coding is used to model the global context information. A hierachical tree structure coding is used to take care of the intra-class variation for each detection window. Linear SVM is used for classification.
MISSOURI_SSLMF	Supervised Learning with Multiple Features	Supervised learning with multiple feature	University of Missouri - Columbia	Xutao Lv, Xiaoyu Wang, Guang Chen, Shuai Tang, Yan Li, Miao Sun, Tony X. Han	Multiple available features are combined and fed into a newly developed supervised learning algorithm. The features includes the feature extracted within the bounding box and the feature from the whole image. The features from the whole images are served as context information. We mainly use two feature descriptors in our submission, dense SIFT and HOG. LCC coding method and spatial pyramid is adopted to generate histogram for each action image, and the histogram is then served as feature vector to train and test with the supervised learning algorithm.
MISSOURI_TREE_MAX_POOLING	SVM classifier with tree max-pooling	TREE--MAX-POOLING	University of Missouri	Xiaoyu Wang, Miao Sun, Xutao Lv, Shuai Tang, Guang Chen, Yan Li ,Tony X. Han	A two layers cascade structure for object detection. The first layer employs deformable model to select possible candidates for the second layer. The later layer takes location and global context augmented with LBP feature to improve the accuracy. A bag of words model enhanced with spatial pyramid and local coordilate coding is used to model the global context information. A hierachical tree structure coding is used to take care of the intra-class variation for each detection window. Max-pooling is used for tree node assignment. Linear SVM is used for classification.
MSRAUSTC_HIGH_ORDER_SVM	SVM with mined high order features	MSRA_USTC_HIGH_ORDER_SVM	Microsoft Research Asia & University of Science and Technology of China	Kuiyuan Yang, Lei Zhang, Hong-Jiang Zhang	We introduce a discriminatively-trained parts-based model with different level templates for image classification. The model consists of templates of HOG features (Dalal and Triggs, 2006) at three different levels. The responses of different level templates are combined by a latent-SVM, where the latent variables are the positions of the templates. We develop a novel mining algorithm to define the parts and an iterative training procedure to learn the parts. The model is applied to all 20 PASCAL VOC objects.
MSRAUSTC_PATCH	SVM with multi-channel cell-structured patch featu	MSRA_USTC_PATCH	Microsoft Research Asia & University of Science and Technology of China	Kuiyuan Yang, Lei Zhang, Hong-Jiang Zhang	We introduce a discriminatively-trained patch-based model with cell-structured templates for image classification. Dense sampled patches are represented cell-structured templates of HOG, LBP, HSV, SIFT, CSIFT and SSIM. These templates are then fed to Super-Vector Coding (Xi Zhou, 2010) and Fisher Kernel (Florent Perronnin, 2010) to form the image feature. Then linear SVM is trained for each category in one-vs-the-rest manner. The object detector from Pascal VOC 2007 is used to extract object level features and classifiers are trained based on these features, and then fusion with the former one.
NANJING_DMC_HIK_SVM_SIFT	HIK based svm classifier with dense SIFT features.	DMC-HIK-SVM-SIFT	The University of Nanjing	Yubin Yang, Ye Tang, Lingyan Pan	We adopt a bag-of-visual-words method (cf Csurka et al 2004). A single descriptor type, SIFT descriptors (Lowe 2004) are extracted from 16*16 pixel patches which are densely sampled from each image on a grid with stepsize 12 pixels. We partition the original training and validation data into different categories according to its label, then randomly select 200 images per category (2000 images in total) as the training set. We use a novel difference maximize coding approach to quantize these descriptors into 200 �visual words�. Each image is then represented by a histogram of visual words. Spatial pyramid matching (Lazebnik et al, CVPR 2006) are also used in our method. Finally, we train a HIK kernel (Jianxin Wu et al, ICCV 2009) based SVM classifier using the concatenated pyramid feature vector for each image in training set.
NLPR_DD_DC	NLPR-Detection	Data Decomposition and Distinctive Context	Institute of Automation, Chinese Academy of Sciences	Junge Zhang, Yinan Yu, Yongzhen Huang, Chong Wang, Weiqiang Ren, Jinchen Wu, Kaiqi Huang and Tieniu Tan	Part based model has achieved great success in recent years. To our understanding, the original deformable part based model has several limits: 1) the computational complexity is very large, especially when it is extended to enhanced models via multiple features, more mixtures or flexible part models. 2) The original part based model is not �deformable� enough. To tackle these problems, 1) we propose a data decomposition based feature representation scheme for part based model in an unsupervised manner. The submitted method takes about 1~2 seconds per image from PASCAL VOC datasets on average while keeping high performance. We learn the basis from samples without any label information. The specific label independent rule followed in the submitted methods can be adapted into other variants of part based model such as hierarchical model or flexible mixture models. 2) We found that, each part corresponds to multiple possible locations, which is not reflected in the original part-based model. Accordingly, we propose that the locations of parts should obey the multiple Gaussian distribution. Thus, for each part we learn its optimal locations by clustering which are used to update the original anchors of the part-based model. The proposed method above can more effectively describe the deformation (pose and location variety) of objects� parts. 3) We rescored the initial results by our distinctive context model including global and local and intra-class context information. Besides, segmentation provides strong indication for object�s presence, therefore, the proposed segmentation aware semantic attribute is applied in the final reasoning which indeed shows promising performance.
NLPR_KF_SVM	SVM classifier with five kernels.	NLPR_KF_SVM	National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences.	Ruiguang Hu Weiming Hu	GrayPHOG,HuePHOG,PLBP(R=1 and R=2). GraySIFT_HL,HueSIFT_HL : LLC coding and MAX pooling. Codebooks : K-means clustering. L1 normalization : GrayPHOG,HuePHOG,PLBP; L2 normalization : GraySIFT_HL, HueSIFT_HL; chi-squared kernel : GrayPHOG,HuePHOG,PLBP; Linear kernel : graySIFT_HL,HueSIFT_HL. kernel fusion : Average strategy . train features : extracted from croped sub_images according to annotation boundingbox, test features : extracted from whole test images.
NLPR_SS_VW_PLS	NLPR_CLS	Semi-Semantic Visual Words & Partial Least Squares	National Laboratory of Pattern Recognition, Institute of Automation Chinese Academy of Sciences	Yinan Yu, Junge Zhang, Yongzhen Huang, Weiqiang Ren, Chong Wang, Jinchen Wu, Kaiqi Huang, Tieniu Tan	The framework is based on the classical Bag of Words model. The system consists of: 1) In feature level, we use Semi-Semantic and non-semantic Visual Words Learning, with Fast Feature Extraction (Salient Coding, Super-Vector Coding and Visual Co-occurrence) and multiple features; 2) To learn class model, we employ an alternative multiple-linear-kernel learning for intra-class feature combination, after using Partial Least Squares analysis, which projects the extremely high-dimensional features into a low-dimensional space; 3) The combination of 20 categories scores and detections scores generate a high-level semantic representation of image, we use non-linear-kernel learning to extract inter-class contextual information, which further improve the performance. Besides, all the parameters are decided by cross-validation and prior knowledge on VOC2007 and VOC2010 trainval sets. The motivation and novelty of our algorithm: The traditional codebook describes the distribution of feature space, containing less semantic information of the interesting objects, and a semantic codebook may benefit the performance. We observe that the Deformable-Part-Based Model [Felz TPAMI 2010] describes the object by �object parts�, which can be seen as the semi-semantic visual words. Based on this idea, we propose to use the semi-semantic and non-semantic visual words based Bag of words model for image classification. We analyze the recent image classification algorithms, finding that the feature �distribution�, �reconstruction� and �saliency� is three fundamental issues in coding and image description. However, these methods usually lead to an extremely high-dimensional description, especially with multiple features. In order to learn these features by MKL, we find Partial Least Square is a reliable method for dimensionality reduction. The compression ratio of PLS is over 10000, while the discrimination can be preserved.
NLPR_SVM_BOWDET	Svm with multiple feature and detection results	NLPR_IVA_SVM_BOWDect	NLPR,CASIA	Jing Liu, Jianlong Fu, Bingyuan Liu, Hanqing Lu	Two types of features are considered. First, typical BOW features (OpponentSift, CSift and rgSift) with dense and Harris sampling techniques respectively, and the spatial pyramid kernels with these features are calculated for classification. Second, the object detection based on the deformable part model is employed (P. Felzenszwalb et al. in PAMI 2009 ) . Combined with these features, we attempt to learn a hierarchical classifier with SVM according to the hierarchical structure for all the 20-class data.
NLPR_SVM_BOWDET_CONV	Svm with multiple feature and detection results	NLPR_IVA_SVM_BOWDect_Convolution	NLPR,CASIA	Jing Liu, Jianlong Fu, Bingyuan Liu, Hanqing Lu	Three types of features are considered. First, typical BOW features (OpponentSift, CSift and rgSift) with dense and Harris sampling respectively, and spatial pyramid kernels are calculated. Second, an improved image representation via convolutional sparse coding and max pooling operation is employed, which is motivated by M. Zeiler�s work in ICCV 2011. Third the object detection based on the deformable part model. Combined with multiple features, we attempt to learn a hierarchical classifier with SVM according to the hierarchical structure for all the 20-class data.
NUDT_CONTEXT	Svm classifier with contextual information	NUDT_Context	National University of Defense Technology	Li Zhou, Zongtan Zhou, Dewen Hu	Action classification using contextual information. We present a new model for action classification context based on the distribution of object and the semantic category of scene within images. The scene classification works by creating multiple resolution images and partitioning them into sub-regions with different scales. The visual descriptors of all sub-regions in the same resolution image are directly concatenated for SVM classifiers. Finally, regarding each resolution image as a feature channel, we combine all the feature channels to reach a final decision. The object recognition works by incorporating a multi-resolution representation into the bag-of-features model.
NUDT_LL_SEMANTIC	Svm classifier with low-level and semantic modelin	NUDT_Low-level_Semantic	National University of Defense Technology	Li Zhou, Dewen Hu, Zongtan Zhou	Action classification based on combining low-level and semantic modeling strategies
NUSPSL_CTX_GPM	classification using context svm and GPM	NUSPSL_CTX_GPM	National University of Singapore; Panasonic Singapore Laboratories	NUS: Chen Qiang, Song Zheng, Yan Shuicheng; PSL: Hua Yang, Huang Zhongyang, Shen Shengmei;	The whole solution for object classification is based on BoW framework. On image level, Dense-SIFT, HOG^2, LBP and color moments features are extracted. VQ and Fisher vectors are utilized for feature coding. Traditional SPM and novel spatial-free pyramid matching scheme are then performed to generate image representations. Context-aware features are also extracted based on detection result [1]. The classification models are learnt via kernel SVM. The final classification scores are refined with kernel mapping [2]. The key novelty of the new solution is the pooling strategy (which well handles the spatial mismatching as well as noise feature issues), and considerable improvement has been achieved as shown in other offline experiments. [1] Zheng Song, Qiang Chen, Zhongyang Huang, Yang Hua, and Shuicheng Yan. Contextualizing Object Detection and Classification, CVPR 2011. [2] http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/workshop/nuspsl.pdf
NUSPSL_CTX_GPM_SVM	classification using context svm and GPM,	NUSPSL_CTX_GPM_SVM	National University of Singapore; Panasonic Singapore Laboratories	NUS: Chen Qiang, Song Zheng, Yan Shuicheng; PSL: Hua Yang, Huang Zhongyang, Shen Shengmei;	The whole solution for object classification is based on BoW framework[1]. On image level, Dense-SIFT, HOG^2, LBP and color moments features are extracted. VQ and Fisher vectors are utilized for feature coding. Traditional SPM and novel spatial-free pyramid matching scheme are then performed to generate image representations. Context-aware features are also extracted based on detection result [2]. The classification models are learnt via kernel SVM. The key novelty of the new solution is the pooling strategy (which well handles the spatial mismatching as well as noise feature issues), and considerable improvement has been achieved as shown in other offline experiments. [1]http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/workshop/nuspsl.pdf [2] Zheng Song, Qiang Chen, Zhongyang Huang, Yang Hua, and Shuicheng Yan. Contextualizing Object Detection and Classification, CVPR 2011.
NUS_CONTEXT_SVM	Context-SVM based submission for 3 tasks	NUS_Context_SVM	National University of Singapore	Zheng Song, Qiang Chen, Shuicheng Yan	Classification uses the BoW framework. Dense-SIFT, HOG^2, LBP and color moment features are extracted. We then use VQ and fisher vector for feature coding and SPM and Generalized Pyramid Matching(GPM) to generate image representations. Context-aware features are also extracted based on [1]. The classification models are learnt via kernel SVM. Then final classification scores are refined with kernel mapping[2]. Detection and segmentation results use the baseline of [3] using HOG and LBP feature. And then based on [1], we further learn context model and refine the detection results. The final segmentation result uses the learnt average masks for each detection component learnt using segmentation training set to substitute the rectangle detection boxes. [1] Zheng Song, Qiang Chen, Zhongyang Huang, Yang Hua, and Shuicheng Yan. Contextualizing Object Detection and Classification. [2] http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/workshop/nuspsl.pdf [3] http://people.cs.uchicago.edu/~pff/latent/
NUS_SEG_DET_MASK_CLS_CRF	Segmentation Using CRF with Detection Mask	NUS_SEG_DET_MASK_CLS_CRF	National University of Singapore	Wei XIA, Zheng SONG, Qiang CHEN, Shuicheng YAN, Loong Fah CHEONG	The solution is based on CRF model and the key contribution is the utilization of various types of binary regularization terms. Object detection also plays a very significant role in guiding semantic object segmentation. In this solution, the CRF model is built to integrate the global classification score and local unary and binary information to perform semantic segmentation. What�s more, the detection masks trained by setting a hard threshold of the detection confidence maps are applied as extra unary and smooth terms in the CRF model. Some of masks with high confidence are also used in the post-processing stage to do some refinement at the mask boundaries.
NYUUCLA_HIERARCHY	Latent Hierarchical Learning	NYU-UCLA_Hierarchy	NYU and UCLA	Yuanhao Chen, Li Wan, Long Zhu, Rob Fergus, Alan Yuille	Based on two recent publications: "Latent Hierarchical Structural Learning for Object Detection". Long Zhu, Yuanhao Chen, Alan Yuille, William Freeman. CVPR 2010. "Active Mask Hierarchies for Object Detection". Yuanhao Chen, Long Zhu, Alan Yuille. ECCV 2010 We present a latent hierarchical structural learning method for object detection. An object is represented by a mixture of hierarchical tree models where the nodes represent object parts. The nodes can move spatially to allow both local and global shape deformations. The image features are histograms of words (HOWs) and oriented gradients (HOGs) which enable rich appearance representation of both structured (eg, cat face) and textured (eg,cat body) image regions. Learning the hierarchical model is a latent SVM problem which can be solved by the incremental concave-convex procedure (iCCCP). Object detection is performed by scanning sub-windows using dynamic programming. The detections are rescored by a context model which encodes the correlations of 20 object classes by using both object detection and image classification.
OXFORD_DPM_MK	DPM with basic rescoring	DPM-MK	Oxford VGG	Andrea Vedaldi and Andrew Zisserman	This method uses a Deformable Part Model (our own implementation) to generate an initial (and very good) list of 100 candidate bounding boxes per image. These are then rescored by a multiple features model combining DPM scores with dense SP-BOW, geometry, and context. The SP-BOW model are dense SIFT features (vl_phow in VLFeat) quantized into 1200 visual words, 6x6 spatial layout, cell-by-cell l2 normalization after raising the entries to the 1/4 power (1/4-homogeneous Hellinger's kernel). The geometric model is a second order polynomial kernel on the bounding box coordinates. The context model is a second order polynomial kernels mixing the candidate DPM score with twenty scores obtained as the maximum response of the DPMs for the 20 classes in that image (like Felzenszwalb). A second context model is also added, using 20 scores from a state-of-the-art Fisher kernel image classifier (also on dense SIFT features), as described in Chatfileld et al. 2010. The SVM scores are passed through a sigmoid for standardization in the 0-1 interval; the sigmoid model is fitted to the truing data. The model is trained by means of a large scale linear SVM using the one-slack bundle formulation (aka SVM^perf). The solver hence uses retraining implicitly, and we make sure it reaches full convergence.
OXFORD_RANK_SLACK_RBF	Structured ranking for Layout Detection	SVM-rank-slack-RBF	University of Oxford	Arpit Mittal, Matthew Blaschko, Andrew Zisserman, Manuel J Marin, Phil Torr	We make use of SVM structured ranking algorithm to combine and rank the outputs of different parts detectors. Individual parts are detected using separate detectors, then, the outputs are customized to the local image using the positional and scale cues. Different part detections are finally combined using a ranking function to give a single confidence value for the human layout detection. The ranking is performed such that detections having more true-positive parts (i.e., higher precision) are returned earlier. For detection of human head, we use the parts-based model of Felzenszwalb et al. (PAMI 2010); and hand is localized using the hand detector developed by Mittal et al. (BMVC, 2011). The feet are detected using the foot part of Felzenszwalb et al.'s human detector and also returned as the bounding box around the super-pixels resembling human foot in the lower bracket of the human ROI. We use slack rescaled variant of SVM structured ranking algorithm and RBF kernel map.
SJT_SIFT_LLC_PCAPOOL_DET_SVM	SVM using LLC features with detection results	SIFT-LLC-PCAPOOL-DET-SVM	Shanghai Jiao Tong Univeristy	Jun Zhu, Xiaokang Yang, Yukun Zhu, Rui Zhang, Xiaolin Chen	We adopt the framework based on locality-constrained linear coding (LLC) framework (J. Wang et al, CVPR 2010), fused with the detection results given by discriminatively-trained deformable part-based object detectors (P. Felzenszwalb, et al, CVPR 2008). First, the SIFT descriptors (Lowe, 2004) are extracted from image patches densely sampled by every 8 pixels, with three different scales (16x16, 24x24 and 32x32) respectively. Then, the codebook with 1024 bases is constructed by using K-means clustering on 100,000 randomly selected descriptors from the training set. Each 128-dimensional SIFT descriptor is encoded by the approximated LLC (the number of neighbors is set to 5 with the shift-invariant constraint), obtaining a 1024-dimensional code vector. We use max pooling on the patch-level codes from hundends of overlapping regions with various spatial scales and positions, followed by dimension reduction using PCA. After that, the pooled features are concatenated into a vector with l2-normalization to form the image-level representation. In addition, the results of object detectors are also considered. We use Felzenszwalb's deformable part-based models to detect the bounding-boxes for each object class. The detection scores are max-pooled in each cell of spatial pyramid (i.e., 1x1+2x2+3x1) to construct image-level representation with l2-normalization. We obtain final imgae-level representation through weighted concatenation of the two feature vectors from LLC codes and object detectors. Then, a linear SVM classifier is trained to perform classification. The regularization parameters as well as the fusion weight are respectively tuned for each class using the training and validation set. We adopt 'liblinear' software package, which is released by the machine learning group at national Taiwan university, on the implemention of SVM.
SJT_SIFT_LLC_PCAPOOL_SVM	Linear SVM using LLC features with PCA pooling	SIFT-LLC-PCAPOOL-SVM	Shanghai Jiao Tong Univeristy	Jun Zhu, Xiaokang Yang, Yukun Zhu, Rui Zhang, Xiaoling Chen	We adopt the framework based on locality-constrained linear coding (LLC) framework (J. Wang et al, CVPR 2010). First, the SIFT descriptors (Lowe, 2004) are extracted from image patches densely sampled by every 8 pixels, with three different scales (16x16, 24x24 and 32x32) respectively. Then, the codebook with 1024 bases is constructed by using K-means clustering on 100,000 randomly selected descriptors from the training set. After that, each 128-dimensional SIFT descriptor is encoded by the approximated LLC (the number of neighbors is set to 5 with the shift-invariant constraint), obtaining a 1024-dimensional code vector. We use max pooling on the patch-level codes from hundends of overlapping regions with various spatial scales and positions, followed by dimension reduction using PCA. After that, the pooled features are concatenated into a vector with l2-normalization to form the image-level representation. Finally, we train linear SVM classifiers on this feature representation to perform classification. The regularization parameters are respectively tuned for each class using the training and validation set. We adopt 'liblinear' software package, which is released by the machine learning group at national Taiwan university, on the implemention of SVM.
STANFORD_COMBINE_ATTR_PART	Combine attribute classifiers and object detectors	COMBINE_ATTR_PART	Stanford University	Bangpeng Yao, Aditya Khosla, Li Fei-Fei	Our approach combines attribute classifiers and part detectors for action classification. The method is adapted from our ICCV2011 paper (Yao et al, 2011). The "attributes" are trained by using our random forest classifier (Yao et al, 2011), which are strong classifiers that consider global properties of action classes. As for "parts", we consider the objects that interact with the humans, such as horses, books, etc. Specifically, we take the object bank (Li et al, 2010) detectors that are trained on the ImageNet dataset for part representation. The confidence scores obtained from attribute classifiers and part detectors are combined to form the final score for each image.
STANFORD_MAPSVM_POSELET	MAP-based SVM classifier with poselet features	MAPSVM-Poselet	Stanford University	Tim Tang, Pawan Kumar, Ben Packer, Daphne Koller	We build on the Poselet-based feature vector for action classification (Maji et al., 2010) in four ways: (i) we use a 2-level spatial pyramid (Lazenik et al., CVPR 2006); (ii) we obtain a segmentation of the person bounding box into foreground and background using an efficient GrabCut-like scheme (Rother et al., SIGGRAPH 2004), and use it to divide the feature vector into two parts---one corresponding to the foreground and one corresponding to the background; (iii) we learn a mixture model to deal with the different visual aspects of people performing the same action; and (iv) we optimize mean average precision (Yue et al., SIGIR 2007) instead of the 0/1 loss used in the standard binary SVM. All action classifiers are trained on only the VOC 2011 data, with additional annotations required to compute the Poselets. All hyperparameters are set using 5-fold cross-validation.
STANFORD_RF_DENSEFTR_SVM	Random forest with SVM node classifiers	RF_DENSEFTR_SVM	Stanford University	Bangpeng Yao, Aditya Khosla, Li Fei-Fei	We use a random forest (RF) approach for action classification. Our method is adapted from our CVPR2011 paper (Yao et al, 2011). We explore two key properties that determine the performance of RF classifiers: discrimination and randomization. (1) Discrimination: In order to obtain strong decision trees, instead of randomly generating feature weights as in the conventional RF approaches, we use discriminative SVM classifiers to train the split for each tree node. (2) Randomization: The correlation between different decision trees needs to be small, such that the combination of all the trees can form an effective RF classifier. We consider a very dense feature space, where we sample image regions that can have any size and location in the image. For each sampled region, we use an SPM feature representation. Since each decision tree samples a specific set of image regions, the correlation between the trees can be reduced.
UOCTTI_LSVM_MDPM	LSVM trained mixtures of deformable part models	UOCTTI_LSVM_MDPM	University of Chicago	Ross Girshick (University of Chicago), Pedro Felzenszwalb (Brown), David McAllester (TTI-Chicago)	Based on [1] http://people.cs.uchicago.edu/~pff/latent-release4 and [2] "Object Detection with Discriminatively Trained Part Based Models"; Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester and Deva Ramanan; IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 32, No. 9, September 2010. This entry is a minor modification of our publicly available "voc-release4" object detection system [1]. The system uses latent SVM to train mixtures of deformable part models using HOG features [2]. Final detections are refined using a context rescoring mechanism [2]. We extended [1] to detect smaller objects by adding an extra high-resolution octave to the HOG feature pyramid. The HOG features in this extra octave are computed using 2x2 pixel cells. Additional bias parameters are learned to help calibrate scores from detections in the extra octave with the scores of detections above this octave. This entry is the same as UOCTTI_LSVM_MDPM from the 2010 competition. Detection results are reported for all 20 object classes to provide a baseline for the 2011 competition.
UOCTTI_WL-SSVM_GRAMMAR	Person grammar model trained with WL-SSVM	UOCTTI_WL-SSVM_GRAMMAR	University of Chicago	Ross Girshick (University of Chicago), Pedro Felzenszwalb (Brown), David McAllester (TTI-Chicago)	This entry is described in [1] "Object Detection with Grammar Models"; Ross B. Girshick, Pedro F. Felzenszwalb, David McAllester. Neural Information Processing Systems 2011 (to appear). We define a grammar model for detecting people and train the model�s parameters from bounding box annotations using a formalism that we call weak-label structural SVM (WL-SSVM). The person grammar uses a set of productions that represent varying degrees of visibility/occlusion. Object parts, such as the head and shoulder, are shared across all interpretations of object visibility. Each part is represented by a deformable mixture model that includes deformable subparts. An "occluder" part (itself a deformable mixture of parts) is used to capture the nontrivial appearance of the stuff that typically occludes people from below. We further refine detections using the context rescoring mechanism from the UOCTTI_LSVM_MDPM entry, using the results of that entry for the 19 non-person classes.
UVA_MOSTTELLING	Most Telling Window	UvA_UNITN_MostTellingMonkey	University of Amsterdam, University of Trento	Jasper Uijlings Koen van de Sande Arnold Smeulders Theo Gevers Nicu Sebe Cees Snoek	Classification Task Our main component of this entry is the "Most Telling Window" method which uses Segmentation as Selective Search [1] combined with Bag-of-Words. The "Most Telling Window" method is also used in our Detection entry. However, instead of focusing on finding complete objects, training is adjusted such that we can use the most discriminative part of an object for its identification instead of the whole object. The Most Telling Window method is currently under review. While the "Most Telling Window" method yields the greatest contribution, we improve accuracy further by combining it with a normal Bag-of-Words framework based on SIFT and ColourSift and with the detection scores of the part-based model of Felzenszwalb et al. [1] "Segmentation as Selective Search for Object Recognition"; Koen E. A. van de Sande, Jasper R. R. Uijlings, Theo Gevers, Arnold W. M. Smeulders; 13th International Conference on Computer Vision, 2011.
UVA_SELSEARCH	Selective Search Detection System	SelectiveSearchMonkey	University of Amsterdam and University of Trento	Jasper R. R. Uijlings Koen E. A. van de Sande Arnold W. M. Smeulders Theo Gevers Nicu Sebe Cees Snoek	Based on "Segmentation as Selective Search for Object Recognition"; Koen E. A. van de Sande, Jasper R. R. Uijlings, Theo Gevers, Arnold W. M. Smeulders; 13th International Conference on Computer Vision, 2011. Instead of exhaustive search, which was dominant in the Pascal VOC 2010 detection challenge, we use segmentation as a sampling strategy for selective search (cf. our ICCV paper). Like segmentation, we use the image structure to guide our sampling process. However, unlike segmentation, we propose to generate many approximate locations over few and precise object delineations, as the goal is to cover all object locations. Our sampling is diversified to deal with as many image conditions as possible. Specifically, we use a variety of hierarchical region grouping strategies by varying colour spaces and grouping criteria. This results in a small set of data-driven, class-indepent, high quality object locations (coverage of 96-99% of all objects in the VOC2007 test set). Because we have only a limited number of locations to evaluate, this enables the use of the more computationally expensive bag-of-words framework for classification. Our bag-of-words implementation uses densely sampled SIFT and ColorSIFT descriptors.
WVU_SVM-PHOW	Svm classifier with PHOW features.	SVM-PHOW	West Virginia University	Biyun Lai, Yu Zhu, Qin Wu, Guodong Guo	We develop a method for still-image based action recognition. There are 10 action classes plus the �other� action class provided by PASCAL VOC 2011. We extracted the PHOW features to represent the images, which is a kind of multi-scale dense SIFT implementation. The kernel SVM method is used for training action classifiers. Different kernels are used for the SVM. We also used a learning technique to map the original features into a different space to improve the feature representation. A confidence measure is used to combine the results from different kernels to form the final decision for action classification. The training is performed on the provided training set, and tuned by using the validation set, and then the learned classifiers are applied to the test data.