PASCAL VOC Challenge performance evaluation server

Action Classification Results: VOC2012 ^BETA

Competition "comp9" (train on VOC2012 data)

This leaderboard shows only those submissions that have been marked as public, and so the displayed rankings should not be considered as definitive.

The highest scoring entry in each column is shown in bold.
Clicking on the blue arrow symbol () at the top of a column will order the submissions from high to low wrt performance on that column.

Average Precision (AP %)

		mean	jumping	phoning	playing instrument	reading	riding bike	riding horse	running	taking photo	using computer	walking	submission date
	PoseReuse_ClsStreamPoseStream ^[?]	90.0	90.6	85.8	92.5	79.7	97.5	98.4	94.8	87.5	91.0	82.6	25-Sep-2017
	PoseReuse_ClsStream ^[?]	88.4	89.6	84.2	91.8	77.1	97.1	98.2	93.4	85.4	89.4	78.3	25-Sep-2017
	STANFORD_RF_MULTFEAT_SVM ^[?]	69.1	75.7	44.8	66.6	44.4	93.2	94.2	87.6	38.4	70.6	75.6	23-Sep-2012
	SZU_DPM_RF_SVM ^[?]	67.1	73.8	45.0	62.8	41.4	93.0	93.4	87.8	35.0	64.7	73.5	23-Sep-2012
	RF_DENSEFTR_SVM ^[?]	63.6	65.8	42.7	59.8	41.3	90.0	92.1	86.4	29.1	62.4	66.1	13-Oct-2011
	NUDT_Low-level_Semantic ^[?]	60.1	66.1	42.8	53.7	34.9	88.9	89.9	87.2	25.3	53.9	58.5	30-Sep-2011
	NUDT_Context ^[?]	60.7	65.6	42.9	57.2	34.4	88.9	90.0	87.6	25.4	54.8	59.9	12-Oct-2011
	HOBJ+DSAL ^[?]	57.0	71.6	51.6	77.3	37.5	86.5	89.4	83.7	25.2	59.1	59.7	13-Oct-2011
	M4AP ^[?]	53.7	47.8	35.4	46.7	28.7	83.4	85.2	84.2	28.5	42.4	54.0	27-Jan-2014
	Supervised learning with multiple feature ^[?]	54.5	58.6	38.3	48.3	30.2	81.7	83.0	78.0	21.2	51.4	54.0	13-Oct-2011
	DSAL ^[?]	50.6	62.1	40.9	60.3	32.8	80.9	83.6	80.0	23.2	54.0	50.6	13-Oct-2011
	SVM-PHOW ^[?]	35.9	42.3	31.0	32.0	26.4	48.6	46.2	58.9	13.6	24.2	35.9	14-Oct-2011

Abbreviations

Title	Method	Affiliation	Contributors	Description	Date
CNN classifier with semantic region from pose	PoseReuse_ClsStream	Southeast University of China	Jian Dong, Changyin Sun, Wankou Yang	The bounding box region and the semantic regions obtained based on pose estimation are fed into an end-to-end CNN.	2017-09-25 17:33:54
CNN classifier with two models	PoseReuse_ClsStreamPoseStream	Southeast University of China	Jian Dong, Changyin Sun, Wankou Yang	Weight the CNN model initialized with different parameters. One is general image classification model. The other is pose estimation model.	2017-09-25 17:40:55
Random forest with SVM on multiple features	STANFORD_RF_MULTFEAT_SVM	Stanford University; MIT	Aditya Khosla, Rui Zhang, Bangpeng Yao, and Li Fei-Fei	We use a random forest (RF) approach for action classification. Our method is adapted from our CVPR2011 paper (Khosla, Yao, Fei-Fei, 2011). We explore two key properties that determine the performance of RF classifiers: discrimination and randomization. (1) Discrimination: We obtain strong decision trees, using discriminative SVM classifiers at each tree node. (2) Randomization: We consider a very dense feature space, where we sample image regions that can have any size and location in the image. Compared to VOC2011, we use multiple features including SIFT, HOG, Color Naming, Object Bank and LBP. We modify some of the existing features to better address our need. Further, we perform tree selection (similar to feature selection) to identify more discriminative regions in a class-specific manner.	2012-09-23 14:08:50
Part based models and object detection	SZU_DPM_RF_SVM	Shenzhen University	Shiqi Yu, Shengyin Wu, Wensheng Chen	Based on "Object Detection with Discriminatively Trained Part Based Models", P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan; IEEE TPAMI, 2010; and "Combining Randomization and Discrimination for Fine-Grained Image Categorization"; B. Yao, A. Khosla, and L. Fei-Fei. CVPR2011. We combine human part based models with object detectors for the proposed action classification method. Based on a simple principle that similar poses are presented when the subjects perform the same action, the deformable part based model [Felzenszwalb et. al. TPAMI 2010] is employed to describe the pose of a human body. In detail, the positions and textures of the parts can be extracted as features for action classification. For detection part of the proposed method, we use the random forest (RF) described in [Yao et. al. CVPR2011]. RF can detect human body parts and the objects interact with the human. At last we fuse the features and scores from the two models, and achieved a stronger classifier.	2012-09-23 06:11:57
Discriminative spatial saliency	DSAL	Univ Caen/ INRIA LEAR	Gaurav Sharma, Frederic Jurie, Cordelia Schmid	We propose to learn discriminative saliency maps for images which highlight the regions which are more discriminant for the current classification task. We use the saliency maps to weight the visual words for improving discriminative capacity of bag of words features. The approach is motivated by the observation that for many human actions and attributes, local regions are highly discriminative e.g. for running the bent arms and legs are highly discriminant. Along with that we combine features based on SIFT, HOG, Color and texture.	2011-10-13 20:42:10
Human obj interaction and discriminative saliency	HOBJ+DSAL	Univ Caen/ INRIA LEAR	Gaurav Sharma, Alessandro Prest, Frederic Jurie, Vittorio Ferrari, Cordelia Schmid	We use the weakly supervised approach (Prest et al. PAMI2010) for learning human actions modeled as interactions between humans and objects. The human bounding box is taken as reference and the object relevant to the action and its spatial relation with the human is automatically learnt. The method is combined with a method to learn discriminative spatial saliency which highlights the regions which are more discriminant for the current classification task. We use the saliency maps to weight the visual words for improving discriminative capacity of bag of words features. Along with that we combine features based on SIFT, HOG, Color and texture.	2011-10-13 20:45:19
Max margin	M4AP	INRIA and Ecole Centrale de Paris	Puneet Kumar and M Pawan Kumar	I use different features to capture the action class and the contextual information contained within the classes. The methodology uses the contextual information in order to improve the results. We use Structured prediction (srtuctured support vector machine) for learning the parameters. The idea is to incorporate contexual information such that the actions in the similar images should have same class. In our experiments using trainval dataset of pascal we have noticed significant improvements.	2014-01-27 18:40:53
Svm classifier with contextual information	NUDT_Context	National University of Defense Technology	Li Zhou, Zongtan Zhou, Dewen Hu	Action classification using contextual information. We present a new model for action classification context based on the distribution of object and the semantic category of scene within images. The scene classification works by creating multiple resolution images and partitioning them into sub-regions with different scales. The visual descriptors of all sub-regions in the same resolution image are directly concatenated for SVM classifiers. Finally, regarding each resolution image as a feature channel, we combine all the feature channels to reach a final decision. The object recognition works by incorporating a multi-resolution representation into the bag-of-features model.	2011-10-12 17:25:12
Svm classifier with low-level and semantic modelin	NUDT_Low-level_Semantic	National University of Defense Technology	Li Zhou, Dewen Hu, Zongtan Zhou	Action classification based on combining low-level and semantic modeling strategies	2011-09-30 16:10:58
Random forest with SVM node classifiers	RF_DENSEFTR_SVM	Stanford University	Bangpeng Yao, Aditya Khosla, Li Fei-Fei	We use a random forest (RF) approach for action classification. Our method is adapted from our CVPR2011 paper (Yao et al, 2011). We explore two key properties that determine the performance of RF classifiers: discrimination and randomization. (1) Discrimination: In order to obtain strong decision trees, instead of randomly generating feature weights as in the conventional RF approaches, we use discriminative SVM classifiers to train the split for each tree node. (2) Randomization: The correlation between different decision trees needs to be small, such that the combination of all the trees can form an effective RF classifier. We consider a very dense feature space, where we sample image regions that can have any size and location in the image. For each sampled region, we use an SPM feature representation. Since each decision tree samples a specific set of image regions, the correlation between the trees can be reduced.	2011-10-13 07:37:36
Svm classifier with PHOW features.	SVM-PHOW	West Virginia University	Biyun Lai, Yu Zhu, Qin Wu, Guodong Guo	We develop a method for still-image based action recognition. There are 10 action classes plus the �other� action class provided by PASCAL VOC 2011. We extracted the PHOW features to represent the images, which is a kind of multi-scale dense SIFT implementation. The kernel SVM method is used for training action classifiers. Different kernels are used for the SVM. We also used a learning technique to map the original features into a different space to improve the feature representation. A confidence measure is used to combine the results from different kernels to form the final decision for action classification. The training is performed on the provided training set, and tuned by using the validation set, and then the learned classifiers are applied to the test data.	2011-10-14 00:06:25
Supervised Learning with Multiple Features	Supervised learning with multiple feature	University of Missouri - Columbia	Xutao Lv, Xiaoyu Wang, Guang Chen, Shuai Tang, Yan Li, Miao Sun, Tony X. Han	Multiple available features are combined and fed into a newly developed supervised learning algorithm. The features includes the feature extracted within the bounding box and the feature from the whole image. The features from the whole images are served as context information. We mainly use two feature descriptors in our submission, dense SIFT and HOG. LCC coding method and spatial pyramid is adopted to generate histogram for each action image, and the histogram is then served as feature vector to train and test with the supervised learning algorithm.	2011-10-13 21:50:30