Action Classification Results: VOC2012 BETA

Competition "comp9" (train on VOC2012 data)

This leaderboard shows only those submissions that have been marked as public, and so the displayed rankings should not be considered as definitive.

Average Precision (AP %)

  mean

jumping

phoning

playing
instrument
reading

riding
bike
riding
horse
running

taking
photo
using
computer
walking

submission
date
PoseReuse_ClsStreamPoseStream [?] 90.090.685.892.579.797.598.494.887.591.082.625-Sep-2017
PoseReuse_ClsStream [?] 88.489.684.291.877.197.198.293.485.489.478.325-Sep-2017
STANFORD_RF_MULTFEAT_SVM [?] 69.175.744.866.644.493.294.287.638.470.675.623-Sep-2012
SZU_DPM_RF_SVM [?] 67.173.845.062.841.493.093.487.835.064.773.523-Sep-2012
RF_DENSEFTR_SVM [?] 63.665.842.759.841.390.092.186.429.162.466.113-Oct-2011
NUDT_Low-level_Semantic [?] 60.166.142.853.734.988.989.987.225.353.958.530-Sep-2011
NUDT_Context [?] 60.765.642.957.234.488.990.087.625.454.859.912-Oct-2011
HOBJ+DSAL [?] 57.071.651.677.337.586.589.483.725.259.159.713-Oct-2011
M4AP [?] 53.747.835.446.728.783.485.284.228.542.454.027-Jan-2014
Supervised learning with multiple feature [?] 54.558.638.348.330.281.783.078.021.251.454.013-Oct-2011
DSAL [?] 50.662.140.960.332.880.983.680.023.254.050.613-Oct-2011
SVM-PHOW [?] 35.942.331.032.026.448.646.258.913.624.235.914-Oct-2011

Abbreviations

TitleMethodAffiliationContributorsDescriptionDate
CNN classifier with semantic region from posePoseReuse_ClsStreamSoutheast University of ChinaJian Dong, Changyin Sun, Wankou YangThe bounding box region and the semantic regions obtained based on pose estimation are fed into an end-to-end CNN.2017-09-25 17:33:54
CNN classifier with two modelsPoseReuse_ClsStreamPoseStreamSoutheast University of ChinaJian Dong, Changyin Sun, Wankou YangWeight the CNN model initialized with different parameters. One is general image classification model. The other is pose estimation model.2017-09-25 17:40:55
Random forest with SVM on multiple featuresSTANFORD_RF_MULTFEAT_SVMStanford University; MITAditya Khosla, Rui Zhang, Bangpeng Yao, and Li Fei-FeiWe use a random forest (RF) approach for action classification. Our method is adapted from our CVPR2011 paper (Khosla*, Yao*, Fei-Fei, 2011). We explore two key properties that determine the performance of RF classifiers: discrimination and randomization. (1) Discrimination: We obtain strong decision trees, using discriminative SVM classifiers at each tree node. (2) Randomization: We consider a very dense feature space, where we sample image regions that can have any size and location in the image. Compared to VOC2011, we use multiple features including SIFT, HOG, Color Naming, Object Bank and LBP. We modify some of the existing features to better address our need. Further, we perform tree selection (similar to feature selection) to identify more discriminative regions in a class-specific manner.2012-09-23 14:08:50
Part based models and object detectionSZU_DPM_RF_SVMShenzhen UniversityShiqi Yu, Shengyin Wu, Wensheng ChenBased on "Object Detection with Discriminatively Trained Part Based Models", P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan; IEEE TPAMI, 2010; and "Combining Randomization and Discrimination for Fine-Grained Image Categorization"; B. Yao, A. Khosla, and L. Fei-Fei. CVPR2011. We combine human part based models with object detectors for the proposed action classification method. Based on a simple principle that similar poses are presented when the subjects perform the same action, the deformable part based model [Felzenszwalb et. al. TPAMI 2010] is employed to describe the pose of a human body. In detail, the positions and textures of the parts can be extracted as features for action classification. For detection part of the proposed method, we use the random forest (RF) described in [Yao et. al. CVPR2011]. RF can detect human body parts and the objects interact with the human. At last we fuse the features and scores from the two models, and achieved a stronger classifier.2012-09-23 06:11:57
Discriminative spatial saliencyDSALUniv Caen/ INRIA LEARGaurav Sharma, Frederic Jurie, Cordelia SchmidWe propose to learn discriminative saliency maps for images which highlight the regions which are more discriminant for the current classification task. We use the saliency maps to weight the visual words for improving discriminative capacity of bag of words features. The approach is motivated by the observation that for many human actions and attributes, local regions are highly discriminative e.g. for running the bent arms and legs are highly discriminant. Along with that we combine features based on SIFT, HOG, Color and texture. 2011-10-13 20:42:10
Human obj interaction and discriminative saliencyHOBJ+DSALUniv Caen/ INRIA LEARGaurav Sharma, Alessandro Prest, Frederic Jurie, Vittorio Ferrari, Cordelia SchmidWe use the weakly supervised approach (Prest et al. PAMI2010) for learning human actions modeled as interactions between humans and objects. The human bounding box is taken as reference and the object relevant to the action and its spatial relation with the human is automatically learnt. The method is combined with a method to learn discriminative spatial saliency which highlights the regions which are more discriminant for the current classification task. We use the saliency maps to weight the visual words for improving discriminative capacity of bag of words features. Along with that we combine features based on SIFT, HOG, Color and texture. 2011-10-13 20:45:19
Max marginM4APINRIA and Ecole Centrale de ParisPuneet Kumar and M Pawan KumarI use different features to capture the action class and the contextual information contained within the classes. The methodology uses the contextual information in order to improve the results. We use Structured prediction (srtuctured support vector machine) for learning the parameters. The idea is to incorporate contexual information such that the actions in the similar images should have same class. In our experiments using trainval dataset of pascal we have noticed significant improvements. 2014-01-27 18:40:53
Svm classifier with contextual informationNUDT_ContextNational University of Defense TechnologyLi Zhou, Zongtan Zhou, Dewen HuAction classification using contextual information. We present a new model for action classification context based on the distribution of object and the semantic category of scene within images. The scene classification works by creating multiple resolution images and partitioning them into sub-regions with different scales. The visual descriptors of all sub-regions in the same resolution image are directly concatenated for SVM classifiers. Finally, regarding each resolution image as a feature channel, we combine all the feature channels to reach a final decision. The object recognition works by incorporating a multi-resolution representation into the bag-of-features model.2011-10-12 17:25:12
Svm classifier with low-level and semantic modelinNUDT_Low-level_SemanticNational University of Defense TechnologyLi Zhou, Dewen Hu, Zongtan ZhouAction classification based on combining low-level and semantic modeling strategies2011-09-30 16:10:58
Random forest with SVM node classifiersRF_DENSEFTR_SVMStanford UniversityBangpeng Yao, Aditya Khosla, Li Fei-FeiWe use a random forest (RF) approach for action classification. Our method is adapted from our CVPR2011 paper (Yao et al, 2011). We explore two key properties that determine the performance of RF classifiers: discrimination and randomization. (1) Discrimination: In order to obtain strong decision trees, instead of randomly generating feature weights as in the conventional RF approaches, we use discriminative SVM classifiers to train the split for each tree node. (2) Randomization: The correlation between different decision trees needs to be small, such that the combination of all the trees can form an effective RF classifier. We consider a very dense feature space, where we sample image regions that can have any size and location in the image. For each sampled region, we use an SPM feature representation. Since each decision tree samples a specific set of image regions, the correlation between the trees can be reduced.2011-10-13 07:37:36
Svm classifier with PHOW features. SVM-PHOWWest Virginia UniversityBiyun Lai, Yu Zhu, Qin Wu, Guodong GuoWe develop a method for still-image based action recognition. There are 10 action classes plus the “other” action class provided by PASCAL VOC 2011. We extracted the PHOW features to represent the images, which is a kind of multi-scale dense SIFT implementation. The kernel SVM method is used for training action classifiers. Different kernels are used for the SVM. We also used a learning technique to map the original features into a different space to improve the feature representation. A confidence measure is used to combine the results from different kernels to form the final decision for action classification. The training is performed on the provided training set, and tuned by using the validation set, and then the learned classifiers are applied to the test data. 2011-10-14 00:06:25
Supervised Learning with Multiple FeaturesSupervised learning with multiple featureUniversity of Missouri - ColumbiaXutao Lv, Xiaoyu Wang, Guang Chen, Shuai Tang, Yan Li, Miao Sun, Tony X. HanMultiple available features are combined and fed into a newly developed supervised learning algorithm. The features includes the feature extracted within the bounding box and the feature from the whole image. The features from the whole images are served as context information. We mainly use two feature descriptors in our submission, dense SIFT and HOG. LCC coding method and spatial pyramid is adopted to generate histogram for each action image, and the histogram is then served as feature vector to train and test with the supervised learning algorithm.2011-10-13 21:50:30