PASCAL VOC Challenge performance evaluation and download server |
|
Home | Leaderboard |
mean | jumping | phoning | playing instrument | reading | riding bike | riding horse | running | taking photo | using computer | walking | submission date | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PoseReuse_ClsStreamPoseStream [?] | 90.0 | 90.6 | 85.8 | 92.5 | 79.7 | 97.5 | 98.4 | 94.8 | 87.5 | 91.0 | 82.6 | 25-Sep-2017 | |
PoseReuse_ClsStream [?] | 88.4 | 89.6 | 84.2 | 91.8 | 77.1 | 97.1 | 98.2 | 93.4 | 85.4 | 89.4 | 78.3 | 25-Sep-2017 | |
STANFORD_RF_MULTFEAT_SVM [?] | 69.1 | 75.7 | 44.8 | 66.6 | 44.4 | 93.2 | 94.2 | 87.6 | 38.4 | 70.6 | 75.6 | 23-Sep-2012 | |
SZU_DPM_RF_SVM [?] | 67.1 | 73.8 | 45.0 | 62.8 | 41.4 | 93.0 | 93.4 | 87.8 | 35.0 | 64.7 | 73.5 | 23-Sep-2012 | |
RF_DENSEFTR_SVM [?] | 63.6 | 65.8 | 42.7 | 59.8 | 41.3 | 90.0 | 92.1 | 86.4 | 29.1 | 62.4 | 66.1 | 13-Oct-2011 | |
NUDT_Context [?] | 60.7 | 65.6 | 42.9 | 57.2 | 34.4 | 88.9 | 90.0 | 87.6 | 25.4 | 54.8 | 59.9 | 12-Oct-2011 | |
NUDT_Low-level_Semantic [?] | 60.1 | 66.1 | 42.8 | 53.7 | 34.9 | 88.9 | 89.9 | 87.2 | 25.3 | 53.9 | 58.5 | 30-Sep-2011 | |
HOBJ+DSAL [?] | 57.0 | 71.6 | 51.6 | 77.3 | 37.5 | 86.5 | 89.4 | 83.7 | 25.2 | 59.1 | 59.7 | 13-Oct-2011 | |
Supervised learning with multiple feature [?] | 54.5 | 58.6 | 38.3 | 48.3 | 30.2 | 81.7 | 83.0 | 78.0 | 21.2 | 51.4 | 54.0 | 13-Oct-2011 | |
M4AP [?] | 53.7 | 47.8 | 35.4 | 46.7 | 28.7 | 83.4 | 85.2 | 84.2 | 28.5 | 42.4 | 54.0 | 27-Jan-2014 | |
DSAL [?] | 50.6 | 62.1 | 40.9 | 60.3 | 32.8 | 80.9 | 83.6 | 80.0 | 23.2 | 54.0 | 50.6 | 13-Oct-2011 | |
SVM-PHOW [?] | 35.9 | 42.3 | 31.0 | 32.0 | 26.4 | 48.6 | 46.2 | 58.9 | 13.6 | 24.2 | 35.9 | 14-Oct-2011 |
Title | Method | Affiliation | Contributors | Description | Date |
---|---|---|---|---|---|
CNN classifier with semantic region from pose | PoseReuse_ClsStream | Southeast University of China | Jian Dong, Changyin Sun, Wankou Yang | The bounding box region and the semantic regions obtained based on pose estimation are fed into an end-to-end CNN. | 2017-09-25 17:33:54 |
CNN classifier with two models | PoseReuse_ClsStreamPoseStream | Southeast University of China | Jian Dong, Changyin Sun, Wankou Yang | Weight the CNN model initialized with different parameters. One is general image classification model. The other is pose estimation model. | 2017-09-25 17:40:55 |
Random forest with SVM on multiple features | STANFORD_RF_MULTFEAT_SVM | Stanford University; MIT | Aditya Khosla, Rui Zhang, Bangpeng Yao, and Li Fei-Fei | We use a random forest (RF) approach for action classification. Our method is adapted from our CVPR2011 paper (Khosla*, Yao*, Fei-Fei, 2011). We explore two key properties that determine the performance of RF classifiers: discrimination and randomization. (1) Discrimination: We obtain strong decision trees, using discriminative SVM classifiers at each tree node. (2) Randomization: We consider a very dense feature space, where we sample image regions that can have any size and location in the image. Compared to VOC2011, we use multiple features including SIFT, HOG, Color Naming, Object Bank and LBP. We modify some of the existing features to better address our need. Further, we perform tree selection (similar to feature selection) to identify more discriminative regions in a class-specific manner. | 2012-09-23 14:08:50 |
Part based models and object detection | SZU_DPM_RF_SVM | Shenzhen University | Shiqi Yu, Shengyin Wu, Wensheng Chen | Based on "Object Detection with Discriminatively Trained Part Based Models", P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan; IEEE TPAMI, 2010; and "Combining Randomization and Discrimination for Fine-Grained Image Categorization"; B. Yao, A. Khosla, and L. Fei-Fei. CVPR2011. We combine human part based models with object detectors for the proposed action classification method. Based on a simple principle that similar poses are presented when the subjects perform the same action, the deformable part based model [Felzenszwalb et. al. TPAMI 2010] is employed to describe the pose of a human body. In detail, the positions and textures of the parts can be extracted as features for action classification. For detection part of the proposed method, we use the random forest (RF) described in [Yao et. al. CVPR2011]. RF can detect human body parts and the objects interact with the human. At last we fuse the features and scores from the two models, and achieved a stronger classifier. | 2012-09-23 06:11:57 |
Discriminative spatial saliency | DSAL | Univ Caen/ INRIA LEAR | Gaurav Sharma, Frederic Jurie, Cordelia Schmid | We propose to learn discriminative saliency maps for images which highlight the regions which are more discriminant for the current classification task. We use the saliency maps to weight the visual words for improving discriminative capacity of bag of words features. The approach is motivated by the observation that for many human actions and attributes, local regions are highly discriminative e.g. for running the bent arms and legs are highly discriminant. Along with that we combine features based on SIFT, HOG, Color and texture. | 2011-10-13 20:42:10 |
Human obj interaction and discriminative saliency | HOBJ+DSAL | Univ Caen/ INRIA LEAR | Gaurav Sharma, Alessandro Prest, Frederic Jurie, Vittorio Ferrari, Cordelia Schmid | We use the weakly supervised approach (Prest et al. PAMI2010) for learning human actions modeled as interactions between humans and objects. The human bounding box is taken as reference and the object relevant to the action and its spatial relation with the human is automatically learnt. The method is combined with a method to learn discriminative spatial saliency which highlights the regions which are more discriminant for the current classification task. We use the saliency maps to weight the visual words for improving discriminative capacity of bag of words features. Along with that we combine features based on SIFT, HOG, Color and texture. | 2011-10-13 20:45:19 |
Max margin | M4AP | INRIA and Ecole Centrale de Paris | Puneet Kumar and M Pawan Kumar | I use different features to capture the action class and the contextual information contained within the classes. The methodology uses the contextual information in order to improve the results. We use Structured prediction (srtuctured support vector machine) for learning the parameters. The idea is to incorporate contexual information such that the actions in the similar images should have same class. In our experiments using trainval dataset of pascal we have noticed significant improvements. | 2014-01-27 18:40:53 |
Svm classifier with contextual information | NUDT_Context | National University of Defense Technology | Li Zhou, Zongtan Zhou, Dewen Hu | Action classification using contextual information. We present a new model for action classification context based on the distribution of object and the semantic category of scene within images. The scene classification works by creating multiple resolution images and partitioning them into sub-regions with different scales. The visual descriptors of all sub-regions in the same resolution image are directly concatenated for SVM classifiers. Finally, regarding each resolution image as a feature channel, we combine all the feature channels to reach a final decision. The object recognition works by incorporating a multi-resolution representation into the bag-of-features model. | 2011-10-12 17:25:12 |
Svm classifier with low-level and semantic modelin | NUDT_Low-level_Semantic | National University of Defense Technology | Li Zhou, Dewen Hu, Zongtan Zhou | Action classification based on combining low-level and semantic modeling strategies | 2011-09-30 16:10:58 |
Random forest with SVM node classifiers | RF_DENSEFTR_SVM | Stanford University | Bangpeng Yao, Aditya Khosla, Li Fei-Fei | We use a random forest (RF) approach for action classification. Our method is adapted from our CVPR2011 paper (Yao et al, 2011). We explore two key properties that determine the performance of RF classifiers: discrimination and randomization. (1) Discrimination: In order to obtain strong decision trees, instead of randomly generating feature weights as in the conventional RF approaches, we use discriminative SVM classifiers to train the split for each tree node. (2) Randomization: The correlation between different decision trees needs to be small, such that the combination of all the trees can form an effective RF classifier. We consider a very dense feature space, where we sample image regions that can have any size and location in the image. For each sampled region, we use an SPM feature representation. Since each decision tree samples a specific set of image regions, the correlation between the trees can be reduced. | 2011-10-13 07:37:36 |
Svm classifier with PHOW features. | SVM-PHOW | West Virginia University | Biyun Lai, Yu Zhu, Qin Wu, Guodong Guo | We develop a method for still-image based action recognition. There are 10 action classes plus the “other” action class provided by PASCAL VOC 2011. We extracted the PHOW features to represent the images, which is a kind of multi-scale dense SIFT implementation. The kernel SVM method is used for training action classifiers. Different kernels are used for the SVM. We also used a learning technique to map the original features into a different space to improve the feature representation. A confidence measure is used to combine the results from different kernels to form the final decision for action classification. The training is performed on the provided training set, and tuned by using the validation set, and then the learned classifiers are applied to the test data. | 2011-10-14 00:06:25 |
Supervised Learning with Multiple Features | Supervised learning with multiple feature | University of Missouri - Columbia | Xutao Lv, Xiaoyu Wang, Guang Chen, Shuai Tang, Yan Li, Miao Sun, Tony X. Han | Multiple available features are combined and fed into a newly developed supervised learning algorithm. The features includes the feature extracted within the bounding box and the feature from the whole image. The features from the whole images are served as context information. We mainly use two feature descriptors in our submission, dense SIFT and HOG. LCC coding method and spatial pyramid is adopted to generate histogram for each action image, and the histogram is then served as feature vector to train and test with the supervised learning algorithm. | 2011-10-13 21:50:30 |