PASCAL VOC Challenge performance evaluation and download server |
|
Home | Leaderboard |
mean | jumping | phoning | playing instrument | reading | riding bike | riding horse | running | taking photo | using computer | walking | submission date | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Human-object Relation [?] | 92.8 | 91.1 | 89.8 | 95.4 | 87.7 | 98.6 | 98.8 | 95.4 | 91.4 | 95.8 | 84.3 | 01-Dec-2019 | |
merge [?] | 91.5 | 92.7 | 89.8 | 96.9 | 87.9 | 96.7 | 99.1 | 93.6 | 86.9 | 95.2 | 75.9 | 21-Feb-2023 | |
Attention [?] | 90.2 | 92.7 | 86.0 | 93.2 | 83.7 | 96.6 | 98.8 | 93.5 | 85.3 | 91.8 | 80.1 | 07-Jun-2017 | |
R*CNN [?] | 90.2 | 91.5 | 84.4 | 93.6 | 83.2 | 96.9 | 98.4 | 93.8 | 85.9 | 92.6 | 81.8 | 13-Aug-2015 | |
VERY_DEEP_CONVNET_16_19_SVM [?] | 84.0 | 89.3 | 71.3 | 94.7 | 71.3 | 97.1 | 98.2 | 90.2 | 73.3 | 88.5 | 66.4 | 03-Dec-2014 | |
Oxford_RMP [?] | 76.3 | 82.3 | 52.9 | 84.3 | 53.6 | 95.6 | 96.1 | 89.7 | 60.4 | 76.0 | 72.9 | 03-Dec-2014 | |
OXFORD_ALIGNED_BODYPARTS [?] | 69.6 | 77.0 | 50.4 | 65.3 | 39.5 | 94.1 | 95.9 | 87.7 | 42.7 | 68.6 | 74.5 | 23-Sep-2012 | |
COMBINE_ATTR_PART [?] | 63.8 | 66.6 | 42.9 | 60.6 | 42.0 | 90.5 | 92.2 | 85.9 | 28.9 | 64.0 | 64.5 | 14-Oct-2011 | |
HU_BU_MIL_MULTI_CUE [?] | 56.0 | 59.4 | 39.6 | 56.5 | 34.4 | 75.7 | 80.2 | 74.3 | 27.6 | 55.2 | 56.6 | 24-Sep-2012 | |
BERKELEY_ACTION_POSELETS [?] | 55.1 | 59.3 | 32.4 | 45.4 | 27.5 | 84.5 | 88.3 | 77.2 | 31.2 | 47.4 | 58.2 | 12-Oct-2011 | |
MAPSVM-Poselet [?] | 42.4 | 26.8 | 30.3 | 28.1 | 23.6 | 72.1 | 82.4 | 67.0 | 19.8 | 26.1 | 47.3 | 13-Oct-2011 |
Title | Method | Affiliation | Contributors | Description | Date |
---|---|---|---|---|---|
Multi-branch attention networks | Attention | University of Liverpool | Shiyang Yan, Jeremy S. Smith, Bailing Zhang | Use multiple contextual cues to facilitate the recognition. | 2017-06-07 07:40:31 |
On Recognizing Actions in Still Images via MIL | HU_BU_MIL_MULTI_CUE | Hacettepe University, Bilkent University | Cagdas Bas, Fadime Sener, Nazli Ikizler-Cinbis | We propose a multi-cue based approach for recognizing hu- man actions in still images, where candidate object regions are discovered and utilized in a weakly supervised manner. Our approach is weakly su- pervised in the sense that it does not require any explicitly trained object detector or part/attribute annotation. Instead, a multiple instance learn- ing approach is used over sets of object hypotheses in order to represent objects relevant to the actions. Our results show that using multiple ob- ject hypotheses within multiple instance learning is eective for human action recognition in still images and such an object representation is suitable for using in conjunction with other visual features. | 2012-09-24 00:26:48 |
Human-object Relation Network | Human-object Relation | Tongji University | Wentao Ma, Shuang Liang | We propose the human-object relation network for action recognition in still images. It computes pair-wise relation information from actor and object appearances as well as spatial locations, and enhances both features for action classification. | 2019-12-01 13:47:22 |
Human action recognition from aligned body parts | OXFORD_ALIGNED_BODYPARTS | University of Oxford | Minh Hoai, Lubor Ladicky, Andrew Zisserman | We propose a method for human action recognition that uses body-part detectors to localize the human and align feature descriptors. We first automatically detect the upper-body, the hands, and the silhouette. We compute appearance features (SIFT + HOG + color) and location features (aspect ratio + relative position) of the detected body parts. We use the silhouette as an indicator of the pose and as a mask for discarding irrelevant features computed outside the body. We further utilize the detection scores obtained from several object detectors trained on publicly available datasets. | 2012-09-23 23:08:28 |
Regularized Max Pooling | Oxford_RMP | University of Oxford | Minh Hoai | We used Regularized Max Pooling (RMP) for human action classification. RMP classifies an image (or an image region) by extracting feature vectors at multiple subwindows at multiple locations and scales. Unlike Spatial Pyramid Matching where the subwindows are defined purely based on geometric correspondence, RMP accounts for the deformation of discriminative parts. The amount of deformation and the discriminative ability for multiple parts are jointly learned during training. For more information, please refer to: Minh Hoai (2014). Regularized Max Pooling for Image Classification. British Machine Vision Conference. | 2014-12-03 01:43:39 |
R*CNN classifier | R*CNN | UC Berkeley | Georgia Gkioxari, Ross Girshick, Jitendra Malik | We use a RCNN classifier which uses action specific regions for action classification. The primary region of interest containing the actor is re-scored using the auxiliary region. We use a R*CNN network which trains both models (primary and auxiliary) end-to-end. | 2015-08-13 19:38:09 |
Very deep ConvNet features and SVM classifier | VERY_DEEP_CONVNET_16_19_SVM | Visual Geometry Group, University of Oxford | Karen Simonyan, Andrew Zisserman | The results were obtained using multi-scale convolutional features and an SVM classifier. The features were computed using two very deep convolutional networks (16 and 19 weight layers), pre-trained on ILSVRC-2012 (1000-class subset of ImageNet). Fine-tuning on VOC-2012 was not performed. The algorithm is similar to the one used for the VOC-2012 classification task (comp2), which is described in our paper: "Very Deep Convolutional Networks for Large-Scale Image Recognition" (http://arxiv.org/pdf/1409.1556). The main difference is that for the action classification task we computed separate descriptors for the whole image and for the provided bounding box, and stacked them to obtain the final representation. | 2014-12-03 19:32:35 |
merge | merge | merge | merge | merge | 2023-02-21 16:08:32 |
Poselets trained on action categories. | BERKELEY_ACTION_POSELETS | University of California, Berkeley | Subhransu Maji, Lubomir Bourdev, Jitendra Malik | This is based on our CVPR 2011 paper: "Action recognition using a distributed representation of pose and appearance", Subhransu Maji, Lubomir Bourdev and Jitendra Malik For this submission we train 200 poselets for each action category. In addition we train poselets based on subcategory labels for playinginstrument and ridingbike. Linear SVMs are trained on the "poselet activation vector" along with features from object detectors for four categories: motorbike, bicycle, horse and tvmonitor. Context models re-rank the objects at the image level, as described in the CVPR'11 paper. | 2011-10-12 06:24:46 |
Combine attribute classifiers and object detectors | COMBINE_ATTR_PART | Stanford University | Bangpeng Yao, Aditya Khosla, Li Fei-Fei | Our approach combines attribute classifiers and part detectors for action classification. The method is adapted from our ICCV2011 paper (Yao et al, 2011). The "attributes" are trained by using our random forest classifier (Yao et al, 2011), which are strong classifiers that consider global properties of action classes. As for "parts", we consider the objects that interact with the humans, such as horses, books, etc. Specifically, we take the object bank (Li et al, 2010) detectors that are trained on the ImageNet dataset for part representation. The confidence scores obtained from attribute classifiers and part detectors are combined to form the final score for each image. | 2011-10-14 00:08:58 |
MAP-based SVM classifier with poselet features | MAPSVM-Poselet | Stanford University | Tim Tang, Pawan Kumar, Ben Packer, Daphne Koller | We build on the Poselet-based feature vector for action classification (Maji et al., 2010) in four ways: (i) we use a 2-level spatial pyramid (Lazenik et al., CVPR 2006); (ii) we obtain a segmentation of the person bounding box into foreground and background using an efficient GrabCut-like scheme (Rother et al., SIGGRAPH 2004), and use it to divide the feature vector into two parts---one corresponding to the foreground and one corresponding to the background; (iii) we learn a mixture model to deal with the different visual aspects of people performing the same action; and (iv) we optimize mean average precision (Yue et al., SIGIR 2007) instead of the 0/1 loss used in the standard binary SVM. All action classifiers are trained on only the VOC 2011 data, with additional annotations required to compute the Poselets. All hyperparameters are set using 5-fold cross-validation. | 2011-10-13 23:07:42 |