Action Classification Results: VOC2012 BETA

Competition "comp10" (train on own data)

This leaderboard shows only those submissions that have been marked as public, and so the displayed rankings should not be considered as definitive.

Average Precision (AP %)







Human-object Relation [?] 92.891.189.895.487.798.698.895.491.495.884.301-Dec-2019
merge [?] 91.592.789.896.987.996.799.193.686.995.275.921-Feb-2023
Attention [?] 90.292.786.093.283.796.698.893.585.391.880.107-Jun-2017
R*CNN [?] 90.291.584.493.683.296.998.493.885.992.681.813-Aug-2015
VERY_DEEP_CONVNET_16_19_SVM [?] 84.089.371.394.771.397.
Oxford_RMP [?] 76.382.352.984.353.695.696.189.760.476.072.903-Dec-2014
OXFORD_ALIGNED_BODYPARTS [?] 69.677.050.465.339.594.195.987.742.768.674.523-Sep-2012
COMBINE_ATTR_PART [?] 63.866.642.960.642.090.592.285.928.964.064.514-Oct-2011
HU_BU_MIL_MULTI_CUE [?] 56.059.439.656.534.475.780.274.327.655.256.624-Sep-2012
BERKELEY_ACTION_POSELETS [?] 55.159.332.445.427.584.588.377.231.247.458.212-Oct-2011
MAPSVM-Poselet [?] 42.426.830.328.123.672.182.467.019.826.147.313-Oct-2011


Multi-branch attention networksAttentionUniversity of LiverpoolShiyang Yan, Jeremy S. Smith, Bailing ZhangUse multiple contextual cues to facilitate the recognition.2017-06-07 07:40:31
On Recognizing Actions in Still Images via MILHU_BU_MIL_MULTI_CUEHacettepe University, Bilkent UniversityCagdas Bas, Fadime Sener, Nazli Ikizler-CinbisWe propose a multi-cue based approach for recognizing hu- man actions in still images, where candidate object regions are discovered and utilized in a weakly supervised manner. Our approach is weakly su- pervised in the sense that it does not require any explicitly trained object detector or part/attribute annotation. Instead, a multiple instance learn- ing approach is used over sets of object hypotheses in order to represent objects relevant to the actions. Our results show that using multiple ob- ject hypotheses within multiple instance learning is e ective for human action recognition in still images and such an object representation is suitable for using in conjunction with other visual features.2012-09-24 00:26:48
Human-object Relation NetworkHuman-object RelationTongji UniversityWentao Ma, Shuang LiangWe propose the human-object relation network for action recognition in still images. It computes pair-wise relation information from actor and object appearances as well as spatial locations, and enhances both features for action classification.2019-12-01 13:47:22
Human action recognition from aligned body partsOXFORD_ALIGNED_BODYPARTSUniversity of OxfordMinh Hoai, Lubor Ladicky, Andrew ZissermanWe propose a method for human action recognition that uses body-part detectors to localize the human and align feature descriptors. We first automatically detect the upper-body, the hands, and the silhouette. We compute appearance features (SIFT + HOG + color) and location features (aspect ratio + relative position) of the detected body parts. We use the silhouette as an indicator of the pose and as a mask for discarding irrelevant features computed outside the body. We further utilize the detection scores obtained from several object detectors trained on publicly available datasets.2012-09-23 23:08:28
Regularized Max PoolingOxford_RMPUniversity of OxfordMinh HoaiWe used Regularized Max Pooling (RMP) for human action classification. RMP classifies an image (or an image region) by extracting feature vectors at multiple subwindows at multiple locations and scales. Unlike Spatial Pyramid Matching where the subwindows are defined purely based on geometric correspondence, RMP accounts for the deformation of discriminative parts. The amount of deformation and the discriminative ability for multiple parts are jointly learned during training. For more information, please refer to: Minh Hoai (2014). Regularized Max Pooling for Image Classification. British Machine Vision Conference.2014-12-03 01:43:39
R*CNN classifierR*CNNUC BerkeleyGeorgia Gkioxari, Ross Girshick, Jitendra MalikWe use a RCNN classifier which uses action specific regions for action classification. The primary region of interest containing the actor is re-scored using the auxiliary region. We use a R*CNN network which trains both models (primary and auxiliary) end-to-end. 2015-08-13 19:38:09
Very deep ConvNet features and SVM classifierVERY_DEEP_CONVNET_16_19_SVMVisual Geometry Group, University of OxfordKaren Simonyan, Andrew ZissermanThe results were obtained using multi-scale convolutional features and an SVM classifier. The features were computed using two very deep convolutional networks (16 and 19 weight layers), pre-trained on ILSVRC-2012 (1000-class subset of ImageNet). Fine-tuning on VOC-2012 was not performed. The algorithm is similar to the one used for the VOC-2012 classification task (comp2), which is described in our paper: "Very Deep Convolutional Networks for Large-Scale Image Recognition" ( The main difference is that for the action classification task we computed separate descriptors for the whole image and for the provided bounding box, and stacked them to obtain the final representation.2014-12-03 19:32:35
mergemergemergemergemerge2023-02-21 16:08:32
Poselets trained on action categories.BERKELEY_ACTION_POSELETSUniversity of California, BerkeleySubhransu Maji, Lubomir Bourdev, Jitendra MalikThis is based on our CVPR 2011 paper: "Action recognition using a distributed representation of pose and appearance", Subhransu Maji, Lubomir Bourdev and Jitendra Malik For this submission we train 200 poselets for each action category. In addition we train poselets based on subcategory labels for playinginstrument and ridingbike. Linear SVMs are trained on the "poselet activation vector" along with features from object detectors for four categories: motorbike, bicycle, horse and tvmonitor. Context models re-rank the objects at the image level, as described in the CVPR'11 paper. 2011-10-12 06:24:46
Combine attribute classifiers and object detectorsCOMBINE_ATTR_PARTStanford UniversityBangpeng Yao, Aditya Khosla, Li Fei-FeiOur approach combines attribute classifiers and part detectors for action classification. The method is adapted from our ICCV2011 paper (Yao et al, 2011). The "attributes" are trained by using our random forest classifier (Yao et al, 2011), which are strong classifiers that consider global properties of action classes. As for "parts", we consider the objects that interact with the humans, such as horses, books, etc. Specifically, we take the object bank (Li et al, 2010) detectors that are trained on the ImageNet dataset for part representation. The confidence scores obtained from attribute classifiers and part detectors are combined to form the final score for each image.2011-10-14 00:08:58
MAP-based SVM classifier with poselet featuresMAPSVM-PoseletStanford UniversityTim Tang, Pawan Kumar, Ben Packer, Daphne KollerWe build on the Poselet-based feature vector for action classification (Maji et al., 2010) in four ways: (i) we use a 2-level spatial pyramid (Lazenik et al., CVPR 2006); (ii) we obtain a segmentation of the person bounding box into foreground and background using an efficient GrabCut-like scheme (Rother et al., SIGGRAPH 2004), and use it to divide the feature vector into two parts---one corresponding to the foreground and one corresponding to the background; (iii) we learn a mixture model to deal with the different visual aspects of people performing the same action; and (iv) we optimize mean average precision (Yue et al., SIGIR 2007) instead of the 0/1 loss used in the standard binary SVM. All action classifiers are trained on only the VOC 2011 data, with additional annotations required to compute the Poselets. All hyperparameters are set using 5-fold cross-validation.2011-10-13 23:07:42