PASCAL VOC Challenge performance evaluation server

Action Classification Results: VOC2012 ^BETA

Competition "comp10" (train on own data)

This leaderboard shows only those submissions that have been marked as public, and so the displayed rankings should not be considered as definitive.

The highest scoring entry in each column is shown in bold.
Clicking on the blue arrow symbol () at the top of a column will order the submissions from high to low wrt performance on that column.

Average Precision (AP %)

		mean	jumping	phoning	playing instrument	reading	riding bike	riding horse	running	taking photo	using computer	walking	submission date
	Human-object Relation ^[?]	92.8	91.1	89.8	95.4	87.7	98.6	98.8	95.4	91.4	95.8	84.3	01-Dec-2019
	merge ^[?]	91.5	92.7	89.8	96.9	87.9	96.7	99.1	93.6	86.9	95.2	75.9	21-Feb-2023
	Attention ^[?]	90.2	92.7	86.0	93.2	83.7	96.6	98.8	93.5	85.3	91.8	80.1	07-Jun-2017
	R*CNN ^[?]	90.2	91.5	84.4	93.6	83.2	96.9	98.4	93.8	85.9	92.6	81.8	13-Aug-2015
	VERY_DEEP_CONVNET_16_19_SVM ^[?]	84.0	89.3	71.3	94.7	71.3	97.1	98.2	90.2	73.3	88.5	66.4	03-Dec-2014
	Oxford_RMP ^[?]	76.3	82.3	52.9	84.3	53.6	95.6	96.1	89.7	60.4	76.0	72.9	03-Dec-2014
	OXFORD_ALIGNED_BODYPARTS ^[?]	69.6	77.0	50.4	65.3	39.5	94.1	95.9	87.7	42.7	68.6	74.5	23-Sep-2012
	COMBINE_ATTR_PART ^[?]	63.8	66.6	42.9	60.6	42.0	90.5	92.2	85.9	28.9	64.0	64.5	14-Oct-2011
	HU_BU_MIL_MULTI_CUE ^[?]	56.0	59.4	39.6	56.5	34.4	75.7	80.2	74.3	27.6	55.2	56.6	24-Sep-2012
	BERKELEY_ACTION_POSELETS ^[?]	55.1	59.3	32.4	45.4	27.5	84.5	88.3	77.2	31.2	47.4	58.2	12-Oct-2011
	MAPSVM-Poselet ^[?]	42.4	26.8	30.3	28.1	23.6	72.1	82.4	67.0	19.8	26.1	47.3	13-Oct-2011

Abbreviations

Title	Method	Affiliation	Contributors	Description	Date
Multi-branch attention networks	Attention	University of Liverpool	Shiyang Yan, Jeremy S. Smith, Bailing Zhang	Use multiple contextual cues to facilitate the recognition.	2017-06-07 07:40:31
On Recognizing Actions in Still Images via MIL	HU_BU_MIL_MULTI_CUE	Hacettepe University, Bilkent University	Cagdas Bas, Fadime Sener, Nazli Ikizler-Cinbis	We propose a multi-cue based approach for recognizing hu- man actions in still images, where candidate object regions are discovered and utilized in a weakly supervised manner. Our approach is weakly su- pervised in the sense that it does not require any explicitly trained object detector or part/attribute annotation. Instead, a multiple instance learn- ing approach is used over sets of object hypotheses in order to represent objects relevant to the actions. Our results show that using multiple ob- ject hypotheses within multiple instance learning is eective for human action recognition in still images and such an object representation is suitable for using in conjunction with other visual features.	2012-09-24 00:26:48
Human-object Relation Network	Human-object Relation	Tongji University	Wentao Ma, Shuang Liang	We propose the human-object relation network for action recognition in still images. It computes pair-wise relation information from actor and object appearances as well as spatial locations, and enhances both features for action classification.	2019-12-01 13:47:22
Human action recognition from aligned body parts	OXFORD_ALIGNED_BODYPARTS	University of Oxford	Minh Hoai, Lubor Ladicky, Andrew Zisserman	We propose a method for human action recognition that uses body-part detectors to localize the human and align feature descriptors. We first automatically detect the upper-body, the hands, and the silhouette. We compute appearance features (SIFT + HOG + color) and location features (aspect ratio + relative position) of the detected body parts. We use the silhouette as an indicator of the pose and as a mask for discarding irrelevant features computed outside the body. We further utilize the detection scores obtained from several object detectors trained on publicly available datasets.	2012-09-23 23:08:28
Regularized Max Pooling	Oxford_RMP	University of Oxford	Minh Hoai	We used Regularized Max Pooling (RMP) for human action classification. RMP classifies an image (or an image region) by extracting feature vectors at multiple subwindows at multiple locations and scales. Unlike Spatial Pyramid Matching where the subwindows are defined purely based on geometric correspondence, RMP accounts for the deformation of discriminative parts. The amount of deformation and the discriminative ability for multiple parts are jointly learned during training. For more information, please refer to: Minh Hoai (2014). Regularized Max Pooling for Image Classification. British Machine Vision Conference.	2014-12-03 01:43:39
R*CNN classifier	R*CNN	UC Berkeley	Georgia Gkioxari, Ross Girshick, Jitendra Malik	We use a RCNN classifier which uses action specific regions for action classification. The primary region of interest containing the actor is re-scored using the auxiliary region. We use a R*CNN network which trains both models (primary and auxiliary) end-to-end.	2015-08-13 19:38:09
Very deep ConvNet features and SVM classifier	VERY_DEEP_CONVNET_16_19_SVM	Visual Geometry Group, University of Oxford	Karen Simonyan, Andrew Zisserman	The results were obtained using multi-scale convolutional features and an SVM classifier. The features were computed using two very deep convolutional networks (16 and 19 weight layers), pre-trained on ILSVRC-2012 (1000-class subset of ImageNet). Fine-tuning on VOC-2012 was not performed. The algorithm is similar to the one used for the VOC-2012 classification task (comp2), which is described in our paper: "Very Deep Convolutional Networks for Large-Scale Image Recognition" (http://arxiv.org/pdf/1409.1556). The main difference is that for the action classification task we computed separate descriptors for the whole image and for the provided bounding box, and stacked them to obtain the final representation.	2014-12-03 19:32:35
merge	merge	merge	merge	merge	2023-02-21 16:08:32
Poselets trained on action categories.	BERKELEY_ACTION_POSELETS	University of California, Berkeley	Subhransu Maji, Lubomir Bourdev, Jitendra Malik	This is based on our CVPR 2011 paper: "Action recognition using a distributed representation of pose and appearance", Subhransu Maji, Lubomir Bourdev and Jitendra Malik For this submission we train 200 poselets for each action category. In addition we train poselets based on subcategory labels for playinginstrument and ridingbike. Linear SVMs are trained on the "poselet activation vector" along with features from object detectors for four categories: motorbike, bicycle, horse and tvmonitor. Context models re-rank the objects at the image level, as described in the CVPR'11 paper.	2011-10-12 06:24:46
Combine attribute classifiers and object detectors	COMBINE_ATTR_PART	Stanford University	Bangpeng Yao, Aditya Khosla, Li Fei-Fei	Our approach combines attribute classifiers and part detectors for action classification. The method is adapted from our ICCV2011 paper (Yao et al, 2011). The "attributes" are trained by using our random forest classifier (Yao et al, 2011), which are strong classifiers that consider global properties of action classes. As for "parts", we consider the objects that interact with the humans, such as horses, books, etc. Specifically, we take the object bank (Li et al, 2010) detectors that are trained on the ImageNet dataset for part representation. The confidence scores obtained from attribute classifiers and part detectors are combined to form the final score for each image.	2011-10-14 00:08:58
MAP-based SVM classifier with poselet features	MAPSVM-Poselet	Stanford University	Tim Tang, Pawan Kumar, Ben Packer, Daphne Koller	We build on the Poselet-based feature vector for action classification (Maji et al., 2010) in four ways: (i) we use a 2-level spatial pyramid (Lazenik et al., CVPR 2006); (ii) we obtain a segmentation of the person bounding box into foreground and background using an efficient GrabCut-like scheme (Rother et al., SIGGRAPH 2004), and use it to divide the feature vector into two parts---one corresponding to the foreground and one corresponding to the background; (iii) we learn a mixture model to deal with the different visual aspects of people performing the same action; and (iv) we optimize mean average precision (Yue et al., SIGIR 2007) instead of the 0/1 loss used in the standard binary SVM. All action classifiers are trained on only the VOC 2011 data, with additional annotations required to compute the Poselets. All hyperparameters are set using 5-fold cross-validation.	2011-10-13 23:07:42