PASCAL VOC Challenge performance evaluation server

Classification Results: VOC2012 ^BETA

Competition "comp2" (train on own data)

This leaderboard shows only those submissions that have been marked as public, and so the displayed rankings should not be considered as definitive.

The highest scoring entry in each column is shown in bold.
Clicking on the blue arrow symbol () at the top of a column will order the submissions from high to low wrt performance on that column.

Average Precision (AP %)

	mean	aero plane	bicycle	bird	boat	bottle	bus	car	cat	chair	cow	dining table	dog	horse	motor bike	person	potted plant	sheep	sofa	train	tv/ monitor	submission date
RSTN ^[?]	96.7	99.9	98.1	99.2	98.5	89.3	97.8	96.3	99.2	92.6	99.3	91.2	99.3	99.6	97.7	99.2	89.3	99.6	92.0	99.8	96.6	23-Apr-2021
SYSU_ KESD ^[?]	95.4	99.9	96.6	98.4	97.0	88.6	96.4	95.9	99.2	89.0	97.9	88.6	99.4	99.3	97.9	99.2	85.8	98.6	86.7	99.4	95.1	16-Oct-2018
From_MOD_To_MLT ^[?]	95.3	99.2	96.7	97.6	96.3	87.0	97.1	96.9	98.8	90.9	96.7	88.5	98.5	98.6	98.0	99.2	87.5	96.1	88.7	98.5	94.3	21-Apr-2017
Random_Crop_Pooling_AGS ^[?]	94.3	99.8	94.5	98.1	96.1	85.5	96.1	95.5	99.0	90.2	95.0	87.8	98.7	98.4	97.5	99.0	80.1	95.9	86.5	98.8	94.6	09-May-2016
SDE_CNN_AGS ^[?]	94.0	99.8	94.7	97.6	96.4	83.6	95.9	94.8	99.0	90.4	94.3	88.1	98.9	98.5	97.2	98.8	76.8	95.0	86.8	98.7	94.2	16-Nov-2015
NUS-HCP++ ^[?]	93.0	99.8	94.9	97.7	95.5	80.9	95.7	93.9	98.9	88.5	95.0	85.7	98.1	98.5	97.1	96.9	74.2	93.8	84.1	98.4	92.5	22-Apr-2015
inceptionv4_svm ^[?]	92.4	99.4	95.6	96.6	95.0	80.2	93.8	91.6	98.4	84.7	94.2	86.0	98.2	98.3	96.7	98.3	74.5	96.2	79.1	98.4	92.1	21-Dec-2017
Random_Crop_Pooling ^[?]	92.2	99.3	92.2	97.5	94.9	82.6	94.1	92.4	98.5	83.8	93.5	83.1	98.1	97.3	96.0	98.8	77.7	95.1	79.4	97.7	92.4	09-May-2016
SDE_CNN ^[?]	91.7	99.1	92.2	96.9	95.3	80.0	93.0	90.3	98.5	83.2	93.2	84.2	98.1	97.6	95.6	98.7	75.0	94.3	79.7	97.8	91.2	16-Nov-2015
FisherNet-VGG16 ^[?]	91.5	99.2	92.5	96.8	94.4	81.0	93.2	92.3	98.2	82.9	94.3	82.2	97.4	97.3	95.9	98.7	72.9	95.1	77.7	97.5	90.8	16-Aug-2016
MSDA+FC ^[?]	91.4	99.2	93.8	96.1	95.2	81.7	94.3	91.6	98.1	81.9	91.7	83.5	96.3	95.6	96.0	98.2	77.9	93.6	74.7	97.6	91.9	07-Sep-2015
new_label_2 ^[?]	91.3	99.2	95.4	96.1	94.8	79.7	93.9	90.9	98.4	84.2	94.0	85.4	98.0	98.0	96.7	98.0	59.5	95.6	78.9	98.4	91.4	13-Jan-2018
MVMI-DSP ^[?]	90.7	98.9	93.1	96.0	94.1	76.4	93.5	90.8	97.9	80.2	92.1	82.4	97.2	96.8	95.7	98.1	73.9	93.6	76.8	97.5	89.0	19-Apr-2015
Tencent-BestImage&CASIA_FCFOF ^[?]	90.4	98.8	92.5	96.1	94.0	74.3	92.6	90.9	97.8	85.0	92.2	83.1	97.1	95.8	93.0	97.8	67.6	92.5	82.2	97.0	88.5	09-Apr-2015
NUS-HCP-AGS ^[?]	90.3	99.0	91.8	94.8	92.4	72.6	95.0	91.8	97.4	85.2	92.9	83.1	96.0	96.6	96.1	94.9	68.4	92.0	79.6	97.3	88.5	09-Jun-2014
new_label_4 ^[?]	89.4	99.5	95.2	96.2	94.5	77.2	93.9	90.9	98.3	79.9	94.0	79.4	98.1	98.1	96.6	96.9	61.1	95.7	76.7	98.3	68.7	13-Jan-2018
VERY_DEEP_CONVNET_16_19_SVM ^[?]	89.3	99.1	89.1	96.0	94.1	74.1	92.2	85.3	97.9	79.9	92.0	83.7	97.5	96.5	94.7	97.1	63.7	93.6	75.2	97.4	87.8	16-Nov-2014
VERY_DEEP_CONVNET_19_SVM ^[?]	89.0	99.1	88.7	95.7	93.9	73.1	92.1	84.8	97.7	79.1	90.7	83.2	97.3	96.2	94.3	96.9	63.4	93.2	74.6	97.3	87.9	17-Nov-2014
VERY_DEEP_CONVNET_16_SVM ^[?]	89.0	99.0	88.8	95.9	93.8	73.1	92.1	85.1	97.8	79.5	91.1	83.3	97.2	96.3	94.5	96.9	63.1	93.4	75.0	97.1	87.1	17-Nov-2014
new_label_6 ^[?]	88.5	99.4	94.5	95.1	94.3	78.8	93.8	90.0	97.5	83.4	89.1	85.8	98.1	95.3	74.6	96.0	55.4	95.5	76.9	98.3	77.8	13-Jan-2018
BCE loss with transfer learning ^[?]	84.4	97.8	84.0	93.1	88.1	63.0	88.7	80.8	95.8	72.4	87.2	77.1	94.4	93.0	91.0	95.4	54.6	87.8	69.3	94.3	80.8	06-Mar-2019
NUS-HCP ^[?]	84.2	97.5	84.3	93.0	89.4	62.5	90.2	84.6	94.8	69.7	90.2	74.1	93.4	93.7	88.8	93.2	59.7	90.3	61.8	94.4	78.0	09-Jun-2014
CNN-S-TUNE-RNK ^[?]	83.2	96.8	82.5	91.5	88.1	62.1	88.3	81.9	94.8	70.3	80.2	76.2	92.9	90.3	89.3	95.2	57.4	83.6	66.4	93.5	81.9	28-Jul-2014
NN-ImageNet-Pretrain-1512classes ^[?]	83.0	95.0	83.2	88.4	84.4	61.0	89.1	84.7	90.8	72.9	87.2	69.0	91.8	93.2	88.4	96.1	64.9	87.3	62.7	91.0	80.0	14-Apr-2014
new_label_8 ^[?]	73.6	99.4	94.5	95.5	94.1	56.3	93.0	85.4	97.8	80.9	3.8	78.3	97.1	93.7	35.3	94.6	58.5	25.7	74.5	95.8	18.5	13-Jan-2018
CW_DEEP_FCN ^[?]	67.5	91.2	67.3	83.5	75.3	33.3	77.6	69.7	87.0	55.1	62.1	33.4	83.5	70.2	71.5	90.1	44.6	69.6	37.5	85.6	61.4	11-Aug-2016
LIRIS_CLSTEXT ^[?]	65.6	88.3	66.1	60.8	68.5	46.7	77.3	69.3	63.7	55.9	52.6	56.6	55.5	69.7	73.8	87.1	46.3	65.4	54.0	81.2	72.8	13-Oct-2011
ITI_FK_FLICKR_GRAYSIFT_ENTROPY ^[?]	63.5	88.1	63.0	61.9	68.6	34.9	79.6	67.4	70.5	57.5	52.0	55.3	60.1	68.7	74.3	83.2	26.4	57.6	53.4	83.0	64.0	23-Sep-2012

Abbreviations

Title	Method	Affiliation	Contributors	Description	Date
BCE loss with transfer learning	BCE loss with transfer learning	Singapore University of tehcnology and design	Teo Kai Xiang Woong Wen Tat	We use BCE for each of the 20 classes with a pretrained resnet model with 3 phase training	2019-03-06 08:02:58
Convolutional network pre-trained on ILSVRC-2012	CNN-S-TUNE-RNK	Visual Geometry Group, University of Oxford	Ken Chatfield, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman	A convolutional network, pre-trained on ILSVRC-2012 (1000-class subset of ImageNet), and fine-tuned on VOC-2012 using the ranking hinge loss. The details can be found in our BMVC 2014 paper: "Return of the Devil in the Details: Delving Deep into Convolutional Nets" (Table 3, row (g)) and on the project website: http://www.robots.ox.ac.uk/~vgg/research/deep_eval/	2014-07-28 12:23:33
softmax	CW_DEEP_FCN	UESTC	HD DR	ultra deep network	2016-08-11 11:10:49
Deep FisherNet for Object Classification	FisherNet-VGG16	HUST & UCSD	Peng Tang, Xinggang Wang, Baoguang Shi, Xiang Bai, Wenyu Liu, Zhuowen Tu	We propose a neural network structure with FV layer being part of an end-to-end trainable system that is differentiable; we name our network FisherNet that is learnable using back-propagation. Our proposed FisherNet combines convolutional neural network training and Fisher Vector encoding in a single end-to-end structure. The details can be viewed in our paper "Deep FisherNet for Object Classification".	2016-08-16 12:32:03
JueunGot	From_MOD_To_MLT	SWRDC, Device Solutions, Samsung Electronics	Hayoung Joo, Donghyuk Kwon, Yong-Deok Kim	The Multi-Object Detection result is converted to Multi-lable classification. For each box, we only use the classification socore. If there exist multiple boxes for some class, we simply take maximum classification socre among them.	2017-04-21 08:26:58
Multimodal bootstrapping using MIRFLICKR1m	ITI_FK_FLICKR_GRAYSIFT_ENTROPY	ITI-CERTH & Surrey University	E. Chatzilari, S. Nikolopoulos, Y. Kompatsiaris, J. Kittler	Based on the implementation of �K. Chatfield, V. Lempitsky, A. Vedaldi, A. Zisserman, The devil is in the details: an evaluation of recent feature encoding methods, British Machine Vision Conference, 2011� and specifically following the approach described in �F. Perronnin, J. Sanchez, and T. Mensink. 2010. Improving the fisher kernel for large-scale image classification. In Proc. (ECCV'10), Springer-Verlag, Berlin, Heidelberg, 143-156.� for feature encoding based on the Fisher Kernel. We use gray-SIFT descriptors reduced with PCA to 80 dimensions and GMM with 256 components for estimating the probabilistic visual vocabulary. A spatial pyramid tree with 3 levels (1st:1x1, 2nd:2x2 and 3rd:3x1 horizontal) and dense sampling every 3 pixels has been employed to define the key-points. The descriptors are aggregated using Fisher encoding, which produces in a 40960-dimensional vector for each of the 8 regions of the spatial pyramid. These vectors are subsequently concatenated to produce the final 327680-dimensional representation vector for each image. SVM classifiers are trained using the Hellinger kernel which results in square rooting the features and then normalizing the results using the l2 norm. In addition to the train+validation dataset the set of examples that is used for training the visual recognition models is further enriched by collecting the first 500 images per concept from the MIRFLICKR dataset (1 million images in total). The images are ranked in ascending order based on the geometric mean of the image visual score (distance from the SVM-hyperplane), the complement of the image tag-based similarity (between the image tags and the concept of interest) and the entropy of tag-based similarities among all concepts in the dataset.	2012-09-23 16:30:53
global_MSDA_local_FC	MSDA+FC	beihang university & Intel labels China	jianweiluo, zhiguo jiang, jianguo li, jun wan	We use the output of the 1000-way softmax layer of VGG's CNN trained on the ILSVRC classification task as feature, namely Deep Attribute. Given an image, it is represented by the aggregation of the 1000-d feature from all the regions extracted on the image by objectness detection techniques like edgebox. We perform feature aggregation on five scales according to the size of region. The ultimate representation is thus 5000-d, and named MSDA. An initial SVMs classifiers are trained on the MSDA feature. Then, we apply the previously trained classifiers to regions to select a few correlated regions for each image, and perform feature aggregation only using features from these regions. The feature we use in this step is the first Fully-connected feature. A new set of classifiers are trained on these aggregated FC featues. The final predictions of the image is the fusion of the the results from both the steps. To note that, we do not perform any data augmentation like flip, crops, and do not fine-tune the VGG's CNN on the PASCAL dataset. For this evaluation, we use all of the VOC07 dataset and VOC12 trainval as the training set.	2015-09-07 02:56:29
NTU & NJU _MVMI_DSP	MVMI-DSP	NTU, NJU	Hao Yang, Joey Tianyi Zhou, Yu Zhang, Bin-bin Gao, Jianxin Wu, Jianfei Cai	We combine the features generated from the whole image with the features from a proposal based multi-view multi-instance framework to form the final representation of the image.	2015-04-19 06:18:54
CNN pre-trained on Imagenet	NN-ImageNet-Pretrain-1512classes	INRIA	Maxime Oquab, L�on Bottou, Ivan Laptev, Josef Sivic	We use features extracted using a Convolutional Neural Network to perform classification on the VOC dataset. Convolutional Neural Network features are trained on a 1512-class subset of the ImageNet database. A 2-layer neural network is then trained on the Pascal VOC 2012 dataset, on top of the pre-trained layers. Details on the method can be found at : http://www.di.ens.fr/willow/research/cnn/	2014-04-14 15:04:01
HCP: Hypothesis CNN Pooling	NUS-HCP	National University of Singapore, Beijing Jiaotong University	Yunchao Wei, Wei Xia, Jian Dong, Junshi Huang, Bingbing Ni, Yao Zhao, Shuicheng Yan.	Convolutional Neural Network (CNN) has demonstrated promising performance in single-label image classification tasks. However, how CNN best copes with multi-label images still remains an open problem, mainly due to the underlying complex object layouts and insufficient multi-label training images. In this work, we propose a flexible deep CNN infrastructure, called Hypotheses-CNN-Pooling (HCP), which takes an arbitrary number of object segment hypotheses as the inputs, and a shared CNN is connected with each hypothesis, finally the CNN outputs from different hypotheses are aggregated with max pooling for the ultimate multi-label predictions. Some unique characteristics of this flexible deep CNN infrastructure include 1) no ground truth bounding box information is required for training, 2) the whole HCP infrastructure is robust to those possibly noisy and/or redundant hypotheses, 3) no explicit hypothesis label is required, and 4) it may naturally output multi-label prediction results. Experimental results on Pascal VOC2007 and VOC2012 multi-label image datasets well demonstrate the superiority of the proposed HCP infrastructure over other state-of-the-arts, and particularly the MAP reaches 0.842 by HCP only on the VOC2012 dataset.	2014-06-09 10:54:22
HCP: Hypothesis CNN Pooling	NUS-HCP++	National University of Singapore, Beijing Jiaotong University	Yunchao Wei, Wei Xia, Jian Dong, Min Lin, Junshi Huang, Bingbing Ni, Yao Zhao, Shuicheng Yan.	In this submission, we utilize the VGG-16 pre-trained model on ILSVRC-2012 (1000-class subset of ImageNet) as the shared CNN. The single model performance can reach 90.1%. The final result is the combination of the NUS-HCP with the approach proposed in [1]. [1]Jian Dong, Wei Xia, Qiang Chen, Jiashi Feng, Zhongyang Huang, Shuicheng Yan. Subcategory-aware Object Classification. In CVPR, Portland, Oregon, USA, Jun 23-28, 2013.	2015-04-22 02:45:39
HCP:Hypothesis CNN Pooling with Subcategory Mining	NUS-HCP-AGS	National University of Singapore, Beijing Jiaotong University	Yunchao Wei, Wei Xia, Jian Dong, Junshi Huang, Bingbing Ni, Yao Zhao, Shuicheng Yan.	Convolutional Neural Network (CNN) has demonstrated promising performance in single label image classification tasks. However, how CNN best copes with multi-label images still remains an open problem, mainly due to the underlying complex object layouts and insufficient multi-label training images. In this work, we propose a flexible deep CNN infrastructure, called Hypotheses-CNN-Pooling (HCP), which takes an arbitrary number of object segment hypotheses as the inputs, and a shared CNN is connected with each hypothesis, finally the CNN outputs from different hypotheses are aggregated with max pooling for the ultimate multi-label predictions. Some unique characteristics of this flexible deep CNN infrastructure include 1) no ground truth bounding box information is required for training, 2) the whole HCP infrastructure is robust to those possibly noisy and/or redundant hypotheses, 3) no explicit hypothesis label is required, and 4) it may naturally output multi-label prediction results. Experimental results on Pascal VOC2007 and VOC2012 multi-label image datasets well demonstrate the superiority of the proposed HCP infrastructure over other state-of-the-arts, and particularly the MAP reaches 0.842 by HCP only and 0.903 after the combination with [1] on the VOC2012 dataset. [1]Jian Dong, Wei Xia, Qiang Chen, Jiashi Feng, Zhongyang Huang, Shuicheng Yan. Subcategory-aware Object Classification. In CVPR, Portland, Oregon, USA, Jun 23-28, 2013.	2014-06-09 10:58:29
Multi-Label Image Classification with RSTN	RSTN	#	Anonymity for ACM MM submission	we propose a region selection transformer network (RSTN), a tailored vision transformer architecture, for tackling the MLIC task. Specifically, RSTN consists of a transformer encoder, a region selection module (RSM), and a region refinement module (RRM). The transformer encoder takes as inputs a sequence of flattened image patches for discovering the global long-range information across the whole network. Next, based on the intermediate attention outputs, RSM utilizes a ranking mechanism to select the semantic-related discriminative regions. Further, RRM is proposed to aggregate the local context information among the selected regions.	2021-04-23 12:38:49
HFUT_Random_Crop_Pooling	Random_Crop_Pooling	Hefei University of Technology	Changzhi Luo, Meng Wang, Richang Hong, Jiashi Feng	We first finetune the 16-layer VGG-Net with a random crop pooling approach, and then use the finetuned model to extract feature for each image. The final results are obtained using a linear SVM classifier.	2016-05-09 03:14:23
HFUT_Random_Crop_Pooling_AGS	Random_Crop_Pooling_AGS	Hefei University of Technology	Changzhi Luo, Meng Wang, Richang Hong, Jiashi Feng	We fuse the random crop pooling approach with the approach proposed in [1]. [1] Jian Dong, Wei Xia, Qiang Chen, Jiashi Feng, Zhongyang Huang, Shuicheng Yan. Subcategory-aware Object Classification. In CVPR, Portland, Oregon, USA, Jun 23-28, 2013.	2016-05-09 03:12:36
SDE embedded CNN	SDE_CNN	NUS & NLPR	Guo-Sen Xie, Xu-Yao Zhang, Shuicheng Yan, and Cheng-Lin Liu	Bag of Words~(BoW) model and Convolutional Neural Network~(CNN) are two milestones in visual recognition. Both BoW and CNN require a feature pooling operation for constructing the frameworks. Particularly, the max-pooling has been validated as an efficient and effective pooling method compared with other methods such as average pooling and stochastic pooling. In this paper, we first evaluate different pooling methods, and then propose a new feature pooling method termed as Selective, Discriminative and Equalizing pooling~(SDE). The SDE representation is a feature learning mechanism by jointly optimizing the pooled representations with the target of learning more selective, discriminative and equalizing features. We use bilevel optimization to solve the joint optimization problem. Experiments on multiple benchmark databases (including both single-label and multi-label ones) well validate the effectiveness of our framework. Particularly, we achieve the state-of-the-art results (mAP) of 93.2% and 94.0% on the PASCAL VOC2007 and VOC2012 databases, respectively.	2015-11-16 15:18:08
SDE embedded CNN	SDE_CNN_AGS	NUS & NLPR	Guo-Sen Xie, Xu-Yao Zhang, Shuicheng Yan, and Cheng-Lin Liu	Bag of Words~(BoW) model and Convolutional Neural Network~(CNN) are two milestones in visual recognition. Both BoW and CNN require a feature pooling operation for constructing the frameworks. Particularly, the max-pooling has been validated as an efficient and effective pooling method compared with other methods such as average pooling and stochastic pooling. In this paper, we first evaluate different pooling methods, and then propose a new feature pooling method termed as Selective, Discriminative and Equalizing pooling~(SDE). The SDE representation is a feature learning mechanism by jointly optimizing the pooled representations with the target of learning more selective, discriminative and equalizing features. We use bilevel optimization to solve the joint optimization problem. Experiments on multiple benchmark databases (including both single-label and multi-label ones) well validate the effectiveness of our framework. Particularly, we achieve the state-of-the-art results~(mAP) of 93.2% and 94.0% on the PASCAL VOC2007 and VOC2012 databases, respectively.	2015-11-16 15:06:45
Knowledge embedded semantic decomposition	SYSU_ KESD	Sun Yat-Sen University	Tianshui Chen, Muxin Xu, Xiaolu Hui, Riquan Chen, Liang Lin	We present a novel approach that incorporates statistical prior knowledge to extract semantic-aware features and simultaneously capture co-occurrence of objects in an image.	2018-10-16 05:00:08
FCFOF:Fusion of Context Feature and Object Feature	Tencent-BestImage&CASIA_FCFOF	Tencent BestImage Team; Institute of Automation, Chinese Academy of Sciences	Yan Kong, ScorpioGuo, Fuzhang Wu, Fan Tang, GaryHuang, Weiming Dong	In this submission,we make use of the features in both the context level and object level.We extract the context CNN features from the whole image to represent context information and extract local CNN features by selective search method to represent exact object information. This two kinds of features are used to train SVM classifier. The final result is the combination of the two models.	2015-04-09 12:18:17
Very deep ConvNet features and SVM classifier	VERY_DEEP_CONVNET_16_19_SVM	Visual Geometry Group, University of Oxford	Karen Simonyan, Andrew Zisserman	The results were obtained using multi-scale convolutional features and an SVM classifier. The features were computed using two very deep convolutional networks (16 and 19 weight layers), pre-trained on ILSVRC-2012 (1000-class subset of ImageNet). Fine-tuning on VOC-2012 was not performed. The details can be found in our paper: "Very Deep Convolutional Networks for Large-Scale Image Recognition" (http://arxiv.org/pdf/1409.1556).	2014-11-16 15:51:49
Very deep ConvNet features and SVM classifier	VERY_DEEP_CONVNET_16_SVM	Visual Geometry Group, University of Oxford	Karen Simonyan, Andrew Zisserman	The results were obtained using multi-scale convolutional features and an SVM classifier. The features were computed using a very deep convolutional network (16 weight layers), pre-trained on ILSVRC-2012 (1000-class subset of ImageNet). Fine-tuning on VOC-2012 was not performed. The details can be found in our paper: "Very Deep Convolutional Networks for Large-Scale Image Recognition" (http://arxiv.org/pdf/1409.1556).	2014-11-17 16:30:06
Very deep ConvNet features and SVM classifier	VERY_DEEP_CONVNET_19_SVM	Visual Geometry Group, University of Oxford	Karen Simonyan, Andrew Zisserman	The results were obtained using multi-scale convolutional features and an SVM classifier. The features were computed using a very deep convolutional network (19 weight layers), pre-trained on ILSVRC-2012 (1000-class subset of ImageNet). Fine-tuning on VOC-2012 was not performed. The details can be found in our paper: "Very Deep Convolutional Networks for Large-Scale Image Recognition" (http://arxiv.org/pdf/1409.1556).	2014-11-17 16:15:32
finetune	inceptionv4_svm	seu	wangyin4	no	2017-12-21 07:28:57
svm with v4	new_label_2	seu	wangyin	some new label	2018-01-13 10:16:24
svm with v4 finetune	new_label_4	seu	wangyin	4 new label	2018-01-13 11:18:14
svm with v4 finetune part	new_label_6	seu	wangying	as up	2018-01-13 11:43:50
svm with v4 finetune part small	new_label_8	seu	wangyin,zhangyu	as up	2018-01-13 13:20:12
Classification with additional text feature	LIRIS_CLSTEXT	LIRIS, Ecole Centrale de Lyon, CNRS, UMR5205, France	Chao ZHU, Yuxing TANG, Ningning LIU, Charles-Edmond BICHOT, Emmanuel Dellandrea, Liming CHEN	In this submission, we try to use additional text information to help with object classification. We propose novel text features [1] based on semantic distance using WordNet. The basic idea is to calculate the semantic distance between the text associated with an image and an emotional dictionary based on path similarity, denoting how similar two word senses are, based on the shortest path that connects the senses in a taxonomy. As there are no tags included in Pascal2011 dataset, we downloaded 1 million Flickr images (including their tags) as the additional textual source. Firstly, for each Pascal image, we find its similar images (top 20) from the database using KNN method based on visual features (LBP and color HSV histogram), and then use these tags to extract the text feature. We use SVM with RBF kernel to train the classifier and predict the outputs. For classification based on visual features, we follow the same method described in our other submission. The outputs of visual feature based method and text feature based method are then linearly combined as final results. [1] N. Liu, Y. Zhang, E. Dellandr�a, B. Tellez, L. Chen: �Associating text features with visual ones to improve affective image classification�, International Conference Affective Computing (ACII), Memphis, USA, 2011.	2011-10-13 21:20:50

PASCAL VOC Challenge performance evaluation and download server

Classification Results: VOC2012 BETA

Competition "comp2" (train on own data)

Average Precision (AP %)

Abbreviations

Classification Results: VOC2012 ^BETA