Classification Results: VOC2012 BETA

Competition "comp2" (train on own data)

This leaderboard shows only those submissions that have been marked as public, and so the displayed rankings should not be considered as definitive.

Average Precision (AP %)

  mean

aero
plane
bicycle

bird

boat

bottle

bus

car

cat

chair

cow

dining
table
dog

horse

motor
bike
person

potted
plant
sheep

sofa

train

tv/
monitor
submission
date
SYSU_ KESD [?] 95.499.996.698.497.088.696.495.999.289.097.988.699.499.397.999.285.898.686.799.495.116-Oct-2018
RSTN [?] 96.799.998.199.298.589.397.896.399.292.699.391.299.399.697.799.289.399.692.099.896.623-Apr-2021
NUS-HCP++ [?] 93.099.894.997.795.580.995.793.998.988.595.085.798.198.597.196.974.293.884.198.492.522-Apr-2015
Random_Crop_Pooling_AGS [?] 94.399.894.598.196.185.596.195.599.090.295.087.898.798.497.599.080.195.986.598.894.609-May-2016
SDE_CNN_AGS [?] 94.099.894.797.696.483.695.994.899.090.494.388.198.998.597.298.876.895.086.898.794.216-Nov-2015
new_label_4 [?] 89.499.595.296.294.577.293.990.998.379.994.079.498.198.196.696.961.195.776.798.368.713-Jan-2018
inceptionv4_svm [?] 92.499.495.696.695.080.293.891.698.484.794.286.098.298.396.798.374.596.279.198.492.121-Dec-2017
new_label_8 [?] 73.699.494.595.594.156.393.085.497.880.93.878.397.193.735.394.658.525.774.595.818.513-Jan-2018
new_label_6 [?] 88.599.494.595.194.378.893.890.097.583.489.185.898.195.374.696.055.495.576.998.377.813-Jan-2018
Random_Crop_Pooling [?] 92.299.392.297.594.982.694.192.498.583.893.583.198.197.396.098.877.795.179.497.792.409-May-2016
MSDA+FC [?] 91.499.293.896.195.281.794.391.698.181.991.783.596.395.696.098.277.993.674.797.691.907-Sep-2015
From_MOD_To_MLT [?] 95.399.296.797.696.387.097.196.998.890.996.788.598.598.698.099.287.596.188.798.594.321-Apr-2017
new_label_2 [?] 91.399.295.496.194.879.793.990.998.484.294.085.498.098.096.798.059.595.678.998.491.413-Jan-2018
FisherNet-VGG16 [?] 91.599.292.596.894.481.093.292.398.282.994.382.297.497.395.998.772.995.177.797.590.816-Aug-2016
VERY_DEEP_CONVNET_19_SVM [?] 89.099.188.795.793.973.192.184.897.779.190.783.297.396.294.396.963.493.274.697.387.917-Nov-2014
VERY_DEEP_CONVNET_16_19_SVM [?] 89.399.189.196.094.174.192.285.397.979.992.083.797.596.594.797.163.793.675.297.487.816-Nov-2014
SDE_CNN [?] 91.799.192.296.995.380.093.090.398.583.293.284.298.197.695.698.775.094.379.797.891.216-Nov-2015
VERY_DEEP_CONVNET_16_SVM [?] 89.099.088.895.993.873.192.185.197.879.591.183.397.296.394.596.963.193.475.097.187.117-Nov-2014
NUS-HCP-AGS [?] 90.399.091.894.892.472.695.091.897.485.292.983.196.096.696.194.968.492.079.697.388.509-Jun-2014
MVMI-DSP [?] 90.798.993.196.094.176.493.590.897.980.292.182.497.296.895.798.173.993.676.897.589.019-Apr-2015
Tencent-BestImage&CASIA_FCFOF [?] 90.498.892.596.194.074.392.690.997.885.092.283.197.195.893.097.867.692.582.297.088.509-Apr-2015
BCE loss with transfer learning [?] 84.497.884.093.188.163.088.780.895.872.487.277.194.493.091.095.454.687.869.394.380.806-Mar-2019
NUS-HCP [?] 84.297.584.393.089.462.590.284.694.869.790.274.193.493.788.893.259.790.361.894.478.009-Jun-2014
CNN-S-TUNE-RNK [?] 83.296.882.591.588.162.188.381.994.870.380.276.292.990.389.395.257.483.666.493.581.928-Jul-2014
NN-ImageNet-Pretrain-1512classes [?] 83.095.083.288.484.461.089.184.790.872.987.269.091.893.288.496.164.987.362.791.080.014-Apr-2014
CW_DEEP_FCN [?] 67.591.267.383.575.333.377.669.787.055.162.133.483.570.271.590.144.669.637.585.661.411-Aug-2016
LIRIS_CLSTEXT [?] 65.688.366.160.868.546.777.369.363.755.952.656.655.569.773.887.146.365.454.081.272.813-Oct-2011
ITI_FK_FLICKR_GRAYSIFT_ENTROPY [?] 63.588.163.061.968.634.979.667.470.557.552.055.360.168.774.383.226.457.653.483.064.023-Sep-2012

Abbreviations

TitleMethodAffiliationContributorsDescriptionDate
BCE loss with transfer learningBCE loss with transfer learningSingapore University of tehcnology and designTeo Kai Xiang Woong Wen TatWe use BCE for each of the 20 classes with a pretrained resnet model with 3 phase training2019-03-06 08:02:58
Convolutional network pre-trained on ILSVRC-2012 CNN-S-TUNE-RNKVisual Geometry Group, University of OxfordKen Chatfield, Karen Simonyan, Andrea Vedaldi, Andrew ZissermanA convolutional network, pre-trained on ILSVRC-2012 (1000-class subset of ImageNet), and fine-tuned on VOC-2012 using the ranking hinge loss. The details can be found in our BMVC 2014 paper: "Return of the Devil in the Details: Delving Deep into Convolutional Nets" (Table 3, row (g)) and on the project website: http://www.robots.ox.ac.uk/~vgg/research/deep_eval/ 2014-07-28 12:23:33
softmaxCW_DEEP_FCNUESTCHD DRultra deep network2016-08-11 11:10:49
Deep FisherNet for Object ClassificationFisherNet-VGG16HUST & UCSDPeng Tang, Xinggang Wang, Baoguang Shi, Xiang Bai, Wenyu Liu, Zhuowen TuWe propose a neural network structure with FV layer being part of an end-to-end trainable system that is differentiable; we name our network FisherNet that is learnable using back-propagation. Our proposed FisherNet combines convolutional neural network training and Fisher Vector encoding in a single end-to-end structure. The details can be viewed in our paper "Deep FisherNet for Object Classification".2016-08-16 12:32:03
JueunGotFrom_MOD_To_MLTSWRDC, Device Solutions, Samsung ElectronicsHayoung Joo, Donghyuk Kwon, Yong-Deok KimThe Multi-Object Detection result is converted to Multi-lable classification. For each box, we only use the classification socore. If there exist multiple boxes for some class, we simply take maximum classification socre among them.2017-04-21 08:26:58
Multimodal bootstrapping using MIRFLICKR1mITI_FK_FLICKR_GRAYSIFT_ENTROPYITI-CERTH & Surrey UniversityE. Chatzilari, S. Nikolopoulos, Y. Kompatsiaris, J. Kittler Based on the implementation of “K. Chatfield, V. Lempitsky, A. Vedaldi, A. Zisserman, The devil is in the details: an evaluation of recent feature encoding methods, British Machine Vision Conference, 2011” and specifically following the approach described in “F. Perronnin, J. Sanchez, and T. Mensink. 2010. Improving the fisher kernel for large-scale image classification. In Proc. (ECCV'10), Springer-Verlag, Berlin, Heidelberg, 143-156.” for feature encoding based on the Fisher Kernel. We use gray-SIFT descriptors reduced with PCA to 80 dimensions and GMM with 256 components for estimating the probabilistic visual vocabulary. A spatial pyramid tree with 3 levels (1st:1x1, 2nd:2x2 and 3rd:3x1 horizontal) and dense sampling every 3 pixels has been employed to define the key-points. The descriptors are aggregated using Fisher encoding, which produces in a 40960-dimensional vector for each of the 8 regions of the spatial pyramid. These vectors are subsequently concatenated to produce the final 327680-dimensional representation vector for each image. SVM classifiers are trained using the Hellinger kernel which results in square rooting the features and then normalizing the results using the l2 norm. In addition to the train+validation dataset the set of examples that is used for training the visual recognition models is further enriched by collecting the first 500 images per concept from the MIRFLICKR dataset (1 million images in total). The images are ranked in ascending order based on the geometric mean of the image visual score (distance from the SVM-hyperplane), the complement of the image tag-based similarity (between the image tags and the concept of interest) and the entropy of tag-based similarities among all concepts in the dataset. 2012-09-23 16:30:53
global_MSDA_local_FCMSDA+FCbeihang university & Intel labels China jianweiluo, zhiguo jiang, jianguo li, jun wanWe use the output of the 1000-way softmax layer of VGG's CNN trained on the ILSVRC classification task as feature, namely Deep Attribute. Given an image, it is represented by the aggregation of the 1000-d feature from all the regions extracted on the image by objectness detection techniques like edgebox. We perform feature aggregation on five scales according to the size of region. The ultimate representation is thus 5000-d, and named MSDA. An initial SVMs classifiers are trained on the MSDA feature. Then, we apply the previously trained classifiers to regions to select a few correlated regions for each image, and perform feature aggregation only using features from these regions. The feature we use in this step is the first Fully-connected feature. A new set of classifiers are trained on these aggregated FC featues. The final predictions of the image is the fusion of the the results from both the steps. To note that, we do not perform any data augmentation like flip, crops, and do not fine-tune the VGG's CNN on the PASCAL dataset. For this evaluation, we use all of the VOC07 dataset and VOC12 trainval as the training set.2015-09-07 02:56:29
NTU & NJU _MVMI_DSPMVMI-DSPNTU, NJUHao Yang, Joey Tianyi Zhou, Yu Zhang, Bin-bin Gao, Jianxin Wu, Jianfei CaiWe combine the features generated from the whole image with the features from a proposal based multi-view multi-instance framework to form the final representation of the image.2015-04-19 06:18:54
CNN pre-trained on ImagenetNN-ImageNet-Pretrain-1512classesINRIAMaxime Oquab, Léon Bottou, Ivan Laptev, Josef SivicWe use features extracted using a Convolutional Neural Network to perform classification on the VOC dataset. Convolutional Neural Network features are trained on a 1512-class subset of the ImageNet database. A 2-layer neural network is then trained on the Pascal VOC 2012 dataset, on top of the pre-trained layers. Details on the method can be found at : http://www.di.ens.fr/willow/research/cnn/ 2014-04-14 15:04:01
HCP: Hypothesis CNN PoolingNUS-HCPNational University of Singapore, Beijing Jiaotong UniversityYunchao Wei*, Wei Xia*, Jian Dong, Junshi Huang, Bingbing Ni, Yao Zhao, Shuicheng Yan. Convolutional Neural Network (CNN) has demonstrated promising performance in single-label image classification tasks. However, how CNN best copes with multi-label images still remains an open problem, mainly due to the underlying complex object layouts and insufficient multi-label training images. In this work, we propose a flexible deep CNN infrastructure, called Hypotheses-CNN-Pooling (HCP), which takes an arbitrary number of object segment hypotheses as the inputs, and a shared CNN is connected with each hypothesis, finally the CNN outputs from different hypotheses are aggregated with max pooling for the ultimate multi-label predictions. Some unique characteristics of this flexible deep CNN infrastructure include 1) no ground truth bounding box information is required for training, 2) the whole HCP infrastructure is robust to those possibly noisy and/or redundant hypotheses, 3) no explicit hypothesis label is required, and 4) it may naturally output multi-label prediction results. Experimental results on Pascal VOC2007 and VOC2012 multi-label image datasets well demonstrate the superiority of the proposed HCP infrastructure over other state-of-the-arts, and particularly the MAP reaches 0.842 by HCP only on the VOC2012 dataset. 2014-06-09 10:54:22
HCP: Hypothesis CNN PoolingNUS-HCP++National University of Singapore, Beijing Jiaotong UniversityYunchao Wei*, Wei Xia*, Jian Dong, Min Lin, Junshi Huang, Bingbing Ni, Yao Zhao, Shuicheng Yan.In this submission, we utilize the VGG-16 pre-trained model on ILSVRC-2012 (1000-class subset of ImageNet) as the shared CNN. The single model performance can reach 90.1%. The final result is the combination of the NUS-HCP with the approach proposed in [1]. [1]Jian Dong, Wei Xia, Qiang Chen, Jiashi Feng, Zhongyang Huang, Shuicheng Yan. Subcategory-aware Object Classification. In CVPR, Portland, Oregon, USA, Jun 23-28, 2013. 2015-04-22 02:45:39
HCP:Hypothesis CNN Pooling with Subcategory MiningNUS-HCP-AGSNational University of Singapore, Beijing Jiaotong UniversityYunchao Wei*, Wei Xia*, Jian Dong, Junshi Huang, Bingbing Ni, Yao Zhao, Shuicheng Yan. Convolutional Neural Network (CNN) has demonstrated promising performance in single label image classification tasks. However, how CNN best copes with multi-label images still remains an open problem, mainly due to the underlying complex object layouts and insufficient multi-label training images. In this work, we propose a flexible deep CNN infrastructure, called Hypotheses-CNN-Pooling (HCP), which takes an arbitrary number of object segment hypotheses as the inputs, and a shared CNN is connected with each hypothesis, finally the CNN outputs from different hypotheses are aggregated with max pooling for the ultimate multi-label predictions. Some unique characteristics of this flexible deep CNN infrastructure include 1) no ground truth bounding box information is required for training, 2) the whole HCP infrastructure is robust to those possibly noisy and/or redundant hypotheses, 3) no explicit hypothesis label is required, and 4) it may naturally output multi-label prediction results. Experimental results on Pascal VOC2007 and VOC2012 multi-label image datasets well demonstrate the superiority of the proposed HCP infrastructure over other state-of-the-arts, and particularly the MAP reaches 0.842 by HCP only and 0.903 after the combination with [1] on the VOC2012 dataset. [1]Jian Dong, Wei Xia, Qiang Chen, Jiashi Feng, Zhongyang Huang, Shuicheng Yan. Subcategory-aware Object Classification. In CVPR, Portland, Oregon, USA, Jun 23-28, 2013. 2014-06-09 10:58:29
Multi-Label Image Classification with RSTNRSTN#Anonymity for ACM MM submissionwe propose a region selection transformer network (RSTN), a tailored vision transformer architecture, for tackling the MLIC task. Specifically, RSTN consists of a transformer encoder, a region selection module (RSM), and a region refinement module (RRM). The transformer encoder takes as inputs a sequence of flattened image patches for discovering the global long-range information across the whole network. Next, based on the intermediate attention outputs, RSM utilizes a ranking mechanism to select the semantic-related discriminative regions. Further, RRM is proposed to aggregate the local context information among the selected regions.2021-04-23 12:38:49
HFUT_Random_Crop_Pooling Random_Crop_PoolingHefei University of TechnologyChangzhi Luo, Meng Wang, Richang Hong, Jiashi FengWe first finetune the 16-layer VGG-Net with a random crop pooling approach, and then use the finetuned model to extract feature for each image. The final results are obtained using a linear SVM classifier.2016-05-09 03:14:23
HFUT_Random_Crop_Pooling_AGS Random_Crop_Pooling_AGSHefei University of TechnologyChangzhi Luo, Meng Wang, Richang Hong, Jiashi Feng We fuse the random crop pooling approach with the approach proposed in [1]. [1] Jian Dong, Wei Xia, Qiang Chen, Jiashi Feng, Zhongyang Huang, Shuicheng Yan. Subcategory-aware Object Classification. In CVPR, Portland, Oregon, USA, Jun 23-28, 2013.2016-05-09 03:12:36
SDE embedded CNNSDE_CNNNUS & NLPRGuo-Sen Xie, Xu-Yao Zhang, Shuicheng Yan, and Cheng-Lin LiuBag of Words~(BoW) model and Convolutional Neural Network~(CNN) are two milestones in visual recognition. Both BoW and CNN require a feature pooling operation for constructing the frameworks. Particularly, the max-pooling has been validated as an efficient and effective pooling method compared with other methods such as average pooling and stochastic pooling. In this paper, we first evaluate different pooling methods, and then propose a new feature pooling method termed as Selective, Discriminative and Equalizing pooling~(SDE). The SDE representation is a feature learning mechanism by jointly optimizing the pooled representations with the target of learning more selective, discriminative and equalizing features. We use bilevel optimization to solve the joint optimization problem. Experiments on multiple benchmark databases (including both single-label and multi-label ones) well validate the effectiveness of our framework. Particularly, we achieve the state-of-the-art results (mAP) of 93.2% and 94.0% on the PASCAL VOC2007 and VOC2012 databases, respectively.2015-11-16 15:18:08
SDE embedded CNNSDE_CNN_AGSNUS & NLPRGuo-Sen Xie, Xu-Yao Zhang, Shuicheng Yan, and Cheng-Lin LiuBag of Words~(BoW) model and Convolutional Neural Network~(CNN) are two milestones in visual recognition. Both BoW and CNN require a feature pooling operation for constructing the frameworks. Particularly, the max-pooling has been validated as an efficient and effective pooling method compared with other methods such as average pooling and stochastic pooling. In this paper, we first evaluate different pooling methods, and then propose a new feature pooling method termed as Selective, Discriminative and Equalizing pooling~(SDE). The SDE representation is a feature learning mechanism by jointly optimizing the pooled representations with the target of learning more selective, discriminative and equalizing features. We use bilevel optimization to solve the joint optimization problem. Experiments on multiple benchmark databases (including both single-label and multi-label ones) well validate the effectiveness of our framework. Particularly, we achieve the state-of-the-art results~(mAP) of 93.2% and 94.0% on the PASCAL VOC2007 and VOC2012 databases, respectively.2015-11-16 15:06:45
Knowledge embedded semantic decompositionSYSU_ KESDSun Yat-Sen UniversityTianshui Chen, Muxin Xu, Xiaolu Hui, Riquan Chen, Liang Lin We present a novel approach that incorporates statistical prior knowledge to extract semantic-aware features and simultaneously capture co-occurrence of objects in an image. 2018-10-16 05:00:08
FCFOF:Fusion of Context Feature and Object FeatureTencent-BestImage&CASIA_FCFOF Tencent BestImage Team; Institute of Automation, Chinese Academy of SciencesYan Kong, ScorpioGuo, Fuzhang Wu, Fan Tang, GaryHuang, Weiming DongIn this submission,we make use of the features in both the context level and object level.We extract the context CNN features from the whole image to represent context information and extract local CNN features by selective search method to represent exact object information. This two kinds of features are used to train SVM classifier. The final result is the combination of the two models. 2015-04-09 12:18:17
Very deep ConvNet features and SVM classifierVERY_DEEP_CONVNET_16_19_SVMVisual Geometry Group, University of OxfordKaren Simonyan, Andrew ZissermanThe results were obtained using multi-scale convolutional features and an SVM classifier. The features were computed using two very deep convolutional networks (16 and 19 weight layers), pre-trained on ILSVRC-2012 (1000-class subset of ImageNet). Fine-tuning on VOC-2012 was not performed. The details can be found in our paper: "Very Deep Convolutional Networks for Large-Scale Image Recognition" (http://arxiv.org/pdf/1409.1556).2014-11-16 15:51:49
Very deep ConvNet features and SVM classifierVERY_DEEP_CONVNET_16_SVMVisual Geometry Group, University of OxfordKaren Simonyan, Andrew ZissermanThe results were obtained using multi-scale convolutional features and an SVM classifier. The features were computed using a very deep convolutional network (16 weight layers), pre-trained on ILSVRC-2012 (1000-class subset of ImageNet). Fine-tuning on VOC-2012 was not performed. The details can be found in our paper: "Very Deep Convolutional Networks for Large-Scale Image Recognition" (http://arxiv.org/pdf/1409.1556).2014-11-17 16:30:06
Very deep ConvNet features and SVM classifierVERY_DEEP_CONVNET_19_SVMVisual Geometry Group, University of OxfordKaren Simonyan, Andrew ZissermanThe results were obtained using multi-scale convolutional features and an SVM classifier. The features were computed using a very deep convolutional network (19 weight layers), pre-trained on ILSVRC-2012 (1000-class subset of ImageNet). Fine-tuning on VOC-2012 was not performed. The details can be found in our paper: "Very Deep Convolutional Networks for Large-Scale Image Recognition" (http://arxiv.org/pdf/1409.1556).2014-11-17 16:15:32
finetuneinceptionv4_svmseuwangyin4no 2017-12-21 07:28:57
svm with v4new_label_2seuwangyinsome new label2018-01-13 10:16:24
svm with v4 finetunenew_label_4seuwangyin4 new label2018-01-13 11:18:14
svm with v4 finetune partnew_label_6seuwangyingas up2018-01-13 11:43:50
svm with v4 finetune part smallnew_label_8seuwangyin,zhangyuas up2018-01-13 13:20:12
Classification with additional text featureLIRIS_CLSTEXTLIRIS, Ecole Centrale de Lyon, CNRS, UMR5205, FranceChao ZHU, Yuxing TANG, Ningning LIU, Charles-Edmond BICHOT, Emmanuel Dellandrea, Liming CHENIn this submission, we try to use additional text information to help with object classification. We propose novel text features [1] based on semantic distance using WordNet. The basic idea is to calculate the semantic distance between the text associated with an image and an emotional dictionary based on path similarity, denoting how similar two word senses are, based on the shortest path that connects the senses in a taxonomy. As there are no tags included in Pascal2011 dataset, we downloaded 1 million Flickr images (including their tags) as the additional textual source. Firstly, for each Pascal image, we find its similar images (top 20) from the database using KNN method based on visual features (LBP and color HSV histogram), and then use these tags to extract the text feature. We use SVM with RBF kernel to train the classifier and predict the outputs. For classification based on visual features, we follow the same method described in our other submission. The outputs of visual feature based method and text feature based method are then linearly combined as final results. [1] N. Liu, Y. Zhang, E. Dellandréa, B. Tellez, L. Chen: ‘Associating text features with visual ones to improve affective image classification’, International Conference Affective Computing (ACII), Memphis, USA, 2011.2011-10-13 21:20:50