PASCAL VOC Challenge performance evaluation and download server |
|
Home | Leaderboard |
mean | aero plane | bicycle | bird | boat | bottle | bus | car | cat | chair | cow | dining table | dog | horse | motor bike | person | potted plant | sheep | sofa | train | tv/ monitor | submission date | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RSTN [?] | 96.7 | 99.9 | 98.1 | 99.2 | 98.5 | 89.3 | 97.8 | 96.3 | 99.2 | 92.6 | 99.3 | 91.2 | 99.3 | 99.6 | 97.7 | 99.2 | 89.3 | 99.6 | 92.0 | 99.8 | 96.6 | 23-Apr-2021 | |
SYSU_ KESD [?] | 95.4 | 99.9 | 96.6 | 98.4 | 97.0 | 88.6 | 96.4 | 95.9 | 99.2 | 89.0 | 97.9 | 88.6 | 99.4 | 99.3 | 97.9 | 99.2 | 85.8 | 98.6 | 86.7 | 99.4 | 95.1 | 16-Oct-2018 | |
From_MOD_To_MLT [?] | 95.3 | 99.2 | 96.7 | 97.6 | 96.3 | 87.0 | 97.1 | 96.9 | 98.8 | 90.9 | 96.7 | 88.5 | 98.5 | 98.6 | 98.0 | 99.2 | 87.5 | 96.1 | 88.7 | 98.5 | 94.3 | 21-Apr-2017 | |
Random_Crop_Pooling_AGS [?] | 94.3 | 99.8 | 94.5 | 98.1 | 96.1 | 85.5 | 96.1 | 95.5 | 99.0 | 90.2 | 95.0 | 87.8 | 98.7 | 98.4 | 97.5 | 99.0 | 80.1 | 95.9 | 86.5 | 98.8 | 94.6 | 09-May-2016 | |
SDE_CNN_AGS [?] | 94.0 | 99.8 | 94.7 | 97.6 | 96.4 | 83.6 | 95.9 | 94.8 | 99.0 | 90.4 | 94.3 | 88.1 | 98.9 | 98.5 | 97.2 | 98.8 | 76.8 | 95.0 | 86.8 | 98.7 | 94.2 | 16-Nov-2015 | |
NUS-HCP++ [?] | 93.0 | 99.8 | 94.9 | 97.7 | 95.5 | 80.9 | 95.7 | 93.9 | 98.9 | 88.5 | 95.0 | 85.7 | 98.1 | 98.5 | 97.1 | 96.9 | 74.2 | 93.8 | 84.1 | 98.4 | 92.5 | 22-Apr-2015 | |
inceptionv4_svm [?] | 92.4 | 99.4 | 95.6 | 96.6 | 95.0 | 80.2 | 93.8 | 91.6 | 98.4 | 84.7 | 94.2 | 86.0 | 98.2 | 98.3 | 96.7 | 98.3 | 74.5 | 96.2 | 79.1 | 98.4 | 92.1 | 21-Dec-2017 | |
Random_Crop_Pooling [?] | 92.2 | 99.3 | 92.2 | 97.5 | 94.9 | 82.6 | 94.1 | 92.4 | 98.5 | 83.8 | 93.5 | 83.1 | 98.1 | 97.3 | 96.0 | 98.8 | 77.7 | 95.1 | 79.4 | 97.7 | 92.4 | 09-May-2016 | |
SDE_CNN [?] | 91.7 | 99.1 | 92.2 | 96.9 | 95.3 | 80.0 | 93.0 | 90.3 | 98.5 | 83.2 | 93.2 | 84.2 | 98.1 | 97.6 | 95.6 | 98.7 | 75.0 | 94.3 | 79.7 | 97.8 | 91.2 | 16-Nov-2015 | |
FisherNet-VGG16 [?] | 91.5 | 99.2 | 92.5 | 96.8 | 94.4 | 81.0 | 93.2 | 92.3 | 98.2 | 82.9 | 94.3 | 82.2 | 97.4 | 97.3 | 95.9 | 98.7 | 72.9 | 95.1 | 77.7 | 97.5 | 90.8 | 16-Aug-2016 | |
MSDA+FC [?] | 91.4 | 99.2 | 93.8 | 96.1 | 95.2 | 81.7 | 94.3 | 91.6 | 98.1 | 81.9 | 91.7 | 83.5 | 96.3 | 95.6 | 96.0 | 98.2 | 77.9 | 93.6 | 74.7 | 97.6 | 91.9 | 07-Sep-2015 | |
new_label_2 [?] | 91.3 | 99.2 | 95.4 | 96.1 | 94.8 | 79.7 | 93.9 | 90.9 | 98.4 | 84.2 | 94.0 | 85.4 | 98.0 | 98.0 | 96.7 | 98.0 | 59.5 | 95.6 | 78.9 | 98.4 | 91.4 | 13-Jan-2018 | |
MVMI-DSP [?] | 90.7 | 98.9 | 93.1 | 96.0 | 94.1 | 76.4 | 93.5 | 90.8 | 97.9 | 80.2 | 92.1 | 82.4 | 97.2 | 96.8 | 95.7 | 98.1 | 73.9 | 93.6 | 76.8 | 97.5 | 89.0 | 19-Apr-2015 | |
Tencent-BestImage&CASIA_FCFOF [?] | 90.4 | 98.8 | 92.5 | 96.1 | 94.0 | 74.3 | 92.6 | 90.9 | 97.8 | 85.0 | 92.2 | 83.1 | 97.1 | 95.8 | 93.0 | 97.8 | 67.6 | 92.5 | 82.2 | 97.0 | 88.5 | 09-Apr-2015 | |
NUS-HCP-AGS [?] | 90.3 | 99.0 | 91.8 | 94.8 | 92.4 | 72.6 | 95.0 | 91.8 | 97.4 | 85.2 | 92.9 | 83.1 | 96.0 | 96.6 | 96.1 | 94.9 | 68.4 | 92.0 | 79.6 | 97.3 | 88.5 | 09-Jun-2014 | |
new_label_4 [?] | 89.4 | 99.5 | 95.2 | 96.2 | 94.5 | 77.2 | 93.9 | 90.9 | 98.3 | 79.9 | 94.0 | 79.4 | 98.1 | 98.1 | 96.6 | 96.9 | 61.1 | 95.7 | 76.7 | 98.3 | 68.7 | 13-Jan-2018 | |
VERY_DEEP_CONVNET_16_19_SVM [?] | 89.3 | 99.1 | 89.1 | 96.0 | 94.1 | 74.1 | 92.2 | 85.3 | 97.9 | 79.9 | 92.0 | 83.7 | 97.5 | 96.5 | 94.7 | 97.1 | 63.7 | 93.6 | 75.2 | 97.4 | 87.8 | 16-Nov-2014 | |
VERY_DEEP_CONVNET_19_SVM [?] | 89.0 | 99.1 | 88.7 | 95.7 | 93.9 | 73.1 | 92.1 | 84.8 | 97.7 | 79.1 | 90.7 | 83.2 | 97.3 | 96.2 | 94.3 | 96.9 | 63.4 | 93.2 | 74.6 | 97.3 | 87.9 | 17-Nov-2014 | |
VERY_DEEP_CONVNET_16_SVM [?] | 89.0 | 99.0 | 88.8 | 95.9 | 93.8 | 73.1 | 92.1 | 85.1 | 97.8 | 79.5 | 91.1 | 83.3 | 97.2 | 96.3 | 94.5 | 96.9 | 63.1 | 93.4 | 75.0 | 97.1 | 87.1 | 17-Nov-2014 | |
new_label_6 [?] | 88.5 | 99.4 | 94.5 | 95.1 | 94.3 | 78.8 | 93.8 | 90.0 | 97.5 | 83.4 | 89.1 | 85.8 | 98.1 | 95.3 | 74.6 | 96.0 | 55.4 | 95.5 | 76.9 | 98.3 | 77.8 | 13-Jan-2018 | |
BCE loss with transfer learning [?] | 84.4 | 97.8 | 84.0 | 93.1 | 88.1 | 63.0 | 88.7 | 80.8 | 95.8 | 72.4 | 87.2 | 77.1 | 94.4 | 93.0 | 91.0 | 95.4 | 54.6 | 87.8 | 69.3 | 94.3 | 80.8 | 06-Mar-2019 | |
NUS-HCP [?] | 84.2 | 97.5 | 84.3 | 93.0 | 89.4 | 62.5 | 90.2 | 84.6 | 94.8 | 69.7 | 90.2 | 74.1 | 93.4 | 93.7 | 88.8 | 93.2 | 59.7 | 90.3 | 61.8 | 94.4 | 78.0 | 09-Jun-2014 | |
CNN-S-TUNE-RNK [?] | 83.2 | 96.8 | 82.5 | 91.5 | 88.1 | 62.1 | 88.3 | 81.9 | 94.8 | 70.3 | 80.2 | 76.2 | 92.9 | 90.3 | 89.3 | 95.2 | 57.4 | 83.6 | 66.4 | 93.5 | 81.9 | 28-Jul-2014 | |
NN-ImageNet-Pretrain-1512classes [?] | 83.0 | 95.0 | 83.2 | 88.4 | 84.4 | 61.0 | 89.1 | 84.7 | 90.8 | 72.9 | 87.2 | 69.0 | 91.8 | 93.2 | 88.4 | 96.1 | 64.9 | 87.3 | 62.7 | 91.0 | 80.0 | 14-Apr-2014 | |
new_label_8 [?] | 73.6 | 99.4 | 94.5 | 95.5 | 94.1 | 56.3 | 93.0 | 85.4 | 97.8 | 80.9 | 3.8 | 78.3 | 97.1 | 93.7 | 35.3 | 94.6 | 58.5 | 25.7 | 74.5 | 95.8 | 18.5 | 13-Jan-2018 | |
CW_DEEP_FCN [?] | 67.5 | 91.2 | 67.3 | 83.5 | 75.3 | 33.3 | 77.6 | 69.7 | 87.0 | 55.1 | 62.1 | 33.4 | 83.5 | 70.2 | 71.5 | 90.1 | 44.6 | 69.6 | 37.5 | 85.6 | 61.4 | 11-Aug-2016 | |
LIRIS_CLSTEXT [?] | 65.6 | 88.3 | 66.1 | 60.8 | 68.5 | 46.7 | 77.3 | 69.3 | 63.7 | 55.9 | 52.6 | 56.6 | 55.5 | 69.7 | 73.8 | 87.1 | 46.3 | 65.4 | 54.0 | 81.2 | 72.8 | 13-Oct-2011 | |
ITI_FK_FLICKR_GRAYSIFT_ENTROPY [?] | 63.5 | 88.1 | 63.0 | 61.9 | 68.6 | 34.9 | 79.6 | 67.4 | 70.5 | 57.5 | 52.0 | 55.3 | 60.1 | 68.7 | 74.3 | 83.2 | 26.4 | 57.6 | 53.4 | 83.0 | 64.0 | 23-Sep-2012 |
Title | Method | Affiliation | Contributors | Description | Date |
---|---|---|---|---|---|
BCE loss with transfer learning | BCE loss with transfer learning | Singapore University of tehcnology and design | Teo Kai Xiang Woong Wen Tat | We use BCE for each of the 20 classes with a pretrained resnet model with 3 phase training | 2019-03-06 08:02:58 |
Convolutional network pre-trained on ILSVRC-2012 | CNN-S-TUNE-RNK | Visual Geometry Group, University of Oxford | Ken Chatfield, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman | A convolutional network, pre-trained on ILSVRC-2012 (1000-class subset of ImageNet), and fine-tuned on VOC-2012 using the ranking hinge loss. The details can be found in our BMVC 2014 paper: "Return of the Devil in the Details: Delving Deep into Convolutional Nets" (Table 3, row (g)) and on the project website: http://www.robots.ox.ac.uk/~vgg/research/deep_eval/ | 2014-07-28 12:23:33 |
softmax | CW_DEEP_FCN | UESTC | HD DR | ultra deep network | 2016-08-11 11:10:49 |
Deep FisherNet for Object Classification | FisherNet-VGG16 | HUST & UCSD | Peng Tang, Xinggang Wang, Baoguang Shi, Xiang Bai, Wenyu Liu, Zhuowen Tu | We propose a neural network structure with FV layer being part of an end-to-end trainable system that is differentiable; we name our network FisherNet that is learnable using back-propagation. Our proposed FisherNet combines convolutional neural network training and Fisher Vector encoding in a single end-to-end structure. The details can be viewed in our paper "Deep FisherNet for Object Classification". | 2016-08-16 12:32:03 |
JueunGot | From_MOD_To_MLT | SWRDC, Device Solutions, Samsung Electronics | Hayoung Joo, Donghyuk Kwon, Yong-Deok Kim | The Multi-Object Detection result is converted to Multi-lable classification. For each box, we only use the classification socore. If there exist multiple boxes for some class, we simply take maximum classification socre among them. | 2017-04-21 08:26:58 |
Multimodal bootstrapping using MIRFLICKR1m | ITI_FK_FLICKR_GRAYSIFT_ENTROPY | ITI-CERTH & Surrey University | E. Chatzilari, S. Nikolopoulos, Y. Kompatsiaris, J. Kittler | Based on the implementation of “K. Chatfield, V. Lempitsky, A. Vedaldi, A. Zisserman, The devil is in the details: an evaluation of recent feature encoding methods, British Machine Vision Conference, 2011” and specifically following the approach described in “F. Perronnin, J. Sanchez, and T. Mensink. 2010. Improving the fisher kernel for large-scale image classification. In Proc. (ECCV'10), Springer-Verlag, Berlin, Heidelberg, 143-156.” for feature encoding based on the Fisher Kernel. We use gray-SIFT descriptors reduced with PCA to 80 dimensions and GMM with 256 components for estimating the probabilistic visual vocabulary. A spatial pyramid tree with 3 levels (1st:1x1, 2nd:2x2 and 3rd:3x1 horizontal) and dense sampling every 3 pixels has been employed to define the key-points. The descriptors are aggregated using Fisher encoding, which produces in a 40960-dimensional vector for each of the 8 regions of the spatial pyramid. These vectors are subsequently concatenated to produce the final 327680-dimensional representation vector for each image. SVM classifiers are trained using the Hellinger kernel which results in square rooting the features and then normalizing the results using the l2 norm. In addition to the train+validation dataset the set of examples that is used for training the visual recognition models is further enriched by collecting the first 500 images per concept from the MIRFLICKR dataset (1 million images in total). The images are ranked in ascending order based on the geometric mean of the image visual score (distance from the SVM-hyperplane), the complement of the image tag-based similarity (between the image tags and the concept of interest) and the entropy of tag-based similarities among all concepts in the dataset. | 2012-09-23 16:30:53 |
global_MSDA_local_FC | MSDA+FC | beihang university & Intel labels China | jianweiluo, zhiguo jiang, jianguo li, jun wan | We use the output of the 1000-way softmax layer of VGG's CNN trained on the ILSVRC classification task as feature, namely Deep Attribute. Given an image, it is represented by the aggregation of the 1000-d feature from all the regions extracted on the image by objectness detection techniques like edgebox. We perform feature aggregation on five scales according to the size of region. The ultimate representation is thus 5000-d, and named MSDA. An initial SVMs classifiers are trained on the MSDA feature. Then, we apply the previously trained classifiers to regions to select a few correlated regions for each image, and perform feature aggregation only using features from these regions. The feature we use in this step is the first Fully-connected feature. A new set of classifiers are trained on these aggregated FC featues. The final predictions of the image is the fusion of the the results from both the steps. To note that, we do not perform any data augmentation like flip, crops, and do not fine-tune the VGG's CNN on the PASCAL dataset. For this evaluation, we use all of the VOC07 dataset and VOC12 trainval as the training set. | 2015-09-07 02:56:29 |
NTU & NJU _MVMI_DSP | MVMI-DSP | NTU, NJU | Hao Yang, Joey Tianyi Zhou, Yu Zhang, Bin-bin Gao, Jianxin Wu, Jianfei Cai | We combine the features generated from the whole image with the features from a proposal based multi-view multi-instance framework to form the final representation of the image. | 2015-04-19 06:18:54 |
CNN pre-trained on Imagenet | NN-ImageNet-Pretrain-1512classes | INRIA | Maxime Oquab, Léon Bottou, Ivan Laptev, Josef Sivic | We use features extracted using a Convolutional Neural Network to perform classification on the VOC dataset. Convolutional Neural Network features are trained on a 1512-class subset of the ImageNet database. A 2-layer neural network is then trained on the Pascal VOC 2012 dataset, on top of the pre-trained layers. Details on the method can be found at : http://www.di.ens.fr/willow/research/cnn/ | 2014-04-14 15:04:01 |
HCP: Hypothesis CNN Pooling | NUS-HCP | National University of Singapore, Beijing Jiaotong University | Yunchao Wei*, Wei Xia*, Jian Dong, Junshi Huang, Bingbing Ni, Yao Zhao, Shuicheng Yan. | Convolutional Neural Network (CNN) has demonstrated promising performance in single-label image classification tasks. However, how CNN best copes with multi-label images still remains an open problem, mainly due to the underlying complex object layouts and insufficient multi-label training images. In this work, we propose a flexible deep CNN infrastructure, called Hypotheses-CNN-Pooling (HCP), which takes an arbitrary number of object segment hypotheses as the inputs, and a shared CNN is connected with each hypothesis, finally the CNN outputs from different hypotheses are aggregated with max pooling for the ultimate multi-label predictions. Some unique characteristics of this flexible deep CNN infrastructure include 1) no ground truth bounding box information is required for training, 2) the whole HCP infrastructure is robust to those possibly noisy and/or redundant hypotheses, 3) no explicit hypothesis label is required, and 4) it may naturally output multi-label prediction results. Experimental results on Pascal VOC2007 and VOC2012 multi-label image datasets well demonstrate the superiority of the proposed HCP infrastructure over other state-of-the-arts, and particularly the MAP reaches 0.842 by HCP only on the VOC2012 dataset. | 2014-06-09 10:54:22 |
HCP: Hypothesis CNN Pooling | NUS-HCP++ | National University of Singapore, Beijing Jiaotong University | Yunchao Wei*, Wei Xia*, Jian Dong, Min Lin, Junshi Huang, Bingbing Ni, Yao Zhao, Shuicheng Yan. | In this submission, we utilize the VGG-16 pre-trained model on ILSVRC-2012 (1000-class subset of ImageNet) as the shared CNN. The single model performance can reach 90.1%. The final result is the combination of the NUS-HCP with the approach proposed in [1]. [1]Jian Dong, Wei Xia, Qiang Chen, Jiashi Feng, Zhongyang Huang, Shuicheng Yan. Subcategory-aware Object Classification. In CVPR, Portland, Oregon, USA, Jun 23-28, 2013. | 2015-04-22 02:45:39 |
HCP:Hypothesis CNN Pooling with Subcategory Mining | NUS-HCP-AGS | National University of Singapore, Beijing Jiaotong University | Yunchao Wei*, Wei Xia*, Jian Dong, Junshi Huang, Bingbing Ni, Yao Zhao, Shuicheng Yan. | Convolutional Neural Network (CNN) has demonstrated promising performance in single label image classification tasks. However, how CNN best copes with multi-label images still remains an open problem, mainly due to the underlying complex object layouts and insufficient multi-label training images. In this work, we propose a flexible deep CNN infrastructure, called Hypotheses-CNN-Pooling (HCP), which takes an arbitrary number of object segment hypotheses as the inputs, and a shared CNN is connected with each hypothesis, finally the CNN outputs from different hypotheses are aggregated with max pooling for the ultimate multi-label predictions. Some unique characteristics of this flexible deep CNN infrastructure include 1) no ground truth bounding box information is required for training, 2) the whole HCP infrastructure is robust to those possibly noisy and/or redundant hypotheses, 3) no explicit hypothesis label is required, and 4) it may naturally output multi-label prediction results. Experimental results on Pascal VOC2007 and VOC2012 multi-label image datasets well demonstrate the superiority of the proposed HCP infrastructure over other state-of-the-arts, and particularly the MAP reaches 0.842 by HCP only and 0.903 after the combination with [1] on the VOC2012 dataset. [1]Jian Dong, Wei Xia, Qiang Chen, Jiashi Feng, Zhongyang Huang, Shuicheng Yan. Subcategory-aware Object Classification. In CVPR, Portland, Oregon, USA, Jun 23-28, 2013. | 2014-06-09 10:58:29 |
Multi-Label Image Classification with RSTN | RSTN | # | Anonymity for ACM MM submission | we propose a region selection transformer network (RSTN), a tailored vision transformer architecture, for tackling the MLIC task. Specifically, RSTN consists of a transformer encoder, a region selection module (RSM), and a region refinement module (RRM). The transformer encoder takes as inputs a sequence of flattened image patches for discovering the global long-range information across the whole network. Next, based on the intermediate attention outputs, RSM utilizes a ranking mechanism to select the semantic-related discriminative regions. Further, RRM is proposed to aggregate the local context information among the selected regions. | 2021-04-23 12:38:49 |
HFUT_Random_Crop_Pooling | Random_Crop_Pooling | Hefei University of Technology | Changzhi Luo, Meng Wang, Richang Hong, Jiashi Feng | We first finetune the 16-layer VGG-Net with a random crop pooling approach, and then use the finetuned model to extract feature for each image. The final results are obtained using a linear SVM classifier. | 2016-05-09 03:14:23 |
HFUT_Random_Crop_Pooling_AGS | Random_Crop_Pooling_AGS | Hefei University of Technology | Changzhi Luo, Meng Wang, Richang Hong, Jiashi Feng | We fuse the random crop pooling approach with the approach proposed in [1]. [1] Jian Dong, Wei Xia, Qiang Chen, Jiashi Feng, Zhongyang Huang, Shuicheng Yan. Subcategory-aware Object Classification. In CVPR, Portland, Oregon, USA, Jun 23-28, 2013. | 2016-05-09 03:12:36 |
SDE embedded CNN | SDE_CNN | NUS & NLPR | Guo-Sen Xie, Xu-Yao Zhang, Shuicheng Yan, and Cheng-Lin Liu | Bag of Words~(BoW) model and Convolutional Neural Network~(CNN) are two milestones in visual recognition. Both BoW and CNN require a feature pooling operation for constructing the frameworks. Particularly, the max-pooling has been validated as an efficient and effective pooling method compared with other methods such as average pooling and stochastic pooling. In this paper, we first evaluate different pooling methods, and then propose a new feature pooling method termed as Selective, Discriminative and Equalizing pooling~(SDE). The SDE representation is a feature learning mechanism by jointly optimizing the pooled representations with the target of learning more selective, discriminative and equalizing features. We use bilevel optimization to solve the joint optimization problem. Experiments on multiple benchmark databases (including both single-label and multi-label ones) well validate the effectiveness of our framework. Particularly, we achieve the state-of-the-art results (mAP) of 93.2% and 94.0% on the PASCAL VOC2007 and VOC2012 databases, respectively. | 2015-11-16 15:18:08 |
SDE embedded CNN | SDE_CNN_AGS | NUS & NLPR | Guo-Sen Xie, Xu-Yao Zhang, Shuicheng Yan, and Cheng-Lin Liu | Bag of Words~(BoW) model and Convolutional Neural Network~(CNN) are two milestones in visual recognition. Both BoW and CNN require a feature pooling operation for constructing the frameworks. Particularly, the max-pooling has been validated as an efficient and effective pooling method compared with other methods such as average pooling and stochastic pooling. In this paper, we first evaluate different pooling methods, and then propose a new feature pooling method termed as Selective, Discriminative and Equalizing pooling~(SDE). The SDE representation is a feature learning mechanism by jointly optimizing the pooled representations with the target of learning more selective, discriminative and equalizing features. We use bilevel optimization to solve the joint optimization problem. Experiments on multiple benchmark databases (including both single-label and multi-label ones) well validate the effectiveness of our framework. Particularly, we achieve the state-of-the-art results~(mAP) of 93.2% and 94.0% on the PASCAL VOC2007 and VOC2012 databases, respectively. | 2015-11-16 15:06:45 |
Knowledge embedded semantic decomposition | SYSU_ KESD | Sun Yat-Sen University | Tianshui Chen, Muxin Xu, Xiaolu Hui, Riquan Chen, Liang Lin | We present a novel approach that incorporates statistical prior knowledge to extract semantic-aware features and simultaneously capture co-occurrence of objects in an image. | 2018-10-16 05:00:08 |
FCFOF:Fusion of Context Feature and Object Feature | Tencent-BestImage&CASIA_FCFOF | Tencent BestImage Team; Institute of Automation, Chinese Academy of Sciences | Yan Kong, ScorpioGuo, Fuzhang Wu, Fan Tang, GaryHuang, Weiming Dong | In this submission,we make use of the features in both the context level and object level.We extract the context CNN features from the whole image to represent context information and extract local CNN features by selective search method to represent exact object information. This two kinds of features are used to train SVM classifier. The final result is the combination of the two models. | 2015-04-09 12:18:17 |
Very deep ConvNet features and SVM classifier | VERY_DEEP_CONVNET_16_19_SVM | Visual Geometry Group, University of Oxford | Karen Simonyan, Andrew Zisserman | The results were obtained using multi-scale convolutional features and an SVM classifier. The features were computed using two very deep convolutional networks (16 and 19 weight layers), pre-trained on ILSVRC-2012 (1000-class subset of ImageNet). Fine-tuning on VOC-2012 was not performed. The details can be found in our paper: "Very Deep Convolutional Networks for Large-Scale Image Recognition" (http://arxiv.org/pdf/1409.1556). | 2014-11-16 15:51:49 |
Very deep ConvNet features and SVM classifier | VERY_DEEP_CONVNET_16_SVM | Visual Geometry Group, University of Oxford | Karen Simonyan, Andrew Zisserman | The results were obtained using multi-scale convolutional features and an SVM classifier. The features were computed using a very deep convolutional network (16 weight layers), pre-trained on ILSVRC-2012 (1000-class subset of ImageNet). Fine-tuning on VOC-2012 was not performed. The details can be found in our paper: "Very Deep Convolutional Networks for Large-Scale Image Recognition" (http://arxiv.org/pdf/1409.1556). | 2014-11-17 16:30:06 |
Very deep ConvNet features and SVM classifier | VERY_DEEP_CONVNET_19_SVM | Visual Geometry Group, University of Oxford | Karen Simonyan, Andrew Zisserman | The results were obtained using multi-scale convolutional features and an SVM classifier. The features were computed using a very deep convolutional network (19 weight layers), pre-trained on ILSVRC-2012 (1000-class subset of ImageNet). Fine-tuning on VOC-2012 was not performed. The details can be found in our paper: "Very Deep Convolutional Networks for Large-Scale Image Recognition" (http://arxiv.org/pdf/1409.1556). | 2014-11-17 16:15:32 |
finetune | inceptionv4_svm | seu | wangyin4 | no | 2017-12-21 07:28:57 |
svm with v4 | new_label_2 | seu | wangyin | some new label | 2018-01-13 10:16:24 |
svm with v4 finetune | new_label_4 | seu | wangyin | 4 new label | 2018-01-13 11:18:14 |
svm with v4 finetune part | new_label_6 | seu | wangying | as up | 2018-01-13 11:43:50 |
svm with v4 finetune part small | new_label_8 | seu | wangyin,zhangyu | as up | 2018-01-13 13:20:12 |
Classification with additional text feature | LIRIS_CLSTEXT | LIRIS, Ecole Centrale de Lyon, CNRS, UMR5205, France | Chao ZHU, Yuxing TANG, Ningning LIU, Charles-Edmond BICHOT, Emmanuel Dellandrea, Liming CHEN | In this submission, we try to use additional text information to help with object classification. We propose novel text features [1] based on semantic distance using WordNet. The basic idea is to calculate the semantic distance between the text associated with an image and an emotional dictionary based on path similarity, denoting how similar two word senses are, based on the shortest path that connects the senses in a taxonomy. As there are no tags included in Pascal2011 dataset, we downloaded 1 million Flickr images (including their tags) as the additional textual source. Firstly, for each Pascal image, we find its similar images (top 20) from the database using KNN method based on visual features (LBP and color HSV histogram), and then use these tags to extract the text feature. We use SVM with RBF kernel to train the classifier and predict the outputs. For classification based on visual features, we follow the same method described in our other submission. The outputs of visual feature based method and text feature based method are then linearly combined as final results. [1] N. Liu, Y. Zhang, E. Dellandréa, B. Tellez, L. Chen: ‘Associating text features with visual ones to improve affective image classification’, International Conference Affective Computing (ACII), Memphis, USA, 2011. | 2011-10-13 21:20:50 |