Classification Results: VOC2012 BETA

Competition "comp1" (train on VOC2012 data)

This leaderboard shows only those submissions that have been marked as public, and so the displayed rankings should not be considered as definitive.

Average Precision (AP %)

  mean

aero
plane
bicycle

bird

boat

bottle

bus

car

cat

chair

cow

dining
table
dog

horse

motor
bike
person

potted
plant
sheep

sofa

train

tv/
monitor
submission
date
DID [?] 96.899.998.299.098.889.298.196.999.392.399.390.599.399.798.899.289.399.792.199.796.707-Sep-2021
CSAC-Net V1 [?] 89.496.492.094.092.277.190.587.395.284.091.378.194.194.693.495.274.991.984.294.386.409-Feb-2020
SRN+ [?] 88.898.289.692.992.369.393.089.895.980.287.881.294.195.294.097.071.890.480.396.587.602-Jul-2018
SFA_NET [?] 87.595.289.792.190.175.088.984.793.883.490.978.393.093.490.392.772.189.883.092.182.322-May-2018
SE [?] 86.598.386.492.792.067.190.984.695.675.984.582.194.393.192.696.162.588.071.996.385.119-Oct-2016
LIG_DCNN_FEAT_ALL [?] 85.498.686.093.492.265.491.083.695.573.482.179.694.792.992.195.059.487.467.896.082.708-Sep-2015
S&P_OverFeast_Fast_Bayes [?] 82.897.182.391.289.461.287.880.494.070.777.975.792.589.189.695.056.083.267.493.982.120-Nov-2014
NUSPSL_CTX_GPM_SCM [?] 82.297.384.280.885.360.889.986.889.375.477.875.183.087.590.195.057.879.273.494.580.730-Oct-2014
BCE_loss [?] 82.197.181.794.885.867.886.783.995.464.282.164.892.788.386.790.055.588.262.392.682.320-Jan-2018
Resnet [?] 80.798.481.192.988.757.187.573.396.763.490.164.094.495.193.076.843.893.067.393.165.225-Apr-2017
CNN_SIGMOID [?] 79.796.383.088.584.656.588.382.191.969.468.871.388.183.288.693.950.172.663.193.179.614-Jun-2017
NUSPSL_CTX_GPM [?] 78.695.581.179.482.558.287.784.183.168.572.868.576.483.387.592.856.577.867.091.277.613-Oct-2011
Semi-Semantic Visual Words & Partial Least Squares [?] 78.394.582.679.480.757.887.885.583.966.674.269.475.283.088.293.656.275.664.190.076.613-Oct-2011
NLPR_PLS_SSVW [?] 78.394.582.679.480.757.887.885.583.966.674.269.475.283.088.293.656.275.664.190.076.613-Oct-2011
NUS_Context_SVM [?] 78.395.381.578.981.857.587.383.782.368.475.068.575.882.986.792.756.877.766.190.777.105-Oct-2011
Bayes_Ridge_CNN [?] 77.095.075.687.984.153.183.676.187.966.164.871.586.881.185.193.450.973.756.489.776.418-Nov-2014
NUSPSL_CTX_GPM_SVM [?] 76.794.378.576.480.057.086.382.181.565.674.766.573.481.985.491.953.274.065.189.576.113-Oct-2011
Bayes_Ridge_Deep [?] 74.793.473.585.680.848.982.573.886.364.162.468.785.078.483.192.848.470.754.987.774.022-Sep-2014
CVC_UVA_UNITN [?] 74.392.074.273.077.554.385.281.976.465.263.268.568.978.281.091.655.969.465.486.777.423-Sep-2012
UvA_UNITN_MostTellingMonkey [?] 73.490.174.166.676.057.085.681.274.563.562.764.566.676.581.390.858.769.566.384.777.313-Oct-2011
CNNsSVM [?] 72.294.672.486.082.841.782.668.886.653.464.660.182.380.581.687.435.472.549.790.170.927-Jan-2015
CVC_CLS [?] 71.089.370.969.873.951.384.879.672.963.859.464.164.775.579.291.442.763.261.986.773.823-Sep-2012
MSRA_USTC_HIGH_ORDER_SVM [?] 70.592.874.869.676.147.383.576.476.959.854.563.567.075.178.890.443.263.360.485.671.213-Oct-2011
MSRA_USTC_PATCH [?] 70.292.774.569.475.445.783.476.576.659.654.563.467.474.878.690.343.063.358.685.271.412-Oct-2011
ITI_FK_FUSED_GRAY-RGB-HSV-OP-SIFT [?] 67.190.465.465.872.337.780.670.572.460.355.161.463.672.477.486.837.761.157.285.968.722-Sep-2012
LIRIS_CLSDET [?] 66.890.066.263.370.947.080.973.963.961.252.757.956.969.673.988.446.365.554.281.372.813-Oct-2011
ITI_FK_BS_GRAYSIFT [?] 63.289.162.360.068.133.479.866.970.357.451.055.059.368.674.583.125.657.253.883.464.922-Sep-2012
BPACAD_COMB_LF_AK_WK [?] 61.486.558.359.767.433.274.264.065.558.544.853.557.060.770.984.639.455.750.580.763.213-Oct-2011
NLPR_IVA_SVM_BOWDect_Convolution [?] 61.183.869.847.860.645.480.574.660.454.051.345.351.564.572.787.735.957.939.875.862.713-Oct-2011
LIRIS_CLS [?] 61.088.356.359.368.633.276.662.264.555.342.655.156.262.070.182.537.356.748.379.664.813-Oct-2011
BPACAD_CS_FISH256_LF [?] 60.587.158.060.066.531.575.762.163.457.145.450.655.858.471.184.036.654.050.879.361.813-Oct-2011
SIFT-LLC-PCAPOOL-DET-SVM [?] 60.485.666.551.960.345.476.970.365.156.434.349.652.563.171.686.826.157.247.975.565.613-Oct-2011
NLPR_IVA_SVM_BOWDect [?] 60.382.969.445.460.146.080.075.159.954.950.743.350.063.472.388.136.157.337.775.258.513-Oct-2011
BPACAD_CS_FISH256-1024_SVM_AVGKER [?] 59.885.057.057.765.930.775.062.464.456.942.250.955.359.169.284.239.352.646.778.961.913-Oct-2011
BPACAD_CS_FISH256-1024_SVM_WEKER [?] 58.285.255.958.665.333.033.862.665.057.941.751.656.060.369.684.238.155.547.979.962.013-Oct-2011
SIFT-LLC-PCAPOOL-SVM [?] 54.983.252.549.359.726.073.558.264.452.136.644.952.157.863.878.119.253.144.172.057.413-Oct-2011
JDL_K17_AVG_CLS [?] 54.884.252.054.563.225.371.258.061.150.233.344.349.757.965.279.920.947.843.077.756.913-Oct-2011
FastScSPM-KDES [?] 45.278.148.638.845.715.870.748.551.043.325.335.638.045.555.168.412.935.329.468.449.013-Oct-2011
ComplexLogNormal_LogFoveal_PhaseInvariance [?] 36.473.233.431.044.717.057.734.445.941.218.130.234.323.139.357.311.923.125.351.236.223-Sep-2012
JDL_K17_HOK2_CLS [?] 34.783.531.118.946.211.770.726.431.724.78.728.722.834.340.451.88.719.215.163.455.013-Oct-2011
DMC-HIK-SVM-SIFT [?] 32.255.625.531.136.515.841.440.040.630.017.821.134.027.031.157.911.920.822.548.435.713-Oct-2011
nopatch mthod [?] 28.665.123.917.336.012.640.531.135.427.210.420.831.313.629.955.010.719.219.242.130.902-Oct-2011
max 4 method [?] 25.065.123.912.936.08.533.420.535.421.25.820.831.36.930.355.010.719.210.821.730.702-Oct-2011
combining methods [?] 19.861.511.912.429.78.730.618.423.621.65.814.818.57.112.647.87.215.09.818.819.402-Oct-2011
NLPR_KF_SVM [?] 10.610.59.110.76.06.57.213.312.211.59.55.616.78.66.638.95.315.05.08.35.407-Sep-2011
MAVEN_SCENE_NEW [?] -81.842.145.755.7-70.853.655.943.034.639.244.051.456.764.716.540.637.668.846.711-Oct-2014
Ensemble of ensemble [?] ---------------88.7-----19-Sep-2012

Abbreviations

TitleMethodAffiliationContributorsDescriptionDate
CNN classifier with BCE lossBCE_lossShanghai Jiao Tong UniversityLuteinA VGG-19 architecture and BCE loss is used to train a classifier2018-01-20 13:55:13
Linear classifier with CNN featuresBayes_Ridge_CNNPohang University of Science and TechnologyYong-Deok KimWe trained linear classifier with squared error loss. We use the feature from CNN which is pretrained in imagenet data.2014-11-18 00:44:47
Bayesian Linear Regression with Deep featureBayes_Ridge_DeepPohang University of Science and TechnologyYong-Deok Kim Tae-Woong Jang Seungjin ChoiWe used the feature from CNN which is presented by Caffe library. Instead of final layer output which is the already prediction on 1000 class, we used 7th layer output (4096 dimension). We select the linear regression model, because we can select the optimal regularization parameter in the perspective of empirical Bayes method. We developed computationally very efficient method for maximising marginalized log likelihood of linear regression model.2014-09-22 03:46:26
CNN Features with Sigmoid ClassifierCNN_SIGMOIDYonsei UniversitySunhee HwangCNN Features (VGG Network[1]) with ImageNet pretrained model. Sigmoid classifier with sigmoid cross entropy loss. [1] Very Deep Convolutional Networks for Large-Scale Image Recognition, Simonyan et al., ICLR 2015.2017-06-14 07:03:58
SVM on CNN featsCNNsSVMuniversity of veronagiorgio roffotest this please in mAP2015-01-27 07:40:43
CSAC-Net V1CSAC-Net V1Zhejiang University?Cancer Hospital of the University of Chinese Academy of SciencesJincao YaoCross-layer sparse atrous convolution network2020-02-09 13:48:07
BOW with fisher and color-HOG detection CVC_CLSComputer Vision BarcelonaAlbert Gordo, Camp Davesa, Fahad Khan, Pep Gonfaus, Joost van de Weijer, Rao Muhammad Anwer, Ramon Baldrich, Jordi Gonzalez, Ernest ValvenyOur submission is a combination of standard bag-of-words pipeline, fisher vectors and color-HOG based part detection models. For bag-of-words, we use SIFT and ColorNames. To combine multiple cues, we use late fusion and color attention [1]. Fisher representation is based on SIFT features. Finally, we use our color-HOG detector [2] which introduces color information within the part based detection framework [3]. References: 1. Fahad shahbaz khan, Joost van de Weijer, Maria Vanrell. Modulating shape features by color attention for object recognition. IJCV, 98(1):49-64, 2012. 2. Fahad shahbaz khan, Rao Muhammad Anwer, Joost van de Weijer, Andrew D. Bagdanov, Maria Vanrell, Antonio M. Lopez. Color Attributes for Object Detection. In CVPR 2012. 3. P. F. Felzenszwalb, R. B. Girshick, D. A. McAllester, D. Ramanan. Object detection with discriminatively trained part-based models. PAMI, 32(9):1627–1645, 2010.2012-09-23 17:00:58
Combination of BOW, Fihser, Color detection, SPCVC_UVA_UNITNComputer Vision Barcelona, University of Amsterdam, University of TrentoFahad Khan, Jan van Gemert, Camp Davesa, Jasper Uijlings , Albert Gordo, Sezer Karaoglu, Koen van de Sande, Pep Gonfaus, Rao Muhammad Anwer, Joost van de Weijer, Cees Snoek, Ramon Baldrich, Nicu Sebe, Theo geversFor bag-of-words, we use SIFT and ColorNames. To combine multiple cues, we use late fusion and color attention [1]. Fisher representation is based on SIFT features. We use our color-HOG detector [2] which introduces color information within the part based detection framework [3]. We extend the Spatial Pyramid pooling with a generic functional pooling scheme. Pooling can be seen as a crude pre-matching technique which may be based on geometry (SPM) but can be any other grouping function [4]. This technique has shown to aid with pooling based on saliency [5]. Here we also include pools based on signal to noise, interest points, and a pyramid over scale. References: 1. Fahad shahbaz khan, Joost van de Weijer, Maria Vanrell. Modulating shape features by color attention for object recognition. IJCV, 98(1):49-64, 2012. 2. Fahad shahbaz khan, Rao Muhammad Anwer, Joost van de Weijer, Andrew D. Bagdanov, Maria Vanrell, Antonio M. Lopez. Color Attributes for Object Detection. In CVPR 2012. 3. P. F. Felzenszwalb, R. B. Girshick, D. A. McAllester, D. Ramanan. Object detection with discriminatively trained part-based models. PAMI, 32(9):1627–1645, 2010. 4. J. C. van Gemert. Exploiting Photographic Style for Category-Level Image Classification by Generalizing the Spatial Pyramid. In ICMR, 2011. 5. S. Karaoglu, J. C. van Gemert and Th. Gevers. Object Reading: Text Recognition for Object Recognition. In ECCV-IFCVCR 2012, Oct 2012. 2012-09-23 17:41:23
Complex LogNormal Scale SpaceComplexLogNormal_LogFoveal_PhaseInvarianceImperial College London1) Ioannis Alexiou 2) Anil A. BharathWe design and optimize a scale space based on sparse coding optimization in the frequency domain. This Scale Space is obtained by using Complex filters with lognormal envelopes which span over angular and radial frequencies. Two basic features are harvested from these filters. The oriented magnitudes and the projected Phase of the filter outputs which are used for sampling Keypoints and Grid-points specifically designed for such filter outputs. We design descriptors comprised by pooling functions to properly accumulate such outputs. The descriptors are produced by foveal arranged poolers which sample the basic features using (136 & 544) inner products per sampling point. These poolers are obtained by using lognormal distributions on the spatial domain this time. Two basic descriptors of 136 and 544 dimensions are produced for keypoint and grid based sampling. These are fed to a k-means algorithmic module to generate 4000 visual words. Histograms of these words are used at fixed regions with hard assignment of the words. Another class of histograms is introduced by pairing up words and computing histograms of word pairs, as proposed by Alexiou, Bharath (BMVC2012). The fixed regions compose a spatial pyramid where each is independently learnt by an SVM classifier. A final simple learning approach is used to merge all SVM predictions into a class prediction.2012-09-23 20:23:12
DiDDIDZhejiang UniversityNoNo2021-09-07 12:14:30
SVM with different descriptorsEnsemble of ensembleuniversity of PadovaLoris Nanniwe proposed a system that incorporates several perturbation approaches and descriptors for a generic computer vision system. Some of the variations of approach we investigate include using different global and bag-of-feature based descriptors, different clusterings for codebook creations, different subspace projections for reducing the dimensionality of the descriptors extracted from each region. The basic classifier used in our ensembles is the Support Vector Machine. The ensemble decisions are combined by sum rule. 2012-09-19 15:09:25
Fisher Encoding Baseline using gray SIFT featuresITI_FK_BS_GRAYSIFTITI-CERTHE. Chatzilari, S. Nikolopoulos, Y. Kompatsiaris, J. KittlerBased on the implementation of “K. Chatfield, V. Lempitsky, A. Vedaldi, A. Zisserman, The devil is in the details: an evaluation of recent feature encoding methods, British Machine Vision Conference, 2011” and specifically following the approach described in “F. Perronnin, J. Sanchez, and T. Mensink. 2010. Improving the fisher kernel for large-scale image classification. In Proc. (ECCV'10), Springer-Verlag, Berlin, Heidelberg, 143-156.” for feature encoding based on the Fisher Kernel. We use gray-SIFT descriptors reduced with PCA to 80 dimensions and GMM with 256 components for estimating the probabilistic visual vocabulary. A spatial pyramid tree with 3 levels (1st:1x1, 2nd:2x2 and 3rd:3x1 horizontal) and dense sampling every 3 pixels has been employed to define the key-points. The descriptors are aggregated using Fisher encoding, which produces in a 40960-dimensional vector for each of the 8 regions of the spatial pyramid. These vectors are subsequently concatenated to produce the final 327680-dimensional representation vector for each image. SVM classifiers are trained using the Hellinger kernel which results in square rooting the features and then normalizing the results using the l2 norm. 2012-09-22 12:50:01
Late fusion of Gray, RGB, HSV and Op SIFT by avgITI_FK_FUSED_GRAY-RGB-HSV-OP-SIFTITI-CERTHE. Chatzilari, S. Nikolopoulos, Y. Kompatsiaris, J. KittlerBased on the implementation of “K. Chatfield, V. Lempitsky, A. Vedaldi, A. Zisserman, The devil is in the details: an evaluation of recent feature encoding methods, British Machine Vision Conference, 2011” and specifically following the approach described in “F. Perronnin, J. Sanchez, and T. Mensink. 2010. Improving the fisher kernel for large-scale image classification. In Proc. (ECCV'10), Springer-Verlag, Berlin, Heidelberg, 143-156.” for feature encoding based on the Fisher Kernel. We use SIFT descriptors reduced with PCA to 80 dimensions and GMM with 256 components for estimating the probabilistic visual vocabulary. A spatial pyramid tree with 3 levels (1st:1x1, 2nd:2x2 and 3rd:3x1 horizontal) and dense sampling every 3 pixels has been employed to define the key-points. The descriptors are aggregated using Fisher encoding, which produces in a 40960-dimensional vector for each of the 8 regions of the spatial pyramid. These vectors are subsequently concatenated to produce the final 327680-dimensional representation vector for each image. SVM classifiers are trained using the Hellinger kernel which results in square rooting the features and then normalizing the results using the l2 norm. Four different feature spaces namely, Gray-SIFT, RGB-SIFT, HSV-SIFT, OP-SIFT have been calculated and the final prediction is generated by averaging the predictions of all four feature spaces (late fusion). 2012-09-22 12:56:18
KNN/MSVM classifier with several DCNN featuresLIG_DCNN_FEAT_ALLLIG-CNRSBahjat Safadi, Mateusz Budnik, Georges Quénot.This submission is computed using a combination of KNN and MSVM classifiers applied to an early fusion of three DCNN based descriptors (features). The DCNN features were computed and extracted using the fc6 layer (4096 components) of caffe/bvlc_alexnet [1], the pool5 layer (1024 components) of caffe/bvlc_googlenet [2] and the fc8 layer (1000 components) of the 19-layer version of the VGG network [3]. Descriptor optimization was performed separately on each of them using non-linear (power) transformations combined with PCA [4]. Early fusion (concatenation) was then performed after per descriptor scale normalization and a final descriptor optimization was performed on it resulting in a 294-component global descriptor. Classification by KNN and MSVM [5] are finally computed separately and fused by score averaging. [1] ImageNet classification with deep convolutional neural networks, Krizhevsky et al., NIPS 2012. [2] Going Deeper with Convolutions, Szegedy et al., CVPR 2015. [3] Very Deep Convolutional Networks for Large-Scale Image Recognition, Simonyan et al., ICLR 2015. [4] Descriptor optimization for multimedia indexing and retrieval, Safadi et al., MTAP 2015. [5] Evaluations of multi-learners approaches for concepts indexing in video document, Safadi et al., RIAO 2010. 2015-09-08 09:39:07
SVM with HOG featuresMAVEN_SCENE_NEWUniversity of CagliariRoberto Tronci, Luca Piras, Mauro Mereu, Davide AriuWe used the HOG to extract the interest points, then we built a vocabulary of 300 visual words by clustering.2014-10-11 17:32:46
Subcategory-aware Object ClassificationNUSPSL_CTX_GPM_SCMNational University of Singapore; Panasonic Singapore Laboratories; Sun Yat-sen UniversityNUS: Dong Jian, Chen Qiang, Song Zheng, Pan Yan, Xia Wei, Yan Shuicheng; PSL: Hua Yang, Huang Zhongyang, Shen Shengmei;The new solution is motivated by the observation of considerable within-class diversities in the PASCAL VOC dataset. For example, the chair category includes two obvious sub-classes, namely sofa-like chairs and rigid-material chairs. In feature space, these two sub-classes are essentially far away, and it is intuitively beneficial to model them independently. The proposed new solution contributes in the following aspects: 1) inhomogeneous-similarity aware sub-class mining (SCM), 2) sub-class aware object detection and classification, and 3) sub-class aware kernel mapping for late fusion. Also the whole solution is founded on several valuable components from the NUS-PSL team in VOC 2011: 1) Traditional SPM and the novel Generalized Hierarchical Matching (GPM) schemes are performed to generate image representations. 2) Contextualized object detection and classification. Considerable improvement over our solution for PASCAL VOC 2011 has been achieved as shown in our offline train-vs-validation experiments[1]. [1] Jian Dong, Wei Xia , Qiang Chen, Jianshi Feng, Zhongyang Huang, Shuicheng Yan. Subcategory-aware Object Classification, CVPR 2013.2014-10-30 15:06:19
Resnet 34ResnetShanghai Jiao Tong UniversityJiangchao YaoWe just use Resnet 34 as a baseline method in PASCAL VOC 20122017-04-25 02:33:58
Linear classifier with least square lossS&P_OverFeast_Fast_BayesPohang University of Science and TechnologyKim Yong-Deok, Jang Tae-WoongFeature : fc7 from OverFeat CNN trained on ILSRVC 2012 Classifier: linear classifier with least squares loss2014-11-20 05:37:22
SESEMiddle East Technical UniversitySadegh Eskandari, Emre AkbasWe apply our own feature selection algorithm on CNN features from LeNet, ResNet and VeryDeepCNN. Then, we run linear SVM (liblinear) and simply add the SVM scores coming from each feature set. The feature selection algorithm is not published yet.2016-10-19 13:24:36
Sparse-feature-aware netSFA_NETZhejiang UniversityJincao YaoWe combined our spare-feature-aware (SFA) method with CNN. Instead of using the global image feature to classify the image, we also select a batch of ROIs as local anchor to supervise and double check the classification. Finally, the model will accomplish the classification task according to both the sparse local features found by the SFA and the overall output of the network.2018-05-22 10:56:56
SRN+SRN+Nankai UniversityYuhuan Wu Yun LiuSRN+2018-07-02 11:12:34
Combination of the late fusion, avgker and wekerBPACAD_COMB_LF_AK_WKData Mining and Web Search Research Group (DMWS), MTA SZTAKI. HungaryBálint Daróczy, László NikházyThis is the average of the confidence output of a late fusion, an aggregated kernel and an averaged kernel (BPACAD_CS_FISH256-1024_SVM_AVGKER) method. We computed RGB Color moments and SIFT descriptors (Löwe 1999) on a dense grid and regions detected by Harris-Laplacian (Mikolajczik et al, 2005). All the three methods are based on non-hierarchical Gaussian Mixture Models (GMM) with 256 Gaussians (two of them also using GMMs with 1024 Gaussians ) and non-sparse Fisher vectors (Perronnin et al, 2007 ). We used our open-source GMM/Fisher toolkit for GPGPUs to compute the Fisher vectors and train the GMMs (http://datamining.sztaki.hu/?q=en/GPU-GMM). All of them are using Fisher vector based pre-computed kernels (basic kernels) for learning linear SVM classifiers (Daróczy et al, ImageCLEF 2011) . The late fusion method is based on a combination of SVM predictions (18 SVM classifiers per class), meanwhile the aggregated and averaged kernels are computed before the classification (only one SVM classifier per class). While the averaged kernel method needs no parameter tuning, for the late fusion and the aggregated kernel method we learned optimal weights per class on the validation set.2011-10-13 21:55:55
SVM on averaged Fisher KernelsBPACAD_CS_FISH256-1024_SVM_AVGKERData Mining and Web Search Research Group (DMWS), MTA SZTAKI, HungaryBálint Daróczy, László NikházyWe computed RGB Color moments and SIFT descriptors (Löwe 1999) on a dense grid and regions detected by Harris-Laplacian (Mikolajczik et al, 2005). We trained non-hierarchical Gaussian Mixture Models (with 256 Gaussians and 1024 Gaussians) on a subset (1 million) of the extracted low-level features of the training images. We extracted non-sparse Fisher vectors on nine different poolings with GMM with 256 Gaussians (dense grid, Harris-Laplacian, 3x1 and 2x2 spatial pyramids (Lazebnik et al, 2006)) and four with GMM with 1024 Gaussians (dense,3x1). We used our open-source GMM/Fisher toolkit for GPGPUs to compute the Fisher vectors and train the GMMs (http://datamining.sztaki.hu/?q=en/GPU-GMM). We calculated pre-computed kernels (Daróczy et al, ImageCLEF 2011) and averaged them. We trained only one binary SVM classifier per class.2011-10-13 20:58:15
SVM on weighted Fisher kernelsBPACAD_CS_FISH256-1024_SVM_WEKERData Mining and Web Search Research Group (DMWS), MTA SZTAKI. HungaryBálint Daróczy, László NikházyWe computed RGB Color moments and SIFT descriptors (Löwe 1999) on a dense grid and regions detected by Harris-Laplacian (Mikolajczik et al, 2005). We trained non-hierarchical Gaussian Mixture Models (with 256 Gaussians and 1024 Gaussians) on a subset (1 million) of the extracted low-level features of the training images. We extracted non-sparse Fisher vectors (Perronnin et al, 2007 ) on nine different poolings with GMM with 256 Gaussians (dense grid, Harris-Laplacian, 3x1 and 2x2 spatial pyramids (Lazebnik et al, 2006)) and four with GMM with 1024 Gaussians (dense,3x1). We used pre-computed kernels (Daróczy et al, ImageCLEF 2011) and aggregated them per class with different weights. We learned optimal weights per class on the validation set. We trained only one binary SVM classifier per class.2011-10-13 21:33:27
SVM classifier late fusion with Fisher vectorsBPACAD_CS_FISH256_LFData Mining and Web Search Research Group (DMWS), MTA SZTAKI. HungaryBálint Daróczy, László NikházyWe computed RGB Color moments and SIFT descriptors (Löwe, 1999) on a dense grid and regions detected by Harris-Laplacian (Mikolajczik et al, 2005). We trained non-hierarchical Gaussian Mixture Models (with 256 Gaussians) on a subset of the extracted low-level features of the training images. We calculated non-sparse Fisher vectors (Perronnin et al, 2007) on nine different poolings per low-level descriptors (dense grid, Harris-Laplacian, 3x1 and 2x2 spatial pyramids (Lazebnik et al, 2006)). We used pre-computed kernels (Daróczy et al, ImageCLEF 2011). We trained one-vs-all SVM classifiers for each basic kernels (2x9 total per class) and combined linearly the test predictions. We learned optimal weights per class on the validation set.2011-10-13 20:13:05
HIK based svm classifier with dense SIFT features.DMC-HIK-SVM-SIFTThe University of NanjingYubin Yang, Ye Tang, Lingyan PanWe adopt a bag-of-visual-words method (cf Csurka et al 2004). A single descriptor type, SIFT descriptors (Lowe 2004) are extracted from 16*16 pixel patches which are densely sampled from each image on a grid with stepsize 12 pixels. We partition the original training and validation data into different categories according to its label, then randomly select 200 images per category (2000 images in total) as the training set. We use a novel difference maximize coding approach to quantize these descriptors into 200 “visual words”. Each image is then represented by a histogram of visual words. Spatial pyramid matching (Lazebnik et al, CVPR 2006) are also used in our method. Finally, we train a HIK kernel (Jianxin Wu et al, ICCV 2009) based SVM classifier using the concatenated pyramid feature vector for each image in training set.2011-10-13 15:36:46
Fast ScSPM+LSVM with dense KDES features.FastScSPM-KDESK.U. Leuven, VISICS / IBBTRadu Timofte, K.U. Leuven ESAT-PSI-VISICS / IBBTWe use the pipeline of Linear Spatial Pyramid Matching Using Sparse Coding for Image Classification (Yang et al 2009). Gradient Kernel Descriptors (Bo et al 2010) are densely extracted from each image. 500k randomly selected descriptors from the trainval set are used for fast optimizing a small dictionary of 1500 "visual words" using an approach similar to Yang et al 2009 but with our fast sparse coding method. For each image the descriptors are fast sparse coded over the dictionary and multi-scale max pooled. Linear SVM 1vsAll classifiers are trained on trainval. It is an one shot system, one day in total, no tuning.2011-10-13 18:52:10
SVM with average kernel using 17 kernelsJDL_K17_AVG_CLSJDL, Institute of Computing Technology, Chinese Academy of SciencesShuhui Wang, Li Shen, Shuqiang Jiang, Qi Tian, Qingming Huangwe calculate six types of commonly used BOW features(including dense and sparse sift, dense color sift,hog, lbp and self similarity) and 3 global features(color moment, gist and block gist), where the visual vocabulary size is typically around 1000. We calculate 3 level spatial pyramid features on those BOW representation respectively. Then 17 base kernels are calculated by using histogram intersection, RBF and chi-square kernels on these features, whose kernel parameters are tuned using the validation data. We calculate an average kernel by using these 17 base kernel. one-against-all SVM classifiers are used to train the final classfiers for each category. 2011-10-13 20:50:40
SVM with average kernel of second order kernelsJDL_K17_HOK2_CLSJDL, Institute of Computing Technology, Chinese Academy of SciencesShuhui Wang, Li Shen, Shuqiang Jiang, Qi Tian, Qingming Huangwe calculate six types of commonly used BOW features(including dense and sparse sift, dense color sift,hog, lbp and self similarity) and 3 global features(color moment, gist and block gist), where the visual vocabulary size is typically around 1000. We calculate 3 level spatial pyramid features on those BOW representation respectively. Then 17 base kernels are calculated by using histogram intersection, RBF and chi-square kernels on these features, whose kernel parameters are tuned using the validation data. We calculate a set of second order kernels that each one is calculated by using any two kernels from these 17 base kernel. The formulation of second order kernel is K_pq(x,y) = sqrt(K_p(x,y)*K_q(x,y)). An average kernel is calculated by using all the 136 second order kernels. Finally one-against-all SVM classifiers are used to train the final classfiers for each category. 2011-10-13 21:25:02
MKL classifier with multiple featuresLIRIS_CLSLIRIS, Ecole Centrale de Lyon, CNRS, UMR5205, FranceChao ZHU, Yuxing TANG, Ningning LIU, Charles-Edmond BICHOT, Emmanuel Dellandrea, Liming CHENIn this submission, we mainly make use of local descriptors and the popular bag-of-visual-words approach for classification. Regions of interest in each image are detected using both Harris-Laplace detector and dense sampling strategy. SIFT and color SIFT descriptors are then computed for each region as baseline. In addition, we also extract DAISY and extended LBP descriptors based on our work [1][2] for computational efficiency and complementary information to SIFT. For each kind of descriptor, 1,000,000 randomly selected descriptors from the train + val set are quantized using k-means algorithm into 4000 visual words. Each image is then represented by the histogram using hard assignment. Spatial pyramid technique is applied for coarse spatial information. The chi-square kernels of different levels in the pyramid are computed, and then fused by linear combination. The final outputs are obtained by using multiple kernel learning algorithm to fuse different descriptors. [1] C. Zhu, C.E. Bichot, L. Chen: 'Visual object recognition using DAISY descriptor', in Proc. of IEEE International Conference on Multimedia and Expo (ICME), to appear, 2011. [2] C. Zhu, C.E. Bichot, L. Chen: 'Multi-scale color local binary patterns for visual object classes recognition', in Proc. of 20th International Conference on Pattern Recognition (ICPR), pp.3065-3068, 2010.2011-10-13 20:12:26
Classification combined with detectionLIRIS_CLSDETLIRIS, Ecole Centrale de Lyon, CNRS, UMR5205, FranceChao ZHU, Yuxing TANG, Ningning LIU, Charles-Edmond BICHOT, Emmanuel Dellandrea, Liming CHENIn this submission, we improve the classification performances by combining it with object detection results. For classification, we mainly make use of local descriptors and the popular bag-of-visual-words approach. Regions of interest in each image are detected using both Harris-Laplace detector and dense sampling strategy. SIFT and color SIFT descriptors are then computed for each region as baseline. In addition, we also extract DAISY and extended LBP descriptors based on our work [1][2] for computational efficiency and complementary information to SIFT. For each kind of descriptor, 1,000,000 randomly selected descriptors from the train + val set are quantized using k-means algorithm into 4000 visual words. Each image is then represented by the histogram using hard assignment. Spatial pyramid technique is applied for coarse spatial information. The chi-square kernels of different levels in the pyramid are computed, and then fused by linear combination. The final outputs are obtained by using multiple kernel learning algorithm to fuse different descriptors. For object detection, we apply the HOG feature to train deformable part models, and use the models together with sliding window approach to detect objects. Finally, we combine the outputs of classification and detection by late fusion. [1] C. Zhu, C.E. Bichot, L. Chen: 'Visual object recognition using DAISY descriptor', in Proc. of IEEE International Conference on Multimedia and Expo (ICME), to appear, 2011. [2] C. Zhu, C.E. Bichot, L. Chen: 'Multi-scale color local binary patterns for visual object classes recognition', in Proc. of 20th International Conference on Pattern Recognition (ICPR), pp.3065-3068, 2010.2011-10-13 20:18:08
SVM with mined high order featuresMSRA_USTC_HIGH_ORDER_SVMMicrosoft Research Asia & University of Science and Technology of ChinaKuiyuan Yang, Lei Zhang, Hong-Jiang ZhangWe introduce a discriminatively-trained parts-based model with different level templates for image classification. The model consists of templates of HOG features (Dalal and Triggs, 2006) at three different levels. The responses of different level templates are combined by a latent-SVM, where the latent variables are the positions of the templates. We develop a novel mining algorithm to define the parts and an iterative training procedure to learn the parts. The model is applied to all 20 PASCAL VOC objects.2011-10-13 18:02:16
SVM with multi-channel cell-structured patch featuMSRA_USTC_PATCHMicrosoft Research Asia & University of Science and Technology of ChinaKuiyuan Yang, Lei Zhang, Hong-Jiang ZhangWe introduce a discriminatively-trained patch-based model with cell-structured templates for image classification. Dense sampled patches are represented cell-structured templates of HOG, LBP, HSV, SIFT, CSIFT and SSIM. These templates are then fed to Super-Vector Coding (Xi Zhou, 2010) and Fisher Kernel (Florent Perronnin, 2010) to form the image feature. Then linear SVM is trained for each category in one-vs-the-rest manner. The object detector from Pascal VOC 2007 is used to extract object level features and classifiers are trained based on these features, and then fusion with the former one.2011-10-12 04:12:56
Svm with multiple feature and detection resultsNLPR_IVA_SVM_BOWDectNLPR,CASIAJing Liu, Jianlong Fu, Bingyuan Liu, Hanqing LuTwo types of features are considered. First, typical BOW features (OpponentSift, CSift and rgSift) with dense and Harris sampling techniques respectively, and the spatial pyramid kernels with these features are calculated for classification. Second, the object detection based on the deformable part model is employed (P. Felzenszwalb et al. in PAMI 2009 ) . Combined with these features, we attempt to learn a hierarchical classifier with SVM according to the hierarchical structure for all the 20-class data.2011-10-13 19:31:12
Svm with multiple feature and detection resultsNLPR_IVA_SVM_BOWDect_ConvolutionNLPR,CASIAJing Liu, Jianlong Fu, Bingyuan Liu, Hanqing LuThree types of features are considered. First, typical BOW features (OpponentSift, CSift and rgSift) with dense and Harris sampling respectively, and spatial pyramid kernels are calculated. Second, an improved image representation via convolutional sparse coding and max pooling operation is employed, which is motivated by M. Zeiler’s work in ICCV 2011. Third the object detection based on the deformable part model. Combined with multiple features, we attempt to learn a hierarchical classifier with SVM according to the hierarchical structure for all the 20-class data.2011-10-13 20:39:39
SVM classifier with five kernels.NLPR_KF_SVMNational Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences.Ruiguang Hu Weiming HuGrayPHOG,HuePHOG,PLBP(R=1 and R=2). GraySIFT_HL,HueSIFT_HL : LLC coding and MAX pooling. Codebooks : K-means clustering. L1 normalization : GrayPHOG,HuePHOG,PLBP; L2 normalization : GraySIFT_HL, HueSIFT_HL; chi-squared kernel : GrayPHOG,HuePHOG,PLBP; Linear kernel : graySIFT_HL,HueSIFT_HL. kernel fusion : Average strategy . train features : extracted from croped sub_images according to annotation boundingbox, test features : extracted from whole test images.2011-09-07 04:28:25
PLS-multi-feature and Semi-Semantic Visual WordsNLPR_PLS_SSVWNational Laboratory of Pattern Recognition, Institute of Automation Chinese Academy of SciencesYinan Yu, Junge Zhang, Yongzhen Huang, Weiqiang Ren, Chong Wang, Jinchen Wu, Kaiqi Huang, Tieniu TanTHIS SUBMISSION IS A BACKUP OF OUR FINAL FILES. THIS SUBMISSION SHOULD BE INACTIVE FOR FINAL EVALUATION. WE ARE WRITING THE DESCRIPTION OF OUR ALGORITHM. WE WILL RESUBMIT A NEW ONE FOR COMPETITION. The algorithm is based on the traditional bag of words model, with fast feature extraction for visual representation, partial least squares analysis for dimensionality reduction, a novel semi-semantic visual words, an alternative (simple, effective, efficient) multiple LINEAR kernel learning for feature combination, self-re-score by non-linear classifier and finally detection confusion.2011-10-13 04:45:23
classification using context svm and GPMNUSPSL_CTX_GPMNational University of Singapore; Panasonic Singapore LaboratoriesNUS: Chen Qiang, Song Zheng, Yan Shuicheng; PSL: Hua Yang, Huang Zhongyang, Shen Shengmei; The whole solution for object classification is based on BoW framework. On image level, Dense-SIFT, HOG^2, LBP and color moments features are extracted. VQ and Fisher vectors are utilized for feature coding. Traditional SPM and novel spatial-free pyramid matching scheme are then performed to generate image representations. Context-aware features are also extracted based on detection result [1]. The classification models are learnt via kernel SVM. The final classification scores are refined with kernel mapping [2]. The key novelty of the new solution is the pooling strategy (which well handles the spatial mismatching as well as noise feature issues), and considerable improvement has been achieved as shown in other offline experiments. [1] Zheng Song, Qiang Chen, Zhongyang Huang, Yang Hua, and Shuicheng Yan. Contextualizing Object Detection and Classification, CVPR 2011. [2] http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/workshop/nuspsl.pdf 2011-10-13 20:36:35
classification using context svm and GPM, NUSPSL_CTX_GPM_SVMNational University of Singapore; Panasonic Singapore LaboratoriesNUS: Chen Qiang, Song Zheng, Yan Shuicheng; PSL: Hua Yang, Huang Zhongyang, Shen Shengmei;The whole solution for object classification is based on BoW framework[1]. On image level, Dense-SIFT, HOG^2, LBP and color moments features are extracted. VQ and Fisher vectors are utilized for feature coding. Traditional SPM and novel spatial-free pyramid matching scheme are then performed to generate image representations. Context-aware features are also extracted based on detection result [2]. The classification models are learnt via kernel SVM. The key novelty of the new solution is the pooling strategy (which well handles the spatial mismatching as well as noise feature issues), and considerable improvement has been achieved as shown in other offline experiments. [1]http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/workshop/nuspsl.pdf [2] Zheng Song, Qiang Chen, Zhongyang Huang, Yang Hua, and Shuicheng Yan. Contextualizing Object Detection and Classification, CVPR 2011. 2011-10-13 20:58:30
Context-SVM based submission for 3 tasksNUS_Context_SVMNational University of SingaporeZheng Song, Qiang Chen, Shuicheng YanClassification uses the BoW framework. Dense-SIFT, HOG^2, LBP and color moment features are extracted. We then use VQ and fisher vector for feature coding and SPM and Generalized Pyramid Matching(GPM) to generate image representations. Context-aware features are also extracted based on [1]. The classification models are learnt via kernel SVM. Then final classification scores are refined with kernel mapping[2]. Detection and segmentation results use the baseline of [3] using HOG and LBP feature. And then based on [1], we further learn context model and refine the detection results. The final segmentation result uses the learnt average masks for each detection component learnt using segmentation training set to substitute the rectangle detection boxes. [1] Zheng Song*, Qiang Chen*, Zhongyang Huang, Yang Hua, and Shuicheng Yan. Contextualizing Object Detection and Classification. [2] http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/workshop/nuspsl.pdf [3] http://people.cs.uchicago.edu/~pff/latent/ 2011-10-05 09:01:23
SVM using LLC features with detection resultsSIFT-LLC-PCAPOOL-DET-SVMShanghai Jiao Tong UniveristyJun Zhu, Xiaokang Yang, Yukun Zhu, Rui Zhang, Xiaolin ChenWe adopt the framework based on locality-constrained linear coding (LLC) framework (J. Wang et al, CVPR 2010), fused with the detection results given by discriminatively-trained deformable part-based object detectors (P. Felzenszwalb, et al, CVPR 2008). First, the SIFT descriptors (Lowe, 2004) are extracted from image patches densely sampled by every 8 pixels, with three different scales (16x16, 24x24 and 32x32) respectively. Then, the codebook with 1024 bases is constructed by using K-means clustering on 100,000 randomly selected descriptors from the training set. Each 128-dimensional SIFT descriptor is encoded by the approximated LLC (the number of neighbors is set to 5 with the shift-invariant constraint), obtaining a 1024-dimensional code vector. We use max pooling on the patch-level codes from hundends of overlapping regions with various spatial scales and positions, followed by dimension reduction using PCA. After that, the pooled features are concatenated into a vector with l2-normalization to form the image-level representation. In addition, the results of object detectors are also considered. We use Felzenszwalb's deformable part-based models to detect the bounding-boxes for each object class. The detection scores are max-pooled in each cell of spatial pyramid (i.e., 1x1+2x2+3x1) to construct image-level representation with l2-normalization. We obtain final imgae-level representation through weighted concatenation of the two feature vectors from LLC codes and object detectors. Then, a linear SVM classifier is trained to perform classification. The regularization parameters as well as the fusion weight are respectively tuned for each class using the training and validation set. We adopt 'liblinear' software package, which is released by the machine learning group at national Taiwan university, on the implemention of SVM.2011-10-13 22:43:14
Linear SVM using LLC features with PCA poolingSIFT-LLC-PCAPOOL-SVMShanghai Jiao Tong UniveristyJun Zhu, Xiaokang Yang, Yukun Zhu, Rui Zhang, Xiaoling ChenWe adopt the framework based on locality-constrained linear coding (LLC) framework (J. Wang et al, CVPR 2010). First, the SIFT descriptors (Lowe, 2004) are extracted from image patches densely sampled by every 8 pixels, with three different scales (16x16, 24x24 and 32x32) respectively. Then, the codebook with 1024 bases is constructed by using K-means clustering on 100,000 randomly selected descriptors from the training set. After that, each 128-dimensional SIFT descriptor is encoded by the approximated LLC (the number of neighbors is set to 5 with the shift-invariant constraint), obtaining a 1024-dimensional code vector. We use max pooling on the patch-level codes from hundends of overlapping regions with various spatial scales and positions, followed by dimension reduction using PCA. After that, the pooled features are concatenated into a vector with l2-normalization to form the image-level representation. Finally, we train linear SVM classifiers on this feature representation to perform classification. The regularization parameters are respectively tuned for each class using the training and validation set. We adopt 'liblinear' software package, which is released by the machine learning group at national Taiwan university, on the implemention of SVM.2011-10-13 19:07:49
NLPR_CLSSemi-Semantic Visual Words & Partial Least SquaresNational Laboratory of Pattern Recognition, Institute of Automation Chinese Academy of SciencesYinan Yu, Junge Zhang, Yongzhen Huang, Weiqiang Ren, Chong Wang, Jinchen Wu, Kaiqi Huang, Tieniu TanThe framework is based on the classical Bag of Words model. The system consists of: 1) In feature level, we use Semi-Semantic and non-semantic Visual Words Learning, with Fast Feature Extraction (Salient Coding, Super-Vector Coding and Visual Co-occurrence) and multiple features; 2) To learn class model, we employ an alternative multiple-linear-kernel learning for intra-class feature combination, after using Partial Least Squares analysis, which projects the extremely high-dimensional features into a low-dimensional space; 3) The combination of 20 categories scores and detections scores generate a high-level semantic representation of image, we use non-linear-kernel learning to extract inter-class contextual information, which further improve the performance. Besides, all the parameters are decided by cross-validation and prior knowledge on VOC2007 and VOC2010 trainval sets. The motivation and novelty of our algorithm: The traditional codebook describes the distribution of feature space, containing less semantic information of the interesting objects, and a semantic codebook may benefit the performance. We observe that the Deformable-Part-Based Model [Felz TPAMI 2010] describes the object by “object parts”, which can be seen as the semi-semantic visual words. Based on this idea, we propose to use the semi-semantic and non-semantic visual words based Bag of words model for image classification. We analyze the recent image classification algorithms, finding that the feature “distribution”, “reconstruction” and “saliency” is three fundamental issues in coding and image description. However, these methods usually lead to an extremely high-dimensional description, especially with multiple features. In order to learn these features by MKL, we find Partial Least Square is a reliable method for dimensionality reduction. The compression ratio of PLS is over 10000, while the discrimination can be preserved.2011-10-13 14:55:53
Most Telling WindowUvA_UNITN_MostTellingMonkeyUniversity of Amsterdam, University of TrentoJasper Uijlings Koen van de Sande Arnold Smeulders Theo Gevers Nicu Sebe Cees SnoekClassification Task Our main component of this entry is the "Most Telling Window" method which uses Segmentation as Selective Search [1] combined with Bag-of-Words. The "Most Telling Window" method is also used in our Detection entry. However, instead of focusing on finding complete objects, training is adjusted such that we can use the most discriminative part of an object for its identification instead of the whole object. The Most Telling Window method is currently under review. While the "Most Telling Window" method yields the greatest contribution, we improve accuracy further by combining it with a normal Bag-of-Words framework based on SIFT and ColourSift and with the detection scores of the part-based model of Felzenszwalb et al. [1] "Segmentation as Selective Search for Object Recognition"; Koen E. A. van de Sande, Jasper R. R. Uijlings, Theo Gevers, Arnold W. M. Smeulders; 13th International Conference on Computer Vision, 2011. 2011-10-13 22:58:00
BUPT_MCPR_allcombining methodsBeijing University of Posts and Telecommunications-MCPRLZhicheng Zhao, Tao Liu, Xin Guo, Anni CaiTwo methods are used: one with patch, and another without patch.2011-10-02 05:30:10
BUPT_MCPR_max4max 4 methodBeijing University of Posts and Telecommunications-MCPRLZhicheng Zhao, Tao Liu, Xin Guo, Anni CaiFour features with high ap values are involved.2011-10-02 05:35:31
BUPT_MCPR_nopatchnopatch mthodBeijing University of Posts and Telecommunications-MCPRLZhicheng Zhao, Tao Liu, Xin Guo, Anni CaiSIFT, SURF and HOG features with dense sampling and keypoint detection are used.2011-10-02 05:40:55