Segmentation Results: VOC2012 BETA

Competition "comp5" (train on VOC2012 data)

This leaderboard shows only those submissions that have been marked as public, and so the displayed rankings should not be considered as definitive.

Average Precision (AP %)

  mean

aero
plane
bicycle

bird

boat

bottle

bus

car

cat

chair

cow

dining
table
dog

horse

motor
bike
person

potted
plant
sheep

sofa

train

tv/
monitor
submission
date
XC-FLATTENET [?] 84.394.073.291.574.483.295.590.596.738.194.576.292.995.688.990.476.093.863.686.878.612-Jan-2020
FDNet_16s [?] 84.095.477.995.969.180.696.492.695.540.592.670.693.893.190.489.971.292.763.188.577.722-Mar-2018
FLATTENET-001 [?] 83.195.469.886.473.378.794.692.195.541.092.475.893.195.289.890.065.894.060.086.679.629-Dec-2019
WASPNet+CRF [?] 79.690.760.985.368.980.793.884.794.538.586.069.790.786.985.486.867.688.857.485.474.219-Nov-2019
WASPNet [?] 79.489.563.887.668.879.193.584.793.737.684.869.490.887.985.786.567.487.057.285.272.725-Jul-2019
refinenet_HPM [?] 74.287.962.276.055.376.086.782.685.428.979.664.281.279.485.185.065.683.251.379.268.601-Mar-2019
DCONV_SSD_FCN [?] 72.988.638.185.257.871.490.884.286.032.183.453.780.480.881.681.261.484.151.977.567.017-Mar-2018
deeplabv3_plus_reproduction [?] 69.581.738.183.160.162.789.881.487.630.072.860.978.177.778.678.144.976.750.875.458.611-May-2019
TCnet [?] 68.472.632.674.259.568.986.777.178.634.468.663.074.475.876.377.054.976.655.576.961.702-May-2018
UMICH_EG-ConvCRF_Iter_Res101 [?] 67.976.432.182.052.969.589.581.187.529.674.754.283.479.278.776.148.181.059.451.349.313-Dec-2019
UMICH_TCS_101 [?] 66.774.631.581.550.367.288.480.682.329.972.353.476.176.078.473.647.579.252.057.956.601-Dec-2019
UMICH_EG-ConvCRF_Iter_Res50 [?] 66.475.832.285.550.969.386.479.685.129.173.755.779.674.776.375.844.879.651.050.548.510-Dec-2019
UMICH_TCS [?] 65.573.632.481.650.468.586.279.481.828.275.555.679.075.577.774.348.279.052.344.442.728-Nov-2019
bothweight th0.4 [?] 65.379.133.488.220.165.388.076.290.024.780.743.785.185.882.369.847.784.943.841.454.515-Apr-2019
weight+RS [?] 64.585.031.985.419.165.388.672.988.524.875.650.183.382.081.666.856.680.345.844.639.724-Mar-2019
AttnBN [?] 63.075.732.973.549.960.478.176.577.419.972.027.473.872.777.272.351.277.337.973.553.614-Aug-2019
Extended [?] 59.377.928.975.142.655.270.458.953.024.366.751.973.171.372.563.945.259.243.965.258.629-Aug-2018
weakly_seg_validation_test [?] 57.767.631.166.441.960.170.665.471.825.363.624.772.268.768.368.841.667.533.665.049.208-Sep-2019
O2P_SVRSEGM_CPMC_CSI [?] 47.564.032.245.934.746.359.561.749.414.847.931.242.551.358.854.634.954.634.750.642.215-Nov-2012
NUS_DET_SPR_GC_SP [?] 47.352.931.039.844.558.960.852.549.022.638.127.547.452.446.851.935.755.340.854.247.823-Sep-2012
BONN_O2PCPMC_FGT_SEGM [?] 47.065.429.351.333.444.259.860.352.513.653.632.640.357.657.349.033.553.529.247.637.623-Sep-2012
BONNGC_O2P_CPMC_CSI [?] 45.459.327.943.939.841.452.261.556.413.644.526.142.851.757.951.329.845.728.849.943.323-Sep-2012
BONN_CMBR_O2P_CPMC_LIN [?] 44.860.027.346.440.041.757.659.050.410.041.622.343.051.756.850.133.743.729.547.544.723-Sep-2012
comp6_test_cls [?] 37.736.610.838.925.930.856.053.857.84.924.622.148.133.132.656.123.529.731.842.745.610-May-2018
OptNBNN-CRF [?] 11.310.52.33.03.01.030.214.915.00.26.12.35.112.115.323.40.58.93.510.75.323-Sep-2012

Abbreviations

TitleMethodAffiliationContributorsDescriptionDate
AttnBNAttnBNAttnBNAttnBNAttnBN2019-08-14 23:23:24
O2P Regressor + Composite Statistical InferenceBONNGC_O2P_CPMC_CSI(1) University of Bonn, (2) Georgia Institute of Technology, (3) University of CoimbraJoao Carreira (1,3) Fuxin Li (2) Guy Lebanon (2) Cristian Sminchisescu (1)We utilize a novel probabilistic inference procedure (unpublished yet), Composite Statisitcal Inference (CSI), on semantic segmentation using predictions on overlapping figure-ground hypotheses. Regressor predictions on segment overlaps to the ground truth object are modelled as generated by the true overlap with the ground truth segment plus noise. A model of ground truth overlap is defined by parametrizing on the unknown percentage of each superpixel that belongs to the unknown ground truth. A joint optimization on all the superpixels and all the categories is then performed in order to maximize the likelihood of the SVR predictions. The optimization has a tight convex relaxation so solutions can be expected to be close to the global optimum. A fast and optimal search algorithm is then applied to retrieve each object. CSI takes the intuition from the SVRSEGM inference algorithm that multiple predictions on similar segments can be combined to better consolidate the segment mask. But it fully develops the idea by constructing a probabilistic framework and performing composite MLE jointly on all segments and categories. Therefore it is able to consolidate better object boundaries and handle hard cases when objects interact closely and heavily occlude each other. For each image, we use 150 overlapping figure-ground hypotheses generated by the CPMC algorithm (Carreira and Sminchisescu, PAMI 2012), and linear SVR predictions on them with the novel second order O2P features (Carreira, Caseiro, Batista, Sminchisescu, ECCV2012; see VOC12 entry BONN_CMBR_O2P_CPMC_LIN) as the input to the inference algorithm.2012-09-23 23:49:02
Linear SVR with second-order pooling.BONN_CMBR_O2P_CPMC_LIN(1) University of Bonn, (2) University of CoimbraJoao Carreira (2,1) Rui Caseiro (2) Jorge Batista (2) Cristian Sminchisescu (1)We present a novel effective local feature aggregation method that we use in conjunction with an existing figure-ground segmentation sampling mechanism. This submission is described in detail in [1]. We sample multiple figure-ground segmentation candidates per image using the Constrained Parametric Min-Cuts (CPMC) algorithm. SIFT, masked SIFT and LBP features are extracted on the whole image, then pooled over each object segmentation candidate to generate global region descriptors. We employ a novel second-order pooling procedure, O2P, with two non-linearities: a tangent space mapping and power normalization. The global region descriptors are passed through linear regressors for each category, then labeled segments in each image having scores above some threshold are pasted onto the image in the order of these scores. Learning is performed using an epsilon-insensitive loss function on overlap with ground truth, similar to [2], but within a linear formulation (using LIBLINEAR). comp6: learning uses all images in the segmentation+detection trainval sets, and external ground truth annotations provided by courtesy of the Berkeley vision group. comp5: one model is trained for each category using the available ground truth segmentations from the 2012 trainval set. Then, on each image having no associated ground truth segmentations, the learned models are used together with bounding box constraints, low-level cues and region competition to generate predicted object segmentations inside all bounding boxes. Afterwards, learning proceeds similarly to the fully annotated case. 1. “Semantic Segmentation with Second-Order Pooling”, Carreira, Caseiro, Batista, Sminchisescu. ECCV 2012. 2. "Object Recognition by Ranking Figure-Ground Hypotheses", Li, Carreira, Sminchisescu. CVPR 2010.2012-09-23 19:11:47
BONN_O2PCPMC_FGT_SEGMBONN_O2PCPMC_FGT_SEGM(1) Universitfy of Bonn, (2) University of Coimbra, (3) Georgia Institute of Technology, (4) Vienna University of TechnologyJoao Carreira(1,2), Adrian Ion(4), Fuxin Li(3), Cristian Sminchisescu(1)We present a joint image segmentation and labeling model which, given a bag of figure-ground segment hypotheses extracted at multiple image locations and scales using CPMC (Carreira and Sminchisescu, PAMI 2012), constructs a joint probability distribution over both the compatible image interpretations (tilings or image segmentations) composed from those segments, and over their labeling into categories. The process of drawing samples from the joint distribution can be interpreted as first sampling tilings, modeled as maximal cliques, from a graph connecting spatially non-overlapping segments in the bag (Ion, Carreira, Sminchisescu, ICCV2011), followed by sampling labels for those segments, conditioned on the choice of a particular tiling. We learn the segmentation and labeling parameters jointly, based on Maximum Likelihood with a novel Incremental Saddle Point estimation procedure (Ion, Carreira, Sminchisescu, NIPS2011). As meta-features we combine outputs from linear SVRs using novel second order O2P features to predict the overlap between segments and ground-truth objects of each class (Carreira, Caseiro, Batista, Sminchisescu, ECCV2012; see VOC12 entry BONNCMBR_O2PCPMC_LINEAR), bounding box object detectors, and kernel SVR outputs trained to predict the overlap between segments and ground-truth objects of each class (Carreira, Li, Sminchisescu, IJCV 2012). comp6: the O2P SVR learning uses all images in the segmentation+detection trainval sets, and external ground truth annotations provided by courtesy of the Berkeley vision group.2012-09-23 21:39:35
dssd style archDCONV_SSD_FCNshanghai universityli junhao(jxlijunhao@163.com)combine object detection and semantic segmentation in one forward pass2018-03-17 02:58:20
Weakly-supervised modelExtendedSYSUWenfeng LuoDCNN trained under image labels2018-08-29 12:54:29
FDNet_16sFDNet_16sHongKong University of Science and Technology, altizure.comMingmin Zhen, Jinglu Wang, Siyu Zhu, Runze Zhang, Shiwei Li, Tian Fang, Long QuanA fully dense neural network with encoder-decoder structure is proposed that we abbreviate as FDNet. For each stage in the decoder module, feature maps of all the previous blocks are adaptively aggregated to feedforward as input. 2018-03-22 08:52:44
Fully Convolutional NetworkFLATTENET-001Sichuan University, ChinaXin CaiIn contrast to the commonly-used strategies, such as dilated convolution and encoder-decoder structure, we introduce the Flattening Module to produce high-resolution predictions without either removing any subsampling operations or building a complicated decoder module. https://arxiv.org/abs/1909.099612019-12-29 07:29:05
DM2: Detection, Mask transfer, MRF pruningNUS_DET_SPR_GC_SPNational University of Singapore(NUS), Panasonic Singapore Laboratories(PSL)(NUS) Wei XIA, Csaba DOMOKOS, Jian DONG, Shuicheng YAN, Loong Fah CHEONG, (PSL) Zhongyang HUANG, Shengmei SHENWe propose a three-step coarse-to-fine framework for general object segmentation. Given a test image, the object bounding boxes are first predicted by object detectors, and then the coarse masks within the corresponding bounding boxes are transferred from the training data based on the optimization framework of coupled global and local sparse representations in [1]. Then based on the coarse masks as well as the original detection information (bounding boxes and confidence maps), we built a super-pixel based MRF model for each bounding box, and then perform foreground-background inference. Both L-a-b color histogram and detection confidence map are used for characterizing the unary terms, while the PB edge contrast is used as smoothness term. Finally, the segmentation results are further refined by post-processing of multi-scale super-pixel segmentation. [1]Wei Xia, Zheng Song, Jiashi Feng, Loong Fah Cheong and Shuicheng Yan. Segmentation over Detection by Coupled Global and Local Sparse Representations, ECCV 2012. 2012-09-23 20:01:56
O2P+SVRSEGM Regressor + Composite Statistical InferenceO2P_SVRSEGM_CPMC_CSI(1) Georgia Institute of Technology (2) University of California - Berkeley (3) Amazon Inc. (4) Lund University Fuxin Li (1), Joao Carreira (2), Guy Lebanon (3), Cristian Sminchisescu (4)We utilize a novel probabilistic inference procedure, Composite Statisitcal Inference (CSI) [1], on semantic segmentation using predictions on overlapping figure-ground hypotheses. Regressor predictions on segment overlaps to the ground truth object are modelled as generated by the true overlap with the ground truth segment plus noise, parametrized on the unknown percentage of each superpixel that belongs to the unknown ground truth. A joint optimization on all the superpixels and all the categories is then performed in order to maximize the likelihood of the SVR predictions. The optimization has a tight convex relaxation so solutions can be expected to be close to the global optimum. A fast and optimal search algorithm is then applied to retrieve each object. CSI takes the intuition from the SVRSEGM inference algorithm that multiple predictions on similar segments can be combined to better consolidate the segment mask. But it fully develops the idea by constructing a probabilistic framework and performing maximum composite likelihood jointly on all segments and categories. Therefore it is able to consolidate better object boundaries and handle hard cases when objects interact closely and heavily occlude each other. For each image, we use 150 overlapping figure-ground hypotheses generated by the CPMC algorithm (Carreira and Sminchisescu, PAMI 2012), SVRSEGM results, and linear SVR predictions on them with the novel second order O2P features (Carreira, Caseiro, Batista, Sminchisescu, ECCV2012; see VOC12 entry BONN_CMBR_O2P_CPMC_LIN) as the input to the inference algorithm. [1] Fuxin Li, Joao Carreira, Guy Lebanon, Cristian Sminchisescu. Composite Statistical Inference for Semantic Segmentation. CVPR 2013. 2012-11-15 22:50:41
CRF with NBNN features and simple smoothingOptNBNN-CRFUniversity of Amsterdam (UvA)Carsten van Weelden, Maarten van der Velden, Jan van GemertNaive Bayes nearest neighbor (NBNN) [Boiman et al, CVPR 2008] performs well in image classification because it avoids quantization of image features and estimates image-to-class distance. In the context of my MSc thesis we applied the NBNN method to segmentation by estimating image-to-class distances for superpixels, which we use as unary potentials in a simple conditional random field (CRF). To get the NBNN estimates we extract dense SIFT features from the training set and store these in a FLANN index [Muja and Lowe, VISSAPP'09] for efficient nearest neighbor search. To deal with the unbalanced class frequency we learn a linear correction for each class as in [Behmo et al, ECCV 2010]. We segment each test image into 500 SLIC superpixels [Achanta et al, TPAMI 2012] and take each superpixel as a vertex in the CRF. We use the corrected NBNN estimates as unary potentials and Potts potential as pairwise potentials and infer the MAP labeling using alpha-expansion [Boykov et al, TPAMI 2001]. We tune the weighting between the unary and pairwise potential by exhaustive search.2012-09-23 12:48:10
TCnetTCnetTsinghua UniversityLiu YulinTCnet2018-05-02 08:02:45
Iterative method with Entropy-gated ConvCRFUMICH_EG-ConvCRF_Iter_Res101University of Michigan Deep Learning Research GroupChuan Cen, supervisor: Prof. Honglak LeeTrain segmentation network and relation models iteratively. Infer pseudo-labels for segmentation network with the novel Entropy-gated ConvCRF, which is proved to be superior to random walk under the same conditions. Seg net: Deeplabv2 Seg net backbone: Res101 Relation model backbone: Res1012019-12-13 00:37:25
Iterative method with Entropy-gated ConvCRFUMICH_EG-ConvCRF_Iter_Res50University of Michigan Deep Learning Research GroupChuan Cen, supervisor: Prof. Honglak LeeTrain segmentation network and relation models iteratively. Infer pseudo-labels for segmentation network with the novel Entropy-gated ConvCRF, which is proved to be superior to random walk under the same conditions. Seg net: Deeplabv2 Seg net backbone: Res50 Relation model backbone: Res502019-12-10 22:02:49
Transductive semi-sup, co-train, self-trainUMICH_TCSUniversity of Michigan Deep Learning Research GroupChuan CenIt's a method for solving weakly supervised semantic segmentation problem with image-level label only. The problem is viewed as a semi-supervised learning task, then apply graph semi-supervised learning method, co-training and self-training methods together achieving the SOTA performance. 2019-11-28 19:09:22
Transductive semi-sup, co-train, self-trainUMICH_TCS_101University of Michigan Deep Learning Research GroupChuan CenIt's a method for solving weakly supervised semantic segmentation problem with image-level label only. The problem is viewed as a semi-supervised learning task, then apply graph semi-supervised learning method, co-training and self-training methods together achieving the SOTA performance. 2019-12-01 02:40:47
WASP for Effective Semantic SegmentationWASPNetRochester Institute of TechnologyBruno Artacho and Andreas Savakis, Rochester Institute of TechnologyWe propose an efficient architecture for semantic segmentation based on an improvement of Atrous Spatial Pyramid Pooling that achieves a considerable accuracy increase while decreasing the number of parameters and amount of memory necessary. Current semantic segmentation methods rely either on deconvolutional stages that inherently require a large number of parameters, or cascade methods that abdicate larger fields-of-views obtained in the parallelization. The proposed Waterfall architecture leverages the progressive information abstraction from cascade architecture while obtaining multi-scale fields-of-view from spatial pyramid configurations. We demonstrate that the Waterfall approach is a robust and efficient architecture for semantic segmentation using ResNet type networks and obtaining state-of-the-art results with over 20% reduction in the number of parameters and improved performance.2019-07-25 20:28:04
Waterfall Atrous Spatial Pooling Arch. for Sem SegWASPNet+CRFRochester Institute of TechnologyRochester Institute of Technology Bruno Artacho Andreas SavakisWe propose a new efficient architecture for semantic segmentation based on a "Waterfall" Atrous Spatial Pooling architecture that achieves a considerable accuracy increase while decreasing the number of network parameters and memory footprint. The proposed Waterfall architecture leverages the efficiency of progressive filtering in the cascade architecture, while maintaining multi-scale fields-of-view comparable to spatial pyramid configurations. Additionally, our method does not rely on a post-processing stage with Conditional Random Fields, which further reduces complexity and required training time. We demonstrate that the Waterfall approach with a ResNet backbone is a robust and efficient architecture for semantic segmentation obtaining state-of-the-art results with significant reduction in the number of parameters for the Pascal VOC dataset and the Cityscapes dataset.2019-11-19 15:19:18
FLATTENETXC-FLATTENETSichuan University, ChinaXin CaiIt is well-known that the reduced feature resolution due to repeated subsampling operations poses a serious challenge to Fully Convolutional Network (FCN) based models. In contrast to the commonly-used strategies, such as dilated convolution and encoder-decoder structure, we introduce a novel Flattening Module to produce high-resolution predictions without either removing any subsampling operations or building a complicated decoder module. https://ieeexplore.ieee.org/document/8932465/metrics#metrics2020-01-12 02:43:17
bothweight th0.4bothweight th0.4Northwestern Politechnical UniversityPeng Wang, Chunhua Shenbothweight th0.4 270822019-04-15 10:46:51
comp6_test_clscomp6_test_clscomp6_test_clscomp6_test_clscomp6_test_cls2018-05-10 15:54:47
pretrained resnet_101 and ASPP module deeplabv3_plus_reproductionInstitute of Computing TechnologyZhu LifaReproduction of deeplabv3plus with Tensorflow.2019-05-11 09:45:53
refinenet_HPMrefinenet_HPMSJTUgzxrefinenet_HPM2019-03-01 09:22:09
weakly_seg_validation_testweakly_seg_validation_testNEUSmile Labweakly_seg_validation_test2019-09-08 01:18:19
weight+RSweight+RSNorthwestern Politechnical UniversityPeng Wang, Shunhua Shenweight+RS2019-03-24 08:13:07