aero plane |
bicycle | bird | boat | bottle | bus | car | cat | chair | cow | dining table |
dog | horse | motor bike |
person | potted plant |
sheep | sofa | train | tv/ monitor |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BPACAD_COMB_LF_AK_WK_NOBOXES | 86.5 | 58.3 | 59.7 | 67.4 | 33.2 | 74.2 | 64.0 | 65.5 | 58.5 | 44.8 | 53.5 | 57.0 | 60.7 | 70.8 | 84.6 | 39.4 | 55.4 | 50.5 | 80.7 | 63.1 |
BPACAD_CS_FISH256_1024_SVM_AVGKER_NOBOXES | 85.0 | 57.0 | 57.7 | 65.9 | 30.7 | 75.0 | 62.4 | 64.4 | 56.9 | 42.2 | 50.9 | 55.3 | 59.1 | 69.1 | 84.2 | 39.3 | 52.3 | 46.7 | 78.9 | 61.8 |
BUPT_ALL | 61.5 | 11.9 | 12.4 | 29.7 | 8.7 | 30.6 | 18.4 | 23.6 | 21.6 | 5.8 | 14.8 | 18.5 | 7.1 | 12.3 | 47.7 | 7.2 | 15.0 | 9.8 | 18.8 | 19.2 |
BUPT_NOPATCH | 65.1 | 23.8 | 17.3 | 36.0 | 12.6 | 40.5 | 31.1 | 35.4 | 27.2 | 10.4 | 20.8 | 31.3 | 13.6 | 29.5 | 54.9 | 10.7 | 19.1 | 19.2 | 42.1 | 30.8 |
JDL_K17_AVG_CLS | 84.2 | 52.0 | 54.5 | 63.2 | 25.3 | 71.2 | 58.0 | 61.1 | 50.2 | 33.3 | 44.3 | 49.7 | 57.9 | 65.1 | 79.9 | 20.9 | 47.4 | 43.0 | 77.7 | 56.7 |
LIRIS_CLS | 88.3 | 56.2 | 59.3 | 68.6 | 33.2 | 76.6 | 62.2 | 64.5 | 55.3 | 42.6 | 55.1 | 56.2 | 61.9 | 70.0 | 82.5 | 37.3 | 56.4 | 48.3 | 79.6 | 64.7 |
LIRIS_CLSDET | 90.0 | 66.2 | 63.3 | 70.9 | 47.0 | 80.9 | 73.9 | 63.9 | 61.1 | 52.7 | 57.9 | 56.9 | 69.6 | 73.8 | 88.4 | 46.3 | 65.3 | 54.2 | 81.3 | 72.7 |
MSRAUSTC_HIGH_ORDER_SVM | 92.8 | 74.8 | 69.6 | 76.1 | 47.3 | 83.5 | 76.4 | 76.9 | 59.8 | 54.5 | 63.5 | 67.0 | 75.1 | 78.8 | 90.4 | 43.1 | 63.1 | 60.4 | 85.6 | 71.1 |
MSRAUSTC_PATCH | 92.7 | 74.5 | 69.4 | 75.4 | 45.7 | 83.4 | 76.5 | 76.6 | 59.6 | 54.5 | 63.4 | 67.4 | 74.8 | 78.6 | 90.3 | 43.0 | 63.1 | 58.6 | 85.2 | 71.3 |
NANJING_DMC_HIK_SVM_SIFT | 55.6 | 25.5 | 31.0 | 36.5 | 15.8 | 41.4 | 40.0 | 40.6 | 30.0 | 17.8 | 21.1 | 34.0 | 27.0 | 31.0 | 57.9 | 11.9 | 20.7 | 22.6 | 48.4 | 35.7 |
NLPR_KF_SVM | 10.5 | 9.1 | 10.7 | 6.0 | 6.5 | 7.2 | 13.3 | 12.2 | 11.5 | 9.5 | 5.6 | 16.7 | 8.6 | 6.6 | 38.9 | 5.3 | 15.0 | 5.0 | 8.3 | 5.4 |
NLPR_SS_VW_PLS | 94.5 | 82.6 | 79.4 | 80.7 | 57.8 | 87.8 | 85.5 | 83.9 | 66.6 | 74.2 | 69.4 | 75.2 | 83.0 | 88.1 | 93.5 | 56.2 | 75.5 | 64.1 | 90.0 | 76.6 |
NLPR_SVM_BOWDET | 82.9 | 69.4 | 45.4 | 60.1 | 46.0 | 80.0 | 75.1 | 59.9 | 54.9 | 50.7 | 43.3 | 49.9 | 63.4 | 72.2 | 88.1 | 36.1 | 57.1 | 37.7 | 75.2 | 58.5 |
NLPR_SVM_BOWDET_CONV | 83.8 | 69.8 | 47.8 | 60.5 | 45.4 | 80.5 | 74.6 | 60.4 | 54.0 | 51.3 | 45.3 | 51.5 | 64.5 | 72.6 | 87.7 | 35.9 | 57.7 | 39.8 | 75.8 | 62.7 |
NUSPSL_CTX_GPM | 95.5 | 81.1 | 79.4 | 82.5 | 58.2 | 87.7 | 84.1 | 83.1 | 68.5 | 72.8 | 68.5 | 76.4 | 83.3 | 87.5 | 92.8 | 56.5 | 77.7 | 67.0 | 91.2 | 77.5 |
NUSPSL_CTX_GPM_SVM | 94.3 | 78.5 | 76.4 | 80.0 | 57.0 | 86.3 | 82.1 | 81.5 | 65.6 | 74.7 | 66.5 | 73.4 | 81.9 | 85.3 | 91.9 | 53.2 | 73.9 | 65.1 | 89.5 | 76.0 |
SJT_SIFT_LLC_PCAPOOL_DET_SVM | 85.6 | 66.5 | 51.9 | 60.3 | 45.4 | 76.8 | 70.3 | 65.1 | 56.4 | 34.3 | 49.6 | 52.4 | 63.1 | 71.5 | 86.8 | 26.1 | 56.9 | 47.9 | 75.5 | 65.6 |
SJT_SIFT_LLC_PCAPOOL_SVM | 83.2 | 52.5 | 49.3 | 59.6 | 26.0 | 73.5 | 58.2 | 64.4 | 52.1 | 36.6 | 44.9 | 52.1 | 57.8 | 63.8 | 78.1 | 19.1 | 52.8 | 44.1 | 72.0 | 57.4 |
UVA_MOSTTELLING | 90.1 | 74.1 | 66.5 | 76.0 | 57.0 | 85.6 | 81.2 | 74.5 | 63.5 | 62.7 | 64.5 | 66.6 | 76.5 | 81.2 | 90.8 | 58.7 | 69.3 | 66.3 | 84.7 | 77.2 |
aero plane |
bicycle | bird | boat | bottle | bus | car | cat | chair | cow | dining table |
dog | horse | motor bike |
person | potted plant |
sheep | sofa | train | tv/ monitor |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LIRIS_CLSTEXT | 88.3 | 66.1 | 60.8 | 68.5 | 46.7 | 77.3 | 69.2 | 63.7 | 55.9 | 52.6 | 56.6 | 55.5 | 69.6 | 73.7 | 87.1 | 46.3 | 65.2 | 54.0 | 81.2 | 72.7 |
aero plane |
bicycle | bird | boat | bottle | bus | car | cat | chair | cow | dining table |
dog | horse | motor bike |
person | potted plant |
sheep | sofa | train | tv/ monitor |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BROOKES_STRUCT_DET_CRF | 37.1 | 42.6 | 2.0 | 0.0 | 16.0 | 43.8 | 38.6 | 17.0 | 10.3 | 7.7 | 2.4 | 1.5 | 34.3 | 41.1 | 38.4 | 1.5 | 14.7 | 5.3 | 35.4 | 27.1 |
CMIC_GS_DPM | - | - | - | 13.3 | 26.4 | - | 41.5 | - | - | - | 12.2 | - | - | 41.6 | - | 8.3 | 31.4 | - | - | - |
CMIC_SYNTHDPM | 40.4 | 47.8 | - | 11.4 | 23.7 | 48.9 | 40.9 | 23.5 | 11.9 | 25.5 | - | 10.9 | 42.0 | 38.6 | 40.7 | 7.5 | 30.4 | - | 38.4 | 34.8 |
CORNELL_ISVM_VIEWPOINT | 42.5 | 43.7 | 5.4 | 4.8 | 18.1 | 28.6 | 36.6 | 24.2 | 12.6 | 20.5 | 4.4 | 17.5 | 15.2 | 38.2 | 7.9 | 1.7 | 23.2 | 7.1 | 41.0 | 25.7 |
MISSOURI_LCC_TREE_CODING | 41.1 | 51.7 | 13.7 | 11.9 | 27.3 | 52.1 | 41.7 | 32.9 | 17.6 | 27.3 | 18.5 | 23.1 | 45.2 | 48.6 | 41.9 | 11.6 | 32.4 | 27.5 | 44.2 | 38.3 |
MISSOURI_TREE_MAX_POOLING | 43.8 | 51.7 | 13.7 | 12.7 | 27.3 | 51.5 | 43.7 | 32.9 | 18.3 | 27.3 | 18.5 | 23.1 | 45.2 | 48.6 | 42.9 | 11.6 | 32.4 | 27.5 | 47.0 | 39.3 |
NLPR_DD_DC | 55.0 | 58.1 | 22.5 | 18.8 | 33.9 | 57.6 | 54.5 | 42.6 | 20.2 | 40.3 | 29.3 | 37.1 | 54.6 | 58.3 | 51.6 | 14.7 | 44.8 | 32.1 | 51.7 | 41.0 |
NUS_CONTEXT_SVM | 51.4 | 52.9 | 20.1 | 15.7 | 26.9 | 53.0 | 45.6 | 37.6 | 15.2 | 36.0 | 25.1 | 32.6 | 50.4 | 55.8 | 36.8 | 12.3 | 37.6 | 30.5 | 48.1 | 41.0 |
NYUUCLA_HIERARCHY | 56.3 | 55.9 | 23.4 | 20.3 | 27.2 | 56.6 | 48.1 | 53.8 | 23.2 | 32.9 | 33.3 | 39.2 | 53.0 | 56.9 | 43.6 | 14.3 | 37.9 | 39.4 | 52.6 | 43.7 |
OXFORD_DPM_MK | 56.0 | 53.3 | 19.2 | 17.2 | 25.8 | 53.1 | 45.4 | 44.5 | 20.1 | 32.1 | 28.1 | 37.2 | 52.3 | 56.6 | 43.3 | 12.1 | 34.3 | 37.6 | 51.8 | 45.2 |
UOCTTI_LSVM_MDPM | 53.2 | 53.9 | 13.1 | 13.5 | 30.5 | 55.5 | 51.2 | 31.7 | 14.5 | 29.0 | 16.0 | 22.1 | 43.1 | 50.3 | 46.3 | 8.8 | 33.0 | 22.9 | 45.8 | 38.2 |
UOCTTI_WL-SSVM_GRAMMAR | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 49.2 | - | - | - | - | - |
UVA_SELSEARCH | 56.9 | 43.4 | 16.6 | 15.8 | 18.0 | 52.3 | 38.3 | 48.9 | 12.2 | 29.7 | 32.8 | 36.7 | 45.7 | 54.4 | 30.4 | 16.2 | 37.2 | 34.7 | 45.9 | 44.2 |
aero plane |
bicycle | bird | boat | bottle | bus | car | cat | chair | cow | dining table |
dog | horse | motor bike |
person | potted plant |
sheep | sofa | train | tv/ monitor |
---|
- Entries in parentheses are synthesized from detection results.
[mean] | back ground |
aero plane |
bicycle | bird | boat | bottle | bus | car | cat | chair | cow | dining table |
dog | horse | motor bike |
person | potted plant |
sheep | sofa | train | tv/ monitor |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BONN_FGT_SEGM | 41.4 | 83.4 | 51.7 | 23.7 | 46.0 | 33.9 | 49.4 | 66.2 | 56.2 | 41.7 | 10.4 | 41.9 | 29.6 | 24.4 | 49.1 | 50.5 | 39.6 | 19.9 | 44.9 | 26.1 | 40.0 | 41.6 |
BONN_SVR_SEGM | 43.3 | 84.9 | 54.3 | 23.9 | 39.5 | 35.3 | 42.6 | 65.4 | 53.5 | 46.1 | 15.0 | 47.4 | 30.1 | 33.9 | 48.8 | 54.4 | 46.4 | 28.8 | 51.3 | 26.2 | 44.9 | 37.2 |
BROOKES_STRUCT_DET_CRF | 31.3 | 79.4 | 36.6 | 18.6 | 9.2 | 11.0 | 29.8 | 59.0 | 50.3 | 25.5 | 11.8 | 29.0 | 24.8 | 16.0 | 29.1 | 47.9 | 41.9 | 16.1 | 34.0 | 11.6 | 43.3 | 31.7 |
NUS_CONTEXT_SVM | 35.1 | 77.2 | 40.5 | 19.0 | 28.4 | 27.8 | 40.7 | 56.4 | 45.0 | 33.1 | 7.2 | 37.4 | 17.4 | 26.8 | 33.7 | 46.6 | 40.6 | 23.3 | 33.4 | 23.9 | 41.2 | 38.6 |
NUS_SEG_DET_MASK_CLS_CRF | 37.7 | 79.8 | 41.5 | 20.2 | 30.4 | 29.1 | 47.4 | 61.2 | 47.7 | 35.0 | 8.5 | 38.3 | 14.5 | 28.6 | 36.5 | 47.8 | 42.5 | 28.5 | 37.8 | 26.4 | 43.5 | 45.8 |
(CORNELL_ISVM_VIEWPOINT) | 11.8 | 1.4 | 7.4 | 10.5 | 5.5 | 1.6 | 22.9 | 25.7 | 27.9 | 10.9 | 4.7 | 16.4 | 5.2 | 5.6 | 10.3 | 21.4 | 11.1 | 4.8 | 6.7 | 3.0 | 21.3 | 24.2 |
(MISSOURI_LCC_TREE_CODING) | 13.1 | 0.5 | 9.2 | 9.4 | 8.1 | 2.2 | 25.7 | 32.6 | 18.6 | 13.2 | 4.1 | 9.5 | 13.8 | 9.5 | 13.5 | 17.4 | 26.7 | 10.0 | 9.5 | 14.5 | 15.9 | 11.2 |
(MISSOURI_TREE_MAX_POOLING) | 13.1 | 0.6 | 10.0 | 7.8 | 7.4 | 2.3 | 27.1 | 30.2 | 38.8 | 12.3 | 3.9 | 8.3 | 10.7 | 7.8 | 11.4 | 14.4 | 26.9 | 6.3 | 8.6 | 10.3 | 16.9 | 13.2 |
(NLPR_DD_DC) | 19.4 | 0.8 | 21.6 | 2.9 | 10.1 | 7.9 | 38.0 | 27.2 | 26.0 | 7.4 | 7.3 | 30.4 | 17.8 | 26.3 | 24.9 | 41.6 | 29.2 | 2.4 | 27.8 | 20.7 | 31.0 | 6.9 |
(NYUUCLA_HIERARCHY) | 15.3 | 1.2 | 11.9 | 7.6 | 12.9 | 6.7 | 12.4 | 24.3 | 28.4 | 26.2 | 2.9 | 21.3 | 9.3 | 19.8 | 18.6 | 27.7 | 27.6 | 6.3 | 23.1 | 5.9 | 18.1 | 9.1 |
(OXFORD_DPM_MK) | 15.2 | 0.4 | 16.3 | 7.4 | 8.7 | 4.7 | 27.0 | 29.8 | 18.9 | 23.0 | 3.2 | 15.3 | 11.6 | 13.9 | 19.6 | 19.1 | 23.3 | 4.3 | 22.7 | 7.7 | 19.5 | 22.5 |
(UOCTTI_LSVM_MDPM) | 13.1 | 4.0 | 9.2 | 7.8 | 9.2 | 6.2 | 20.4 | 38.4 | 24.9 | 11.2 | 3.3 | 12.8 | 5.9 | 10.4 | 15.4 | 19.5 | 20.4 | 5.7 | 13.4 | 5.0 | 15.9 | 16.3 |
(UVA_SELSEARCH) | 16.2 | 2.9 | 13.9 | 8.2 | 5.4 | 7.2 | 18.8 | 52.0 | 29.2 | 21.9 | 3.9 | 17.5 | 10.7 | 13.7 | 12.2 | 27.7 | 14.7 | 7.8 | 21.3 | 12.9 | 17.2 | 20.5 |
- Entries in parentheses are synthesized from detection results.
[mean] | back ground |
aero plane |
bicycle | bird | boat | bottle | bus | car | cat | chair | cow | dining table |
dog | horse | motor bike |
person | potted plant |
sheep | sofa | train | tv/ monitor |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BERKELEY_REGION_CLASSIFY | 39.1 | 83.3 | 48.9 | 20.0 | 32.8 | 28.2 | 41.1 | 53.9 | 48.3 | 48.0 | 6.0 | 34.9 | 27.5 | 35.0 | 47.2 | 47.3 | 48.4 | 20.6 | 52.7 | 25.0 | 36.6 | 35.4 |
Head | Hand | Foot |
---|
Head | Hand | Foot | |
---|---|---|---|
OXFORD_RANK_SLACK_RBF | 72.9 | 26.9 | 4.1 |
jumping | phoning | playing instrument |
reading | riding bike |
riding horse |
running | taking photo |
using computer |
walking | |
---|---|---|---|---|---|---|---|---|---|---|
CAENLEAR_DSAL | 62.1 | 39.7 | 60.5 | 33.6 | 80.8 | 83.6 | 80.3 | 23.2 | 53.4 | 50.2 |
CAENLEAR_HOBJ_DSAL | 71.6 | 50.7 | 77.5 | 37.8 | 86.5 | 89.5 | 83.8 | 25.1 | 58.9 | 59.2 |
MISSOURI_SSLMF | 58.8 | 36.8 | 48.5 | 30.6 | 81.5 | 83.0 | 78.5 | 21.3 | 50.7 | 53.8 |
NUDT_CONTEXT | 65.9 | 41.5 | 57.4 | 34.7 | 88.8 | 90.2 | 87.9 | 25.7 | 54.5 | 59.5 |
NUDT_LL_SEMANTIC | 66.3 | 41.3 | 53.9 | 35.2 | 88.8 | 90.0 | 87.6 | 25.5 | 53.7 | 58.2 |
STANFORD_RF_DENSEFTR_SVM | 66.0 | 41.0 | 60.0 | 41.5 | 90.0 | 92.1 | 86.6 | 28.8 | 62.0 | 65.9 |
WVU_SVM-PHOW | 42.5 | 29.5 | 32.1 | 26.7 | 48.5 | 46.3 | 59.2 | 13.5 | 24.3 | 35.6 |
jumping | phoning | playing instrument |
reading | riding bike |
riding horse |
running | taking photo |
using computer |
walking | |
---|---|---|---|---|---|---|---|---|---|---|
BERKELEY_ACTION_POSELETS | 59.5 | 31.3 | 45.6 | 27.8 | 84.4 | 88.3 | 77.6 | 31.0 | 47.4 | 57.6 |
STANFORD_COMBINE_ATTR_PART | 66.7 | 41.1 | 60.8 | 42.2 | 90.5 | 92.2 | 86.2 | 28.8 | 63.5 | 64.2 |
STANFORD_MAPSVM_POSELET | 27.0 | 29.3 | 28.3 | 23.8 | 71.9 | 82.4 | 67.3 | 20.1 | 26.0 | 46.4 |
Abbreviation | Title | Method | Affiliation | Contributors | Descriptiorn |
---|---|---|---|---|---|
BERKELEY_ACTION_POSELETS | Poselets trained on action categories. | BERKELEY_ACTION_POSELETS | University of California, Berkeley | Subhransu Maji, Lubomir Bourdev, Jitendra Malik | This is based on our CVPR 2011 paper: "Action recognition using a distributed representation of pose and appearance", Subhransu Maji, Lubomir Bourdev and Jitendra Malik For this submission we train 200 poselets for each action category. In addition we train poselets based on subcategory labels for playinginstrument and ridingbike. Linear SVMs are trained on the "poselet activation vector" along with features from object detectors for four categories: motorbike, bicycle, horse and tvmonitor. Context models re-rank the objects at the image level, as described in the CVPR'11 paper. |
BERKELEY_REGION_CLASSIFY | Classification of low-level regions | Berkeley_Region_Classify | UC Berkeley | Pablo Arbelaez, Bharath Hariharan, Saurabh Gupta, Chunhui Gu, Lubomir Bourdev and Jitendra Malik | We propose a semantic segmentation approach that represents and classifies generic regions from low-level segmentation. We extract object candidates using ultrametric contour maps (Arbelaez et al., TPAMI 2011) at several image resolutions. We represent each region using mid- and high-level features that capture its appearance (color, shape , texture) and also its compatibility with the activations of a part detector (we use the poselets from Bourdev et al, ECCV 2010.) . A category label is assigned to each region using a hierarchy of IKSVM classifiers (Maji et al, CVPR 2008). |
BONN_FGT_SEGM | BONN_FGT_SEGM | BONN_FGT_SEGM | ¹University of Bonn, ²Vienna University of Technology, ³Georgia Institute of Technology | Joao Carreira¹, Adrian Ion², Fuxin Li³, Cristian Sminchisescu¹ | We present a joint image segmentation and labeling model which, given a bag of figure-ground segment hypotheses extracted at multiple image locations and scales using CPMC (Carreira and Sminchisescu, CVPR2010), constructs a joint probability distribution over both the compatible image interpretations (tilings or image segmentations) composed from those segments, and over their labeling into categories. The process of drawing samples from the joint distribution can be interpreted as first sampling tilings, modeled as maximal cliques, from a graph connecting spatially non-overlapping segments in the bag (Ion, Carreira, Sminchisescu, ICCV11) , followed by sampling labels for those segments, conditioned on the choice of a particular tiling. We learn the segmentation and labeling parameters jointly, based on Maximum Likelihood with a novel Incremental Saddle Point estimation procedure (Ion, Carreira, Sminchisescu, NIPS2011). |
BONN_SVR_SEGM | SVR on CPMC-generated Figure-ground segmentations | BONN_SVRSEGM | University of Bonn | Joao Carreira, Fuxin Li, Cristian Sminchisescu | We present a recognition system based on sequential figure-ground ranking. We extract a bag of figure-ground segments using CPMC (Carreira and Sminchisescu, CVPR 2010). The bag is then filtered down to 100 segments using a class-independent ranker. Using these features we learn one nonlinear Support Vector Regressor (SVR) for each category that predicts the overlap between each segment and an object from that category. A complete image interpretation is obtained by sequentially selecting segments using combination and non-maxima suppression schemes. Details can be found in respectively (F. Li, J.Carreira, C. Sminchisescu, CVPR 2010, IJCV11). Additionally, the system is trained with both object segmentation layouts and weak annotations from bounding boxes. |
BPACAD_COMB_LF_AK_WK_NOBOXES | Combination of the late fusion, avgker and weker | BPACAD_COMB_LF_AK_WK_NOBOXES | Data Mining and Web Search Research Group (DMWS), MTA SZTAKI. Hungary | Bálint Daróczy, László Nikházy | This is the average of the confidence output of a late fusion, an aggregated kernel and an averaged kernel (BPACAD_CS_FISH256-1024_SVM_AVGKER) method. We computed RGB Color moments and SIFT descriptors (Löwe 1999) on a dense grid and regions detected by Harris-Laplacian (Mikolajczik et al, 2005). All the three methods are based on non-hierarchical Gaussian Mixture Models (GMM) with 256 Gaussians (two of them also using GMMs with 1024 Gaussians ) and non-sparse Fisher vectors (Perronnin et al, 2007 ). We used our open-source GMM/Fisher toolkit for GPGPUs to compute the Fisher vectors and train the GMMs (http://datamining.sztaki.hu/?q=en/GPU-GMM). All of them are using Fisher vector based pre-computed kernels (basic kernels) for learning linear SVM classifiers (Daróczy et al, ImageCLEF 2011) . The late fusion method is based on a combination of SVM predictions (18 SVM classifiers per class), meanwhile the aggregated and averaged kernels are computed before the classification (only one SVM classifier per class). While the averaged kernel method needs no parameter tuning, for the late fusion and the aggregated kernel method we learned optimal weights per class on the validation set. |
BPACAD_CS_FISH256_1024_SVM_AVGKER_NOBOXES | SVM on averaged Fisher Kernels | BPACAD_CS_FISH256-1024_SVM_AVGKER_NOBOXES | Data Mining and Web Search Research Group (DMWS), MTA SZTAKI, Hungary | Bálint Daróczy, László Nikházy | We computed RGB Color moments and SIFT descriptors (Löwe 1999) on a dense grid and regions detected by Harris-Laplacian (Mikolajczik et al, 2005). We trained non-hierarchical Gaussian Mixture Models (with 256 Gaussians and 1024 Gaussians) on a subset (1 million) of the extracted low-level features of the training images. We extracted non-sparse Fisher vectors on nine different poolings with GMM with 256 Gaussians (dense grid, Harris-Laplacian, 3x1 and 2x2 spatial pyramids (Lazebnik et al, 2006)) and four with GMM with 1024 Gaussians (dense,3x1). We used our open-source GMM/Fisher toolkit for GPGPUs to compute the Fisher vectors and train the GMMs (http://datamining.sztaki.hu/?q=en/GPU-GMM). We calculated pre-computed kernels (Daróczy et al, ImageCLEF 2011) and averaged them. We trained only one binary SVM classifier per class. |
BROOKES_STRUCT_DET_CRF | Structured Detection and Segmentation CRF | Struct_Det_CRF | Oxford Brookes University | Jonathan Warrell, Vibhav Vineet, Paul Sturgess, Philip Torr | We form a hierarchical CRF which jointly models a pool of candidate detections and the multiclass pixel segmentation of an image. Attractive and repulsive pairwise terms are allowed between detection nodes (cf Desai et al, ICCV 2009), which are integrated into a Pn-Potts based hierarchical segmentation energy (cf Ladicky et al, ECCV 2010). A cutting-plane algorithm is used to train the model, using approximate MAP inference. We form a joint loss which combines segmentation and detection components (i.e. paying a penalty both for each pixel incorrectly labelled, and each false detection node which is active in a solution), and use different weightings of this loss to train the model to perform detection and segmentation. The segmentation results thus make use of the bounding box annotations. The candidate detections are generated using the Felzenschwalb et al. CVPR 2008/2010 detector, and as features for segmentation we use textons, SIFT, LBPs and the detection response surfaces themselves. |
BUPT_ALL | BUPT_MCPR_all | combining methods | Beijing University of Posts and Telecommunications-MCPRL | Zhicheng Zhao, Tao Liu, Xin Guo, Anni Cai | A region-based method is used, in which all features mentioned above are extracted on regions rather than keypoints. A region is a group of pixels with similar appearance, and meanshift method is employed to do this. Finally, we combine the results of two methods with a linear fusion algorithm.without patch. |
BUPT_NOPATCH | BUPT_MCPR_nopatch | nopatch mthod | Beijing University of Posts and Telecommunications-MCPRL | Zhicheng Zhao, Tao Liu, Xin Guo, Anni Cai | A bag of words method with SIFT, SURF and HOG features, and dense sampling method for keypoints is also involved.and keypoint detection are used. |
CAENLEAR_DSAL | Discriminative spatial saliency | DSAL | Univ Caen/ INRIA LEAR | Gaurav Sharma, Frederic Jurie, Cordelia Schmid | We propose to learn discriminative saliency maps for images which highlight the regions which are more discriminant for the current classification task. We use the saliency maps to weight the visual words for improving discriminative capacity of bag of words features. The approach is motivated by the observation that for many human actions and attributes, local regions are highly discriminative e.g. for running the bent arms and legs are highly discriminant. Along with that we combine features based on SIFT, HOG, Color and texture. |
CAENLEAR_HOBJ_DSAL | Human obj interaction and discriminative saliency | HOBJ+DSAL | Univ Caen/ INRIA LEAR | Gaurav Sharma, Alessandro Prest, Frederic Jurie, Vittorio Ferrari, Cordelia Schmid | We use the weakly supervised approach (Prest et al. PAMI2010) for learning human actions modeled as interactions between humans and objects. The human bounding box is taken as reference and the object relevant to the action and its spatial relation with the human is automatically learnt. The method is combined with a method to learn discriminative spatial saliency which highlights the regions which are more discriminant for the current classification task. We use the saliency maps to weight the visual words for improving discriminative capacity of bag of words features. Along with that we combine features based on SIFT, HOG, Color and texture. |
CMIC_GS_DPM | Synthetic Trainining for deformable parts model | CMIC-GS-DPM | Cairo Microsoft Innovation Center | Dr. Motaz El-Saban , Osama Khalil, Mostafa Izz, Mohamed Fathi | We introduce dataset augmentation using synthetic examples as a method for introducing novel variations not present in the original set. We make use of deformable parts-based model (Felzenszwalb et al 2010). We augment the training set with examples obtained by applying global scaling of the dataset examples. Global scaling includes no, up and down scaling with varying performance across different object classes. Technique selection is based upon performance on the validation set. The augmented dataset is then used to train parts-based detectors using HOG features (Dalal & Triggs 2006) and latent SVM. The resulting class models are applied on test images in a “sliding-window” fashion. |
CMIC_SYNTHDPM | Synthetic Trainining for deformable parts model | CMIC-Synthetic-DPM | Cairo Microsoft Innovation Center | Dr. Motaz El-Saban , Osama Khalil, Mostafa Izz, Mohamed Fathi | We introduce dataset augmentation using synthetic examples as a method for introducing novel variations not present in the original set. We make use of deformable parts-based model (Felzenszwalb et al 2010). We augment the training set with examples obtained by relocating objects (having segmentation masks) to new backgrounds. New backgrounds used for relocation are selected using a set of techniques (no relocation, same image, “different” image or image with co-occurring objects). Performance of those techniques varies across classes according to the object class properties. For every class, we select the technique that achieves the highest AP on the validation set. The augmented dataset is then used to train parts-based detectors using HOG features (Dalal & Triggs 2006) and latent SVM. The resulting class models are applied on test images in a “sliding-window” fashion. |
CORNELL_ISVM_VIEWPOINT | Using viewpoint cues to improve object recognition | lSVM-Viewpoint | Cornell | Joshua Schwartz Noah Snavely Daniel Huttenlocher | Our system is based on the Latent SVM framework of [1], including their context rescoring method. We train 6 component models with 8 parts. However, unlike [1], components are trained using a clustering based on an unsupervised estimation of 3D object viewpoint. In this sense, our approach is similar to the unsupervised approach in [2], which also seeks to estimate viewpoint, but our clustering is based on explicit reasoning about 3D geometry. Additionally, we add features based on estimated 3D scene geometry for context rescoring. Of note is the fact that a detection with our method gives rise to an explicit estimation of object viewpoint within a scene, rather than just a bounding box. [1] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object Detection with Discriminatively Trained Part Based Models. PAMI 2010 [2] C. Gu and X. Ren. Discriminative Mixture-of-Templates for Viewpoint Classification. ECCV 2010 |
JDL_K17_AVG_CLS | SVM with average kernel using 17 kernels | JDL_K17_AVG_CLS | JDL, Institute of Computing Technology, Chinese Academy of Sciences | Shuhui Wang, Li Shen, Shuqiang Jiang, Qi Tian, Qingming Huang | we calculate six types of commonly used BOW features(including dense and sparse sift, dense color sift,hog, lbp and self similarity) and 3 global features(color moment, gist and block gist), where the visual vocabulary size is typically around 1000. We calculate 3 level spatial pyramid features on those BOW representation respectively. Then 17 base kernels are calculated by using histogram intersection, RBF and chi-square kernels on these features, whose kernel parameters are tuned using the validation data. We calculate an average kernel by using these 17 base kernel. one-against-all SVM classifiers are used to train the final classfiers for each category. |
LIRIS_CLS | MKL classifier with multiple features | LIRIS_CLS | LIRIS, Ecole Centrale de Lyon, CNRS, UMR5205, France | Chao ZHU, Yuxing TANG, Ningning LIU, Charles-Edmond BICHOT, Emmanuel Dellandrea, Liming CHEN | In this submission, we mainly make use of local descriptors and the popular bag-of-visual-words approach for classification. Regions of interest in each image are detected using both Harris-Laplace detector and dense sampling strategy. SIFT and color SIFT descriptors are then computed for each region as baseline. In addition, we also extract DAISY and extended LBP descriptors based on our work [1][2] for computational efficiency and complementary information to SIFT. For each kind of descriptor, 1,000,000 randomly selected descriptors from the train + val set are quantized using k-means algorithm into 4000 visual words. Each image is then represented by the histogram using hard assignment. Spatial pyramid technique is applied for coarse spatial information. The chi-square kernels of different levels in the pyramid are computed, and then fused by linear combination. The final outputs are obtained by using multiple kernel learning algorithm to fuse different descriptors. [1] C. Zhu, C.E. Bichot, L. Chen: 'Visual object recognition using DAISY descriptor', in Proc. of IEEE International Conference on Multimedia and Expo (ICME), to appear, 2011. [2] C. Zhu, C.E. Bichot, L. Chen: 'Multi-scale color local binary patterns for visual object classes recognition', in Proc. of 20th International Conference on Pattern Recognition (ICPR), pp.3065-3068, 2010. |
LIRIS_CLSDET | Classification combined with detection | LIRIS_CLSDET | LIRIS, Ecole Centrale de Lyon, CNRS, UMR5205, France | Chao ZHU, Yuxing TANG, Ningning LIU, Charles-Edmond BICHOT, Emmanuel Dellandrea, Liming CHEN | In this submission, we improve the classification performances by combining it with object detection results. For classification, we mainly make use of local descriptors and the popular bag-of-visual-words approach. Regions of interest in each image are detected using both Harris-Laplace detector and dense sampling strategy. SIFT and color SIFT descriptors are then computed for each region as baseline. In addition, we also extract DAISY and extended LBP descriptors based on our work [1][2] for computational efficiency and complementary information to SIFT. For each kind of descriptor, 1,000,000 randomly selected descriptors from the train + val set are quantized using k-means algorithm into 4000 visual words. Each image is then represented by the histogram using hard assignment. Spatial pyramid technique is applied for coarse spatial information. The chi-square kernels of different levels in the pyramid are computed, and then fused by linear combination. The final outputs are obtained by using multiple kernel learning algorithm to fuse different descriptors. For object detection, we apply the HOG feature to train deformable part models, and use the models together with sliding window approach to detect objects. Finally, we combine the outputs of classification and detection by late fusion. [1] C. Zhu, C.E. Bichot, L. Chen: 'Visual object recognition using DAISY descriptor', in Proc. of IEEE International Conference on Multimedia and Expo (ICME), to appear, 2011. [2] C. Zhu, C.E. Bichot, L. Chen: 'Multi-scale color local binary patterns for visual object classes recognition', in Proc. of 20th International Conference on Pattern Recognition (ICPR), pp.3065-3068, 2010. |
LIRIS_CLSTEXT | Classification with additional text feature | LIRIS_CLSTEXT | LIRIS, Ecole Centrale de Lyon, CNRS, UMR5205, France | Chao ZHU, Yuxing TANG, Ningning LIU, Charles-Edmond BICHOT, Emmanuel Dellandrea, Liming CHEN | In this submission, we try to use additional text information to help with object classification. We propose novel text features [1] based on semantic distance using WordNet. The basic idea is to calculate the semantic distance between the text associated with an image and an emotional dictionary based on path similarity, denoting how similar two word senses are, based on the shortest path that connects the senses in a taxonomy. As there are no tags included in Pascal2011 dataset, we downloaded 1 million Flickr images (including their tags) as the additional textual source. Firstly, for each Pascal image, we find its similar images (top 20) from the database using KNN method based on visual features (LBP and color HSV histogram), and then use these tags to extract the text feature. We use SVM with RBF kernel to train the classifier and predict the outputs. For classification based on visual features, we follow the same method described in our other submission. The outputs of visual feature based method and text feature based method are then linearly combined as final results. [1] N. Liu, Y. Zhang, E. Dellandréa, B. Tellez, L. Chen: ‘Associating text features with visual ones to improve affective image classification’, International Conference Affective Computing (ACII), Memphis, USA, 2011. |
MISSOURI_LCC_TREE_CODING | SVM classifier with LCC and tree coding | LCC-TREE-CODING | University of Missouri | Xiaoyu Wang Miao Sun Xutao Lv Shuai Tang Guang Chen Yan Li Tony X. Han | A two layers cascade structure for object detection. The first layer employs deformable model to select possible candidates for the second layer. The later layer takes location and global context augmented with LBP feature to improve the accuracy. A bag of words model enhanced with spatial pyramid and local coordilate coding is used to model the global context information. A hierachical tree structure coding is used to take care of the intra-class variation for each detection window. Linear SVM is used for classification. |
MISSOURI_SSLMF | Supervised Learning with Multiple Features | Supervised learning with multiple feature | University of Missouri - Columbia | Xutao Lv, Xiaoyu Wang, Guang Chen, Shuai Tang, Yan Li, Miao Sun, Tony X. Han | Multiple available features are combined and fed into a newly developed supervised learning algorithm. The features includes the feature extracted within the bounding box and the feature from the whole image. The features from the whole images are served as context information. We mainly use two feature descriptors in our submission, dense SIFT and HOG. LCC coding method and spatial pyramid is adopted to generate histogram for each action image, and the histogram is then served as feature vector to train and test with the supervised learning algorithm. |
MISSOURI_TREE_MAX_POOLING | SVM classifier with tree max-pooling | TREE--MAX-POOLING | University of Missouri | Xiaoyu Wang, Miao Sun, Xutao Lv, Shuai Tang, Guang Chen, Yan Li ,Tony X. Han | A two layers cascade structure for object detection. The first layer employs deformable model to select possible candidates for the second layer. The later layer takes location and global context augmented with LBP feature to improve the accuracy. A bag of words model enhanced with spatial pyramid and local coordilate coding is used to model the global context information. A hierachical tree structure coding is used to take care of the intra-class variation for each detection window. Max-pooling is used for tree node assignment. Linear SVM is used for classification. |
MSRAUSTC_HIGH_ORDER_SVM | SVM with mined high order features | MSRA_USTC_HIGH_ORDER_SVM | Microsoft Research Asia & University of Science and Technology of China | Kuiyuan Yang, Lei Zhang, Hong-Jiang Zhang | We introduce a discriminatively-trained parts-based model with different level templates for image classification. The model consists of templates of HOG features (Dalal and Triggs, 2006) at three different levels. The responses of different level templates are combined by a latent-SVM, where the latent variables are the positions of the templates. We develop a novel mining algorithm to define the parts and an iterative training procedure to learn the parts. The model is applied to all 20 PASCAL VOC objects. |
MSRAUSTC_PATCH | SVM with multi-channel cell-structured patch featu | MSRA_USTC_PATCH | Microsoft Research Asia & University of Science and Technology of China | Kuiyuan Yang, Lei Zhang, Hong-Jiang Zhang | We introduce a discriminatively-trained patch-based model with cell-structured templates for image classification. Dense sampled patches are represented cell-structured templates of HOG, LBP, HSV, SIFT, CSIFT and SSIM. These templates are then fed to Super-Vector Coding (Xi Zhou, 2010) and Fisher Kernel (Florent Perronnin, 2010) to form the image feature. Then linear SVM is trained for each category in one-vs-the-rest manner. The object detector from Pascal VOC 2007 is used to extract object level features and classifiers are trained based on these features, and then fusion with the former one. |
NANJING_DMC_HIK_SVM_SIFT | HIK based svm classifier with dense SIFT features. | DMC-HIK-SVM-SIFT | The University of Nanjing | Yubin Yang, Ye Tang, Lingyan Pan | We adopt a bag-of-visual-words method (cf Csurka et al 2004). A single descriptor type, SIFT descriptors (Lowe 2004) are extracted from 16*16 pixel patches which are densely sampled from each image on a grid with stepsize 12 pixels. We partition the original training and validation data into different categories according to its label, then randomly select 200 images per category (2000 images in total) as the training set. We use a novel difference maximize coding approach to quantize these descriptors into 200 “visual words”. Each image is then represented by a histogram of visual words. Spatial pyramid matching (Lazebnik et al, CVPR 2006) are also used in our method. Finally, we train a HIK kernel (Jianxin Wu et al, ICCV 2009) based SVM classifier using the concatenated pyramid feature vector for each image in training set. |
NLPR_DD_DC | NLPR-Detection | Data Decomposition and Distinctive Context | Institute of Automation, Chinese Academy of Sciences | Junge Zhang, Yinan Yu, Yongzhen Huang, Chong Wang, Weiqiang Ren, Jinchen Wu, Kaiqi Huang and Tieniu Tan | Part based model has achieved great success in recent years. To our understanding, the original deformable part based model has several limits: 1) the computational complexity is very large, especially when it is extended to enhanced models via multiple features, more mixtures or flexible part models. 2) The original part based model is not “deformable” enough. To tackle these problems, 1) we propose a data decomposition based feature representation scheme for part based model in an unsupervised manner. The submitted method takes about 1~2 seconds per image from PASCAL VOC datasets on average while keeping high performance. We learn the basis from samples without any label information. The specific label independent rule followed in the submitted methods can be adapted into other variants of part based model such as hierarchical model or flexible mixture models. 2) We found that, each part corresponds to multiple possible locations, which is not reflected in the original part-based model. Accordingly, we propose that the locations of parts should obey the multiple Gaussian distribution. Thus, for each part we learn its optimal locations by clustering which are used to update the original anchors of the part-based model. The proposed method above can more effectively describe the deformation (pose and location variety) of objects’ parts. 3) We rescored the initial results by our distinctive context model including global and local and intra-class context information. Besides, segmentation provides strong indication for object’s presence, therefore, the proposed segmentation aware semantic attribute is applied in the final reasoning which indeed shows promising performance. |
NLPR_KF_SVM | SVM classifier with five kernels. | NLPR_KF_SVM | National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. | Ruiguang Hu Weiming Hu | GrayPHOG,HuePHOG,PLBP(R=1 and R=2). GraySIFT_HL,HueSIFT_HL : LLC coding and MAX pooling. Codebooks : K-means clustering. L1 normalization : GrayPHOG,HuePHOG,PLBP; L2 normalization : GraySIFT_HL, HueSIFT_HL; chi-squared kernel : GrayPHOG,HuePHOG,PLBP; Linear kernel : graySIFT_HL,HueSIFT_HL. kernel fusion : Average strategy . train features : extracted from croped sub_images according to annotation boundingbox, test features : extracted from whole test images. |
NLPR_SS_VW_PLS | NLPR_CLS | Semi-Semantic Visual Words & Partial Least Squares | National Laboratory of Pattern Recognition, Institute of Automation Chinese Academy of Sciences | Yinan Yu, Junge Zhang, Yongzhen Huang, Weiqiang Ren, Chong Wang, Jinchen Wu, Kaiqi Huang, Tieniu Tan | The framework is based on the classical Bag of Words model. The system consists of: 1) In feature level, we use Semi-Semantic and non-semantic Visual Words Learning, with Fast Feature Extraction (Salient Coding, Super-Vector Coding and Visual Co-occurrence) and multiple features; 2) To learn class model, we employ an alternative multiple-linear-kernel learning for intra-class feature combination, after using Partial Least Squares analysis, which projects the extremely high-dimensional features into a low-dimensional space; 3) The combination of 20 categories scores and detections scores generate a high-level semantic representation of image, we use non-linear-kernel learning to extract inter-class contextual information, which further improve the performance. Besides, all the parameters are decided by cross-validation and prior knowledge on VOC2007 and VOC2010 trainval sets. The motivation and novelty of our algorithm: The traditional codebook describes the distribution of feature space, containing less semantic information of the interesting objects, and a semantic codebook may benefit the performance. We observe that the Deformable-Part-Based Model [Felz TPAMI 2010] describes the object by “object parts”, which can be seen as the semi-semantic visual words. Based on this idea, we propose to use the semi-semantic and non-semantic visual words based Bag of words model for image classification. We analyze the recent image classification algorithms, finding that the feature “distribution”, “reconstruction” and “saliency” is three fundamental issues in coding and image description. However, these methods usually lead to an extremely high-dimensional description, especially with multiple features. In order to learn these features by MKL, we find Partial Least Square is a reliable method for dimensionality reduction. The compression ratio of PLS is over 10000, while the discrimination can be preserved. |
NLPR_SVM_BOWDET | Svm with multiple feature and detection results | NLPR_IVA_SVM_BOWDect | NLPR,CASIA | Jing Liu, Jianlong Fu, Bingyuan Liu, Hanqing Lu | Two types of features are considered. First, typical BOW features (OpponentSift, CSift and rgSift) with dense and Harris sampling techniques respectively, and the spatial pyramid kernels with these features are calculated for classification. Second, the object detection based on the deformable part model is employed (P. Felzenszwalb et al. in PAMI 2009 ) . Combined with these features, we attempt to learn a hierarchical classifier with SVM according to the hierarchical structure for all the 20-class data. |
NLPR_SVM_BOWDET_CONV | Svm with multiple feature and detection results | NLPR_IVA_SVM_BOWDect_Convolution | NLPR,CASIA | Jing Liu, Jianlong Fu, Bingyuan Liu, Hanqing Lu | Three types of features are considered. First, typical BOW features (OpponentSift, CSift and rgSift) with dense and Harris sampling respectively, and spatial pyramid kernels are calculated. Second, an improved image representation via convolutional sparse coding and max pooling operation is employed, which is motivated by M. Zeiler’s work in ICCV 2011. Third the object detection based on the deformable part model. Combined with multiple features, we attempt to learn a hierarchical classifier with SVM according to the hierarchical structure for all the 20-class data. |
NUDT_CONTEXT | Svm classifier with contextual information | NUDT_Context | National University of Defense Technology | Li Zhou, Zongtan Zhou, Dewen Hu | Action classification using contextual information. We present a new model for action classification context based on the distribution of object and the semantic category of scene within images. The scene classification works by creating multiple resolution images and partitioning them into sub-regions with different scales. The visual descriptors of all sub-regions in the same resolution image are directly concatenated for SVM classifiers. Finally, regarding each resolution image as a feature channel, we combine all the feature channels to reach a final decision. The object recognition works by incorporating a multi-resolution representation into the bag-of-features model. |
NUDT_LL_SEMANTIC | Svm classifier with low-level and semantic modelin | NUDT_Low-level_Semantic | National University of Defense Technology | Li Zhou, Dewen Hu, Zongtan Zhou | Action classification based on combining low-level and semantic modeling strategies |
NUSPSL_CTX_GPM | classification using context svm and GPM | NUSPSL_CTX_GPM | National University of Singapore; Panasonic Singapore Laboratories | NUS: Chen Qiang, Song Zheng, Yan Shuicheng; PSL: Hua Yang, Huang Zhongyang, Shen Shengmei; | The whole solution for object classification is based on BoW framework. On image level, Dense-SIFT, HOG^2, LBP and color moments features are extracted. VQ and Fisher vectors are utilized for feature coding. Traditional SPM and novel spatial-free pyramid matching scheme are then performed to generate image representations. Context-aware features are also extracted based on detection result [1]. The classification models are learnt via kernel SVM. The final classification scores are refined with kernel mapping [2]. The key novelty of the new solution is the pooling strategy (which well handles the spatial mismatching as well as noise feature issues), and considerable improvement has been achieved as shown in other offline experiments. [1] Zheng Song, Qiang Chen, Zhongyang Huang, Yang Hua, and Shuicheng Yan. Contextualizing Object Detection and Classification, CVPR 2011. [2] http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/workshop/nuspsl.pdf |
NUSPSL_CTX_GPM_SVM | classification using context svm and GPM, | NUSPSL_CTX_GPM_SVM | National University of Singapore; Panasonic Singapore Laboratories | NUS: Chen Qiang, Song Zheng, Yan Shuicheng; PSL: Hua Yang, Huang Zhongyang, Shen Shengmei; | The whole solution for object classification is based on BoW framework[1]. On image level, Dense-SIFT, HOG^2, LBP and color moments features are extracted. VQ and Fisher vectors are utilized for feature coding. Traditional SPM and novel spatial-free pyramid matching scheme are then performed to generate image representations. Context-aware features are also extracted based on detection result [2]. The classification models are learnt via kernel SVM. The key novelty of the new solution is the pooling strategy (which well handles the spatial mismatching as well as noise feature issues), and considerable improvement has been achieved as shown in other offline experiments. [1]http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/workshop/nuspsl.pdf [2] Zheng Song, Qiang Chen, Zhongyang Huang, Yang Hua, and Shuicheng Yan. Contextualizing Object Detection and Classification, CVPR 2011. |
NUS_CONTEXT_SVM | Context-SVM based submission for 3 tasks | NUS_Context_SVM | National University of Singapore | Zheng Song, Qiang Chen, Shuicheng Yan | Classification uses the BoW framework. Dense-SIFT, HOG^2, LBP and color moment features are extracted. We then use VQ and fisher vector for feature coding and SPM and Generalized Pyramid Matching(GPM) to generate image representations. Context-aware features are also extracted based on [1]. The classification models are learnt via kernel SVM. Then final classification scores are refined with kernel mapping[2]. Detection and segmentation results use the baseline of [3] using HOG and LBP feature. And then based on [1], we further learn context model and refine the detection results. The final segmentation result uses the learnt average masks for each detection component learnt using segmentation training set to substitute the rectangle detection boxes. [1] Zheng Song*, Qiang Chen*, Zhongyang Huang, Yang Hua, and Shuicheng Yan. Contextualizing Object Detection and Classification. [2] http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/workshop/nuspsl.pdf [3] http://people.cs.uchicago.edu/~pff/latent/ |
NUS_SEG_DET_MASK_CLS_CRF | Segmentation Using CRF with Detection Mask | NUS_SEG_DET_MASK_CLS_CRF | National University of Singapore | Wei XIA, Zheng SONG, Qiang CHEN, Shuicheng YAN, Loong Fah CHEONG | The solution is based on CRF model and the key contribution is the utilization of various types of binary regularization terms. Object detection also plays a very significant role in guiding semantic object segmentation. In this solution, the CRF model is built to integrate the global classification score and local unary and binary information to perform semantic segmentation. What’s more, the detection masks trained by setting a hard threshold of the detection confidence maps are applied as extra unary and smooth terms in the CRF model. Some of masks with high confidence are also used in the post-processing stage to do some refinement at the mask boundaries. |
NYUUCLA_HIERARCHY | Latent Hierarchical Learning | NYU-UCLA_Hierarchy | NYU and UCLA | Yuanhao Chen, Li Wan, Long Zhu, Rob Fergus, Alan Yuille | Based on two recent publications: "Latent Hierarchical Structural Learning for Object Detection". Long Zhu, Yuanhao Chen, Alan Yuille, William Freeman. CVPR 2010. "Active Mask Hierarchies for Object Detection". Yuanhao Chen, Long Zhu, Alan Yuille. ECCV 2010 We present a latent hierarchical structural learning method for object detection. An object is represented by a mixture of hierarchical tree models where the nodes represent object parts. The nodes can move spatially to allow both local and global shape deformations. The image features are histograms of words (HOWs) and oriented gradients (HOGs) which enable rich appearance representation of both structured (eg, cat face) and textured (eg,cat body) image regions. Learning the hierarchical model is a latent SVM problem which can be solved by the incremental concave-convex procedure (iCCCP). Object detection is performed by scanning sub-windows using dynamic programming. The detections are rescored by a context model which encodes the correlations of 20 object classes by using both object detection and image classification. |
OXFORD_DPM_MK | DPM with basic rescoring | DPM-MK | Oxford VGG | Andrea Vedaldi and Andrew Zisserman | This method uses a Deformable Part Model (our own implementation) to generate an initial (and very good) list of 100 candidate bounding boxes per image. These are then rescored by a multiple features model combining DPM scores with dense SP-BOW, geometry, and context. The SP-BOW model are dense SIFT features (vl_phow in VLFeat) quantized into 1200 visual words, 6x6 spatial layout, cell-by-cell l2 normalization after raising the entries to the 1/4 power (1/4-homogeneous Hellinger's kernel). The geometric model is a second order polynomial kernel on the bounding box coordinates. The context model is a second order polynomial kernels mixing the candidate DPM score with twenty scores obtained as the maximum response of the DPMs for the 20 classes in that image (like Felzenszwalb). A second context model is also added, using 20 scores from a state-of-the-art Fisher kernel image classifier (also on dense SIFT features), as described in Chatfileld et al. 2010. The SVM scores are passed through a sigmoid for standardization in the 0-1 interval; the sigmoid model is fitted to the truing data. The model is trained by means of a large scale linear SVM using the one-slack bundle formulation (aka SVM^perf). The solver hence uses retraining implicitly, and we make sure it reaches full convergence. |
OXFORD_RANK_SLACK_RBF | Structured ranking for Layout Detection | SVM-rank-slack-RBF | University of Oxford | Arpit Mittal, Matthew Blaschko, Andrew Zisserman, Manuel J Marin, Phil Torr | We make use of SVM structured ranking algorithm to combine and rank the outputs of different parts detectors. Individual parts are detected using separate detectors, then, the outputs are customized to the local image using the positional and scale cues. Different part detections are finally combined using a ranking function to give a single confidence value for the human layout detection. The ranking is performed such that detections having more true-positive parts (i.e., higher precision) are returned earlier. For detection of human head, we use the parts-based model of Felzenszwalb et al. (PAMI 2010); and hand is localized using the hand detector developed by Mittal et al. (BMVC, 2011). The feet are detected using the foot part of Felzenszwalb et al.'s human detector and also returned as the bounding box around the super-pixels resembling human foot in the lower bracket of the human ROI. We use slack rescaled variant of SVM structured ranking algorithm and RBF kernel map. |
SJT_SIFT_LLC_PCAPOOL_DET_SVM | SVM using LLC features with detection results | SIFT-LLC-PCAPOOL-DET-SVM | Shanghai Jiao Tong Univeristy | Jun Zhu, Xiaokang Yang, Yukun Zhu, Rui Zhang, Xiaolin Chen | We adopt the framework based on locality-constrained linear coding (LLC) framework (J. Wang et al, CVPR 2010), fused with the detection results given by discriminatively-trained deformable part-based object detectors (P. Felzenszwalb, et al, CVPR 2008). First, the SIFT descriptors (Lowe, 2004) are extracted from image patches densely sampled by every 8 pixels, with three different scales (16x16, 24x24 and 32x32) respectively. Then, the codebook with 1024 bases is constructed by using K-means clustering on 100,000 randomly selected descriptors from the training set. Each 128-dimensional SIFT descriptor is encoded by the approximated LLC (the number of neighbors is set to 5 with the shift-invariant constraint), obtaining a 1024-dimensional code vector. We use max pooling on the patch-level codes from hundends of overlapping regions with various spatial scales and positions, followed by dimension reduction using PCA. After that, the pooled features are concatenated into a vector with l2-normalization to form the image-level representation. In addition, the results of object detectors are also considered. We use Felzenszwalb's deformable part-based models to detect the bounding-boxes for each object class. The detection scores are max-pooled in each cell of spatial pyramid (i.e., 1x1+2x2+3x1) to construct image-level representation with l2-normalization. We obtain final imgae-level representation through weighted concatenation of the two feature vectors from LLC codes and object detectors. Then, a linear SVM classifier is trained to perform classification. The regularization parameters as well as the fusion weight are respectively tuned for each class using the training and validation set. We adopt 'liblinear' software package, which is released by the machine learning group at national Taiwan university, on the implemention of SVM. |
SJT_SIFT_LLC_PCAPOOL_SVM | Linear SVM using LLC features with PCA pooling | SIFT-LLC-PCAPOOL-SVM | Shanghai Jiao Tong Univeristy | Jun Zhu, Xiaokang Yang, Yukun Zhu, Rui Zhang, Xiaoling Chen | We adopt the framework based on locality-constrained linear coding (LLC) framework (J. Wang et al, CVPR 2010). First, the SIFT descriptors (Lowe, 2004) are extracted from image patches densely sampled by every 8 pixels, with three different scales (16x16, 24x24 and 32x32) respectively. Then, the codebook with 1024 bases is constructed by using K-means clustering on 100,000 randomly selected descriptors from the training set. After that, each 128-dimensional SIFT descriptor is encoded by the approximated LLC (the number of neighbors is set to 5 with the shift-invariant constraint), obtaining a 1024-dimensional code vector. We use max pooling on the patch-level codes from hundends of overlapping regions with various spatial scales and positions, followed by dimension reduction using PCA. After that, the pooled features are concatenated into a vector with l2-normalization to form the image-level representation. Finally, we train linear SVM classifiers on this feature representation to perform classification. The regularization parameters are respectively tuned for each class using the training and validation set. We adopt 'liblinear' software package, which is released by the machine learning group at national Taiwan university, on the implemention of SVM. |
STANFORD_COMBINE_ATTR_PART | Combine attribute classifiers and object detectors | COMBINE_ATTR_PART | Stanford University | Bangpeng Yao, Aditya Khosla, Li Fei-Fei | Our approach combines attribute classifiers and part detectors for action classification. The method is adapted from our ICCV2011 paper (Yao et al, 2011). The "attributes" are trained by using our random forest classifier (Yao et al, 2011), which are strong classifiers that consider global properties of action classes. As for "parts", we consider the objects that interact with the humans, such as horses, books, etc. Specifically, we take the object bank (Li et al, 2010) detectors that are trained on the ImageNet dataset for part representation. The confidence scores obtained from attribute classifiers and part detectors are combined to form the final score for each image. |
STANFORD_MAPSVM_POSELET | MAP-based SVM classifier with poselet features | MAPSVM-Poselet | Stanford University | Tim Tang, Pawan Kumar, Ben Packer, Daphne Koller | We build on the Poselet-based feature vector for action classification (Maji et al., 2010) in four ways: (i) we use a 2-level spatial pyramid (Lazenik et al., CVPR 2006); (ii) we obtain a segmentation of the person bounding box into foreground and background using an efficient GrabCut-like scheme (Rother et al., SIGGRAPH 2004), and use it to divide the feature vector into two parts---one corresponding to the foreground and one corresponding to the background; (iii) we learn a mixture model to deal with the different visual aspects of people performing the same action; and (iv) we optimize mean average precision (Yue et al., SIGIR 2007) instead of the 0/1 loss used in the standard binary SVM. All action classifiers are trained on only the VOC 2011 data, with additional annotations required to compute the Poselets. All hyperparameters are set using 5-fold cross-validation. |
STANFORD_RF_DENSEFTR_SVM | Random forest with SVM node classifiers | RF_DENSEFTR_SVM | Stanford University | Bangpeng Yao, Aditya Khosla, Li Fei-Fei | We use a random forest (RF) approach for action classification. Our method is adapted from our CVPR2011 paper (Yao et al, 2011). We explore two key properties that determine the performance of RF classifiers: discrimination and randomization. (1) Discrimination: In order to obtain strong decision trees, instead of randomly generating feature weights as in the conventional RF approaches, we use discriminative SVM classifiers to train the split for each tree node. (2) Randomization: The correlation between different decision trees needs to be small, such that the combination of all the trees can form an effective RF classifier. We consider a very dense feature space, where we sample image regions that can have any size and location in the image. For each sampled region, we use an SPM feature representation. Since each decision tree samples a specific set of image regions, the correlation between the trees can be reduced. |
UOCTTI_LSVM_MDPM | LSVM trained mixtures of deformable part models | UOCTTI_LSVM_MDPM | University of Chicago | Ross Girshick (University of Chicago), Pedro Felzenszwalb (Brown), David McAllester (TTI-Chicago) | Based on [1] http://people.cs.uchicago.edu/~pff/latent-release4 and [2] "Object Detection with Discriminatively Trained Part Based Models"; Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester and Deva Ramanan; IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 32, No. 9, September 2010. This entry is a minor modification of our publicly available "voc-release4" object detection system [1]. The system uses latent SVM to train mixtures of deformable part models using HOG features [2]. Final detections are refined using a context rescoring mechanism [2]. We extended [1] to detect smaller objects by adding an extra high-resolution octave to the HOG feature pyramid. The HOG features in this extra octave are computed using 2x2 pixel cells. Additional bias parameters are learned to help calibrate scores from detections in the extra octave with the scores of detections above this octave. This entry is the same as UOCTTI_LSVM_MDPM from the 2010 competition. Detection results are reported for all 20 object classes to provide a baseline for the 2011 competition. |
UOCTTI_WL-SSVM_GRAMMAR | Person grammar model trained with WL-SSVM | UOCTTI_WL-SSVM_GRAMMAR | University of Chicago | Ross Girshick (University of Chicago), Pedro Felzenszwalb (Brown), David McAllester (TTI-Chicago) | This entry is described in [1] "Object Detection with Grammar Models"; Ross B. Girshick, Pedro F. Felzenszwalb, David McAllester. Neural Information Processing Systems 2011 (to appear). We define a grammar model for detecting people and train the model’s parameters from bounding box annotations using a formalism that we call weak-label structural SVM (WL-SSVM). The person grammar uses a set of productions that represent varying degrees of visibility/occlusion. Object parts, such as the head and shoulder, are shared across all interpretations of object visibility. Each part is represented by a deformable mixture model that includes deformable subparts. An "occluder" part (itself a deformable mixture of parts) is used to capture the nontrivial appearance of the stuff that typically occludes people from below. We further refine detections using the context rescoring mechanism from the UOCTTI_LSVM_MDPM entry, using the results of that entry for the 19 non-person classes. |
UVA_MOSTTELLING | Most Telling Window | UvA_UNITN_MostTellingMonkey | University of Amsterdam, University of Trento | Jasper Uijlings Koen van de Sande Arnold Smeulders Theo Gevers Nicu Sebe Cees Snoek | Classification Task Our main component of this entry is the "Most Telling Window" method which uses Segmentation as Selective Search [1] combined with Bag-of-Words. The "Most Telling Window" method is also used in our Detection entry. However, instead of focusing on finding complete objects, training is adjusted such that we can use the most discriminative part of an object for its identification instead of the whole object. The Most Telling Window method is currently under review. While the "Most Telling Window" method yields the greatest contribution, we improve accuracy further by combining it with a normal Bag-of-Words framework based on SIFT and ColourSift and with the detection scores of the part-based model of Felzenszwalb et al. [1] "Segmentation as Selective Search for Object Recognition"; Koen E. A. van de Sande, Jasper R. R. Uijlings, Theo Gevers, Arnold W. M. Smeulders; 13th International Conference on Computer Vision, 2011. |
UVA_SELSEARCH | Selective Search Detection System | SelectiveSearchMonkey | University of Amsterdam and University of Trento | Jasper R. R. Uijlings Koen E. A. van de Sande Arnold W. M. Smeulders Theo Gevers Nicu Sebe Cees Snoek | Based on "Segmentation as Selective Search for Object Recognition"; Koen E. A. van de Sande, Jasper R. R. Uijlings, Theo Gevers, Arnold W. M. Smeulders; 13th International Conference on Computer Vision, 2011. Instead of exhaustive search, which was dominant in the Pascal VOC 2010 detection challenge, we use segmentation as a sampling strategy for selective search (cf. our ICCV paper). Like segmentation, we use the image structure to guide our sampling process. However, unlike segmentation, we propose to generate many approximate locations over few and precise object delineations, as the goal is to cover all object locations. Our sampling is diversified to deal with as many image conditions as possible. Specifically, we use a variety of hierarchical region grouping strategies by varying colour spaces and grouping criteria. This results in a small set of data-driven, class-indepent, high quality object locations (coverage of 96-99% of all objects in the VOC2007 test set). Because we have only a limited number of locations to evaluate, this enables the use of the more computationally expensive bag-of-words framework for classification. Our bag-of-words implementation uses densely sampled SIFT and ColorSIFT descriptors. |
WVU_SVM-PHOW | Svm classifier with PHOW features. | SVM-PHOW | West Virginia University | Biyun Lai, Yu Zhu, Qin Wu, Guodong Guo | We develop a method for still-image based action recognition. There are 10 action classes plus the “other” action class provided by PASCAL VOC 2011. We extracted the PHOW features to represent the images, which is a kind of multi-scale dense SIFT implementation. The kernel SVM method is used for training action classifiers. Different kernels are used for the SVM. We also used a learning technique to map the original features into a different space to improve the feature representation. A confidence measure is used to combine the results from different kernels to form the final decision for action classification. The training is performed on the provided training set, and tuned by using the validation set, and then the learned classifiers are applied to the test data. |