Shiai Zhu1, Chong-Wah Ngo1, Yu-Gang Jiang2

1Video Retrieval Group (VIREO), City University of Hong Kong

2Dept of Electrical Engineering, Columbia University
shiaizhu2 AT     cwngo AT     yjiang AT

Introduction  ||   Implementation Details and Results   ||   Downloads   ||   Citation


Visual concept detection is essentially a classification task that determines whether a multimedia unit (e.g., an image) is relevant to a given target concept. Classifiers (e.g., SVMs) are trained with various features extracted from training samples, and the learnt classifiers are then employed for concept detection. With a large set of robust concept detectors, significant improvement can be achieved in many challenging applications, such as image search and summarization.

To train the concept detectors, a critical step is to acquire a sufficiently large amount of training data, which is not a trivial process. Fortunately with the popularity of social media, there are more and more digital images available on the Web. Several datasets containing thousands of images collected from websites such as Flickr have recently been released for research. Among them, NUS-WIDE [1] is a popular web image dataset collected by researchers from the National University of Singapore, which includes approximately 260k images with manual annotation of 81 concept categories. To avoid repetitive efforts in building baseline concept detectors (classifiers) over this dataset, we are now releasing our concept detectors for the 81 concepts. The detector set is named as VIREO-WEB81.

Implementation Details and Results:

We adopt similar setting as VIREO-374, a detector set trained from broadcast news videos that has been used by many researchers in the field. For each concept, three SVM classifiers were trained respectively based on bag-of-visual-words (BoW), grid-based color moment, and wavelet texture. For BoW, we use the same visual codebook to VIREO-374 which contains 500 visual words, and applied soft-weighting [2] for vector quantization. For color moment, each image is partitioned into 5 by 5 grids, and the first three moments are computed in Lab color space over each grid. Similarly for wavelet texture, each image is divided into 3 by 3 grids, and each grid is represented by the variances in 9 Haar wavelet sub-bands. For model training and testing, LibSVM package [3] is used. Following the implementations of VIREO-374, Chi-square kernel is used for BoW feature and RBF kernel is used for color moment and wavelet texture. To combine classifiers of the three features, average (late) fusion is used.

The 81 concept detectors were trained using the NUS-WIDE training dataset (~160k). The figure below gives the per-concept average precision (AP) performance of our detectors over the NUS-WIDE test dataset (~100k) .

Figure:Average precision (AP) performance of the 81 concept detectors on NUS-WIDE dataset.


  SVM models, features of NUS-WIDE dataset, and detection scores on the test set:

To download the above items, please send an email to to specify the items you need and your name and affiliation. We will send you instructions via email.

  Feature Extraction Code:

The feature extraction code used for computing the soft-weighted bag-of-visual-word feature is the same to that in VIREO-374, which can be downloaded from here.


Please cite the following paper for VIREO-WEB81:


[1] T.-S. Chua et al. NUS-WIDE: A real-world web image database from national university of singapore. In ACM CIVR, 2009.
[2] Y.-G. Jiang, C.-W. Ngo, and J. Yang. Towards Optimal Bag-of-Features for Object Categorization and Semantic Video Retrieval, In ACM CIVR, 2007.
[3] C.-C. Chang, C.-J. Lin, "LIBSVM: a Library for Support Vector Machines", software available at
, 2001.