Overview  |  Implementation Details  |  Example Results  |  Downloads  |  Citation

Recent Updates:

+ feature extraction code added to the download section of this page [08/15/2010]
+ detection scores on TRECVID 2010 test set available from CU-VIREO374 download site [08/10/2010]


Video concept detection aims to rank video shots according to the presence of semantic concepts (e.g., "sports", "charts", "people marching", etc.). These concepts can act as semantic filters for online video search. For example, a query "find military vehicle" can be easily answered by returning video shots most likely containing concepts "military" and "vehicle". Recently, a number of  concept detection systems have been developed and tested over the NIST TREC video retrieval (TRECVID) benchmarks. Also, the LSCOM effort has defined 1000+ semantic concepts and annotated 400+ of them over a set of broadcast news videos [2]. This provides a good resource for researchers in the field to develop large scale concept detectors. However, even though the labeled training samples are publicly available, developing detectors for such a large number of concepts is still difficult and time-consuming.

A concept detection system generally contains three components: feature extraction, uni-modality learning (e.g., using SVMs), and multi-modality fusion. In order to reduce the effort in replicating similar baseline systems, the Mediamill team in University of Amsterdam released 101 concept detectors [3] and the DVMM lab in Columbia University released 374 LSCOM concept detectors (Columbia374) [4].  As their aim is to reduce the baseline replication effort, the visual features used in Columbia374 and Mediamill-101 are the simple global ones such as color moment and Gabor texture. In our recent work [5], we have shown that local keypoint features, with careful selection of representation choices, are very effective for concept detection. With the goal of stimulating innovation of new concept detection techniques and providing better concept detectors for video search, we extend our system to detect 374 LSCOM concepts [2, 4], namely VIREO-374. We release the detectors, as well as features and detection scores on several recent data sets, to the multimedia community.

Implementation Details:

VIREO-374 detectors were trained on TRECVID-2005 development data using the LSCOM annotation [2]. Below are the implementation details of these detectors. Note that for the newly released detection scores on TV10 data sets, the detectors are different (see CU-VIREO374 download site).

Local Feature: we used DoG detector and SIFT descriptor for keypoint detection and description. A visual vocabulary of 500 visual words was generated by clustering a set of ~500k SIFT features. With the visual vocabulary, a keyframe can then be represented by a 500-d feature vector, analogous to the bag-of-words representation of text documents. We used soft-weighting scheme to weight the significance of each visual word in the keyframe, which has been demonstrated to be more effective than the traditional TF/TF-IDF weighting schemes in our previous work [5]. For more details of our keypoint based video frame representation, please refer to [5].

As a comparison, and also to evaluate the fusion performance of the local feature and traditional color/texture features, we also implemented two global features: grid-based (5 by 5) color moment in Lab space (225-d), and grid-based (3 by 3) wavelet texture (81-d).

Classifier: LibSVM package [6] was used for model training and prediction on test data. We used Chi-square kernel for local features and RBF kernel for the two baseline features. The Chi-square kernel is an extension of the Gaussian RBF kernel where the L2 distance is replaced with Chi-square distance. For more details of kernel choice, please also refer to [5].

It is well-known that the selection of parameters in SVMs will affect the performance. Basically there are two important parameters, C (cost parameter in soft-margin SVMs) and g (the width of the Gaussian kernel). To find the (near-)optimal parameters, grid-search is a common yet time-consuming way. For this work, we only slightly adjusted g around 1/d, where d is the average distance among a set of training samples, and then fixed it for all the concepts. Although this may not be the optimal choice, we have observed that the performance is quite similar to the optimal parameters determined by grid-based search, and the training time is significantly reduced. The parameters we used for the three features are: 1) color moments: -C 8, -g 0.2; 2) wavelet texture: -C 8, -g 0.44; 3) Local feature: -C 8, -g 0.0038.

In addition, note that in our feature representation, each dimension of the two global features was scaled to [-1,1] using the SVM-scale tool of LibSVM. For the keypoint features, scaling was not used.

Output File Format: We adopted the probability outputs of LibSVM (use "-b" option in LibSVM for training and prediction). For each concept, there are three score files containing the probability outputs of three SVM classifiers, trained using the three features respectively. In each score file, the probabilities are listed in one single column, and there is another file named "shotlist.txt" containing the corresponding shot names to each row of the score files.

Detection Results and Comparison:

On TRECVID-2006 Benchmark:

Figure: Performance of our results (red) and all official TRECVID-2006 concept detection systems (purple&yellow). Each team may submit up to 6 runs (30 teams in total). For our best result (the red bar on the far right), we used average fusion to combine different feature modalities.

Top 200 Ranked Keyframes (Shots) on TRECVID 2006 Test Data:

  • Local feature only (MAP on TRECVID '06 Benchmark: 0.119):

  • Average Fusion of color/texture and local feature (MAP on TRECVID '06 Benchmark: 0.154):


  374 SVM models, Features of TRECVID 05&06, and Detection Scores on TRECVID 06 Test Data:

  To download above items, please click here (User-ID: vireodetector Pass: vireo1234)

  Feature Extraction Code:

 Feature extraction code for computing the soft-weighted bag-of-visual-word feature (described in [5]) is available here.
 This feature can be directly applied for concept detection using the VIREO-374 SVM models.

  Features and Detection Scores on Recent Datasets:

    - Aug. 2007: Features and Detection Scores on TRECVID-07 Dataset
    - Aug. 2008: Features and Detection Scores on TRECVID-08 Dataset
    - Aug. 2009: Features and Detection Scores on TRECVID-09 Test Set
    - Aug. 2010: Detection Scores on TRECVID-10 Test Set are Available at CU-VIREO374 Download Site


Please cite one of the following papers when using our VIREO-374 detectors, features, and/or detection scores:


[1] A. F. Smeaton, P. Over, and W. Kraaij, "Evaluation campaigns and TRECVid. In Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval (MIR'06), Santa Barbara, USA, October 26 - 27, 2006. ACM Press, New York, NY, 321-330.
[2] "LSCOM Lexicon Definitions and Annotations", in DTO Challenge Workshop on Large Scale Concept Ontology for Multimedia, Columbia University ADVENT Technical Report #217-2006-3, 2006.
[3] Cees G.M. Snoek, Marcel Worring, Jan C. van Gemert, Jan-Mark Geusebroek, and Arnold W.M. Smeulders. "The Challenge Problem for Automated Detection of 101 Semantic Concepts in Multimedia". ACM Multimedia, pp. 421-430, Santa Barbara, USA, October 2006.
[4] A. Yanagawa, S.-F. Chang, L. Kennedy, and W. Hsu, "Columbia University's Baseline Detectors for 374 LSCOM Semantic Visual Concepts", Columbia University ADVENT Technical Report #222-2006-8, March 2007.
[5] Yu-Gang Jiang, Chong-Wah Ngo, Jun Yang, "Towards Optimal Bag-of-Features for Object Categorization and Semantic Video Retrieval", ACM International Conference on Image and Video Retrieval (CIVR'07), Amsterdam, The Netherlands, 2007.
[6] C.-C. Chang, C.-J. Lin, "LIBSVM: a Library for Support Vector Machines", software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001.