Yu-Gang Jiang1, Chong-Wah Ngo1, and
Jun Yang2 1 Video Retrieval Group (VIREO), City University of Hong Kong,
Hong Kong 2 School of Computer Science, Carnegie Mellon University, USA yjiang AT ee.columbia.edu
cwngo AT cs.cityu.edu.hk
+ feature extraction code added to the download section
of this page [08/15/2010]
+ detection scores on TRECVID 2010 test set available from
CU-VIREO374 download
site [08/10/2010]
Overview:
Video concept detection aims to rank
video shots according to the presence of semantic concepts (e.g., "sports", "charts", "people marching",
etc.). These concepts can act as semantic filters for online video
search. For example, a query "find military vehicle" can be easily
answered by returning video shots most likely containing concepts
"military" and "vehicle". Recently, a number of
concept detection systems have been developed and tested over the NIST TREC video retrieval
(TRECVID) benchmarks. Also, the LSCOM effort
has defined 1000+ semantic concepts and annotated 400+ of them over a set
of broadcast news videos [2]. This provides a good resource for researchers in the field to
develop large scale concept detectors. However, even though the labeled
training samples
are publicly available, developing detectors for such a large number of concepts
is still difficult and time-consuming.
A concept detection system generally contains three components:
feature extraction, uni-modality learning (e.g., using SVMs), and
multi-modality fusion. In order
to reduce the effort in replicating similar
baseline systems, the Mediamill team in University of Amsterdam released 101 concept detectors [3]
and the DVMM lab in Columbia University released 374 LSCOM
concept detectors (Columbia374) [4]. As their aim is to reduce the
baseline replication effort, the visual features used in Columbia374 and
Mediamill-101 are the simple global ones such as color moment and Gabor texture. In
our recent work [5], we have shown that local keypoint features, with
careful selection of representation choices, are very effective for
concept detection. With the goal of stimulating innovation of new
concept detection techniques and providing better concept detectors for video search, we extend our
system to detect 374 LSCOM concepts [2, 4], namely VIREO-374. We release
the detectors, as well as features and detection scores on several
recent data sets, to the
multimedia community.
Implementation Details:
VIREO-374 detectors were trained on TRECVID-2005 development data using the LSCOM annotation [2].
Below are the implementation details of these detectors. Note that for the newly
released detection scores on TV10 data sets, the detectors are different
(see CU-VIREO374
download site).
Feature: Local Feature: we used DoG detector
and SIFT descriptor for keypoint detection and description. A visual vocabulary of 500 visual words was generated by clustering
a set of ~500k SIFT features. With the visual vocabulary, a keyframe can then
be represented
by a 500-d feature vector, analogous to the bag-of-words representation of
text documents. We used soft-weighting scheme to weight the significance of
each visual word in the keyframe, which has been demonstrated to be more
effective than the traditional TF/TF-IDF
weighting schemes in our previous work [5]. For more details of our keypoint based
video frame representation, please refer to [5].
As a comparison, and also to evaluate the fusion
performance of the local feature and traditional color/texture features,
we also implemented two global features: grid-based (5 by 5) color moment in
Lab space (225-d), and grid-based (3 by 3) wavelet texture (81-d).
Classifier:
LibSVM package [6] was used for model training and
prediction on test data. We used Chi-square kernel for local
features and RBF kernel for the two baseline features. The Chi-square kernel
is an extension of the Gaussian RBF kernel where the L2 distance
is replaced with Chi-square distance. For more details of kernel choice, please also refer
to [5].
It is well-known that the selection of parameters in SVMs will
affect the performance. Basically there are two important parameters, C
(cost parameter in soft-margin SVMs) and g (the width of the Gaussian
kernel). To find the (near-)optimal parameters, grid-search is a common yet
time-consuming way. For this work, we only slightly adjusted g around
1/d, where d is the average distance among a set of
training samples, and then fixed it for all the concepts. Although this may not
be the optimal choice,
we have observed that the performance is quite similar to the optimal
parameters determined by grid-based search, and the training time is significantly reduced. The parameters
we used for the three features are: 1) color moments: -C 8, -g 0.2;
2) wavelet texture: -C 8, -g 0.44; 3) Local feature: -C 8,
-g 0.0038.
In addition, note that in our
feature representation, each dimension of the two
global features was scaled to [-1,1] using the SVM-scale tool of LibSVM. For the keypoint features, scaling was
not
used.
Output File Format: We adopted the probability outputs of LibSVM
(use "-b" option in LibSVM for training and prediction). For each
concept, there are three score files containing the probability outputs
of three SVM classifiers, trained using the three features respectively. In each score
file, the probabilities are listed in one single column, and there is another
file
named "shotlist.txt" containing the corresponding shot names to each
row of the score
files.
Detection Results and Comparison:
On TRECVID-2006 Benchmark:
Figure:Performance of our results (red) and
all official TRECVID-2006 concept detection systems (purple&yellow). Each team
may submit up to 6 runs (30 teams in total). For our best result (the red
bar on the far right), we used average fusion to
combine different feature modalities.
Top 200 Ranked Keyframes (Shots) on TRECVID 2006 Test Data:
Local feature only (MAP on TRECVID
'06
Benchmark: 0.119):
Average Fusion of color/texture and local
feature (MAP on TRECVID '06 Benchmark: 0.154):
Downloads:
374 SVM models, Features of TRECVID
05&06, and Detection Scores on TRECVID 06 Test Data:
To download above items, please click here (User-ID: vireodetector Pass: vireo1234)
Feature Extraction Code:
Feature extraction code for computing the soft-weighted bag-of-visual-word feature
(described in [5]) is available here.
This feature can be directly applied for concept detection using the VIREO-374 SVM models.
[1] A. F. Smeaton, P. Over, and W. Kraaij,
"Evaluation campaigns and TRECVid. In Proceedings of the 8th ACM
International Workshop on Multimedia Information Retrieval (MIR'06), Santa
Barbara, USA, October 26 - 27, 2006. ACM Press, New York, NY,
321-330.
[2] "LSCOM Lexicon Definitions and
Annotations", in DTO Challenge Workshop on Large Scale Concept Ontology for
Multimedia, Columbia University ADVENT Technical Report #217-2006-3, 2006.
[3] Cees G.M. Snoek, Marcel Worring, Jan
C. van Gemert, Jan-Mark Geusebroek, and Arnold W.M. Smeulders. "The
Challenge Problem for Automated Detection of 101 Semantic Concepts in
Multimedia". ACM Multimedia, pp. 421-430, Santa Barbara, USA, October 2006.
[4] A. Yanagawa, S.-F. Chang, L. Kennedy,
and W. Hsu, "Columbia University's Baseline Detectors for 374 LSCOM Semantic
Visual Concepts", Columbia University ADVENT Technical Report #222-2006-8,
March 2007.
[5] Yu-Gang Jiang, Chong-Wah Ngo, Jun
Yang, "Towards
Optimal Bag-of-Features for Object Categorization and Semantic Video
Retrieval", ACM International Conference on Image and Video Retrieval
(CIVR'07), Amsterdam, The Netherlands, 2007.
[6] C.-C. Chang, C.-J. Lin, "LIBSVM: a
Library for Support Vector Machines", software available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001.