Near-duplicate keyframes (NDK) play a unique role in large-scale video search, news topic detection and tracking. In this paper, we propose a novel NDK retrieval approach by exploring both visual and textual cues from the visual vocabulary and semantic context respectively. The vocabulary, which provides entries for visual keywords, is formed by the clustering of local keypoints. The semantic context is inferred from the speech transcript surrounding a keyframe. We experiment the usefulness of visual keywords and semantic context, separately and jointly, using cosine similarity and language models. By linearly fusing both modalities, performance improvement is reported compared with the techniques with keypoint matching. While matching suffers from expensive computation due to the need of online nearest neighbor search, our approach is effective and efficient enough for online video search.


Figure 2: Near-duplicate keyframe retrieval with the visual and semantic similarity

Figure 2: Proposed framework for near-duplicate keyframe retrieval

 Figure 3: Performance comparison of different measures (T – Context similarity, V – Visual similarity, C – Cosine, D – Dirichlet smoothing, S – Shrinkage smoothing, M – Mixture model) (Measures on context similarity, Measures on visual similarity, OOS, Combination of visual and contextual similarity)

Table 1: Speed efficiency

















Figure 4: Non-NDK pairs having high visual similarity on visual keywords, but different semantic context


l          Bag-of-words representation is both effective and efficient for NDK retrieval. Both cosine similarity and language models show reasonably good performance on visual keywords.

l          There is no obvious winner between cosine similarity and language models on visual keywords. Cosine similarity appears robust and no parameter setting is involved. As in text retrieval [18], we find that language models are sensitive to smoothing techniques and their parameters.

l          Mixture model can accurately estimate the probability of visual keywords, which demonstrates the best performances among all measures. The retrieval precision indeed approaches the techniques with keypoint matching. Meanwhile, the speed is even faster than the baseline retrieval with color moment.

l          Semantic context is a useful cue for NDK retrieval. By complementing context to visual words, the performance can exceed the techniques with keypoint matching, while enjoying the merit of speed efficiency.

l          Using both visual keywords and semantic context, the online and accurate retrieval of NDK pairs become feasible. This also enlightens the efficient ways of mining NDK in large video database for online large-scale video search.


  • Xiao Wu, Wan-Lei Zhao and Chong-Wah Ngo
    Near-Duplicate Keyframe Retrieval with Visual Keywords and Semantic Context
    ACM International Conference on Image and Video Retrieval (ACM CIVR’07), Amsterdam, July 2007.
    Full Text: [PDF, 305K]
  • Xiao Wu, Wan-Lei Zhao and Chong-Wah Ngo
    Efficient Near-Duplicate Keyframe Retrieval with Visual Language Models
    IEEE International Conference on Multimedia & Expo (IEEE ICME’07), Beijing, July 2007.