Abstract:

An overwhelming volume of news videos from different channels and languages is available today, which demands automatic management of this abundant information. To effectively search, retrieve, browse and track cross-lingual news stories, a news story similarity measure plays a critical role in assessing the novelty and redundancy among them. In this paper, we explore the novelty and redundancy detection with visual duplicates and speech transcripts for cross-lingual news stories. News stories are represented by a sequence of keyframes in the visual track and a set of words extracted from speech transcript in the audio track. A major difference to pure text documents is that the number of keyframes in one story is relatively small compared to the number of words and there exist a large number of non-near-duplicate keyframes. These features make the behavior of similarity measures different compared to traditional textual collections. Furthermore, the textual features and visual features complement each other for news stories. They can be further combined to boost the performance. Experiments on the TRECVID-2005 cross-lingual news video corpus show that approaches on textual features and visual features demonstrate different performance, and measures on visual features are quite effective. Overall, the cosine distance on keyframes is still a robust measure. Language models built on visual features demonstrate promising performance. The fusion of textual and visual features improves overall performance.

Figures:

 Figure 1: Novelty and redundancy detection in multi-lingual multimedia environment (in which NDK means near-duplicate keyframes)

Figure 2: Near-duplicate keyframes appeared in three news stories of different channels (tf: term frequency of keyframe, df: document frequency of keyframe).

    

Figure 2: Performance comparison of different uni-modal and multi-modal approaches with different colors (Cosine distance, Language models on text, Language models on visual features, Fusion of text and visual features)  The first eight methods are unimodal methods on textural (T) and visual (V) features respectively, while the last eight methods are the fusion of textual and visual features represented by MeasureT+MeasureV pairs, in which measures are denoted as C – Cosine, D – Dirichlet smoothing, S – Shrinkage smoothing, M – Mixture Model.

Conclusion

l          Due to the special properties of keyframes (the small of keyframes in each story and the large number of Non-NDK), approaches on visual information demonstrate different performance compared with traditional methods on text.

l          Cosine distance on visual features performs better than other measures on unimodality (text or visual features).

l          Language models of visual features are effective, but depend on the accuracy of near-duplicate keyframe detection.

l          Visual language models are less sensitive to smoothing techniques, and have better performance than text-based measures when both completely and somewhat redundant stories are counted in the redundancy metric.

l          A combination of textual and visual information improves the performance, which achieves better performance than individual ones.

Reference:

  • Xiao Wu, Alexander G. Hauptmann and Chong-Wah Ngo
    Novelty Detection for Cross-Lingual News Stories with Visual Duplicates and Speech Transcripts
    ACM International Conference on Multimedia (ACM MM’07), Augsburg, Germany, Sep. 2007 (oral).
    Full Text: [PDF, 412K]