CC_WEB_VIDEO: Near-Duplicate Web Video Dataset
Xiao Wu+, Chong-Wah Ngo+ and Alexander G. Hauptmann#
+Department of Computer Science, City University of Hong Kong
#School of Computer Science, Carnegie Mellon University
Introduction:With the exponential growth of social media in Web 2.0, the huge volume of videos being transmitted and searched on the Internet has increased tremendously. Users can capture videos by mobile phones, video camcorders, or directly obtain videos from the web, and then distribute them again with some modifications. For example, users upload 65,000 new videos each day on video sharing website YouTube and the daily video views were over 100 million in July 2006 . Among these huge volumes of videos, there exist large numbers of duplicate and near-duplicate videos.
Based on a sample of 24 popular queries from YouTube , Google Video  and Yahoo! Video , on average there are 27% redundant videos that are duplicate or nearly duplicate to the most popular version of a video in the search results . For certain queries, the redundancy can be as high as 93% (see Table I). As a consequence, users are often frustrated when they need to spend significant amount of time to find the videos of interest, having to go through different versions of duplicate or near-duplicate videos streamed over the Internet before arriving at an interesting video. An ideal solution would be to return a list which not only maximizes precision with respect to the query, but also novelty (or diversity) of the query topic. To avoid getting overwhelmed by a large number of repeating copies of the same video in any search, efficient near-duplicate video detection and elimination is essential for effective search, retrieval, and browsing.
This work was cooperated by VIREO group from City University of Hong Kong, and Informedia group from Carnegie Mellon University. The dataset is called CC_WEB_VIDEO, named by the initials of City University of Hong Kong and Carnegie Mellon University, and which was collected from the web video sharing web site YouTube and video search engines Google Video and Yahoo! Video.
Furthermore, the social web provides much more than a platform for users to interact and exchange information. This has resulted in the rich sets of context information associated with web videos. These context resources provide complementary information to the video content itself. In this dataset, in addition to the video itself, the contextual information, such as thumbnail images, tags, titles, and time durations, is also provided.
Near-Duplicate Web Videos
Definition: Near-duplicate web videos are identical or approximately identical videos close to the exact duplicate of each other, but different in file formats, encoding parameters, photometric variations (color, lighting changes), editing operations (caption, logo and border insertion), different lengths, and certain modifications (frames add/remove). A user would clearly identify the videos as "essentially the same".
A video is a duplicate of another, if it looks the same, corresponds to approximately the same scene, and does not contain new and important information. Two videos do not have to be pixel identical to be considered duplicates. A user searching for entertaining video content on the web, might care about the overall content and subjective impression when filtering near-duplicate videos for more effective search. Exact duplicate videos are a special case of near-duplicate videos. A couple of near-duplicate web videos are shown in Figure 1 and 2.
Near-duplicate web videos can be mainly categorized into two classes:
1. Formatting differences
2. Content differences
Figure 2. Two videos of complex scene query "White and Nerdy" with complex transformations (only the first ten keyframes are displayed): logo insertion, geometric and photometric variations (lighting change, black border), and keyframes added/removed
We selected 24 queries designed to retrieve the most viewed and top favorite videos from YouTube. Each text query was issued to YouTube, Google Video, and Yahoo! Video respectively. The videos were collected in November, 2006. Videos with time duration over 10 minutes were removed from the dataset. The final data set consists of 12,790 videos. It forms the final dataset. The query information and the number of near-duplicates to the dominant version (the video most frequently appearing in the results) are listed in Table 1. For example, there are 1,771 videos in query 15 “White and Nerdy”, and among them there are 696 near-duplicates of the most common version in the result lists. Shot boundaries were detected and each shot was represented by a keyframe. In total there are 398,015 keyframes in the set.
Table 1. 24 Video Queries Collected from YouTube, Google Video and Yahoo! Video (#: number of videos)
In the VIREO_WEB_VIDEO dataset, it includes the following files:
Note: This dataset is only for non-commercial research and/or educational purposes. To obtain this dataset, you have to fully agree on the following terms and conditions with complete understanding:
The video dataset can be obtained via sending a request email to us. Specifically, the researchers interested in the dataset should download, fill out, scan, and sign the Agreement and Disclaimer Form, and send it back to us (mail to: email@example.com). We will send you instructions via email to download the dataset at our discretion.
Download VIREO_WEB_VIDEO Web Video Dataset
Google Video. Available: