Multimedia Copy Detection Terminology
The task of multimedia copy detection is to automatically determine whether a given multimedia object A (image, audio or video) is similar to another object B that is part of a large collection of multimedia objects. In the literature, object A is often called a copy or duplicate of an original content (i.e. object B). A major difficulty of the copy detection problem is that the copied segment is not an exact reproduction of the original content, but rather a transformed copy. This is why the term near-duplicate is also used to refer to the copied segment.
Another term commonly used to describe the task of detecting a multimedia copy is Content-Based Copy Detection (CBCD). This term is used to describe a specific copy detection approach that uses the content itself of the multimedia document to detect a copy, in contrast to the watermarking approach that inserts a watermark into the original digital document to allow its subsequent detection.
The CBCD approach is also widely known as multimedia fingerprinting since specific features called fingerprints are extracted from the multimedia content and used thereafter to establish the similarity of two digital documents. Although CBCD and multimedia fingerprinting are the most common terms used to refer to this approach, the research literature includes other names such as robust hashing, perceptual hashing, robust matching, and passive watermarking.
Structure of a Multimedia Fingerprinting System
A common design of a multimedia fingerprinting system includes two principal components: a method to extract fingerprints from an audio/video signal that describes specific properties of the signal; a method to search fingerprints of an unknown audio/video in a large dataset of fingerprints.
During the first stage, the system extracts fingerprints from each audio/video reference file and stores them into a reference fingerprints database. In the second stage, the system extracts fingerprints from the audio/video query using the same extraction algorithm used to extract reference fingerprints, and scans the reference fingerprints database to find potentially similar fingerprints. Unlike the first stage that could be processed a posteriori (off-line), the second stage is performed on-line affecting subsequently the system efficiency.It is worth noting that the architecture described above is just an overall structure of a multimedia fingerprinting system. A more realistic architecture includes additional components depending on the kind of the multimedia objects to be detected (audio or video), the nature and complexity of the transformations applied to the query, the feature extraction approach, the fingerprint matching algorithm, etc.
Properties of a Multimedia Fingerprint
A multimedia fingerprint is a unique identifier representing a “signature” extracted from the content of an audio/video signal. In general, it represents the basic element of a multimedia fingerprinting system, on which the rest of the system is built.
The robustness and efficiency of the system are considerably influenced by the design of the fingerprints. Thus, the feature extraction approach should be designed in a way to guarantee the generation of fingerprints that hold specific characteristics. Regardless of the feature extraction approach, a fingerprint should ideally have the following properties:
Robustness: The robustness of a fingerprint represents its resistance to the presence of signal degradation. Thus, a fingerprint generated from the original audio/video content should be similar to the fingerprint generated from a distorted copy of the same content.
Uniqueness: The uniqueness of a fingerprint reflects its discrimination capability over a large number of fingerprints. This property ensures that fingerprints generated from two different signals are distinct. The robustness and uniqueness of a fingerprint are two conflicting properties, and usually a trade-off is made between them. In fact, increasing the invariability of the fingerprint to signal degradation decreases its sensitivity to signal change (i.e. the fingerprint becomes less discriminative). On the other hand, increasing the discrimination ability of a fingerprint decreases its ability to survive in presence of noise.
Compactness: The size of a fingerprint can have significant impact on the memory/storage requirements and the system efficiency. Thus, the fingerprint representation should be small so as to decrease the complexity of the system and reduce the memory space required to store a large number of fingerprints. However, fingerprint representation should at the same time contain the most relevant signal information that maintains the robustness and the discrimination power of the fingerprint.
Easy to compute: The fingerprint extraction algorithm should have low computational complexity. This fingerprint property is very important, especially for on-line applications that require real-time detection.
Applications of Multimedia Copy Detection
Multimedia copy detection field has seen a growing scientific interest in the last decade resulting in significant performance improvements. Research in this area has increased the industrial interest to exploit this technology to create practical applications. In fact, multimedia fingerprinting technology has allowed the development of several real-world applications, where some of them have been successfully commercialized. We enumerate in the following three application scenarios where this technology can be applied.
Copyright protection: Automatically detecting illegal copies of protected digital content is among the applications where the multimedia fingerprinting technology offers an excellent solution. Nowadays, most of the content providers adopt the multimedia fingerprinting technology to detect and filter illegal copies. Examples of these content providers include YouTube (Youtube, 2015), Vimeo (Vimeo, 2015), Yahoo! (Yahoo, 2015) and Dailymotion (Dailymotion, 2015). In addition, a large number of companies provide their services to automatically monitor a large number of media sharing platforms in order to detect illegal copies of protected content (e.g. music, films, TV shows, etc.). Audible Magic (Audiblemagic, 2015) is an example of such companies; their multimedia copy detection product is used by a large number of companies like Facebook (Facebook, 2015), Disney (Disney, 2015) and SoundCloud (Soundcloud, 2015).
Broadcast monitoring: Advertisers are interested to monitor radio, TV and web broadcasts to track their advertisements and verify if they are being broadcasted as agreed. Using the multimedia fingerprinting technology also allows the companies to follow advertising campaigns of their competitors for business intelligence purposes.
Vobile (Vobile, 2015) is among the companies that provide this kind of services. Their products are based on the use of the multimedia fingerprinting technology to automatically identify audiovisual content.
Music identification: The popularity of smartphones together with the maturity of multimedia fingerprinting technology have allowed this kind of application to be a huge success with music lovers. This kind of application recognizes a song on real time using an intelligent mobile phone that records a small segment of the music being played. The application provides to the users all the information related to the recognized song, and enables the user to directly buy the song. Shazam (Shazam) is the best-known company that offers this service with more than 500 million users.
Although a large number of the proposed methods focus on either audio or video fingerprints, some are based on multimodal features, where the audio and visual information are used to detect video copies.
A good multimodal feature representation using complementary audio features, local visual features and global visual features is described in (Mou et al., 2013). The audio part of this system is based on the Weighted Audio Spectrum Flatness (WASF) features introduced in (Chen and Huang, 2008). These features extend the MPEG-7 descriptor by introducing the Human Auditory System (HAS) functions to weight the audio signal. Mou et al. extract 14-dimentional WASF features from each audio frame of size 60 ms. Then, they combine the WASF features generated from 198 audio frames and reduce them to a vector of 72-dimentions using a technique specified in MPEG standard (Carpentier, 2005). The resulting 72-dimentions vector represents the signature of four seconds length audio clip.
For their video part, a local visual feature of dense color SIFT (DC-SIFT) (Bosch, Zisserman and Munoz, 2008) is used as local feature, whereas the global visual feature is based on DCT feature (similar to the DCT introduced in (Ching-Yung and Shih-Fu, 2001)). The similarity search is performed using a temporal pyramid-matching algorithm, where several techniques are employed to speed up the search. Locality Sensitive Hashing (LSH) (Indyk and Motwani,1998) technique is used to index DCT and WASF features, and a bag of words is applied to convert each DC-SIFT vector into a visual word that is stored in an inverted index.
Table des matières
CHAPTER 1 BACKGROUND
1.1 Multimedia Copy Detection Concepts
1.1.1 Multimedia Copy Detection Terminology
1.1.2 Structure of a Multimedia Fingerprinting System
1.1.3 Properties of a Multimedia Fingerprint
1.1.4 System Requirements
1.1.5 Applications of Multimedia Copy Detection
1.2 Datasets and Evaluation Metrics
1.2.2 Evaluation metrics
1.3 Related Work
1.3.1 Audio Fingerprinting
1.3.2 Video Fingerprinting
1.3.3 Audio+Video Systems
1.3.4 Accelerating Fingerprints Search
CHAPTER 2 AUDIO FINGERPRINTING
2.1 System Overview
2.2 Spectrogram Generation
2.3 Fingerprint Generation
2.3.1 Global Mean Fingerprint
2.3.2 Local Mean Fingerprint
2.3.3 Salient-Regions Fingerprint
2.4 Query Fingerprint Generation
2.5.1 Similarity Search
22.214.171.124 Similarity Measure for Global Mean and Local Mean Fingerprints
126.96.36.199 Similarity Measure for Salient-Regions Fingerprint
2.6 Results and Analysis
2.6.1 Results for Global Mean Fingerprint
2.6.2 Results for Local Mean Fingerprint
2.6.3 Combined Results from Global Mean and Local Mean Fingerprints
2.6.4 Results for Salient-Regions Fingerprint
2.6.5 Comparative Audio Copy Detection Systems
CHAPTER 3 VIDEO FINGERPRINTING
3.1 System Overview
3.1.1 Letterbox Detection
3.1.2 PiP Detection
3.1.3 Video Fingerprint Extraction
3.1.4 Matching Algorithm
3.2 Results and Analysis
3.2.1 Video Only Results
3.2.2 Audio+Video Results
CHAPTER 4 ACCELERATING THE AUDIO FINGERPRINT SEARCH USING A GPU AND A CLUSTERING-BASED TECHNIQUE
4.1 GPU Implementation of the Similarity Search
4.1.1 GPU Architecture
4.1.2 Similarity Search on GPU
4.1.3 Similarity Algorithms
4.2 Clustering-Based Technique
4.2.2 Lookup Table Construction
4.2.3 Matching Algorithm
4.2.4 Two-step Search
4.3 Results and Analysis
4.3.1 Run Times of GPU Implementations
4.3.2 Clustering-based Technique Performance
4.3.3 Validation Results on TRECVID 2009
4.3.5 Shazam versus CSR-44
LIST OF REFERENCES