Fingerprint algorithm

In the following paragraphs we are going to present our program, which we have called Video Shazam, in honor of the application that motivated us to do this project. As we have explained before, our program receives a noisy film scene and it is able to deduce which film corresponds to that scene, giving us the title, director, actors… of the movie. The program algorithm is divided in two parts: Fingerprint creation and Minimum Least Squared Error Film Selection.

1) Fingerprint creation

The key to success in the project is to have a very good quality fingerprint of the video. But what is a fingerprint? In few words, a fingerprint is something that is unique to movie and capacitates us to identify the film quickly. Actually, the best example of a fingerprint is our fingerprint. The governments of different countries have huge databases of their citizens fingerprints. This way, if there was a murder for example, and a fingerprint was found, authorities could match it to the database and discover who had been at the crime scene. In this project, we would require something similar for each film in the database.

The fingerprint creation is necessary because we want our program to be quick and efficient. It would be very inefficient if we had to compare pixel by pixel the frames of the noisy input video with the ones of each video in the database. Our database has 4 film scenes of 30 seconds each, and each frame has 240x320 pixels at least. So imagine how many time we would need to compute that pixel by pixel.

After some time looking for algorithms to create video fingerprints on internet we found a very interesting paper, Three-Dimensional Wavelet Based Video Fingerprint. It explains a very interesting and optimal way of creating a fingerprint for a video. Our algorithm, which we are going to explained a continuation, is based on this paper, but we have implemented it in a different way. However, for those interested in reading the original paper, we have attached it below.

Three-Dimensional Wavelet Based Video Fingerprint

Our fingerprint creation algorithm has four principal parts: Normalization, 2-D Haar wavelet transform, Quantification and Vectorization. You can observe the flow-chart (figure1).

Figure 1: Algorithm flow-chart

If someone wants to analyze our matlab code it is attaches also here.

Fingerprint database creation

Projection matrices creation

Shazam GUI - That includes the following functions:

AWGN noise function

Uniform noise function

Luminance change function

Cellphone recorded video function

NORMALIZATION

We have to take into account that the videos that the program are going to receive doesn’t have to be all the same size. They all need to be .mp4, but can have different sizes and quantity of frames. For this reason, we need to normalize the video first because we want all the fingerprints (a vector) to have the same structure and length. In the normalization frames we have two things to do: resize the video frames and convert from RGB to grayscale.

We have uploaded the MP4 video to Matlab with the VideoReader function. This gives us back a 4 dimensional array. Two dimensions for the frame pixels (X and Y axis), one for the RGB color and the last for the time.

The first thing we did is to select a specific number of frames to create the fingerprint. This part has been an issue for us. Our algorithm needs the recording of all the scene with the mobile phone if we want the program to guess the film. But even we record the whole scene and have the same duration, when reading the video with VideoReader the number of frames is not alway the same. We decided to set the number of frames to the minimum of our sample videos, but this part we will need to improve in the future if we want to generalize the program. The number of frames selected for the video are the first 450 frames, because our shortest video has 453 frames.

The second part of Normalization phase consist of converting each frame from RGB to grayscalar and this way, we can obtain a 3-D array instead of the 4-D we had at the beginning. We use the rgb2gray() function for this purpose.

Finally, we use the Matlab function resize() in order to have all the video frames of the same size. We decided to set all the videos to the size 240x320 pixels. Now the video is normalized and ready to apply the Haar wavelet transform.

2-D HAAR WAVELET TRANSFORM

In this module we receive a three-dimensional array with the normalized information of the video. Now we want to apply a wavelet transform to each frame. In the paper, it recommends to apply a 3-D Haar wavelet transform to the segments of the video, but we have decided to change our strategy because we were having a lot of problems with a 3-D Haar wavelet transform and instead we are doing a 2-D Haar wavelet transform to each frame. This transform, uses the Haar wavelet basis (Figue 2) which consist on a sequence of rescaled “square shaped” functions. The wavelet transform helps to overcome the obstacles produced by noise in image processing. FFT measures the data as a function of position (in frequency domain) without considering the time while the wavelet transform displays their correlation as a function of scale and time, so we have more useful information for our video processing.

In matlab there is a function to do a three-dimensional wavelet transform: dwt2(x,’haar’) where x is a normalized frame. It returns four different outputs: CA (low pass approximation), CH (horizontal details), CV (vertical details) and CD (diagonal details). For our fingerprint, we are using the CV and we ignore the rest. We are using the vertical details because after several tries with all of them, this one has been the one that has given us the best results.

QUANTIFICATION

Video of Full Metal Jacket scene transformed

The purpose of the quantification module is to reduce the amount of information to the one that is only vital to identify the video. We extract the most important information of the transform. The paper doesn’t specify the approach to do this, so reading some quantification methods on the internet, finally we have developed one that suits best for our objective. We are going to explain it a continuation.

Remember that what we receive from the previous module is a 3-D matrix which contains the transformed frames.

First, we make the sum of all the frames with the sum function in Matlab. The result of that sum is a matrix which elements are the sum of all the pixels of the same position in each transformed frame (Figure 3). We obtain the mean and the standard deviation of all the values in the matrix. Finally, we quantificate the matrix. If the value is bigger than two standards deviation of the mean, we set the value to one, otherwise we set the value to 0.

Figure 3: Sum of frames

Now we have a matrix which entries are only ones and zeros. Moreover, the matrix is very sparse.

VECTORIZATION

The final step is to encode this matrix into a vector. Because we think is interesting the fingerprint to be a vector. This way, the comparison with other fingerprints will be faster. We sum all the columns and we end with a vector which entries are the sum of each element of the column. As in the previous module, we use the sum function in Matlab.

2) Film selection using Least Squared Error

This part of the program has to deduce which of the database films is the one that we are giving to it. We have to take into account that the video is going to be noisy, the fingerprint of this video is not going to be equal to any of the stored in the database. We need an algorithm that predicts correctly which is the film. Our algorithm is based on choosing the fingerprint of the database that gives the minimum Least Squared Error.

Figure 4: Minimum LSE algorithm flow-chart

We consider that each fingerprint of the database is a basis for a vector subspace. The fingerprint of the noisy video that we are going to receive is not going to be in any of those subspaces. This vector will fall somewhere else in the N-dimensional space (N will be the length of the fingerprint. But we can project our vector into the subspaces generated by the database fingerprints, and see which of the projections gives us the Least Squared Error with noisy fingerprint. Equation 1 shows how we compute the error where X is the fingerprint of our noisy video and P a projection matrix of the database.

Figure 5: Projection to the subspace

The lowest error will mean that the fingerprint is more near from that subspace and so we will consider that the film is the one corresponding to that fingerprint.

First we create the projection matrices of each fingerprint using Singular Value Decomposition (SVD). F is a fingerprint of the database and in Equation 2 we can observe its SVD. The fingerprint is a vector, so its range is equal to one. The projection matrix is obtained by multiplying the k first biggest left singular vectors, in our case k=1 (Equation 3).

We already have the projection matrices of each fingerprint in the database. Finally, we multiply this Projection matrix with our fingerprint. We will obatin another vector. By applying the formula (1) we will see from which fingerprint is more close from the projected one and that will be the matching criteria.

This is the algorithm we have developed for the creation of a video fingerprint. We are aware that we haven’t implemented exactly as it was told in the paper and that it can be improved. We discuss the conclusions and improvements can be done in the conclusions section.