Paper Reading for aMMAI: [Summary] Efficient visual search of videos cast as text retrieval

Topic: Efficient visual search of videos cast as text retrieval

Author: J. Sivic, and A. Zisserman

Summary:

The aim of this paper is to retrieve key frames of a video with which Google retrieves text documents containing specific words. The paper investigates a text retrieval-like approach to be successfully employed in this work.

The approach is as follows:

Detect and extract the viewpoint invariant descriptors by SIFT.
By using K-means, cluster the extracted SIFT descriptors. The clustering result can be seen as "visual vocabulary", and each cluster represents a "visual word". The visual words to an image can be thought of as the real "words" in an article.
Calculate the "TF-IDF" of each "visual words", which is usually used in a text system.
Build up a stop list and consider spatial consistency. The stop list is a list which is used to filter out the words which have high frequency but little real meaning, like "a", "the", etc. Spatial consistency considers the spatial distance of different "visual words".

There's some difference between document retrieval by bag-of-word and frame retrieval by bag-of-visual word:

Bag-of-word does not have spatial information, while bag-of-visual word contains spatial information.
An image query typically contains more visual words than a text query.
Web page retrieval can use some link structure indicator to improve the efficiency.

Comments
Is there any clustering algorithm that has a better performance than K-means in this algorithm? Or is it not the main case to the performance? The approach using here refers to the text retrieval techniques. Can other technique in text retrieval be transformed into the image-based?

Paper Reading for aMMAI

2013年3月7日星期四

[Summary] Efficient visual search of videos cast as text retrieval

沒有留言:

張貼留言

2013年3月7日 星期四

[Summary] Efficient visual search of videos cast as text retrieval

沒有留言:

張貼留言

2013年3月7日星期四