Author: J. Sivic, and A. Zisserman
Summary:
The aim of this paper is to retrieve key frames of a video with which Google retrieves text documents containing specific words. The paper investigates a text retrieval-like approach to be successfully employed in this work.
The approach is as follows:
- Detect and extract the viewpoint invariant descriptors by SIFT.
- By using K-means, cluster the extracted SIFT descriptors. The clustering result can be seen as "visual vocabulary", and each cluster represents a "visual word". The visual words to an image can be thought of as the real "words" in an article.
- Calculate the "TF-IDF" of each "visual words", which is usually used in a text system.
- Build up a stop list and consider spatial consistency. The stop list is a list which is used to filter out the words which have high frequency but little real meaning, like "a", "the", etc. Spatial consistency considers the spatial distance of different "visual words".
There's some difference between document retrieval by bag-of-word and frame retrieval by bag-of-visual word:
- Bag-of-word does not have spatial information, while bag-of-visual word contains spatial information.
- An image query typically contains more visual words than a text query.
- Web page retrieval can use some link structure indicator to improve the efficiency.
Comments
Is there any clustering algorithm that has a better performance than K-means in this algorithm? Or is it not the main case to the performance? The approach using here refers to the text retrieval techniques. Can other technique in text retrieval be transformed into the image-based?
沒有留言:
張貼留言