Paper Reading for aMMAI: 3月 2013

2013年3月28日星期四

[Summary] Latent Dirichlet Allocation

Topic: Latent Dirichlet Allocation

Author: D. Blei, A. Ng, and M. Jordan

Summary:

Latent Dirichlet Allocation(LDA) is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over a set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic prob.

The method of PLSA that we read in last week suffers that the number of parameters in the model grows linearly with the size of the corpus, which leads to serious problems with overfitting. LDA uses the assumption that documents are exchangeable; the specific ordering of the documents in a corpus can also be neglected.
(Exchangeability is not equivalent to an assumption that the random variables are independent and identically distributed. It means “conditionally independent and identically distributed,” where the conditioning is with respect to an underlying latent parameter of a probability distribution.)

Notation and terminology

A word is the basic unit of discrete data, defined to be an item from a vocabulary indexed by {1,…, V}
The vth word in the vocabulary is represented by a V-vector w such that wv = 1 and wu = 0 for u ≠ v
A document is a sequence of N words denoted by w = (w1,w2,…,wN), where wn is the nth word in the sequence
Corpus is a collection of M documents denoted by D = {w1,w2,…,wM}

LDA process:

Choose N ~ Poisson(xi).
Choose theta ~ Dir(alpha).
For each of the N words wn:

Choose a topic zn ~ Multinomial(theta).
Choose a word wn from p(wn|zn,beta), a multinomial probability conditioned on the topic zn.

Generally speaking, choose a topic from the topic distribution function for each document. Then, choose a word from the corresponding words for the given topic until all N words are picked.

PLSA

LDA

We can compare the graphical representation of LDA and PLSA models as the above.
PLSA:
The PLSA model learns the topic mixtures P(z|d) only for those documents on which it is trained. The parameters for a k-topic PLSA model are k multinomial distributions of size V and M mixtures over the k hidden topics. This gives kV +kM parameters and therefore linear growth in M, which will cause overfitting.
LDA:
LDA overcomes both of these problems by treating the topic mixture weights as a k-parameter hidden random variable rather than a large set of individual parameters which are explicitly linked to the training set. The k+kV parameters in a k-topic LDA model do not grow with the size of the training corpus.

LDA must find parameters alpha and beta that maximize the (marginal) log-likelihood of the data. It uses EM algorithm as the following:
Variational EM:

(E-step) For each document, find the optimizing values of the variational parameters {gammad*, phid*:d belongs to D}
(M-step) Maximize the resulting lower bound on the log likelihood with respect to the model parameters alpha and beta.

Comments:
It improves the drawback of PLSA and gets better performance, so there are more practical applicaitons adopt it. However, the proposal is not intuitive and hard to understand. Maybe I will understand in depth as I read more times.

2013年3月21日星期四

[Summary] Probabilistic Latent Semantic Indexing

Topic: Probabilistic Latent Semantic Indexing

Author: T. Hofmann

Summary:

PLSA is an approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. The model is able to deal with synonymy and polyonymy.

Latent Semantic Analysis(LSA) usually takes the high dimensional vector space representation of documents based on TF and applies Singular Value Decomposition (SVD) to the corresponding term/document matrix.

PLSA is based on the likelihood principle and defines a proper generative model of the data. The core of PLSA is a statistical model called aspect model.

unobserved class variable z (z1...zK)
word w (w1...wM)
document d (d1...dN)

The generative model is defined as follows:

select a document d with prob. P(d)
pick a latent class z with prob. P(z|d)
generate a word w with prob P(w|z)

The aspect model is a statistical mixture model which is based on two assumptions:

Observation pairs (d,w) are assumed to be generated independently.
The conditional independence assumption is made that conditioned on the latent class z, word w are generated independently of the specific document d.

Following the likelihood principle, one determines P(d), P(z|d), and P(w|z) by maximization of the log-likelihood function

where n(d,w) is the term frequency.

The standard procedure for maximum likelihood estimation is the Expectation Maximization (EM) algorithm. It contains two steps:

Expectation (E): Posterior prob. are computed for the latent variables z, based on the current estimates of the parameters.
Maximization (M): Parameters are updated for given posterior prob. computed in the E-step.

In order to prevent overfitting of the traing algorihm, the author introduces a variable beta. The proposal is called tempered EM (TEM).

Beta equals 1 at first, and performs EM until the performance on held-out data deteriorates.
Decrease beta by a constant factor eta, and perform EM continuely.
Stop when decreasing beta does not yield further improvement.

The comparison between Mixture Decomposition and Singular Value Decomposition:

If we define matrices by U=(P(di|zk))i,k, V=(P(wj|zk))j,k, Sigma=diag(P(zk))k. Then, the joint prob. model P can be written as P=U(Sigma)V, which can be compared with the SVD in LSA.
The weighted sum over outer products between rows of U and V reflects conditional independence in PLSA.
The left/ right eigenvectors in SVD are seen to correspond to the factors P(w|z) and the component distributions P(d|z) of the aspect model.
The mixing proportions P(z) in PLSA substitute the singular values of the SVD in LSA.

Comments:
PLSA gives a probability meaning to the proposed model, and it takes advantage of statistical standards methods for model fitting, overfitting control and model combination. The same approach is also used in speech processing in my impression, and maybe we can use it in the social network domain for link prediction, or it can find one's action characteristics by adding a time domain and some adjustment.

2013年3月14日星期四

[Summary] Aggregating local descriptors into a compact image representation

Topic: Aggregating local descriptors into a compact image representation

Author: Herve Jegou

Summary:

Accuracy, efficiency, and the memory usage are the three constraints that have to be considered jointly in image searching on a large scale.

The proposed approach contains three main parts to optimize the accuracy:

1. The presentation: how to aggregate local image descriptors into a vector representation

The proposed algorithm, VLAD(vector of locally aggregated descriptors), is to accumulate, for each visual word ci, the differences x-ci of the vectors x assigned to ci. It can be seen as a simplification of the Fisher kernel(only consider the "mean" factor). As in figure 1, we can observe that similar pictures have similar VLAD descriptors.

2. Dimensionality reduction of the vectors

It used principal component analysis(PCA) for dimensionality reduction: the eigenvectors associated with the D' most energetic eigenvalues of the covariance matrix are used to define a matrix M mapping vector x(D dimension) to x'=Mx(D' dimension). Then, by using the ADC(asymmetric distance computation) approach, it encoded the vector x' to q(x').

3. The indexing algorithm

Dataset: Holidays, UKB, Flickr

The choice of D' is constrained by the structure of ADC, which D' is a multiple of m.
The optimization is solely based on the mean square error quantization criterion.
There is a tradeoff on D'. If D' is large, the projection error vector is limited but a large quantization error is introduced.
The proposed approach significantly outperforms the state of the art.

Comments:
The simplification of GMM and proposed a new method is terrific! However, the paper doesn't explain why the weight and the variance in the GMM are not be considered. What if the weight and the variance are included? Do they really have no impact on the results?

2013年3月7日星期四

[Summary] Efficient visual search of videos cast as text retrieval

Topic: Efficient visual search of videos cast as text retrieval

Author: J. Sivic, and A. Zisserman

Summary:

The aim of this paper is to retrieve key frames of a video with which Google retrieves text documents containing specific words. The paper investigates a text retrieval-like approach to be successfully employed in this work.

The approach is as follows:

Detect and extract the viewpoint invariant descriptors by SIFT.
By using K-means, cluster the extracted SIFT descriptors. The clustering result can be seen as "visual vocabulary", and each cluster represents a "visual word". The visual words to an image can be thought of as the real "words" in an article.
Calculate the "TF-IDF" of each "visual words", which is usually used in a text system.
Build up a stop list and consider spatial consistency. The stop list is a list which is used to filter out the words which have high frequency but little real meaning, like "a", "the", etc. Spatial consistency considers the spatial distance of different "visual words".

There's some difference between document retrieval by bag-of-word and frame retrieval by bag-of-visual word:

Bag-of-word does not have spatial information, while bag-of-visual word contains spatial information.
An image query typically contains more visual words than a text query.
Web page retrieval can use some link structure indicator to improve the efficiency.

Comments
Is there any clustering algorithm that has a better performance than K-means in this algorithm? Or is it not the main case to the performance? The approach using here refers to the text retrieval techniques. Can other technique in text retrieval be transformed into the image-based?

2013年3月28日 星期四