Author: D. Blei, A. Ng, and M. Jordan
Summary:
Latent Dirichlet Allocation(LDA) is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over a set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic prob.
The method of PLSA that we read in last week suffers that the number of parameters in the model grows linearly with the size of the corpus, which leads to serious problems with overfitting. LDA uses the assumption that documents are exchangeable; the specific ordering of the documents in a corpus can also be neglected.
(Exchangeability is not equivalent to an assumption that the random variables are independent and identically distributed. It means “conditionally independent and identically distributed,” where the conditioning is with respect to an underlying latent parameter of a probability distribution.)
Notation and terminology
- A word is the basic unit of discrete data, defined to be an item from a vocabulary indexed by {1,…, V}
- The vth word in the vocabulary is represented by a V-vector w such that wv = 1 and wu = 0 for u ≠ v
- A document is a sequence of N words denoted by w = (w1,w2,…,wN), where wn is the nth word in the sequence
- Corpus is a collection of M documents denoted by D = {w1,w2,…,wM}
- Choose N ~ Poisson(xi).
- Choose theta ~ Dir(alpha).
- For each of the N words wn:
- Choose a topic zn ~ Multinomial(theta).
- Choose a word wn from p(wn|zn,beta), a multinomial probability conditioned on the topic zn.
PLSA
LDA
We can compare the graphical representation of LDA and PLSA models as the above.
PLSA:
The PLSA model learns the topic mixtures P(z|d) only for those documents on which it is trained. The parameters for a k-topic PLSA model are k multinomial distributions of size V and M mixtures over the k hidden topics. This gives kV +kM parameters and therefore linear growth in M, which will cause overfitting.
LDA:
LDA overcomes both of these problems by treating the topic mixture weights as a k-parameter hidden random variable rather than a large set of individual parameters which are explicitly linked to the training set. The k+kV parameters in a k-topic LDA model do not grow with the size of the training corpus.
LDA must find parameters alpha and beta that maximize the (marginal) log-likelihood of the data. It uses EM algorithm as the following:
Variational EM:
- (E-step) For each document, find the optimizing values of the variational parameters {gammad*, phid*:d belongs to D}
- (M-step) Maximize the resulting lower bound on the log likelihood with respect to the model parameters alpha and beta.
It improves the drawback of PLSA and gets better performance, so there are more practical applicaitons adopt it. However, the proposal is not intuitive and hard to understand. Maybe I will understand in depth as I read more times.
沒有留言:
張貼留言