LDA Model Introduction
The Latent Dirichlet Allocation (LDA) model is a widely used probabilistic graphical model for analyzing text data. It allows us to discover latent topics within a collection of documents and understand how these topics are represented by words.
To understand the LDA model, we can use plate notation, which is a visual representation of probabilistic graphical models. In this notation, boxes are referred to as “plates” and represent replicates or repeated entities. In the case of LDA, the outer plate represents documents, while the inner plate represents the word positions within a document. Each word position is associated with a choice of topic and word.
Let’s define the variables used in LDA:
- M represents the number of documents in the corpus.
- N represents the number of words in a given document (document i has Ni words).
- α is the parameter of the Dirichlet prior to the per-document topic distributions.
- β is the parameter of the Dirichlet prior to the per-topic word distribution.
- θi is the topic distribution for document i.
- φk is the word distribution for topic k.
- zij is the topic for the j-th word in document i.
- wij is the specific word.
In plate notation, the fact that the variable W (representing words wij) is grayed out indicates that it is the only observable variable, while the other variables are latent.
The LDA model introduces a sparse Dirichlet before modeling the topic-word distribution. This means that the probability distribution over words in a topic is skewed, and only a small set of words have a high probability. This variation of LDA is widely applied in practice.
The plate notation for LDA with the Dirichlet-distributed topic-word distributions is shown on the right. In this notation, K represents the number of topics, and φ1, φ2, …, φK are V-dimensional vectors that store the parameters of the Dirichlet-distributed topic-word distributions (V is the number of words in the vocabulary).
To understand the entities represented by θ and φ, we can think of them as matrices created by decomposing the original document-word matrix that represents the corpus of documents being modeled. In this view, θ consists of rows defined by documents and columns defined by topics, while φ consists of rows defined by topics and columns defined by words. Therefore, φ1, φ2, …, φK refers to a set of rows, each representing a distribution over words for a specific topic, and θ1, θ2, …, θM refers to a set of rows, each representing a distribution over topics for a particular document.
By applying the LDA model to a collection of documents, we can uncover the underlying topics and their associated word distributions. This allows us to gain insights into the structure and content of the document corpus, enabling various applications such as document categorization, topic modeling, and information retrieval.
Reference
Wikipedia contributors, “Latent Dirichlet allocation,” Wikipedia, Apr. 02, 2024. https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation