Volume 72, Issues 7–9, March 2009, Pages 1775–1781
Advances in Machine Learning and Computational Intelligence — 16th European Symposium on Artificial Neural Networks 2008
16th European Symposium on Artificial Neural Networks 2008
Edited By F.-M. Schleif, M. Biehl and A. Vellido
A density-based method for adaptive LDA model selection
- a Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, China
- b Graduate University of the Chinese Academy of Sciences, Beijing 100039, China
- Received 1 August 2007, Revised 20 December 2007, Accepted 18 June 2008, Available online 28 August 2008
- Communicated by T. Heskes
Abstract
Topic models have been successfully used in information classification and retrieval. These models can capture word correlations in a collection of textual documents with a low-dimensional set of multinomial distribution, called “topics”. However, it is important but difficult to select the appropriate number of topics for a specific dataset. In this paper, we study the inherent connection between the best topic structure and the distances among topics in Latent Dirichlet allocation (LDA), and propose a method of adaptively selecting the best LDA model based on density. Experiments show that the proposed method can achieve performance matching the best of LDA without manually tuning the number of topics.
Keywords
- Latent Dirichlet allocation;
- Topic model;
- Topic
1. Introduction
Statistical topic models have been successfully applied in many tasks, including information classification [1]; [16] ; [3], information retrieval [14] ; [4], and data mining [15] ; [8], etc. These models can capture the word correlations in the corpus with a low-dimensional set of multinomial distribution, called “topics”, and find a relatively short description for the documents.
Latent Dirichlet allocation (LDA) is a widely used generative topic model [1]; [4]; [11] ; [14]. In LDA, a document is viewed as a distribution over topics, while a topic is a distribution over words. To generate a document, LDA firstly samples a document-specific multinomial distribution over topics from a Dirichlet distribution; then repeatedly samples the words in the document from the corresponding multinomial distribution.
The topics discovered by LDA can capture the correlations between words, but LDA cannot capture the correlations between topics for the independency assumption underlying Dirichlet distribution. However, topic correlations are common in real-word data, and ignoring these correlations limits LDA's abilities to express the large-scale data and to predict the new data. In recent years many researchers have explored some more complicated and richer structures to model the topic correlations. One example is the correlated topic model (CTM) [2]. Like the LDA, CTM represents each document as a mixture of topics, but the mixture proportion is sampled from a logistic normal distribution. CTM captures the correlations between every pairs of topics by the covariance matrix. To capture the correlations with a more flexible structure, Li et al. [10] proposed Pachinko allocation model (PAM). PAM uses a directed acyclic graph (DAG) to model the semantic structure. Each leaf node in the DAG represents a word in the vocabulary, and each interior node corresponds to a topic. PAM expands the definition of topic to be not only a distribution over words (just like the other topic models), but also a distribution over other topics, called “Super Topic”.
Although these models can describe the topic correlations flexibly, they all face the same difficulty to determine the number of topics (parameter K). This parameter will determine the topic structure extracted by the topic model. Y. Teh et al. [13] found an application of hierarchical Dirichlet process (HDP) to automatically learn the number of topics in LDA model. Moreover, Li et al. [9] proposed a nonparametric Bayesian prior for PAM based on a variant of the HDP.
This work is based on the nonparametric nature of the Bayesian analysis tool known as the Dirichlet process (DP) mixture model. But this method needs constructing a HDP model and a LDA model for the same dataset. In this paper, we propose a new method to adaptively select the best LDA model based on topic density, and integrate this task and the model parameter estimation into the same framework. By modeling the generation process of a new topic, we find that the words connecting several topics are likely to generate the new topics. Furthermore, the model's best K is not only determined by the size of dataset, but is also sensitive to inherent correlations in the document collection. After computing the density of each topic, we find the most unstable topics under the old structure, and iteratively update the parameter K until the model is stable.
The rest sections of this paper are organized as follows. In Section 2, we review the basic principles of LDA and the model selection method based on HDP. In Section 3, we study the meaning of the parameter K, and deeply analyze the inherent connection between the topic correlations and the LDA model performance. In Section 4, we propose our approach, and show the experimental results in Section 5. Finally we draw conclusions and give our future work in Section 6.
2. Related work
2.1. Latent Dirichlet Allocation (LDA)
LDA is a generative probabilistic model, including a three-level structure with word, topic and document. In LDA, documents are viewed as a distribution over topics while each topic is a distribution over words. To generate a document, LDA firstly samples a document-specific multinomial distribution over topics from a Dirichlet distribution. Then it repeatedly samples the words from these topics. LDA and its variants have been successfully applied in many works [2]; [10]; [15] ; [16].
Fig. 1 is the graphical model representation of LDA. Given a corpus D containing V unique words and M documents, where each document containing a sequence of words d {w1, w2, …, wNd}. Given an appropriate topic number K, the generative process for a document d is as following:
- (a)
Sample a K-vector θd from the Dirichlet distribution p(θ|α), where θd is the topic mixture proportion of document d.
- (b)
For i=1…Nd, sample word wi in the d from the document-special multinomial distribution p(wn|θd, β).
where α is a k-vector of Dirichlet parameters, and p(θ|α) is given by
β is a K×V matrix of word probabilities, where βij=p(wj=1|zi=1), i=0, 1, …, K; j=0, 1, …, V.