Comparing LDA with pLSI as a Dimensionality Reduction Method in Document Clustering
- Tomonari Masada,
- Senya Kiyasu,
- Sueharu Miyahara
- … show all 3 hide
Abstract
In this paper, we compare latent Dirichlet allocation (LDA) with probabilistic latent semantic indexing (pLSI) as a dimensionality reduction method and investigate their effectiveness in document clustering by using real-world document sets. For clustering of documents, we use a method based on multinomial mixture, which is known as an efficient framework for text mining. Clustering results are evaluated by F-measure, i.e., harmonic mean of precision and recall. We use Japanese and Korean Web articles for evaluation and regard the category assigned to each Web article as the ground truth for the evaluation of clustering results. Our experiment shows that the dimensionality reduction via LDA and pLSI results in document clusters of almost the same quality as those obtained by using original feature vectors. Therefore, we can reduce the vector dimension without degrading cluster quality. Further, both LDA and pLSI are more effective than random projection, the baseline method in our experiment. However, our experiment provides no meaningful difference between LDA and pLSI. This result suggests that LDA does not replace pLSI at least for dimensionality reduction in document clustering.
- http://japan.internet.com/
- http://nlp.kookmin.ac.kr/HAM/kor/
- http://mecab.sourceforge.net/
- http://okwave.jp/
- http://www.quintura.com/
- http://www.seoul.co.kr/
- http://vivisimo.com/
- Barnard, K., Duygulu, P., Freitas, N., Forsyth, D., Blei, D., Jordan, M. (2003) Matching Words and Pictures. Journal of Machine Learning Research 3: pp. 1107-1135 CrossRef
- Bingham, E., Mannila, H.: Random Projection in Dimensionality Reduction: Applications to Image and Text Data. In: Proc. of KDD 2001, pp. 245–250 (2001)
- Blei, D., Ng, A.Y., Jordan, M.I. (2003) Latent Dirichlet Allocation. Journal of Machine Learning Research 3: pp. 993-1022 CrossRef
- Blei, D., Jordan, M.I. (2005) Variational Inference for Dirichlet Process Mixtures. Bayesian Analysis 1: pp. 121-144
- Conrad, J.G., Al-Kofahi, K., Zhao, Y., Karypis, G.: Effective Document Clustering for Large Heterogeneous Law Firm Collections. In: Proc. of ICAIL 2005, pp. 177–187 (2005)
- Dempster, A.P., Laird, N.M., Rubin, D.B. (1977) Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B 39: pp. 1-38
- Elango, P.K., Jayaraman, K.: Clustering Images Using the Latent Dirichlet Allocation Model (2005), available at http://www.cs.wisc.edu/~pradheep/
- Fattori, M., Pedrazzi, G., Turra, R. (2003) Text Mining Applied to Patent Mapping: a Practical Business Case. World Patent Information 25: pp. 335-342 CrossRef
- Griffiths, T., Steyvers, M. (2004) Finding Scientific Topics. Proc. of the National Academy of Sciences 101: pp. 5228-5235 CrossRef
- Hofmann, T.: Probabilistic Latent Semantic Indexing. In: Proc. of SIGIR 1999, pp. 50–57 (1999)
- Hsu, F.-C., Trappey, A.J.C., Trappey, C.V., Hou, J.-L., Liu, S.-J. (2006) Technology and Knowledge Document Cluster Analysis for Enterprise R&D Strategic Planning. International Journal of Technology Management 36: pp. 336-353 CrossRef
- Madsen, R.E., Kauchak, D., Elkan, C.: Modeling Word Burstiness Using the Dirichlet Distribution. In: Proc. of ICML 2005, pp. 545–552 (2005)
- Malisiewicz, T.J., Huang, J.C., Efros, A.A.: Detecting Objects via Multiple Segmentations and Latent Topic Models (2006), available at http://www.cs.cmu.edu/~tmalisie/
- Minka, T.: Estimating a Dirichlet distribution (2000), available at http://research.microsoft.com/~minka/papers/
- Mimno, D., McCallum, A.: Expertise Modeling for Matching Papers with Reviewers. In: Proc. of KDD 2007, pp. 500–509 (2007)
- Mimno, D., McCallum, A.: Organizing the OCA: Learning Faceted Subjects from a library of digital books. In: Proc. of JCDL 2007, pp. 376–385 (2007)
- Nigam, K., McCallum, A., Thrun, S., Mitchell, T. (2000) Text Classification from Labeled and Unlabeled Documents Using EM. Machine Learning 39: pp. 103-134 CrossRef
- Rose, K., Gurewitz, E., Fox, G. (1990) A Deterministic Annealing Approach to Clustering. Pattern Recognition Letters 11: pp. 589-594 CrossRef
- Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian Approach to Filtering Junk Email. AAAI Technical Report WS-98-05 (1998)
- Teh, Y.W., Newman, D., Welling, M.: A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation. In: Proc. of NIPS 2006, pp. 1353–1360 (2006)
- Yamamoto, M., Sadamitsu, K.: Dirichlet Mixtures in Text Modeling. CS Technical report CS-TR-05-1, University of Tsukuba (2005)
- Zeng, H.-J., He, Q.-C., Chen, Z., Ma, W.-Y., Ma, J.: Learning to Cluster Web Search Results. In: Proc. of SIGIR 2004, pp. 210–217 (2004)
- Title
- Comparing LDA with pLSI as a Dimensionality Reduction Method in Document Clustering
- Book Title
- Large-Scale Knowledge Resources. Construction and Application
- Book Subtitle
- Third International Conference on Large-Scale Knowledge Resources, LKR 2008, Tokyo, Japan, March 3-5, 2008. Proceedings
- Pages
- pp 13-26
- Copyright
- 2008
- DOI
- 10.1007/978-3-540-78159-2_2
- Print ISBN
- 978-3-540-78158-5
- Online ISBN
- 978-3-540-78159-2
- Series Title
- Lecture Notes in Computer Science
- Series Volume
- 4938
- Series ISSN
- 0302-9743
- Publisher
- Springer Berlin Heidelberg
- Copyright Holder
- Springer-Verlag Berlin Heidelberg
- Additional Links
- Topics
- Industry Sectors
- eBook Packages
- Editors
- Authors
-
-
Tomonari Masada
(1)
-
Senya Kiyasu
(1)
-
Sueharu Miyahara
(1)
-
Tomonari Masada
- Author Affiliations
-
- 1. Nagasaki University, 1-14 Bunkyo-machi, Nagasaki, 852-8521, Japan
Continue reading...
To view the rest of this content please follow the download PDF link above.