Article

A Topic Modeling Analysis of TCGA Breast and Lung Cancer Transcriptomic Data

Physics Department, University of Turin and INFN, via P. Giuria 1, 10125 Turin, Italy
*
Author to whom correspondence should be addressed.
Cancers 2020, 12(12), 3799; https://doi.org/10.3390/cancers12123799
Received: 19 October 2020 / Revised: 7 December 2020 / Accepted: 11 December 2020 / Published: 16 December 2020
(This article belongs to the Special Issue Cancer Modeling and Network Biology)
Topic modeling was introduced to classify texts of natural language by inferring their topic structure from the frequency of words. This paper assumes that analogously the cancer subtype identity, which is crucial for the correct diagnosis and treatment plan, can be extracted from gene expression patterns with similar techniques. Focusing on breast and lung cancer, we show that state-of-the-art topic modeling techniques can successfully classify known subtypes and identify cohorts of patients with different survival probabilities. The topic structure hidden in expression data can be looked at as a biologically relevant low-dimensional data representation that can be used to build efficient classifiers of expression patterns.
Topic modeling is a widely used technique to extract relevant information from large arrays of data. The problem of finding a topic structure in a dataset was recently recognized to be analogous to the community detection problem in network theory. Leveraging on this analogy, a new class of topic modeling strategies has been introduced to overcome some of the limitations of classical methods. This paper applies these recent ideas to TCGA transcriptomic data on breast and lung cancer. The established cancer subtype organization is well reconstructed in the inferred latent topic structure. Moreover, we identify specific topics that are enriched in genes known to play a role in the corresponding disease and are strongly related to the survival probability of patients. Finally, we show that a simple neural network classifier operating in the low dimensional topic space is able to predict with high accuracy the cancer subtype of a test expression sample. View Full-Text
Keywords: network-based cancer data analysis; topic modeling; gene expression; network theory; stochastic block modeling network-based cancer data analysis; topic modeling; gene expression; network theory; stochastic block modeling
Show Figures

Figure 1

MDPI and ACS Style

Valle, F.; Osella, M.; Caselle, M. A Topic Modeling Analysis of TCGA Breast and Lung Cancer Transcriptomic Data. Cancers 2020, 12, 3799. https://doi.org/10.3390/cancers12123799

AMA Style

Valle F, Osella M, Caselle M. A Topic Modeling Analysis of TCGA Breast and Lung Cancer Transcriptomic Data. Cancers. 2020; 12(12):3799. https://doi.org/10.3390/cancers12123799

Chicago/Turabian Style

Valle, Filippo, Matteo Osella, and Michele Caselle. 2020. "A Topic Modeling Analysis of TCGA Breast and Lung Cancer Transcriptomic Data" Cancers 12, no. 12: 3799. https://doi.org/10.3390/cancers12123799

Find Other Styles
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Citations

Crossref
 
Scopus
 
Web of Science
 
Google Scholar

Article Access Statistics

Created with Highcharts 4.0.4Chart context menuArticle access statisticsFull-Text ViewsAbstract Views5. Jul6. Jul7. Jul8. Jul9. Jul10. Jul11. Jul12. Jul13. Jul14. Jul15. Jul16. Jul17. Jul18. Jul19. Jul20. Jul21. Jul22. Jul23. Jul24. Jul25. Jul26. Jul27. Jul28. Jul29. Jul30. Jul31. Jul1. Aug2. Aug3. Aug4. Aug5. Aug6. Aug7. Aug8. Aug9. Aug10. Aug11. Aug12. Aug13. Aug14. Aug15. Aug16. Aug17. Aug18. Aug19. Aug20. Aug21. Aug22. Aug23. Aug24. Aug25. Aug26. Aug27. Aug28. Aug29. Aug30. Aug31. Aug1. Sep2. Sep3. Sep4. Sep5. Sep6. Sep7. Sep8. Sep9. Sep10. Sep11. Sep12. Sep13. Sep14. Sep15. Sep16. Sep17. Sep18. Sep19. Sep20. Sep21. Sep22. Sep23. Sep24. Sep25. Sep26. Sep27. Sep28. Sep29. Sep30. Sep1. Oct2. Oct3. Oct05001000150020002500
For more information on the journal statistics, click here.
Multiple requests from the same IP address are counted as one view.