Topic Model——LDA Model
Purpose of Analyzing Yearly Topics
Methodology
- Principle of LDA Model
Topic model is a type of unsupervised model for discovering the abstract “topics” that occur in a series of documents. Supposing that a document (such as a poem or an article) is about a particular topic, relative words are expected to appear in a higher frequency.
Latent Dirichlet Allocation Model (LDA) is a generative thematic model proposed by Blei et al in 2003, which is also known as Three-tier Bayesian Probability Model with three-tier structure of document (D), topic (Z) and word (W), which can effectively model the text. Based on LDA topic model, we are able to mine the potential topics in the data set, and then analyze the main information of the data sets and related feature words.
- Data Preprocessing
- Re-organize records with co-authors
- Remove some meaningless characters and extract Chinese characters only
- Build a stopword dictionary
- Jieba segmentation
- Remove stopwords
- Remove duplicate terms
- Transfer into word dictionary
- Modelling
- Invocate <corpora.dictionary>: A mapping between words and their integer ids.
- Invocate <dictionary.doc2bow>: Convert document into the bag-of-words (BoW) format.
- Invocate <gensim.models.ldamodel>: Run and train the LDA model
- Invocate <pyLDAvis.gensim>: Visualize LDA topic model results
Visualization Results
Yearly Topics of Voice & Verse Magazine
① Result of 2013(Click Here to see the interactive html)
②Result of 2019(Click Here to see the interactive html
To see the results of other years(click the corresponding year)
2011, 2012, 2014, 2015, 2016, 2017, 2018
Codes