LDA Modelling – Digital Scholarship Projects, CUHK Library

Data Preparation

To further explore the collective memories of these 60 interviewees, unsupervised learning offers a valuable approach for identifying common keywords and recurring themes. I employed Latent Dirichlet Allocation (LDA) for topic modeling, using the Gensim library to uncover hidden topic structures within the interview data. The necessary preprocessing steps included text segmentation, stop word removal, dictionary creation, and determining the optimal number of topics. However, I opted to use the Jieba library and its associated dictionary for segmentation, rather than pycantonese, because Jieba’s dictionary functionality proved essential for accurate word segmentation in this situation.

The dictionary encompasses personal names, place names, and Cantonese terms that are not commonly found in standard written Chinese. An example is shown below:

The list of stopwords is as follows: “就”,”呢”,”嘅”,”係”,”我”,”噉”,”咁”,”都”,”喺”,”嗰”,”佢”,”啊”,”冇”,”有”,”呀”,”咗”,”自己”,”乜嘢”,”早期”,”所以”,”因為”,”喇”,”嘛”,”同埋”,”咁樣”,”㗎”,”啦”,”我想講”,”哈哈”,”考考”,”好多好多”,”一個”,”不過”,”好多”,”哈哈哈”,”呀呀”,”其實”,”當時”,”另外”,”好似”,”可以”,”好像”,”或者”,”以前”,”譬如”,”正中”,”多利”,”點解會”,”一批”,”三批”,”本身”,”如果”,”知道”,”同一”,”一直”,”包括”,”仲有”,”加入”,”記得”,”好好”,”當年”,”一年”,”只有”,”有個”,”學下”,”這個”,”之前”,”我覺”,”已經”,”不如”,”一定”,”主要”,”一間”,”而家”,”不如”,”一路”,”甚麼”,”確有”,”我入”,”加上”,”部分”,”一位”,”覺得”,”嗰陣”,”嗰陣時”,”我哋”,”之後”,”唔係”,”呢個”,”嗰度”,”呢個”,”亦都”,”唔知”,”有啲”,”邊度”,”嗰時”.

LDA

After preprocessing, we train a Latent Dirichlet Allocation (LDA) topic model. The model takes a corpus, a dictionary, and the desired number of topics as input. It then uses the Gensim library to train the LDA model and returns the trained model. Finally, we generated 8 topics.

Furthermore, I used pyLDAvis to visualize the results of the LDA modeling. It generates an Intertopic Distance Map and displays the top 30 most salient terms. The distance between the circles indicates the similarity between the topics they represent. The size of each circle corresponds to the proportion of text associated with that topic. An example is shown below: