LDA Modeling
Data Preparation
To further explore the collective memories of these 60 interviewees, unsupervised learning offers a valuable approach for identifying common keywords and recurring themes. I employed Latent Dirichlet Allocation (LDA) for topic modeling, using the Gensim library to uncover hidden topic structures within the interview data. The necessary preprocessing steps included text segmentation, stop word removal, dictionary creation, and determining the optimal number of topics.
It is important to note that I opted to use the Jieba library and its associated dictionary for segmentation, rather than pycantonese, because Jieba’s dictionary functionality proved essential for accurate word segmentation in this situation.
The dictionary encompasses personal names, place names, and Cantonese terms that are not commonly found in standard written Chinese. An example is shown below:

The list of stopwords is as follows:
“就”,”呢”,”嘅”,”係”,”我”,”噉”,”咁”,”都”,”喺”,”嗰”,”佢”,”啊”,”冇”,”有”,”呀”,”咗”,”自己”,”乜嘢”,”早期”,”所以”,”因為”,”喇”,”嘛”,”同埋”,”咁樣”,”㗎”,”啦”,”我想講”,”哈哈”,”考考”,”好多好多”,”一個”,”不過”,”好多”,”哈哈哈”,”呀呀”,”其實”,”當時”,”另外”,”好似”,”可以”,”好像”,”或者”,”以前”,”譬如”,”正中”,”多利”,”點解會”,”一批”,”三批”,”本身”,”如果”,”知道”,”同一”,”一直”,”包括”,”仲有”,”加入”,”記得”,”好好”,”當年”,”一年”,”只有”,”有個”,”學下”,”這個”,”之前”,”我覺”,”已經”,”不如”,”一定”,”主要”,”一間”,”而家”,”不如”,”一路”,”甚麼”,”確有”,”我入”,”加上”,”部分”,”一位”,”覺得”,”嗰陣”,”嗰陣時”,”我哋”,”之後”,”唔係”,”呢個”,”嗰度”,”呢個”,”亦都”,”唔知”,”有啲”,”邊度”,”嗰時”.
LDA
After preprocessing, we train a Latent Dirichlet Allocation (LDA) topic model. The model takes a corpus, a dictionary, and the desired number of topics as input. It then uses the Gensim library to train the LDA model and returns the trained model. Finally, we generated 8 topics.
Furthermore, I used pyLDAvis to visualize the results of the LDA modeling. It generates an Intertopic Distance Map and displays the top 30 most salient terms. The distance between the circles indicates the similarity between the topics they represent. The size of each circle corresponds to the proportion of text associated with that topic. An example is shown below:

from google.colab import drive
drive.mount('/content/drive')
# 1. Convert DOCX to Text
def toText(path):
doc = docx.Document(path)
full_text = []
for parag in doc.paragraphs:
full_text.append(parag.text)
return full_text[0]
# 2. Preprocessing
def jieba_preprocess_text(text):
# stop words
stopwords = ["就","呢","嘅","係","我","噉","咁","都","喺","嗰","佢","啊","冇",
"有","呀","咗","自己","乜嘢","早期","所以","因為","喇","嘛","同埋",
"咁樣","㗎","啦","我想講","哈哈","考考","好多好多","一個","不過","好多",
"哈哈哈","呀呀","其實","當時","另外","好似","可以","好像","或者","以前",
"譬如","正中","多利","點解會","一批","三批","本身","如果","知道","同一"
,"一直","包括","仲有","加入","記得","好好","當年","一年","只有","有個"
,"學下","這個","之前","我覺","已經","不如","一定","主要","一間","而家",
"不如","一路","甚麼","確有","我入","加上","部分","一位","覺得","嗰陣","嗰陣時",
"我哋","之後","唔係","呢個","嗰度","呢個","亦都","唔知","有啲","邊度","嗰時"]
# dictionary
jieba.load_userdict('/content/oralhistory_dictionary.txt')
text = jieba.cut(text)
return [word for word in text if word not in stopwords and len(word) > 1]
# 3. Create Dictionary and Corpus
def create_dictionary_corpus(documents):
dictionary = corpora.Dictionary(documents)
corpus = [dictionary.doc2bow(doc) for doc in documents]
return dictionary, corpus
# 4. Train LDA Model
def train_lda_model(corpus, dictionary, num_topics=8):
lda_model = models.LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=15)
return lda_model
def vius(lda_model, corpus, dictionary):
pyLDAvis.enable_notebook()
data = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
return pyLDAvis.display(data)
import os
import re
import jieba
import docx
import pyLDAvis.gensim_models
import pyLDAvis
from gensim import corpora, models
from gensim.models.coherencemodel import CoherenceModel
# Jieba LDA Model
docx_files = [os.path.join('/content/drive/MyDrive/Oral History Project codes and files/六十人訪問(finalized1)', file) for file in os.listdir('/content/drive/MyDrive/Oral History Project codes and files/六十人訪問(finalized1)') if file.endswith('.docx')]
documents = [toText(file) for file in docx_files]
jieba_documents = [jieba_preprocess_text(doc) for doc in documents]
jieba_dictionary, jieba_corpus = create_dictionary_corpus(jieba_documents)
jieba_lda_model = train_lda_model(jieba_corpus, jieba_dictionary)
# Topics output
print('--------------------------- Jieba LDA Model ------------------------------')
for idx, topic in jieba_lda_model.print_topics(-1):
print(f'Topic: {idx} \nWords: {topic}')
# Graph output
vius(jieba_lda_model, jieba_corpus, jieba_dictionary)
Reference:
- Teng Yuan Chang, ‘直觀理解 LDA (Latent Dirichlet Allocation) 與文件主題模型’, Medium, 19-2-2019, https://tengyuanchang.medium.com/%E7%9B%B4%E8%A7%80%E7%90%86%E8%A7%A3-lda-latent-dirichlet-allocation-%E8%88%87%E6%96%87%E4%BB%B6%E4%B8%BB%E9%A1%8C%E6%A8%A1%E5%9E%8B-ab4f26c27184, 檢索日期: 2025年6月20日.
- Neutrino3316, ‘结巴中文分词’, https://github.com/fxsjy/jieba, 檢索日期: 2025年6月20日.
- pyLDAvis 3.4.1,
https://pypi.org/project/pyLDAvis/, 檢索日期: 2025年6月20日.