LDA Modeling

Data Preparation

To further explore the collective memories of these 60 interviewees, unsupervised learning offers a valuable approach for identifying common keywords and recurring themes. I employed Latent Dirichlet Allocation (LDA) for topic modeling, using the Gensim library to uncover hidden topic structures within the interview data. The necessary preprocessing steps included text segmentation, stop word removal, dictionary creation, and determining the optimal number of topics.

It is important to note that I opted to use the Jieba library and its associated dictionary for segmentation, rather than pycantonese, because Jieba’s dictionary functionality proved essential for accurate word segmentation in this situation.

The dictionary encompasses personal names, place names, and Cantonese terms that are not commonly found in standard written Chinese. An example is shown below:

oralhistory_dictionary Download

The list of stopwords is as follows:

“就”,”呢”,”嘅”,”係”,”我”,”噉”,”咁”,”都”,”喺”,”嗰”,”佢”,”啊”,”冇”,”有”,”呀”,”咗”,”自己”,”乜嘢”,”早期”,”所以”,”因為”,”喇”,”嘛”,”同埋”,”咁樣”,”㗎”,”啦”,”我想講”,”哈哈”,”考考”,”好多好多”,”一個”,”不過”,”好多”,”哈哈哈”,”呀呀”,”其實”,”當時”,”另外”,”好似”,”可以”,”好像”,”或者”,”以前”,”譬如”,”正中”,”多利”,”點解會”,”一批”,”三批”,”本身”,”如果”,”知道”,”同一”,”一直”,”包括”,”仲有”,”加入”,”記得”,”好好”,”當年”,”一年”,”只有”,”有個”,”學下”,”這個”,”之前”,”我覺”,”已經”,”不如”,”一定”,”主要”,”一間”,”而家”,”不如”,”一路”,”甚麼”,”確有”,”我入”,”加上”,”部分”,”一位”,”覺得”,”嗰陣”,”嗰陣時”,”我哋”,”之後”,”唔係”,”呢個”,”嗰度”,”呢個”,”亦都”,”唔知”,”有啲”,”邊度”,”嗰時”.

LDA

After preprocessing, we train a Latent Dirichlet Allocation (LDA) topic model. The model takes a corpus, a dictionary, and the desired number of topics as input. It then uses the Gensim library to train the LDA model and returns the trained model. Finally, we generated 8 topics.

Furthermore, I used pyLDAvis to visualize the results of the LDA modeling. It generates an Intertopic Distance Map and displays the top 30 most salient terms. The distance between the circles indicates the similarity between the topics they represent. The size of each circle corresponds to the proportion of text associated with that topic. An example is shown below:

from google.colab import drive
drive.mount('/content/drive')

# 1. Convert DOCX to Text
def toText(path):
  doc = docx.Document(path)
  full_text = []
  for parag in doc.paragraphs:
    full_text.append(parag.text)
  return full_text[0]

# 2. Preprocessing
def jieba_preprocess_text(text):

    # stop words
    stopwords = ["就","呢","嘅","係","我","噉","咁","都","喺","嗰","佢","啊","冇",
                 "有","呀","咗","自己","乜嘢","早期","所以","因為","喇","嘛","同埋",
                 "咁樣","㗎","啦","我想講","哈哈","考考","好多好多","一個","不過","好多",
                 "哈哈哈","呀呀","其實","當時","另外","好似","可以","好像","或者","以前",
                 "譬如","正中","多利","點解會","一批","三批","本身","如果","知道","同一"
                 ,"一直","包括","仲有","加入","記得","好好","當年","一年","只有","有個"
                 ,"學下","這個","之前","我覺","已經","不如","一定","主要","一間","而家",
                 "不如","一路","甚麼","確有","我入","加上","部分","一位","覺得","嗰陣","嗰陣時",
                 "我哋","之後","唔係","呢個","嗰度","呢個","亦都","唔知","有啲","邊度","嗰時"]

    # dictionary
    jieba.load_userdict('/content/oralhistory_dictionary.txt')

    text = jieba.cut(text)
    return [word for word in text if word not in stopwords and len(word) > 1]

# 3. Create Dictionary and Corpus
def create_dictionary_corpus(documents):
    dictionary = corpora.Dictionary(documents)
    corpus = [dictionary.doc2bow(doc) for doc in documents]
    return dictionary, corpus

# 4. Train LDA Model
def train_lda_model(corpus, dictionary, num_topics=8):
    lda_model = models.LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=15)
    return lda_model

def vius(lda_model, corpus, dictionary):
    pyLDAvis.enable_notebook()

    data = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)

    return pyLDAvis.display(data)

import os
import re
import pyLDAvis.gensim_models
import pyLDAvis
import jieba
import docx
from gensim import corpora, models
from gensim.models.coherencemodel import CoherenceModel

# Jieba LDA Model

docx_files = [os.path.join('/content/drive/MyDrive/Oral History Project codes and files/六十人訪問(finalized1)', file) for file in os.listdir('/content/drive/MyDrive/Oral History Project codes and files/六十人訪問(finalized1)') if file.endswith('.docx')]
documents = [toText(file) for file in docx_files]

jieba_documents = [jieba_preprocess_text(doc) for doc in documents]
jieba_dictionary, jieba_corpus = create_dictionary_corpus(jieba_documents)
jieba_lda_model = train_lda_model(jieba_corpus, jieba_dictionary)

# Topics output
print('--------------------------- Jieba LDA Model ------------------------------')
for idx, topic in jieba_lda_model.print_topics(-1):
    print(f'Topic: {idx} \nWords: {topic}')

# Graph output
vius(jieba_lda_model, jieba_corpus, jieba_dictionary)

Reference:

Teng Yuan Chang, ‘直觀理解 LDA (Latent Dirichlet Allocation) 與文件主題模型’, Medium, 19-2-2019, https://tengyuanchang.medium.com/%E7%9B%B4%E8%A7%80%E7%90%86%E8%A7%A3-lda-latent-dirichlet-allocation-%E8%88%87%E6%96%87%E4%BB%B6%E4%B8%BB%E9%A1%8C%E6%A8%A1%E5%9E%8B-ab4f26c27184.
Ben Mabey, ‘pyLDAvis’, GitHub, GitHub – bmabey/pyLDAvis: Python library for interactive topic model visualization. Port of the R LDAvis package..
Cynthia, ‘程式語言 — LDA模型分析’, Medium, 28-11-2023, https://medium.com/@vnjsdlemw_57536/%E7%A8%8B%E5%BC%8F%E8%AA%9E%E8%A8%80-lda%E6%A8%A1%E5%9E%8B%E5%88%86%E6%9E%90-e459623898fa.
Neutrino3316, ‘结巴中文分词’, GitHub, https://github.com/fxsjy/jieba.
‘pyLDAvis 3.4.1’, https://pypi.org/project/pyLDAvis/.