LDA Model Experiments
With the help of a well-trained LDA (Latent Dirichlet Allocation) model targeted at our datasets, we aim to unravel the hidden topics and their corresponding word distributions within the given documents. Our primary objective is to employ text data as training data to develop an LDA model for text content classification. Subsequently, we can extract the latent topics and gain a deeper understanding of the content within our document.
1. Data Preprocessing
Our Datasets:
- The Family correspondence of Luis Peng Fan from the Chinese University of Hong Kong
- Report of Oral History of Overseas Chinese in Cuba from Hong Kong University
We preprocess the datasets by putting each article into one cell of a .xlsx file.
2. Modelling
- Using jieba package to cut down stopwords: The jieba package is a popular Chinese text segmentation library that can be used to split Chinese text into individual words. In the context of LDA modeling, one common preprocessing step is to remove stopwords, which are commonly occurring words that do not carry much meaningful information. The jieba package can assist in tokenizing the text and removing these stopwords, helping to improve the quality of the LDA model.
- Using sklearn package to train LDA model: The sklearn (scikit-learn) package is a widely used machine learning library in Python. It provides a comprehensive set of tools for various tasks, including topic modeling. To train an LDA model, you can utilize the LDA implementation provided by the sklearn package. This implementation allows you to specify parameters such as the number of topics, the number of iterations, and other settings related to the LDA algorithm.
- Using pyLDAvis package to visualize LDA model: The pyLDAvis package is a Python library that offers interactive visualization tools specifically designed for LDA models. After training the LDA model using the sklearn package, you can utilize pyLDAvis to visualize the results. The package generates an interactive visualization that helps in understanding the topics discovered by the LDA model, their relationships, and the distribution of terms within each topic. This visualization assists in interpreting and exploring the LDA model’s output.
3. LDA Results
We use two versions of LDA results (whether contains stop words of family relations like 妹妹 sister, 姑母 aunt, 兒子 son or not), as we find that the dataset of family letters of CUHK contains too much relative related words to conclude the topics easily. But we also view these stop words as important, because they represent family relations between Cuban Chinese.
By removing the stop word, the topics are:
When stop words (e.g. 兒子,女兒, 母親, 姑母) are included, the topics are:
4. Visualization
By applying pyLDAvis package, we can get a website of the result. On the website, we can get the graph of Intertopic Distance Map and the Top-30 Most Relevant Terms for each topic. The bigger the topic circles, the higher the frequency of the terms. So we can conclude the most relevant topics and terms of Cuban Chinese.