Topic Modeling

Generate Topics

To learn more about the important events happened each year, we would like to use the technique of topic modeling with the help of machine learning to interpret tens of millions of words published by the newspaper. We performed topic modeling with the Bertopic model. Since the Bertopic algorithm is originally used to do topic modeling for English which is naturally segmented, we preprocessed the text data. We first did word segmentation for every year’s content and we used the Ckiptagger to finish the job. After that, some stop words were also removed from the text to decrease their effect. Then, we choose a pre-trained transformer model from Hugging face to run the Bertopic algorithm and generated about 10 topics for all newspapers in each year.

Code

import re
from ckiptagger import data_utils, construct_dictionary, WS, POS, NER
from bertopic import BERTopic
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# download the pre-trained model from the hugging face 
tokenizer = AutoTokenizer.from_pretrained("elliotthwang/\
                            t5-small-finetuned-xlsum-chinese-tradition")
model = AutoModelForSeq2SeqLM.from_pretrained("elliotthwang/\
                            t5-small-finetuned-xlsum-chinese-tradition")

# data is the list of sentences after doing word segmentation, 
# all the words in one sentence are concated with " "

def generate_models(model, data):

    # generate the topic_model objects
    topic_model = BERTopic(embedding_model=model,

        language="multilingual", calculate_probabilities=True, verbose=True, nr_topics=8)


    # generate the topics
    topics, probs = topic_model.fit_transform(data)

    # return the topic_model object
    return topic_model

# get the info of all the topics generated
topic_model.get_topic_info()

# get the most frequent topic
topic_model.get_topic(0)

Topic Visualization

After we chose the optimal transformer model to do the topic modelling, we got about ten topics for one year newspaper and we reduced the number of topics to 10. The words exist in one topic are all the important words that can represent that topic. After that, we used several methods provided by the package to generate the visualization and made some improvement to the visualizations. Since the visualization is generated by the package Ploty, the graph itself supports some functions like zooming in and dragging. So, we saved all the visualization in the form of HTML, which reserves a lot of functions supported by the graph.

Heat Map

The first is the heat map, both the x axis and y axis are the topics generated by the model, which shows the relationship between any two topics. With the color becoming darker, the two topics are more related to each other.

Code

heat_map = topic_model.visualize_heatmap()
heat_map.write_html("heatmap.html")

Intertopic Distance Map

The second one is the intertopic distance map. This map is similar to the heap map, both showing the relationship between the generated topics.

Code

dist_map = topic_model.visualize_topics()
dist_map.write_html("distmap.html")

Bar Chart

The third one is the Bar chart. The bar charts contain eight topics with the word and its score in that topic. The score is calculated by the c-TF-IDF, which is a metric showing the words’ importance in that topic. The higher the score one word has, the more representative that word is. However, only showing the topic, word and score are not informative enough. We want to make the graph show more information. So what we did is to add hyperlink to the word in the bar chart, which links to the library searching API. Once the user clicks the word, the browser will redirect the webpage to the library webpage showing the query results of that word in the database, which are the photos of the newspaper containing that word.

To realize this function, we searched the ploty package and discovered the principle of generating the visualization. Since we saved all the visualization results as HTML form, then we thought of a way to add a tag into that word to make the word clickable. After we got some knowledge about the generating process of the bar chart, we added a tags besides the word in the program. When the HTML is checked, the hyperlink would become a tag around the word to make the word clickable.

Code

import itertools
import numpy as np
from typing import List

import plotly.graph_objects as go
from plotly.subplots import make_subplots
from buildurl import BuildURL


def visualize_barchart(topic_model,
                       topics: List[int] = None,
                       top_n_topics: int = 8,
                       n_words: int = 5,
                       custom_labels: bool = False,
                       title: str = "Topic Word Scores",
                       width: int = 250,
                       height: int = 250) -> go.Figure:

    colors = itertools.cycle(["#D55E00", "#0072B2", "#CC79A7",
                                "#E69F00", "#56B4E9", "#009E73", "#F0E442"])

    # Select topics based on top_n and topics args
    freq_df = topic_model.get_topic_freq()
    freq_df = freq_df.loc[freq_df.Topic != -1, :]
    if topics is not None:
        topics = list(topics)
    elif top_n_topics is not None:
        topics = sorted(freq_df.Topic.to_list()[:top_n_topics])
    else:
        topics = sorted(freq_df.Topic.to_list()[0:6])

    # Initialize figure
    if topic_model.custom_labels_ is not None and custom_labels:
        subplot_titles = [topic_model.custom_labels_[topic + topic_model._outliers] \
                                for topic in topics]
    else:
        subplot_titles = [f"Topic {topic}" for topic in topics]
    columns = 4
    rows = int(np.ceil(len(topics) / columns))
    fig = make_subplots(rows=rows,
                        cols=columns,
                        shared_xaxes=False,
                        horizontal_spacing=.1,
                        vertical_spacing=.4 / rows if rows > 1 else 0,
                        subplot_titles=subplot_titles)

    # Add barchart for each topic
    row = 1
    column = 1
    for topic in topics:
        owords = [word + "  " for word, _ in topic_model.get_topic(topic)][:n_words][::-1]
        words = []
        for i in owords:
            i = i.strip()
            # add the url into the bar chart
            url = r'https://repository.lib.cuhk.edu.hk/en/ \
                        islandora/search/"{}"?type=edismax&cp=cuhk:2581135'.format(i)
            
            word = "<a href='" + url + "'>{} </a>".format(i)
            # word = BuildURL(word)
            print(word)
            words.append(word)
        scores = [score for _, score in topic_model.get_topic(topic)][:n_words][::-1]

        fig.add_trace(
            go.Bar(x=scores,
                   y=words,
                   orientation='h',
                   marker_color=next(colors)),
            row=row, col=column)

        if column == columns:
            column = 1
            row += 1
        else:
            column += 1

    # Stylize graph
    fig.update_layout(
        template="plotly_white",
        showlegend=False,
        title={
            'text': f"<b>{title}",
            'x': .5,
            'xanchor': 'center',
            'yanchor': 'top',
            'font': dict(
                size=22,
                color="Black")
        },
        width=width*4,
        height=height*rows if rows > 1 else height * 1.3,
        hoverlabel=dict(
            bgcolor="white",
            font_size=16,
            font_family="Rockwell"
        ),
    )

    fig.update_xaxes(showgrid=True)
    fig.update_yaxes(showgrid=True)

    return fig