Methodology - Digital Scholarship Projects, CUHK Library

Data Preprocessing

The raw dataset contained 5 million posts from Facebook, Twitter, and Reddit. To prepare the data for analysis, we applied the following preprocessing steps:

Spell checking and correction for user generated typos
Lemmatisation to reduce words to their root forms
Character filtering to retain only alphanumeric characters (removing emojis and special symbols)
Short entry removal to discard posts with 5 words or fewer
Stop words removal to eliminate common noise words

After preprocessing, the dataset was reduced to approximately 3.8 million text entries ready for modelling.

Model Selection Strategy

We tested three different embedding models to find the best fit for social media health data.

The first model we tried was all-MiniLM-L6-v2, a general purpose embedding model. It performed moderately well.

Chart 1: 1st Model Output

The second model was Bio_ClinicalBERT, which is designed for medical and clinical texts. However, it performed poorly on our data.

Chart 2: 2nd Model Output

The third model was BiomedNLP-PubMedBERT, another medical domain model. Like Bio_ClinicalBERT, it struggled with our dataset.

Chart 3: 3rd Model Output

Therefore, we selected all-MiniLM-L6-v2 as our primary embedding model for the rest of the analysis.

Domin Filtering

To ensure our analysis focused specifically on immunotherapy related discussions, we constructed a custom domain keyword dictionary. This dictionary included terms related to:

Cancer types and treatment names (e.g., immunotherapy, PD 1, CAR T, chemotherapy)
Side effects and symptoms (e.g., fatigue, nausea, oral sores)
Healthcare systems and costs (e.g., insurance, coverage, financial toxicity)

Posts that did not contain any domain relevant keyword were removed. This filtering reduced the dataset from approximately 3.8 million to 2.7 million relevant posts.

Topic Modelling

We used BERTopic to identify latent themes in patient discussions. BERTopic uses transformer based embeddings and clustering to discover topics automatically. The process generated 24 distinct topics.

Chart 4: Topic Modelling Output

Sentiment Analysis

To quantify patient sentiment over time, we implemented RoBERTa, a transformer based sentiment analysis model. Specifically, we used the cardiffnlp/twitter roberta base sentiment latest model.

Chart 5: Sentiment Analysis Output