Data Preprocessing
The raw dataset contained 5 million posts from Facebook, Twitter, and Reddit. To prepare the data for analysis, we applied the following preprocessing steps:
- Spell checking and correction for user generated typos
- Lemmatisation to reduce words to their root forms
- Character filtering to retain only alphanumeric characters (removing emojis and special symbols)
- Short entry removal to discard posts with 5 words or fewer
- Stop words removal to eliminate common noise words
After preprocessing, the dataset was reduced to approximately 3.8 million text entries ready for modelling.
Model Selection Strategy
We tested three different embedding models to find the best fit for social media health data.
The first model we tried was all-MiniLM-L6-v2, a general purpose embedding model. It performed moderately well.

chart 1: 1st model output
The second model was Bio_ClinicalBERT, which is designed for medical and clinical texts. However, it performed poorly on our data.

chart 2: 2nd model output
The third model was BiomedNLP-PubMedBERT, another medical domain model. Like Bio_ClinicalBERT, it struggled with our dataset.

chart 3: 3rd model output
Therefore, we selected all-MiniLM-L6-v2 as our primary embedding model for the rest of the analysis.
Domin Filtering
To ensure our analysis focused specifically on immunotherapy related discussions, we constructed a custom domain keyword dictionary. This dictionary included terms related to:
- Cancer types and treatment names (e.g., immunotherapy, PD 1, CAR T, chemotherapy)
- Side effects and symptoms (e.g., fatigue, nausea, oral sores)
- Healthcare systems and costs (e.g., insurance, coverage, financial toxicity)
Posts that did not contain any domain relevant keyword were removed. This filtering reduced the dataset from approximately 3.8 million to 2.7 million relevant posts.
Topic Modelling
We used BERTopic to identify latent themes in patient discussions. BERTopic uses transformer based embeddings and clustering to discover topics automatically. The process generated 24 distinct topics.

chart 4: topic modelling output
Sentiment Analysis
To quantify patient sentiment over time, we implemented RoBERTa, a transformer based sentiment analysis model. Specifically, we used the cardiffnlp/twitter roberta base sentiment latest model.


chart 5: sentiment analysis output