Data Preprocessing
The raw dataset contained 5 million posts from Facebook, Twitter, and Reddit. To prepare the data for analysis, we applied the following preprocessing steps:
- Spell checking and correction for user generated typos
- Lemmatization to reduce words to their root forms
- Character filtering to retain only alphanumeric characters (removing emojis and special symbols)
- Short entry removal to discard posts with 5 words or fewer
- Stop words removal to eliminate common noise words
After preprocessing, the dataset was reduced to approximately 3.8 million text entries ready for modeling.
Model Selection Strategy
We tested three different embedding models to find the best fit for social media health data.
The first model we tried was all-MiniLM-L6-v2, a general purpose embedding model. It performed moderately well.

The second model was Bio_ClinicalBERT, which is designed for medical and clinical texts. However, it performed poorly on our data.

The third model was BiomedNLP-PubMedBERT, another medical domain model. Like Bio_ClinicalBERT, it struggled with our dataset.

Therefore, we selected all-MiniLM-L6-v2 as our primary embedding model for the rest of the analysis.
Domin Filtering
To ensure our analysis focused specifically on immunotherapy related discussions, we constructed a custom domain keyword dictionary. This dictionary included terms related to:
- Cancer types and treatment names (e.g., immunotherapy, PD 1, CAR T, chemotherapy)
- Side effects and symptoms (e.g., fatigue, nausea, oral sores)
- Healthcare systems and costs (e.g., insurance, coverage, financial toxicity)
Posts that did not contain any domain relevant keyword were removed. This filtering reduced the dataset from approximately 3.8 million to 2.7 million relevant posts.

Topic Modeling
We used BERTopic to identify latent themes in patient discussions. BERTopic uses transformer based embeddings and clustering to discover topics automatically. The process generated 24 distinct topics.

Sentiment Analysis
To quantify patient sentiment over time, we implemented RoBERTa, a transformer based sentiment analysis model. Specifically, we used the cardiffnlp/twitter roberta base sentiment latest model.

