User Clustering - Digital Scholarship Projects, CUHK Library

Algorithms Comparison

In this section, we try to perform clustering and try to find user types in Twitter.

We started by performing different algorithms. We tested three clustering algorithms: K-means, CLARA, and DBSCAN. We chose K-means because it was the most stable (ARI = 0.94) and reproducible method, produced six interpretable and reasonably sized personas, and fit our goal of actionable user segmentation better than DBSCAN’s highly uneven density clusters or CLARA’s weaker stability.

Criterion	K-Means	CLARA	DBSCAN	Illustration
Stability	0.946	0.664	0.548	Persona system must be reproducible across resamples
Cluster Structure	6 interpretable clusters	6 clusters, but less distinct	2 clusters + noise; highly uneven	We need usable, explainable segments
Cluster Balance	Reasonably distributed	Moderate	One huge cluster + tiny niche cluster	Very uneven clusters are hard to operationalize

Clustering Results

We found that users are not a single homogeneous population. Instead, they fall into six stable and interpretable personas with different diffusion mechanisms and scale profiles. These personas play different roles in the diffusion process: some are better for immediate amplification(A), some are better bridge-like candidates(B), and some are more associated with deep cascade potential(C). This means audience selection should not rely on follower size alone; it should be based on the combination of network mechanism, exposure pattern, and strategic campaign goal.

Findings

Users are not homogeneous; they form six meaningful personas
Most users are in broad middle segments, not extreme influencer groups.
A (immediate amplification) ：K-means Cluster 1
B (bridge / breakout proxy) ：K-means Cluster 4
C (deep cascade tail) ：K-means Cluster 1

Persona Analysis Based on Scale Features

This graph shows that the clusters differ not only in diffusion mechanism, but also in user scale and activity.

This graph shows the scale profile of each user cluster. It helps us understand whether the differences across personas are driven by account size and activity, not only by diffusion mechanism. We found that some clusters are clearly high-scale and highly active, while others are low-scale or middle-layer groups, which means user clustering captures meaningful differences in both mechanism and scale.

R Code Display

https://github.com/oxbn8/twitter_diffusion/blob/main/clustering_dynamic_network_lite.R