User Clustering

Algorithms Comparison

In this section, we try to perform clustering and try to find user types in Twitter.

We start by performing different algorithms. We tested three clustering algorithms: K-means, CLARA, and DBSCAN. We chose K-means because it was the most stable (ARI = 0.94) and reproducible method, produced six interpretable and reasonably sized personas, and fit our goal of actionable user segmentation better than DBSCAN’s highly uneven density clusters or CLARA’s weaker stability. 

CriterionK-MeansCLARADBSCANIllustration
Stability0.9460.6640.548Persona system must be reproducible across resamples
Cluster Structure6 interpretable clusters6 clusters, but less distinct2 clusters + noise; highly unevenWe need usable, explainable segments
Cluster BalanceReasonably distributedModerateOne huge cluster + tiny niche clusterVery uneven clusters are hard to operationalize

Clustering Results

We found that users are not a single homogeneous population. Instead, they fall into six stable and interpretable personas with different diffusion mechanisms and scale profiles. These personas play different roles in the diffusion process: some are better for immediate amplification(A), some are better bridge-like candidates(B), and some are more associated with deep cascade potential(C). This means audience selection should not rely on follower size alone; it should be based on the combination of network mechanism, exposure pattern, and strategic campaign goal.

User Clustering with Personas tags

 

Findings

  • Users are not homogeneous; they form six meaningful personas
  • Most users are in broad middle segments, not extreme influencer groups.
  • A (immediate amplification) :K-means Cluster 1
  • B (bridge / breakout proxy) :K-means Cluster 4
  • C (deep cascade tail) :K-means Cluster 1

Persona Analysis based on scale features

We include this graph to show that the clusters differ not only in diffusion mechanism, but also in user scale and activity. 

This graph shows the scale profile of each user cluster. It helps us understand whether the differences across personas are driven by account size and activity, not only by diffusion mechanism. We find that some clusters are clearly high-scale and highly active, while others are low-scale or middle-layer groups, which means user clustering captures meaningful differences in both mechanism and scale.