Author Similarity Based on Author-Topic Model – Digital Scholarship Projects, CUHK Library

Author-Topic Model

The author-topic model uses a topic-based representation to model both the content of documents and the interests of authors. As in the author model, a group of authors, ad, decide to write the document d. For each word in the document an author is chosen uniformly at random. Then, as in the topic model, a topic is chosen from a distribution over topics specific to that author, and the word is generated from the chosen topic.

Fig. Author Similarity: closer bubbles are more similar with each other

click here to see bubble chart of authors with more than 5 poems in this magazine: 0207-LargerThan5.html

click here to see bubble chart of authors with more than 12 poems in this magazine: 0203-LargerThan12.html

Table. Author Similarity Table of 宋子江 (Top 10)

AuthorName	Size	Similarity_cos	Similarity_Hellinger
宋子江	77	100.00%	100.00%
陳永財	33	96.04%	74.30%
黃淑嫻	12	96.04%	74.30%
黃峪	23	96.04%	74.30%
陳偉哲	13	88.40%	70.97%
吳詠彤	30	85.02%	65.11%
午夜歌手	18	80.44%	64.05%
梁璧君	19	76.22%	64.10%
洪慧	12	71.82%	68.21%
藍朗	21	64.72%	65.56%

where Size refers to the number of poems by authors;

Hellinger distance and cos distance are two popular measurements of author similarity.

（click specific author name to get one’s topics）

Mutual terms: “世界”、“生命”、“黑暗”……

Fig. Comparison-1

Fig. Comparison-2

Fig. Author Similarity: network diagram

Code

import datetime
import pandas as pd
import pyLDAvis.gensim

# 一篇文章中的重復詞只保留一個
for i in range(len(df)):
    df_2=pd.DataFrame(df.loc[i,'full_text'].split(',')).drop_duplicates().reset_index(drop=True)
    for j in range(len(df_2)):
        if j==0:
            str_i=df_2.loc[j,0]
        else:
            str_i=str_i+','+df_2.loc[j,0]
    df.loc[i, 'full_text']=str_i

[1].Rosen-Zvi, Michal, et al. “The author-topic model for authors and documents.” arXiv preprint arXiv:1207.4169 (2012).

[2].Rosen-Zvi, Michal, et al. “Learning author-topic models from text corpora.” ACM Transactions on Information Systems (TOIS) 28.1 (2010): 1-38.

[3].Sievert, Carson, and Kenneth Shirley.“LDAvis: A method for visualizing and interpreting topics.” Proceedings of the workshop on interactive language learning, visualization, and interfaces. 2014.

[4].Chuang, Jason, Christopher D. Manning, and Jeffrey Heer. “Termite: Visualization techniques for assessing textual topic models.” Proceedings of the international working conference on advanced visual interfaces. 2012.