Author Similarity Based on Author-Topic Model

Author-Topic Model

The author-topic model uses a topic-based representation to model both the content of documents and the interests of authors. As in the author model, a group of authors, ad, decide to write the document d. For each word in the document an author is chosen uniformly at random. Then, as in the topic model, a topic is chosen from a distribution over topics specific to that author, and the word is generated from the chosen topic.

Fig. Author Similarity: closer bubbles are more similar with each other 

click here to see bubble chart of authors with more than 5 poems in this magazine:  0207-LargerThan5.html

click here to see bubble chart of authors with more than 12 poems in this magazine:  0203-LargerThan12.html

Table. Author Similarity Table of 宋子江 (Top 10)

AuthorNameSizeSimilarity_cosSimilarity_Hellinger
宋子江77100.00%100.00%
陳永財3396.04%74.30%
黃淑嫻1296.04%74.30%
黃峪2396.04%74.30%
陳偉哲1388.40%70.97%
吳詠彤3085.02%65.11%
午夜歌手1880.44%64.05%
梁璧君1976.22%64.10%
洪慧1271.82%68.21%
藍朗2164.72%65.56%

where Size refers to the number of poems by authors;

Hellinger distance and cos distance are two popular measurements of author similarity.

click specific author name to get one’s topics

Mutual terms: “世界”、“生命”、“黑暗”……

Fig. Comparison-1

Fig. Comparison-2

Fig. Author Similarity: network diagram

Code

import datetime
import pandas as pd
import pyLDAvis.gensim
# 一篇文章中的重復詞只保留一個
for i in range(len(df)):
    df_2=pd.DataFrame(df.loc[i,'full_text'].split(',')).drop_duplicates().reset_index(drop=True)
    for j in range(len(df_2)):
        if j==0:
            str_i=df_2.loc[j,0]
        else:
            str_i=str_i+','+df_2.loc[j,0]
    df.loc[i, 'full_text']=str_i

[1].Rosen-Zvi, Michal, et al. “The author-topic model for authors and documents.” arXiv preprint arXiv:1207.4169 (2012).

[2].Rosen-Zvi, Michal, et al. “Learning author-topic models from text corpora.” ACM Transactions on Information Systems (TOIS) 28.1 (2010): 1-38.

[3].Sievert, Carson, and Kenneth Shirley.“LDAvis: A method for visualizing and interpreting topics.” Proceedings of the workshop on interactive language learning, visualization, and interfaces. 2014.

[4].Chuang, Jason, Christopher D. Manning, and Jeffrey Heer. “Termite: Visualization techniques for assessing textual topic models.” Proceedings of the international working conference on advanced visual interfaces. 2012.