Author-Topic Model
The author-topic model uses a topic-based representation to model both the content of documents and the interests of authors. As in the author model, a group of authors, ad, decide to write the document d. For each word in the document an author is chosen uniformly at random. Then, as in the topic model, a topic is chosen from a distribution over topics specific to that author, and the word is generated from the chosen topic.
Fig. Author Similarity: closer bubbles are more similar with each other
click here to see bubble chart of authors with more than 5 poems in this magazine: 0207-LargerThan5.html
click here to see bubble chart of authors with more than 12 poems in this magazine: 0203-LargerThan12.html
Table. Author Similarity Table of 宋子江 (Top 10)
AuthorName | Size | Similarity_cos | Similarity_Hellinger |
宋子江 | 77 | 100.00% | 100.00% |
陳永財 | 33 | 96.04% | 74.30% |
黃淑嫻 | 12 | 96.04% | 74.30% |
黃峪 | 23 | 96.04% | 74.30% |
陳偉哲 | 13 | 88.40% | 70.97% |
吳詠彤 | 30 | 85.02% | 65.11% |
午夜歌手 | 18 | 80.44% | 64.05% |
梁璧君 | 19 | 76.22% | 64.10% |
洪慧 | 12 | 71.82% | 68.21% |
藍朗 | 21 | 64.72% | 65.56% |
where Size refers to the number of poems by authors;
Hellinger distance and cos distance are two popular measurements of author similarity.
(click specific author name to get one’s topics)
Mutual terms: “世界”、“生命”、“黑暗”……
Fig. Comparison-1
Fig. Comparison-2
Fig. Author Similarity: network diagram
Code
import datetime
import pandas as pd
import pyLDAvis.gensim
# 一篇文章中的重復詞只保留一個
for i in range(len(df)):
df_2=pd.DataFrame(df.loc[i,'full_text'].split(',')).drop_duplicates().reset_index(drop=True)
for j in range(len(df_2)):
if j==0:
str_i=df_2.loc[j,0]
else:
str_i=str_i+','+df_2.loc[j,0]
df.loc[i, 'full_text']=str_i