Methodology
Because of the limitation of the web pages, we only present the key concepts of the methodology here. Go to our Google Colab to check all details.
OCR
Because of the vertical layout of traditional Chinese, we tried many OCR engines (especially free methods) for a satisfactory method. Finally, by comparing the recognition results of different OCR engines, we decided on the free period of Google Cloud Service and applied the Google Vision OCR to do the job. The link to the Google Vision OCR is here, and the detail code is in the Colab. We noted that Google Vision OCR requires payment.
Word Segmentation
Unlike NLP in English, text analyses in Chinese require extra work: separating sentences into phrases. Here we mainly considered two Python packages: Jieba
and CkipTagger
. These two packages can also tag each phrase to its corresponding properties, such as proper nouns, verbs, and adjectives. In developers’ reports, CkipTagger
has higher accuracy when separating the sentences. Therefore, we firstly applied CkipTagger
as our segmentation model. However in practice, we found that Jieba
has some advantages in the recognition of proper nouns. Hence, we applied both of them.
An additional trick is that word segmentation is the most time-consuming part of the project. Therefore, we used a txt file to store the word segmentation result to save time.
Jieba:
import jieba.posseg as pseg
def splice(all_sen):
result = []
result = [[] for x in range(len(all_sen))]
# the ./stop_words.txt contains some words we wathe nt program to delete frthe om result.
# the stop_words list is designed by us, you may design your own if you need to
stop_words = open('./stop_words.txt',encoding='utf-8',errors='ignore').readlines()
stop_words = [stop_word.rstrip() for stop_word in stop_words]
i = 0
while (i < len(all_sen)):
result[i].append(all_sen[i])
words=pseg.cut(all_sen[i])
kk = []
for word,flag in words:
if word in stop_words and len(word)==1:
continue
kk.append(word+':'+flag)
result[i].append(kk)
i += 1
return result
CkipTagger:
from ckiptagger import data_utils, construct_dictionary, WS, POS, NER
data_utils.download_data_gdown("./") # gdrive-ckip
# 使用 GPU:
# 1. 安裝 tensorflow-gpu (請見安裝說明)
# 2. 設定 CUDA_VISIBLE_DEVICES 環境變數,例如:os.environ["CUDA_VISIBLE_DEVICES"] = "0"
# 3. 設定 disable_cuda=False,例如:ws = WS("./data", disable_cuda=False)
# 使用 CPU:
ws = WS("./data")
pos = POS("./data")
ner = NER("./data")
# test.txt 是我们需要读入的繁体文本,如果遇到无法解码的错误,用errors跳过
f = open('1950.txt', encoding='utf-8', errors="ignore")
sentences = ''
for line in f.readlines():
line = re.sub('\n','',line) #去掉列表中每一个元素的换行符
line = re.sub('[a-zA-Z0-9]','',line) #去掉数字,字母
sentences += line
sentence_list = re.split(r'[,,。.]', sentences) #获得句子的list
f.close()
word_sentence_list = ws(
sentence_list,
# sentence_segmentation = True, # To consider delimiters
# segment_delimiter_set = {",", "。", ":", "?", "!", ";"}), # This is the defualt set of delimiters
# recommend_dictionary = dictionary1, # words in this dictionary are encouraged
# coerce_dictionary = dictionary2, # words in this dictionary are forced
)
Word Cloud&Frequency
This part is based on the Word Separation. After the word segmentation, we counted the word frequency of each word and used the wordcloud package that comes with python to generate the corresponding word cloud image and histogram of word frequency sorted in descending order. In the initial word frequency statistics, we found that there were too many prepositions with a length of 1 (such as “為”), which made words with larger sizes in the word cloud meaningless. So we removed the words with a length of one and re-stated them.
In addition, when segmenting words, we also marked the corresponding part of speech for each word, such as verbs, nouns, etc. In this way, we analyzed and counted the proper nouns after word segmentation and generated corresponding word clouds and word frequency pictures.
Relationship Analysis
In this part, we want to find out the relationship between important people in the newspaper. We first counted and sorted the names with the highest frequency, and then counted the frequency of such names appearing in the same sentence to judge their relationship. I used this benchmark to visualize a graph of the 40 most important people for each year.
Topic Modeling
Based on the result of the word segmentation with the help of Ckiptagger, we can use the Bertopic, a topic modeling technique that leverages transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics. Although the package provides some pre-trained transformer models supporting several languages, the topic modeling result of our text using equipped model is not satisfactory. After searching for the traditional Chinese model, we replaced the original model with some pre-trained traditional Chinese transformer models found on huggingface. The result is much better and more meaningful than using the original one provided by the package, and the result could also be shown in the form of a bar chart, distance chart, and heap map for easier comprehension.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# import the model from the huggingface
tokenizer = AutoTokenizer.from_pretrained("elliotthwang/ \
t5-small-finetuned-xlsum-chinese-tradition")
model = AutoModelForSeq2SeqLM.from_pretrained("elliotthwang/ \
t5-small-finetuned-xlsum-chinese-tradition")
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
# create new topic_model object
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
To make our visualization result more interactive, we tried to add the hyperlink to the word in the plot to realize the functionality that once we clicked the word, the browser would redirect to the web page of the newspaper collection containing the word. Since the Bertopic package uses the Ploty package to generate different kinds of plots, we tried to change the original visualization code in the Bertopic package. What we do is to add some tags containing the url links to the newspaper collection provided by the library. As a result, the word in the new visualizations becomes clickable and the reader could find the original newspaper they are interested in.
Sentiment Analysis
We want to find the attitude changes toward different objects (proper nouns) of The Observatory Review during its publish history. However, the minimum object of sentiment analysis is a sentence. To solve this, our solution is to analyse the attitude of each sentence and apply this value to all the proper nouns it contains. For example, if the analysis result of “中國美國攜手進步” is “1”, which means very positive, then “中國” and “美國” will all get a value of “1”. After sentences of one year were analysed, we summarized the analysis result to see how The Observatory Review consider each object this year. By all years’ results, we can visualize the sentiment changes of one proper noun during 1950-1985.
Required Packages:
from zhconv import convert
from bixin
import predict
import numba as nb
import re
from joblib import Parallel, delayed
import multiprocessing
import numpy as np
import matplotlib.pyplot as plt
Get Sentiment Value of each sentences:
@nb.jit()
def run(all_sentence):
result = nb.string[:]
result = ['' for x in range(len(all_sen))]
i = nb.int64(0)
while(i<len(all_sentence)):
result[i] = all_sentence[i]
result[i] = re.sub('\n', '', result[i])
tmp = convert(result[i].split(' || ')[0],'zh-cn')
# function predict() can be change to use different segment analysis modules
result[i] = result[i].split(' || ')[1] +' || ' + str(predict(tmp))
i+=1
return result
Then, assign this value to the words each sentence contains, according to the word segmentation result. Summarize the values according to different statistical standard, here is our standard as an example:
Object | Word_Frequency | Total_Positive_value | Appear_as_Positive | Total_Negative_value | Appear_as_Negative |
美國 | 667 | 259.65999999999997 | 367 | -187.27999999999997 | 300 |
台灣 | 472 | 165.06 | 247 | -141.23000000000002 | 225 |
Finally, with these data, we can visualize the result into bar char use:
plt.bar(np.arange(26), neg, alpha=0.5,label='Negative' ,color=colorN)
plt.bar(np.arange(26), pos, alpha=0.5,bottom=neg, label='Positive', color=getPos(pos_sc))
plt.legend(loc='upper center')
plt.xticks(np.arange(26),years,fontsize=8)
tickcolor = []