Word Clouds Analysis – Digital Scholarship Projects, CUHK Library

Word Clouds in One Year’s Newspaper

Word cloud is a visual representation of text data, which generally consists of some colorful graphics extracted from words in text data. The core value of the word cloud map is to convey the valuable information behind a large amount of text data with the visual expression of high-frequency keywords.

Creating Word Clouds

Since the amount of data we want to analyze covers the newspaper information of the 《天文臺》 from 1950 to 1985, the analysis of the whole text needs to consume a lot of memory and time. Beyond that, the analysis of years of data is of little interest. Therefore, we decided to divide the unit of analysis into each year. We will conduct word cloud and word frequency pair analysis on newspaper data on a yearly basis.

Before generating the word cloud, we first used ckiptagger to segment the data (txt file) for one year. Next, we filter the data through the installed stop word file to remove some meaningless words. Finally we generate the word cloud which is shown in Fig.1.

Fig.1. Initial unprocessed word cloud.

It can be seen from Fig.1 that most of the words with the largest font in the word cloud generated for the first time are words with a length of one. These words are meaningless in Chinese, so we decided to filter the data and no longer calculate the weight of a pair of words. Fig.2 shows the analysis result.

Fig.2. Word cloud with no words of length one after filtering

After screening, the data display becomes much clearer. From Figure 2, we can see that the political events in several major countries are more concerned. In this diagram, we generated about 300 words that are from newspaper. Through observation, it is found that the words with larger fonts are all countries such as “中國”, “美國”, “日本”, “台灣”. Relatively speaking, the proportion of words such as the “共黨”, “國軍”, “自由” is not high. From this point of view, during this year, newspapers paid more attention to international politics and relations, while relatively despising the political struggles within China.

However, we wondered that whether we can find out the proper noun in the text. Therefore, we filter the word by function pos(), which can sign each word a part of speech. Fig.3 shows the result.

Fig.3. Word cloud with only proper noun

From Fig.3, we can clearly see that Churchill, Stalin and other international politicians appear more frequently, which further shows that newspapers have more control over changes in international relations this year.

Code

The function def yield_cloud(input,output,size) shows how we generated the word cloud diagram by inputting a txt file.

The input is the file name (.txt) of the text data file we need to analyze, and the output is the generated word cloud (.png). At the same time, users can customize the number of words contained in the word cloud by changing the “size” parameter.

The ws() is the function from ckiptagger, which can divide sentences list into words list.

def yield_cloud(input_file, output_file, size):
    try:
        # test.txt 是我們需要讀入的繁體文本，如果遇到無法解碼的錯誤，用errors跳過
        f = open(input_file, encoding='utf-8', errors="ignore")
    except:
        return

    sentences = ''
    for line in f.readlines():
        line = re.sub('\n','',line)  #去掉列表中每一個元素的換行符
        line = re.sub('[a-zA-Z0-9]','',line) #去掉數字，字母
        sentences += line

    sentence_list = re.split(r'[，,。.]', sentences) #获得句子的list
    f.close()
    word_sentence_list = ws(
        sentence_list,
        # sentence_segmentation = True, # To consider delimiters
        # segment_delimiter_set = {",", "。", ":", "?", "!", ";"}), # This is the defualt set of delimiters
        # recommend_dictionary = dictionary1, # words in this dictionary are encouraged
        # coerce_dictionary = dictionary2, # words in this dictionary are forced
    )
    pos_word_sentence_list = pos(word_sentence_list)
    word_list = list(chain.from_iterable(word_sentence_list))
    
    pos_word_list = list(chain.from_iterable(pos_word_sentence_list))

    # 分詞後用空格隔開每個單詞
    word_string = ''
    for j in range(len(word_list)):
        if pos_word_list[j] == 'Nb' and len(word_list[j]) != 1:
            word_string += word_list[j] + ' '

    wc = WordCloud(
      height=300,
      width=500,
      background_color='white',        #   背景顏色
      max_words=size,                  #  最大分詞數量
      mask=None,                       #   背景圖片
      max_font_size=None,              #   顯示字體的最大值
      font_path='./KAIU.TTF',          # 若為中文則需引入中文字型(.TTF)
      random_state=None,               #   隨機碼生成各分詞顏色
      prefer_horizontal=0.9)           #   調整分詞中水平和垂直的比例
    wc.generate(word_string) 
    wc.to_file(output_file)