Word Frequency Analysis – Digital Scholarship Projects, CUHK Library

Word Frequency Presented in a Bar Chart

The word frequency diagram has the same effect as the word cloud to a certain extent, but we hope to obtain more accurate statistics on word frequency through the bar graph.

Bar Chart Analysis

We counted the word frequency of The Observatory Review in each year from 1950 to 1985. But here we don’t show it completely, just show one year as an example. Users can find all word frequency images on our Google Drive.

Fig.1 shows the word frequency diagram from one year. From the top 10 common words displayed by Word Cloud & Frequency, it can be seen that the words “美國”, “中共”, “蘇聯”, “台灣”, and “日本” often appear in newspapers. Relatively speaking, “香港” ranks lower. It can be inferred from this that the newspaper is mainly concerned with international relations during the Cold War, especially the political situation in East Asia, while relatively despising local news in Hong Kong.

Fig.1. Word frequency diagram presented in a Bar Chart

Code

The function def yield_freq(input,output,size) shows how we generated the word frequency diagram by inputting a txt file.

The matplotlib package is used to drew the bar chart.

The ws() is the function from ckiptagger, which can divide sentences list into words list.

def yield_freq(input_file,output_file,size):
    try:
        # test.txt 是我們需要讀入的繁體文本，如果遇到無法解碼的錯誤，用errors跳過
        f = open(input_file, encoding='utf-8', errors="ignore")
    except:
        return

    sentences = ''
    for line in f.readlines():
        line = re.sub('\n','',line)  # 去掉列表中每一個元素的換行符
        line = re.sub('[a-zA-Z0-9]','',line) # 去掉數字，字母
        sentences += line

    sentence_list = re.split(r'[，,。.]', sentences) #獲得句子的list
    f.close()

    word_sentence_list = ws(
        sentence_list,
    )


    word_list = list(chain.from_iterable(word_sentence_list))

    # 移除停用詞
    def remove_stop_words(file_name,seg_list):
      with open(file_name,encoding='utf-8') as f:
        stop_words = f.readlines()
      stop_words = [stop_word.rstrip() for stop_word in stop_words]
      new_list = []
      for seg in seg_list:
        if seg not in stop_words:
          new_list.append(seg) #若在for loop裡用remove的話則會改變總長度
      return new_list
    file_name = './stop_words.txt' # you need to install the stop_words.txt first to remove
    seg_list = remove_stop_words(file_name,word_list)


    # 分詞後用空格隔開每個單詞
    word_list = []
    for j in seg_list:
        if len(j) == 1:
            continue
        word_list.append(j)

    freq = pd.Series(word_list).value_counts().head(size)

    matplotlib.rc("font",family='YouYuan')
    freq.plot(kind = 'bar')
    plt.rcParams['savefig.dpi'] = 600 #图片像素
    plt.rcParams['figure.dpi'] = 1000 #分辨率
    plt.xticks(size = 4,rotation = 40)  # x轴标签旋转
    plt.savefig(output_file)
    plt.show() # 图1