Methodology

Methodology

Post-Optical Character Recognition (OCR) Process[1]

1. Getting OCR-result

First, we use Google OCR engine, Tencent OCR engine and Ali OCR engine to process the raw image separately. This is quite simple, since we only need to utilize the OCR API provided by the platforms. At first, we wanted to consider the whole newspaper as a long string and align them. However, we realized that different OCR engines have different approaches to segment the images. In the later process we did not focus on the order of the sentences but concentrated on improving the quality of the OCR result of each sentence. We decided to do the post-OCR on each sentence and combine them together.

2. Quality Improvement

After the OCR processes, we concentrated on enhancing the quality of each recognized sentence. Since the result of the Tencent OCR engine segmentation was in the form of larger chunks, we used the result of Tencent OCR engine as our basis. Then, we used the coordinate of the bounding box to find the result of Google OCR engine and Ali OCR that locate in the bounding box of Tencent OCR. So far, we have aligned the OCR results in the unit of sentence. 

3. Voting Policy

Voting policy is the final step and selects the word to appear in the corrected result for each sentence. This is a quite straightforward method since with the three OCR engines, if a OCR-word appears in at least two results, we consider this word as correct. If all three results are different words for the same postion, the word from the Tencent OCR result is regarded as correct.

[1] William B. Lund and Eric K. Ringger. 2009. Improving optical character recognition through efficient multiple system alignment. In Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries (JCDL ’09). Association for Computing Machinery, New York, NY, USA, 231–240. https://doi.org/10.1145/1555400.1555437.

Code

The following code in Python 3.7 is for the preparation of the Post-OCR Process:

from tkinter import image_names
import tencent_ocr
import google_ocr
import ali_ocr
import os

image_list = os.listdir('../processed')
post_list = os.listdir('../results')
num = 1
for image in image_list:
    print( str(num) + '/2306')
    num += 1
    image_temp = image[:-4] + '.txt'
    if (image_temp not in post_list):
        img_path = '../processed/' + image

        tencent_result = tencent_ocr.tencent_detect_text(img_path)
        google_result = google_ocr.google_detect_text(img_path)
        ali_result = ali_ocr.ali_detect_text(img_path)

        # because the result of tencent ocr engine segmentation is in form of bigger chunk.
        # we will use the result in tencent_result as basis.
        # we will loop over tencent_result and for each item in tencent_result ,
        # we will fetch all the items in google_result and ali_result that are in the corresponding chunk
        # Then, we will do a voting word by word and give the optimal result in this chunk.

        tencent_key_list = [*tencent_result]
        tencent_key_list.sort()
        google_key_list = [*google_result]
        google_key_list.sort()
        ali_key_list = [*ali_result]
        ali_key_list.sort()
        corrected = []

        for key in tencent_key_list:
            tencent_temp = tencent_result[key]
            google_temp = []
            ali_temp = []
            dict_temp = {}

            for google_key in google_key_list:
                if google_key[0] >= key[0] and google_key[1] >= key[1] and google_key[2] <= key[2] and google_key[3] <= key[3]:
                    google_temp.append(google_result[google_key])

            for ali_key in ali_key_list:
                if ali_key[0] >= key[0] - 30 and ali_key[1] >= key[1] - 30 and ali_key[2] <= key[2] + 30 and ali_key[3] <= key[3] + 30:
                    ali_temp.append(ali_result[ali_key])

            tencent_temp_list = tencent_temp.split(' ')
            ali_temp = ' '.join(ali_temp)
            ali_temp = ali_temp.split(' ')
            google_start = 0
            google_end = 0
            ali_start = 0
            ali_end = 0

            for item in tencent_temp_list:
                if item in google_temp:
                    google_start = tencent_temp.index(item)
                    break

            for item in tencent_temp_list:
                if item in ali_temp:
                    ali_start = tencent_temp.index(item)
                    break

            flag = 0
            for item in tencent_temp_list:
                if flag:
                    if item not in google_temp:
                        google_end = tencent_temp.index(item) - 1
                        break
                else:
                    if item in google_temp:
                        flag = 1

            flag = 0
            for item in tencent_temp_list:
                if flag:
                    if item not in ali_temp:
                        ali_end = tencent_temp.index(item) - 1
                        break
                else:
                    if item in ali_temp:
                        flag = 1

            google_temp = ' '.join(google_temp)
            ali_temp = ' '.join(ali_temp)
            for i in range(len(tencent_temp)):
                if i < google_start or i < ali_start:
                    continue
                elif i >= google_end or i >= ali_end:
                    continue
                elif i >= google_start+ len(google_temp) or i >= ali_start + len(ali_temp):
                    continue
                else:
                    if i >= google_start :
                        if tencent_temp[i] == google_temp[i - google_start] :
                            continue
                    elif i >= ali_start :
                        if tencent_temp[i] == ali_temp[i - ali_start]:
                            continue
                    elif i >= google_start and i >= ali_start and google_temp[i - google_start] == ali_temp[i - ali_start]:
                        tencent_temp = tencent_temp[:i] + \
                            google_temp[i - google_start] + tencent_temp[i+1:]
                    else:
                        continue

            corrected.append(tencent_temp)
        result = ' '.join(corrected)
        filename = '../results/' + image[:-4] + '.txt'
        with open(filename, 'w') as file:
            file.write(result)

Natural Language Processing (NLP) Process

Our NLP process contains three processes: text preprocessing, bag-of-words model developing and result generating.

1. Text Preprocessing

At first in Manipulate() function, we subtract some special symbol from the texts using re package. Considering the potential residual of ambiguous words, we use textblob package to correct the words and store the result into result.csv and name.csv (stores the date of nespaper).

To enhance the accuracy, we tested the sentence’s quality through LanguageTool package assuming the sentence with nearly correct grammer is more useful for our study against to noise. (quality_test(result) function)

2. Bag-of-Words Model Developing[2]

In main() functon, with result.csv and name.csv, we used CountVectorizer (in sklearn.feature_extraction.text package) to consolidate the word counts for developing the Bag-of-Words Model .

Bag of Words is a Natural Language Processing technique of text modelling. In technical terms, it is a method of feature extraction with text data. This approach is a simple and flexible way for extracting features from documents. A bag of words is a representation of text that describes the occurrence of words within a document. It keeps track of word counts and disregards the grammatical details and the word order. It is called a “bag” of words because any information on the order or structure of words in the document is discarded. The model only concerns whether a word occurs in a document, but not where the word appears in the document.

3. Generating Results

After the above processing, we used TfidfTransformer method (in sklearn.feature_extraction.text package) to calculate the TF-IDF index for figuring out which words are the keywords.

TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.

TF-IDF was invented for document search and information retrieval. It works by increasing proportionally to the number of times a word appears in a document, but is offset by the number of documents that contain the word. So, words that are common in every document, such as ‘this’, ‘what’, and ‘if’, rank low even if they may appear for many times because they are not indicative to the document in particular. However, if the word ‘bug’ appears for many times in a document, but not appears as many in other documents, it probably means that it is more relevant. 

Then we use clear(df)  function to subtract some useless words from the word list. Finally, we choose the top 10 words with largest TF-IDF index to be the keywords of that issue of newspaper.

[2] https://machinelearningmastery.com/gentle-introduction-bag-words-model/

Code

The following code in Python 3.7 is for the preparation of the author frequency statistic.   

# -*- coding: utf-8 -*-
import re
import os
import language_tool_python
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from numpy import nan as NaN
from textblob import TextBlob
from gensim.test.utils import datapath
from gensim.models.word2vec import Text8Corpus
from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import textblob



def quality_test(result):
        #grammar check==========================================================
        tool = language_tool_python.LanguageTool('en-US')
        # get the matches
        matches = tool.check(str(result))
        tool.close()
        return(len(matches)<3)
    #assuming if len(matches)>3, the result is bad quality.

    

def Manipulate():
    path = './resultsInDate'
    results = []
    name = []
    dir_list = os.listdir(path)
    for dir in dir_list:
        name.append(dir[:27])
        with open('./results/'+dir, 'r',errors="ignore") as file:
            line=(file.read())
            line=re.sub(r'[^a-zA-Z]+', " ", str(line))
            b = TextBlob(line)
            line = str(b.correct())
        results.append(line)
    df = pd.DataFrame(results)
    df.to_csv( 'result.csv', index=False)
    daname = pd.DataFrame(name)
    daname.to_csv('name.csv', index=False)

def clear(df):

    a=['me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves',
'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 
'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 
"she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 
'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 
'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 
'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 
'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 
'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of',
'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 
'through', 'during', 'before', 'after', 'above', 'below', 'to', 
'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',
'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where',
'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 
'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 
'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 
"don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've',
'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 
'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't",
'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn','mo','ka'
"needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 
'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't",'one','jan','feb',
'mar','april','may','june','july','aug','sep','oct','nov','dec','ta','th','ar','te','ltd'
'os','qui','el']

    for word in a:
        if( any(df.index==word)):
            df.drop(word,axis=0,inplace=True)



def main():
    # generate the result
    Manipulate()
    # initialize the trained model=====================================
    csv_data = pd.read_csv('./result.csv',encoding='utf-8',engine="python")
    name = pd.read_csv('./name.csv',encoding='utf-8',engine="python")#索引从0开始
    # visualize(csv_data)---------optional
    #using the count vectorizer
    li=[]
    keyword = pd.DataFrame(index=name.iloc[:,0],columns=[1,2,3,4,5,6,7,8,9,10])
    for a in range (0,csv_data.shape[0],1):
        li.append(csv_data.iloc[a,0])
    count = CountVectorizer(encoding='utf-8')
    word_count=count.fit_transform(li)
    feature_names = count.get_feature_names_out(li)
    tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
    tfidf_transformer.fit(word_count)
    tf_idf_vector=tfidf_transformer.transform(word_count)
    for a in range(0,csv_data.shape[0],1):
        list=[]
        first_document_vector=tf_idf_vector[a]
        df_tfifd= pd.DataFrame(first_document_vector.T.todense(), index=feature_names,columns=["tfidf"])
        df_tfifd=df_tfifd.sort_values(by=["tfidf"],ascending=False)
        clear(df_tfifd)
        for b in range(0,10,1):
            keyword.iloc[a,b]=df_tfifd.index[b]
        #we can see the word with largest tf-idf is the most important word.
    keyword.to_excel('keyword.xlsx')

if __name__  == "__main__":
    main()

Visualization

We developed an interactive timeline and gave a bar chart composed of the frequency of the top-50 keywords in the specified period.

Code

The following code in Python 3.7 is for the data visualization.

# Run this app with `python app.py` and
# visit http://127.0.0.1:8050/ in your web browser.

from dash import Dash
from dash import dcc
from dash import html
from dash import Input
from dash import Output
from dash import html
import plotly.express as px
import pandas as pd
import json
from sklearn.feature_extraction.text import CountVectorizer

external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']
app = Dash(__name__, external_stylesheets=external_stylesheets)

# assume you have a "long-form" data frame
# see https://plotly.com/python/px-arguments/ for more options

pd_reader=pd.read_csv('keyword.csv')
corpus = pd.read_csv('name.csv')
corpus = corpus.values.tolist()
names=[]

for a in range(0,len(corpus),1):
    names.append(corpus[a][0])
names.sort()

li=''
frequency = {}
for a in range (0,pd_reader.shape[0],1):
    for b in range (1,pd_reader.shape[1],1):
        if pd_reader.iloc[a,b] in frequency.keys():
            frequency[pd_reader.iloc[a,b]].append(pd_reader.iloc[a,0])
        else:
            frequency[pd_reader.iloc[a,b]] = [pd_reader.iloc[a,0]]
        li=li+" "+pd_reader.iloc[a,b]
li=[li]
count = CountVectorizer(encoding='utf-8')
word_count=count.fit_transform(li)
features= count.get_feature_names()
list=[]

for a in range(0,len(features),1):
    list.append(word_count[0,a])

df = pd.DataFrame({
    "Features":features,
    "Frequency": list
})

df.sort_values(by="Frequency",inplace=True,ascending=False)
df=df[:50]
fig = px.bar(df, x="Features", y="Frequency")#barmode="group"
#fig = px.bar(df, x="Fruit", y="Amount", color="City", barmode="group")

app.layout = html.Div(children=[
    html.H1(children='Keywords in the HongKongNews', style={'color':'blue', 'marginLeft':450, 'marginTop':100}),
    dcc.Graph(
        id='example-graph',
        figure=fig
    ),

    dcc.RangeSlider(0, 530, 1, value=[0, 530], marks=None, allowCross = False, id='my-range-slider'),
    html.Div(id='output-container-range-slider', style={'color':'red', 'font-size':28, 'marginLeft': 550}),
    html.Div(id='click-data')
])

@app.callback(
    Output('output-container-range-slider', 'children'),
    [Input('my-range-slider', 'value')])

def update_output(value):
    start = names[value[0]][-8:]
    end = names[value[1]][-8:]
    return 'Start: ' + start + ' End: ' + end

@app.callback(
    Output('example-graph', 'figure'),
    [Input('my-range-slider', 'value')])

def update_figure(value):
    selected = names[value[0]:value[1] + 1]
    li=''
    global frequency
    frequency = {}
    for a in range (0,pd_reader.shape[0],1):
        if pd_reader.iloc[a,0] in selected:
            for b in range (1,pd_reader.shape[1],1):
                if pd_reader.iloc[a,b] in frequency.keys():
                    frequency[pd_reader.iloc[a,b]].append(pd_reader.iloc[a,0])
                else:
                    frequency[pd_reader.iloc[a,b]] = [pd_reader.iloc[a,0]]
                li=li+" "+pd_reader.iloc[a,b]

    li=[li]
    count = CountVectorizer(encoding='utf-8')
    word_count=count.fit_transform(li)
    features= count.get_feature_names()
    list=[]
    for a in range(0,len(features),1):
        list.append(word_count[0,a])
    df = pd.DataFrame({
        "Features":features,
        "Frequency": list
    })

    df.sort_values(by="Frequency",inplace=True,ascending=False)
    df=df[:50]
    fig = px.bar(df, x="Features", y="Frequency")
    fig.update_layout(transition_duration=50)
    return fig

@app.callback(
    Output('click-data', 'children'),
    Input('example-graph', 'clickData'))

def display_click_data(clickData):
    def create_table(li):
        whole_li = []
        test = 0
        while(True):
            temp = []
            for i in range(4):
                try:
                    temp.append(li.pop())
                except:
                    temp.append("")
                    test = 1

            whole_li.append(temp)
            if test == 1:
                break
        table = html.Table(
            [
                html.Tr(
                [
                    html.Td([
                        html.A(text,href = f"https://repository.lib.cuhk.edu.hk/en/islandora/search/mods_titleInfo_title_ms:(%22{text}_001%22)?cp=cuhk:hk-tabloid")
                    ]) for text in li
                ]
                ) for li in whole_li
                            ]
            )
        return table

    if clickData == None:
        return None
    else:
        table = create_table(frequency[clickData['points'][0]['x']])
        return table
if __name__ == '__main__':
    app.run_server(debug=True)