Methodology – Digital Scholarship Projects, CUHK Library

Methdology

To see the process from extracting text files to natural language processing (NLP) result, please follow the link to the GitHub static display of data processing and NLP demo.
Please click the button to initiate the Colab Notebook for interactive code display of data processing and NLP.

CKIP Transformer as method

In this project, we used the “CKIP Transformers”, a newly developed transformer library by the CKIP Lab (Chinese Knowledge and Information Processing research team from the Academia Sinica). This python package provides transformers models for Traditional Chinese text (ALBERT, BERT, GPT2) with natural language processing (NLP) tools (word segmentation, part-of-speech tagging(POS), named-entity recognition(NER)). Also, the CKPI Transformer is integrated with the popular Hugging Face NLP library with pretrained models. Especially, the pretrained models which is based partly on the news articles from Central News Agency (CNA) and Academia Sinica’s modern Chinese corpus. Hence, we thought the CKIP Transformers would be a good fit for our project as a handy ready-to-use black box tool.

Data set

All the OCR extracted text from the mentioned 3250 issues in 13030 pages of valid published Observatory Review are used in this pilot project. Here, valid record implies that the outliers are already excluded from the project. See the explanation below.

Data pre-processing

Excluding the outliers

Few of the newspaper samples are excluded from this study as outliers. While most of the published Observatory Review are semiweekly newspaper with 4 pages (occasionally with extra 2 pages as special issue), the outliers are books of special memorial issue with more than a hundred pages. To prevent the densely packed special issues diluting the overall result, we excluded these files from the samples. On top of that, two pages of Observatory Review special issue from 1955-10-10 are also excluded due to missing text content in the record. Hence, in total, 3250 issues in 13030 pages of valid record.

Limitations of data

First, we organized all the OCR text record into a dataframe, by running a python script that consist of standard built-in open files and read lines functions. Though, there are limitations with our data. For instance, there are variants of punctuation in halfwidth and fullwidth forms, mixed words of Simplified and Traditional Chinese Characters, and special characters in between words and phrases. It makes data preprocessing and cleaning immensely tricky even with regular expression library. We also briefly explored open source and API service like Tesseract OCR, but unable to obtain practical OCR result to revise the existing data.

Natural language processing (NLP) from part-of-speech tagging (POS) to named-entity recognition (NER)

After some trial and error, we abandoned part-of-speech tagging and turned to named-entity recognition. With a bit of editing, for example trimming off the excessive new line spacing character ‘\n’ which was arbitrarily inserted to the text during OCR. With that done, we eventually are able to run named-entity recognition with the CKIP Transformer tool. Also, with the help of Google Colab Pro+ service which provides higher computing performance and cloud IDE for background processing, we are able to run natural language processing tool continuously through day and night . The model returns the output of a list of segmented words with NER type tags stored in a special class of NerToken with the word, NER type, and starting and ending index in the sentence e.g. (毛澤東, PERSON, (201, 204)), (國泰航空公司, ORG, (2898, 2904)), (以色列, GPE, (3836, 3839)). The result is then reorganized into a nested dictionary form for later use:
result = {yearMonthDay1: {word1: {nerType1: frequency, nerType2:frequency, …}, word2: {…}}, yearMonthDay2: {…}}

Data post-processing

Manual Data Cleaning

Unlike other language, data processing and cleaning tools for Traditional Chinese characters is still limited. Also, considering the limitations of our data set, we decided to perform post process data cleaning manually. Manual examination could preserve the integrity and clarity of the result with precision. Although it is a time-consuming process, by doing so we can reserve well-structured data for visualization and hopefully further study.

What have we done actually?

Post processing data cleaning consist of 2 parts.

First, the list of inclusion. Starting with words with the highest frequency of occurrence, we manually examined and picked the valid entity to keep. A word is valid if it is a recognizable entity of person, organization, event etc. that existed, acted, happened at the time of the newspaper’s publication. With the help of CUHK library modern history resources, especially the Hong Kong Studies collection that include works (我怎樣結交羅邱杜, 香港天文臺報創刊卅四週年陳孝威社長從事國民外交卅年紀念特刊) of the founder/editor of The Observatory Review(陳孝威), we are able to recognize the historical name and event in context, and some of the specific use of words in the newspaper. Eventually, we filtered down to a list which contain around 1000 entities from type PERSON, ORG (ORGANIZATION), EVENT, and 100 entities from type GPE.

About the named-entity types (more details at CKIP Entity Types):

PERSON: People, including fictional
ORG: Companies, agencies, institutions, etc.
EVENT: Named hurricanes, battles, wars, sports events, etc.
GPE: Countries, cities, states

Some issues encountered in validity check:

Specific use of acronym: 羅邱杜、杜邱羅 refers to the first character of 羅斯福、邱吉爾、杜爾斯
Alias of entities that refers to the same entity:
ORG: 俄共、蘇俄、俄寇、蘇聯、蘇共
EVENT: 溪山保衛戰、溪山之戰、溪山大戰
PERSON: 梁啟超(name)，卓如、任甫(Courtesy name)，任公(Art name (Hao))

Second, the replaceable sets of entities pairs. Some entities, though still able to be recognized by the named-entity recognition, might contain distorted characters in the OCR record. Therefore, we further examined the result to infer the words of resemblance. Words contain characters which resemble similar structures but is incomprehensible with normal understanding. Again, starting with the highest occurrence, we checked and replaced the problematic characters by assessing their structural resemblance and in reference to the original scanned copy of the newspaper.

Some cases encountered in examining character of resemblance:

Characters of resemblance:
東京：柬京
文化大革命：文化犬革命、文化火革命
毛澤東：毛泽柬、毛澤柬、毛瀑東、毛澤策、毛澤束、毛滓東

As the result, we built a list of inclusive lists of words and a list of filter dictionaries which can filter out the invalid, insignificant words entities. Finally, the result can be visualize and display.

To have a look of the manually built filters, please follow the link to the static display of NER word filters demo.
Please click the button to initiate the Colab Notebook for interactive code display of the NER word filters.

Remarks

We have shown it is possible to extract meaningful information from existing OCR records to display pieces of history with visualization and improve accessibility of the content through library search. Though, it is important to note that there is risk of arbitrary interpretation/conclusion based on the existing suboptimal data. Perhaps, it might be a good idea to revise the existing text with other OCR engine or service before further study on the data. About data cleaning, it might be a better idea to approach a specialist for more thorough and efficient pre and post processing. Finally, on visualization, other tools like Tableau could also be a good choice for interactive display.

Acknowledgement

CKIP Transformers — CKIP Transformers v0.3.1 documentation (ckip-transformers.readthedocs.io)
CKIP Lab 中文詞知識庫小組 (sinica.edu.tw)

References

陳孝威, et al. 我怎樣結交羅邱杜. 天文台報社, 1967.
陳孝威. 香港天文臺報創刊卅四週年陳孝威社長從事國民外交卅年紀念特刊. 香港天文臺報社, 1972.