Character Modification

Character Modification

Digitizing the Chinese Student Weekly is never an easy job. Thanks to the effort by the Library, the full texts of the Weekly are available before the commencement of the project, which significantly reduces our effort in extracting the content from the Weekly. However, since most existing optical character recognition (OCR) software does not perform as well as in English, especially in a historical Chinese newspaper like The Weekly, errors and inconsistencies are common in the full text. The errors in the full text may hinder the performance of the named entity recognition (NER) in the next step. Hence, it is vital for our team to perform a preprocessing step before performing NER.

By our observation, there are four common errors made by the OCR software. This includes (1) character-symbol misrecognition (e.g. misrecognized Chinese character ten (十) as symbol plus (+)), (2) symbols misinterpretation (e.g. misinterpreted full stop (。) as dots (.), leading to incorrect segmentation), (3) high-stroke-density character misrecognition (e.g. Chinese character 攤), and (4) incorrect radical (部首) assignment.

To spot and correct the inconsistencies in the text, we utilize the ChatGPT-3.5 API, developed by the company OpenAI, with the following prompt:

Objective: Correct the incorrect Chinese characters or words extracted by OCR in the following Hong Kong articles into Traditional Chinese based on the preceding text and Chinese word meanings.

Article: {input}
Output:

(Translated, Original prompt:
“目標: 根據前文後理和中文詞義,把下列香港文章中經OCR提取的錯誤漢字或詞語改正為繁體中文
文章:{input}
輸出:”)

It is worth mentioning that this method does not correct characters based on the scanned copies of the Weekly. Instead, the LLM performs the modification based on its understanding of natural language1, which can spot and correct inconsistencies in sentences. This method, however, does not provide guarantees that the corrected characters are equivalent to those on the Weekly. Even though this method cannot correct the characters perfectly, the result is still surprising with the average text-modified rate being around 12%2. An example of the result is shown below as the comparison before and after performing the text modification process.

Before ModificationAfter Modification
校際音樂比賽本月廿七日又開始了,是項比賽今年進入第十三屆o十三年來,由於教育當局的鼓勵,音樂界及社會人士的支持,及教師與學生的努力,使這每年一度的學校音樂節,和學校音樂的水平有了長足的進展o
校際音樂比賽本月廿七日又開始了,是項比賽今年進入第十三屆十三年來,由於教育當局的鼓勵,音樂界及社會人士的支持,及教師與學生的努力,使這每年一度的學校音樂節,和學校音樂的水平有了長足的進展
在朗誦方面,則加設九歲以下及十二歲以下的誦比賽,及華籍東方選手組朗誦比賽,小學組詩篇誦比賽,短歌組等節目o 此賽今年的特色,便是小學生也得以參加,及英文組分開中英國籍。中文誦是去年開始增設的項目,只限中學生參加,可以說是一種新嘗試。由於這種嘗試頗受一般重視,和去年所表現的成很圓滿,於是今年度便更進一步,增加了兒童組和小學組,內容則力求普遍。在朗誦方面,則加設九歲以下及十二歲以下的誦比賽,及華籍東方選手組朗誦比賽,小學組詩篇誦比賽,短歌組等節目此賽今年的特色,便是小學生也得以參加,及英文組分開中英國籍。中文誦是去年開始增設的項目,只限中學生參加,可以說是一種新嘗試。由於這種嘗試頗受一般重視,和去年所表現的成很圓滿,於是今年度便更進一步,增加了兒童組和小學組,內容則力求普遍。
Table 1: An example of text before and after text modification.

Code

The code is provided as follows:

  1. Natural Language Understanding is one of the key features provided by LLMs, which allow LLMs to understand languages written by humans. This can be achieved by pre-training LLMs with tons of training data. Since the training process is lengthy and requires extremely large resources, the pre-training process is done by OpenAI and our team can perform character correction by inferencing the model through APIs. ↩︎
  2. According to our statistics, by comparing the original extracted text from OCR software and the modified text outputted by ChatGPT API, the modified character is 3.4 million (3,397,587). The character modified rate is 12.383% with the total character count being 27 million (27,437,468) between the years 1953 and 1962. ↩︎