OCR & Analysis of The Hongkong News (1942-1945)
Data Analytics Practice Opportunity 2021/22
Background
Published from 31st December 1941, six days after the Christmas Day surrender of the British Colony, The Hongkong News was printed on the abandoned presses of the South China Morning Post at first. With the Japanese forced to surrender to the Allies, it stopped printing on 17th August 1945.
The Hongkong News records the undiluted voice and mindset of the Japanese administration of occupied Hong Kong. Through research about this, scholars gradually understand the special situation of Hongkong at that time with the means of domination through large-scale internment and assurances of certain victory. With The Hongkong News, we can return to that history and trace the whole historical process of ‘the new masters of East Asia’, from Colony’s Imperial overlord to abject surrender.
Target
We aim to extract some keywords from the newspaper to show the trends at that time during the Japanese Occupation of Hong Kong. However, it is not easy to extract keywords directly from the newspaper since the newspaper is in the format of images. In order to deal with this situation, we decided to use some OCR APIs to extract all the information from the newspaper first. We then applied NLP techniques to get the useful information. As we encountered another problem: the quality of the result of a single OCR API could not meet our expection. So, we read some papers and found a method to improve the quality of the OCR result. Finally, we visualized the data with an interactive approach.
Project Team
This project is conducted by a group of students in the Data Analytics Practice Opportunity 2021/22:
- Hangji LI (CSE/3)
- Deyuan KONG (RMSC/3)