Introduction

Introduction

NLP of Observatory

Analysis Object&Project Target

The Observatory Review (天文臺 in Chinese) is a newspaper founded by 陳孝威 in 1936. This newspaper mainly reports the political and military incidents that happened all around the world. As a lieutenant general of KMT (Kuomintang), the editor Chan put more attention on the international situation in Asia. Instead of solely an investigation and report, Chan preferred to occupy more space in the newspaper with his idea or prediction. This newspaper became famous in Feb of 1941, and Chan predicted the war between The Soviet Union and the Nazis. Because of his acute judgment, we believe that it is worthy to carefully study and analyze this newspaper. By analyzing this newspaper, we want to discover how the social focus and attitude change towards different subject matters in different periods.

Because of different kinds of reasons, the image recording of the Observatory Review is distributed with limited missing records. Thanks to CUHK Library, we have the chance to view better-protected newspaper data. Those data are scanned jpg files starting from 1950 and ending in 1985 (Observatory suspended publication during 1973-1982). In total, there are 3250 issues with 13030 pages of valid published Observatory Review with OCR extracted text that can be found in the Database of Library. And our data came from The observatory review | CUHK Digital Repository.

Project Process Sketch

The original data stored in the collection is in image format. We needed to transform it into text for further analysis. Although we already had the text files extracted by raw OCR, the quality of those text files could not satisfy our requirements. So firstly, we need to do OCR to the original data (images). Second, because of the characteristic of Chinese, we need to do Word-Segmentation

 to separate a sentence into words. Then we could operate analysis and visualization, such as word cloud, relationship analysis, topic modeling, and sentiment analysis. With those analysis techniques, we believed these NLP and visualization results could help researchers in the field of history and society know more about the newspaper The Observatory Review.

Group Information

TAO Zhisheng, Jason (CSCI/4) taozhs1022@163.com

XU Tao, Tom (CSCI/4) taoxu8330@gmail.com

WEN Ruizhe, Winston (CSCI/4) rzwen5137@gmail.com

Coach

Vincent LUM

Michael YIP

Joseph LAU