Methodology

Methodology

To see the streamlined image detection, extraction, and text recognition, please follow the link to the  code display.
Please click the    button to initiate the Colab Notebook for interactive code demo.

 

Data Recognition, Extraction, Categorization, and Display

In this project, we applied the Illustration Detector for image recognition, a newly published Computer Vision repository developed by the Visual Geometry Group (VGG) at the University of Oxford. The model is based on the EfficientDet neural network architecture for efficient object detection and trained with the Scottish Chapbooks data set from the National Library of Scotland. The repository is organized in Python code accompany with tools from VGG including List Annotator (LISA) and VGG Image Search Engine (VISE) for image detection review and refinement and image queries respectively. The previous publication from VGG in Visual Analysis of Chapbooks Printed in Scotland have shown the model is capable of handling images from early printed books. Apart from the illustration detector, we also employed the OpenCV computer vision library for image extraction, Google Cloud Vision API for optical character recognition, and the CSS Bootstrap front-end framework for result display. 

Dataset

The Amusement News data set consist of 126 issues with 456 pages covering publications between 1952 and 1959. 

The Dataset is available here in CUHK Research Data Repository.

Data processing 

Limitations

The accuracy of the result might be limited by using the pretrained illustration detector as it is without further adjustment. The detector might overlooked some of the relevant illustrations or capture irrelevant ones. In addition, some copies of the Amusement News are filled with creases or signs of repairment which might also affect the image detection. The printed text on the newspaper does not have the same layout nor print in the same direction which might affect the OCR result. 

Image detection and extraction

In the beginning, we attempted to retrain the illustration detector with the Amusement News data set. We retrained the model with 75 images with manually annotate the location of the illustration. However, the result accuracy with the limited pool of training set was not promising, especially comparing to the accuracy of the pretrained model with over 6500 Chapbook images. Due to the limited scope of the project, we decided to use the illustrator detector as it is. 

With the method settled, the image detection and extraction process are rather straightforward. We first retrieve the raw images of Amusement News from repository and feed them to the illustration detector. Here, we are only using the Automatic Detection of Illustrations portion rather than the whole VGG proposed pipeline. The illustration detector return the boundaries coordinates of the detected images in a JSON file. We then read from the JSON file and use built-in image extraction methods in OpenCV library to crop out the illustrations. 

Text Recognition, illustration selection, and image album display

After extracting the illustrations from the newspaper, we have separated the portraits from other illustrations like comics or advertisements. Then, we used Google Cloud Vision to reserve the Tradition Chinese characters, mainly the image captions, from the selected illustrations. 

We further identified the Cantonese Opera actors from the pictures and categorize them by names. Initially, we attempted to automatically arrange the portraits by only the textual information retrieved from the captions which turn out to be improbable since a good portion of the result images either result in bad OCR text record or does not contain sufficient textual description. This process was turned out to be semi-automated and aided by human intervention. We categorize the portraits by first examining the associated description and names from the extracted captions and some of the portraits were checked with human eyes against books of reference (花月總留痕–香港粵劇回眸(1930s-1970s)錦繡梨園 : 1950至1959年香港粤劇) from the Library’s Hong Kong Studies collection. Occasionally, we also go back and check the raw image files of the Amusement News to ensure the accuracy of categorization. In total, we successfully recognized over 100 Cantonese Opera related people. 

We selected the most frequently appeared actors and stars for display. To keep the process easier to reuse and maintain, we decided on CSS Bootstrap framework. We set up a image carousel and photo album webpages to display the Cantonese Opera stars which further linked to our the Amusement News repository and the Tabloid Newspaper collection. In addition, we have also taken another step forward to perform Digital Restoration on some of the photos. We tried to colourize some of the selected photos with Palette to show how the photo might have been the old days. 

Remarks

We streamlined a semi-automated process for image extraction, categorization, and display with our tabloid newspaper. Even though it is not a fully automated pipeline as we intended, the established process and code should nevertheless be able to apply to similar collections of digitized newspapers and books. 

In this pilot project, we focused on the extracted portraits for display. The next step, we hope to also add the illustrations from life performance, movie, behind the scenes if possible.

In the future, we hope to scale up the project model with retrained illustration detector dedicated to our data set. We also anticipate to partner with experts of specific domain that matches the theme of the newspaper for more efficient data clustering and in-depth interpretation. In addition, we might apply colourization with deep learning and further apply facial recognition model. 

Acknowledgement

Visual Geometry Group – University of Oxford

OpenCV

Detect text in images  |  Cloud Vision API  |  Google Cloud

Bootstrap · The most popular HTML, CSS, and JS library in the world. (getbootstrap.com)

Palette – Colorize Photos

Reference

岳清. 花月總留痕 : 香港粵劇回眸1930s-1970s = Remembrance of evanescent times past : a retrospective look at Hong Kong Cantonese opera. 香港第一版., 三聯書店香港有限公司, 2019.

岳清. 錦繡梨園 : 1950至1959年香港粤劇. 初版., 一點文化有限公司, 2005.

Heritage and Integration- A Study of Hong Kong Cantonese Opera Films (filmarchive.gov.hk)