Methodology
1. Pre-processing of Raw Images
The original image of The Hong Kong News is in a yellowish-brown tone (Fig 1), like any other newspaper. However, the raw image is not optimal for character recognition, which imposes the need for image processing. By utilizing Adobe Photoshop, the brightness, contrast and black-and-white of every newspaper issue were adjusted. After all, the processed newspaper image (Fig 2) is now ready for optical character recognition.
Fig 1 (Left):
Before image processing
Fig 2 (Right):
Pre-processed image
2. Optical Character Recognition (OCR)
After the pre-processing of raw images, it is time to deploy ABBYY. ABBYY is a software that adopts optical character recognition (OCR) techniques to transfer scanned documents into several digital formats. Here, the recognized text will be output in “HTML” format, which is crucial for the extraction of sub-sections in the next stage.
Fig 3: OCR through ABBYY
3. Extract Desirable Paragraph (Around Town)
As mentioned above, OCR was output in “HTML” format to utilize Beautiful Soup in Python. Beautiful Soup is a Python package that parses HTML files into detailed data. In this project, all wordings after the string “Around Town” (The session that reports local news) will be extracted. Below is the Python code to extract the desirable session of The Hong Kong News.
import requests
# import urllib.request
import re
import os
import glob
from bs4 import BeautifulSoup
requests.packages.urllib3.disable_warnings('/Users/yulokhin/Desktop/NEW_HTML/ZZ_HTML (8)')
import cssutils
from pprint import pprint
def check_font_size(css_path: str, font_class: str, size_range: int):
'''
To check if the font size if larger than the required value
'''
with open(css_path, encoding='utf-8') as f:
css_file = f.read()
sheet = cssutils.parseString(css_file)
css = {}
for rule in sheet:
if rule.type == rule.STYLE_RULE:
style = rule.selectorText
css[style] = {}
for item in rule.style:
propertyname = item.name
value = item.value
css[style][propertyname] = value
ft = css['.' + font_class]['font']
font_size = int(re.search(r'\d*', ft).group())
return font_size > size_range
current_working_directory = os.getcwd()
# Create a folder to store the final text files
result_folder = os.path.join(current_working_directory, 'result')
if not os.path.exists(result_folder):
os.makedirs(result_folder)
path = os.path.join(current_working_directory, 'ZZ_HTML (8)') # path of html folder
for filename in glob.glob(path + '/*.htm'): # Process htm file one by one
# Read html file
with open(filename, "r", encoding='utf-8') as f:
html_doc = f.read()
# Parse html file and find the "Around Town" heading by condition: Around or Town or Around Town
soup = BeautifulSoup(html_doc, 'html.parser')
AT_heading = soup.find_all('span', string=re.compile('(Around)|(Town)|(Around Town)'))
# Path of corresponding css file
css_path = filename.split(".htm")[0] + '_files/' + filename.split(".htm")[0].split('/')[-1] + '.css'
for word in reversed(AT_heading):
font_class = word['class'][0]
# Check if this is a large heading font, then search all text after this heading
if check_font_size(css_path, font_class, size_range=14):
# Get text inside the tag <span> after "Around Town"
spans = word.find_all_next("span")
result_text = ""
# Combine all text
for span in spans:
result_text = result_text + span.text
textfilename = filename.split(".htm")[0].split('/')[-1]
# Write txt file and save to folder "result"
with open('result/' + textfilename + '.txt', 'w', encoding='utf-8') as file:
file.write(result_text)
break
4. Natural Language Processing (NLP) on Raw Text
Now, the raw text under the “Around Town” session is pulled out from every issue. However, the existing text contains unwanted characters like “*”, “#” and “@”, as well as Chinese letters. Thus, Natural Language Processing is executed to erase the mentioned characters, and the code for NLP in Python is as follows.
#remove weird charcaters in the text
cleaned_folder = os.path.join(current_working_directory, 'cleaned')
if not os.path.exists(cleaned_folder):
os.makedirs(cleaned_folder)
clean_path = os.path.join(current_working_directory,'result')
for file in glob.glob(clean_path + '/*.txt'):
with open(file, 'r') as input_file:
clean_doc = input_file.readlines()
textfilename = os.path.splitext(os.path.basename(file))[0]
for text in clean_doc:
clean_text = re.sub("([^\x00-\x7F])+"," ", text)
middle_text = re.sub(r"[\\]", "", clean_text)
last_text = re.sub("[*<>^]", "", middle_text)
with open('cleaned/' + textfilename + '.txt', 'w', encoding='utf-8') as output_file:
output_file.write(last_text)
break
5. Name-Entity Recognition (NER)
Traditionally, the Name-Entity Recognition (NER) is completed by program coding software like Python. Yet, with the advent of generative artificial intelligence (AI) technology, it becomes easier to drag out the time, place and event. For instance, this research deploys CUHK AI Chatbot for Learning Support. By simply requesting ChatGPT to find all the places within the text, together with a training example (place 1: exact sentence 1), it can return the result in just a few seconds. Furthermore, the informative database of wartime events is saved in a csv file for later use.
Fig 4: NER process through CUHK AI ChatGPT (CUHK, 2024)
6. Visualisation and Overlay of Historical Map
It seems a bit odd to choose the default map type, as the coastline, road network and urban setting have all changed over the years. Hence, it would be desirable to utilize historic maps, preferably between 1941 and 1945. After liaising with the owner and founder of Hong Kong Historic Maps (https://www.hkmaps.hk/), we have garnered his permission to adopt the 1945 Map into our study. In ArcGIS Online, by linking to the external map tile, the georeferenced historic map can be visualized and overlayed on our project.
7. Generation of Interactive Story Map
Lastly, the aforementioned csv file is imported into ArcGIS Online, depicting the spatial information of historic events. In addition, ArcGIS StoryMap is used to create an online platform, providing functions like interactive maps and time slide for users to experience.