Data Preparation
Two important Excel sheets are necessary before generating the interactive map. The first sheet provides the geographic coordinates (latitude and longitude) for various locations, along with their corresponding place types. This sheet will be used to map these locations within ArcGIS. The second sheet comprises quotes extracted from the 60 interviews, focusing on interviewee statements that reference specific place names at CUHK. This sheet contains the interviewee’s name, the places they mentioned, the quote itself, the video URL, and the source information.
An example of the first sheet (Location Information):

An example of the second sheet (Quotes):

A key question remained: how to extract place names from the transcripts? To address this, I used Python to identify locations and their related quotes. Initially, I employed the pycantonese library for word segmentation. Subsequently, I checked if any of the resulting words ended with a suffix commonly associated with Cantonese place names. (My list of location suffixes included: ‘樓’, ‘書院’, ‘學院’, ‘餐廳’, ‘館’, ‘研究所’,’宿’,’街’,’堂’,’校’,’大學’,’hall’,’站’,’地方’,’嗰度’.) Sentences containing words with these suffixes were then added to a list of location-related sentences.
The example below demonstrates the output of this code: a sentence such as ‘嗰時候中文大學仲未成立,但係有個聯合招生廣告,我就參加咗呢個聯招’ can be extracted because it contains the keyword ‘大學’. In this way, we can obtain the necessary quotes and place names (e.g., 中文大學) for preparing the Excel sheets.

In total, this project identified 81 place names and 519 quotes.
Video URL
I also embedded the original interview videos into the interactive map, leveraging Whisper AI to generate timestamps for the interview audio.
However, this non-customized version of Whisper AI cannot accurately transcribe Cantonese. It only generates timestamps with corresponding text in written Chinese. An example of this is shown below.

Therefore, I used an LLM to determine the correct timestamps. Specifically, I used Gemini via Poe. The prompt setup consisted of the following:
- You will be given a JSON file containing timestamps and corresponding text from a video.
- Note that this JSON file may contain errors or inaccuracies in the timestamps and text.
- Your task is to locate the most accurate timestamp for the sentence I sent to you.
- Use the provided JSON data to find the best matching text snippet.
Mapping
Finally, I used ArcGIS Online to create an interactive map. I then used the ‘Join’ and ‘Relate’ functions in ArcGIS to integrate the location data with the quote data.
Moreover, the map includes different layers for New Asia College, Chung Chi College, United College, and the Chinese University of Hong Kong. Additional layers categorize individuals by their affiliation to CUHK: those enrolled or working before 1963, between 1963 and 1976, and between 1976 and 1980.
Interactive Map