Cantonese Transcription
This project primarily utilizes the ‘CUHK 60th Anniversary Oral History Project’ dataset, which includes interviews with 60 individuals: teaching staff, administrative staff, and other university employees. This valuable dataset is notable for containing both video interviews and summaries of those interviews. My hypothesis is that the summaries do not comprehensively express the interviewees’ viewpoints and focus. For example, the summaries may not include repeated content from the interviewees. (In fact, my hypothesis has been confirmed.) To that end, the first major task is to transcribe the entire video content.
Initially, I used the Whisper AI for Cantonese transcriptions. I utilized the Hugging Face Transformers library to load a pre-trained model for translating Cantonese speech into written Chinese text. The model I chose, called ‘alvanlii,’ is a custom model specifically designed for Cantonese transcription.
from transformers import pipeline
MODEL_NAME = "alvanlii/whisper-small-cantonese"
lang = "zh"
device = "cuda"
pipe = pipeline(
task="automatic-speech-recognition",
model=MODEL_NAME,
chunk_length_s=30,
device=device,
)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task="transcribe")
result = pipe("audio.wav")
print(result["text"])
(Source: alvanlii/whisper-small-cantonese and Digital Humanities Initiative, HKUST)
Of course, the transcription results from Whisper AI are not perfect. This may be due to factors such as variations in people’s accents and network connectivity issues. (Some interviews were conducted via Zoom) Common errors included repeated sentences/words and the AI model’s failure to recognize personal names, proper nouns, and place names.
Here’s an example: on the left is the original output from Whisper AI, and on the right is the revised text generated by the LLM.

The next step involves using an LLM to fine-tune the transcription results. In this case, I used Gemini via Poe. The prompt setup consisted of the following prompts:
1. You will receive two text inputs:
* **Transcription:** A raw, imperfect transcription of Cantonese audio generated by the AI model. This transcription may contain errors, omissions, and disfluencies.
* **Summary:** A well-organized, accurate summary of the same audio content. This summary represents the key points and overall structure of the audio.
2. **Process:**
* **Compare and Align:** Carefully compare the transcription and the summary. Identify sections of the transcription that correspond to points in the summary.
* **Correct Errors:** Correct any obvious errors in the transcription, such as misspellings, incorrect word choices, or grammatical mistakes. Prioritize corrections that align the transcription with the summary.
* **Fill in Gaps:** If the transcription is missing information that is present in the summary, add the missing information to the transcription, phrasing it in a natural and Cantonese-appropriate way.
* **Remove Redundancies and Disfluencies:** Remove any unnecessary repetitions.
* **Maintain Cantonese Style:** Preserve the natural flow and style of spoken Cantonese. Avoid making the transcription sound overly formal or unnatural.
* **Focus on Meaning:** The primary goal is to retain all the contents of transcription and the essence of the original spoken Cantonese.
3. **Output:** Provide a refined and improved version of the original transcription.
Reference
- alvanlii/whisper-small-cantonese, Hugging Face, https://huggingface.co/alvanlii/whisper-small-cantonese, 檢索日期: 2025年6月20日.
- HKUST Digital Humanities Initiative, ‘Transcribe Cantonese Speech to Text: with Code Samples and Automated Batch Processing Techniques’
, https://digitalhumanities.hkust.edu.hk/tutorials/transcribe-cantonese-speech-to-text-with-code-samples-and-automated-batch-processing-techniques/, 檢索日期: 2025年6月20日.