Finding the Hidden Gems in CUHK Library’s Audio Collection:  Machine Learning as a Tool for Audio Analysis

Chanting is a Chinese traditional practice of reading, composing and teaching classical poetry and prose in a specific melodic style with variations in different dialects, lineages or personal preferences.  The Chinese University of Hong Kong Library (CUHK Library) has archival audio stock deposited and donated from various scholars and a large portion were recorded lecture sessions on teaching intermingled with Cantonese chanting having enormous research value.  CUHK Library has been working on these materials to digital online collections for preservation and open access, e.g., Rulan Chao Pian Collection in CUHK Digital Repository.  However, these online recordings have lack of detail on the content and breakdown for non-ethnic music in their records due to the diverse subject analysis approaches. Consequently, researchers are required to invest considerable time and effort in manually sorting out the chanting activities from these hour-long recordings. Therefore, CUHK Library initiated a pilot project aimed at developing a machine learning based classifier as a rapid tool to identify speech and chanting activities from the digital audio repository stock through automated analysis. 

Fig 1. Rulan Chao Pian Collection in CUHK Digital Repository

In this project an open-source GitHub project for audio analysis was adopted for segmentation, classifier training and prediction. It contains a Python library that supervises machine learning models such as Support Vector Machine (SVM) for classification.  This approach involves extracting audio features, performing statistical analysis to identify differences and storing the information in a feature vector. This feature vector can then be utilized to predict the new data based on patterns observed in categorized dataset. 

The training data being applied was mainly from

These audio resources were processed to reduce noise and normalize volume in enhancing the training quality. They were then segmented and categorized into “Chanting”, “Speech” and “Silence” correspondent to their content.