Machine Learning Method


A Python library and a free software were adopted in this project:

pyAudioAnalysis, a Python library for audio analysis, was used to execute the tasks of segmentation, classifier training and prediction;

Audacity, a free and open-source audio editor, was used to prepare the training data set and to review the audio with label file. It is available on Windows, MacOS, GNU/Linux and other operating systems.

Machine Learning Method

In this project, the proposed procedure are as follows:

  • Prepare audio data to the format of an SVM package
  • Define the parameter and kernel of SVM classifier
  • Find the best parameter C by using cross-validation
  • Train the whole training set with the best parameter C
  • Test with real data

By using the function audioTrainTest.extract_features_and_train(), both feature extraction, train/test data split, cross validation and classifier training and evaluation could be done.

Data Collection & Preprocessing

As the audio was recorded under various environments by general recording equipment, preprocessing of the audio signal was necessary. It is important to the training process to ensure the machine extracted the correct features from the processed data. By doing the loudness normalization, the result audio signal had the desired Root Mean Square of amplitude to -14dB. The noise level was then reduced by 20dB with its own noise profile.

The audios were finally segmented and divided into three folders “Chanting”, “Speech” and “Silence”, each folder represents the corresponding category.

Audio Feature Extraction

Fig 1. Short-term and mid-term feature vector ​

Audio signals are segmented into the fix-sized time frame (mid-term window). For each mid-term window, the signal is further segmented into the shorter time frame (short-term window). The number of the time frame depends on the windowing step.

In pyAudioAnalysis, 34 audio features and their delta values were extracted to construct a 68-dimensional feature vector for each short-term window. In mid-term level, the mean and standard deviation of each feature were decided. Thus, it resulted in a 136-dimensional mid-term feature vector. This mid-term feature vector was implemented to the segment level classification.

Table 1. List of audio features extracted in pyAudioAnalysis

Classifier Parameter Selection

In this project, the short-term window and step are 0.005s. The mid-term window and step are 4s and 1s. Two main parameters, Kernal and regularization parameter C, were used to set up a SVM classifier. The Kernel by default is the linear, while regularization C is known as the penalty parameter. In general, the value of C varies in different classification problems. A common approach to find the best C is grid search. Our first grid is from 0.01 to 20, for instance (0.01, 0.1, 0.5, 1, 2, 5, 10, 20), which covers most of the range of general problems. With the cross-validation experiments, if C = 0.1 has the best performance, the second grid would be a finer range from 0.01 to 0.5 to find the more precise value of C.


Our trained classifier with parameter C = 0.1 has the best f1 score of 91.9%. The high (>90%) f1 score showed that it is possible to classify chanting from normal speech.

The model was then evaluated with the recording of the 84-minute “Chinese Poetry Recitation 2022” and it took 160 seconds to finish the prediction. 94% of chanting parts were successfully detected and the overall accuracy is 79.2%.

Fig 2. Predicted result of the recording “Chinese Poetry Recitation” (「露港秋唱」) ​