Topic Retrieval - Digital Scholarship Projects, CUHK Library

Text Preprocessing and Chunking

The foundation of our approach lies in the numerical representation of natural language text. Computers inherently process numbers, which necessitates converting the textual data into a suitable numerical format for effective similarity computation. In this study, we first preprocess the text by organizing it according to volumes and applying a series of data cleaning procedures to remove extraneous or unreadable characters.

Following the cleaning process, the text is segmented into smaller chunks. Unlike the previously discussed segmentation for human readability, this chunking is designed specifically for machine learning purposes. We divide the text into chunks of 300 words each, with an overlap of 100 words between consecutive chunks. This overlapping strategy ensures that the global context of a paragraph is better captured, thereby enabling the retrieval system to provide more robust and relevant results when responding to a query.

Embedding and Topic Retrieval

Once the chunks are generated, they are converted into numerical embeddings using a deep learning model. These embeddings facilitate the computation of similarity scores between the user query and each text chunk. The retrieval process involves identifying the top three text chunks that are most relevant to the user’s topic query. These selected chunks serve as candidates from which corresponding text segments are traced back to yield the final retrieval results.

Evaluation Procedure

To evaluate the effectiveness of the retrieval method, we leveraged a dataset comprised of annotated quotes from 《呂祖全書》. This dataset includes 25 quotes and a total of 53 distinct descriptions drawn from three academic studies. The evaluation process involved the following steps:

Retrieval: For each topic query, the system retrieves the three most relevant text chunks.
Segment Traceback: The retrieved chunks are then mapped back to the original text segments.
Similarity Evaluation: The matching accuracy between these segments and the ground-truth descriptions is quantified using the Dynamic Time Warping (DTW) algorithm. Here, punctuation differences are ignored to focus solely on textual content.

Evaluation Metrics and Observations

The evaluation is based on the inclusion ratio of the ground-truth descriptions within the retrieved segments. Out of 53 segments evaluated:

37 segments showed an exact inclusion of the ground truth.
More than 96% of the segments contained 80% or more of the ground truth text.

The analysis revealed several noteworthy observations:

Despite retrieving three relevant text segments, the evaluation focused on the topmost segment, which could potentially lower the overall measured accuracy.
Longer descriptions tended to yield more accurate inclusion results, highlighting the benefit of extended context for precise matching.

Summary

Our data retrieval strategy not only facilitates efficient cross-text searches but also provides detailed descriptions of the source text. By integrating advanced text processing, embedding generation, and rigorous evaluation metrics, the proposed method significantly enhances the accuracy and robustness of topic retrieval in classical texts.