Text Segmentation – Digital Scholarship Projects, CUHK Library

Dataset Preparation and Preprocessing

Our text segmentation process begins with the digital re-documentation of the source material. This involves:

Adding Punctuations: The original text is augmented with punctuations to enhance sentence and paragraph boundaries.
Re-documenting: The text is re-documented into coherent sentences and paragraphs, which supports multiple forms of downstream analysis, such as passage retrieval.
Data Cleaning: Unwanted punctuations (e.g., “@”, “】”) and extraneous symbols are removed to ensure text consistency.

The initial segmentation, based on re-documented sentences, results in a dataset with the following statistics:

Average Word Count (per passage): Approximately 97.07 words.
Maximum Word Count (in a passage): 7,708 words.
Total Number of Passages: 3,615.

To further refine the text for optimal machine learning performance, additional segmentation is performed using AI algorithms to ensure that each paragraph is less than 256 words. This supplemental process produces a revised dataset with 4,277 passages, following another round of data cleaning for any remaining unwanted characters.

Segmentation Method

The segmentation process is executed in multiple stages:

Volume-Based Division:
The text is first divided by volumes, for instance, demarcated by markers such as “卷二”, ensuring that the inherent structure of the historical document is preserved.
Algorithmic Segmentation:
After the initial volume-based split, the text is further segmented using computational algorithms specifically designed to identify logical sentence or paragraph boundaries.
AI-Based Refinement:
An AI-driven segmentation strategy is applied to enforce upper limits on paragraph lengths (e.g., less than 256 words), ensuring consistency and optimal chunk sizes for subsequent processing.

Evaluation of Text Segmentation

An evaluation dataset was constructed by collecting quotes on 《呂祖全書》 from three academic papers:

《呂祖全書正宗》—-清代北京覺源壇的歷史及其呂祖天仙派信仰
清代四種《呂祖全書》與呂祖扶乩道壇的關係
識見、修煉與降乩──從南宋到清中葉呂洞賓顯化度人的事蹟分析呂祖信仰的變化

From these sources, 25 quotes were collected and broken into 75 true samples for evaluation. Key statistical outcomes include:

Total Evaluation Segments: 75
Exact Inclusions: 25 segments
Accuracy (80% or above inclusion): Approximately 70.67%

It is noted that errors in the source text can also contribute to a reduction in measured accuracy.

Alignment Algorithm Using Dynamic Time Warping (DTW)

To assess the alignment between the segmented text and the ground truth, we applied the Dynamic Time Warping (DTW) algorithm. This method involves:

Cost Calculation:
The DTW algorithm computes the alignment cost between the generated segment and the ground truth text. The cost matrix is defined as follows:
- Skip Words: Assigned a cost ranging from 1 to k.
- Incorrect Match: Assigned a cost of 2.
- Correct Match: Assigned a cost of 0.
Accuracy Determination:
A lower cost indicates a higher similarity between the segmented text and the ground truth. Conversely, a larger cost correlates with reduced alignment accuracy.

Example:
Ground Truth:

今古仙佛，哀愍眾生，已曾宜說無限妙法，欲冀眾生，聞法受持，免墮輪迴

Our Segment:

今古先佛  哀愍眾生，已曾宜說無數妙法，欲冀眾生  聞法受持，免墮輪迴    Cost = 4

In this example, the negative cost value indicates discrepancies caused by differences in word matching and punctuation, guiding the evaluation of segmentation quality.

Summary

By leveraging robust text segmentation methodologies, careful data cleaning, and advanced alignment algorithms such as DTW, our approach offers enhanced precision for retrieving and evaluating historical passages. The dual-stage segmentation ensures that both natural linguistic detachment and machine-readable consistency are maintained, setting a firm foundation for detailed textual analysis and supporting sophisticated downstream digital humanities research.