Merging OCR Result
We need to merge the extracted text from the OCR result. The merging is carried out manually because there are only 112 formulae in Shang-Han-Lun. In Figure 1, the passages on the left both refer to Ma-Huang-Tang-Fang (麻黃湯方). Due to the manual proofreading, the ASCDC OCR result usually has a higher accuracy. On the other hand, the GJ.cool OCR Result is more precise in detecting linebreaks. In Figure 1, the text highlighted in blue should be recognized as 「甘草\n一兩灸\n味咁平」, where “\n” represents a linebreak. ASCDC OCR engine recognizes all of the words correctly while GJ.cool’s engine recognizes the linebreak correctly.
ASCDC OCR Result 麻黃湯方 麻黃三兩去節味甘溫桂枝二兩去皮味辛熱甘草一兩炙味甘平 杏仁七十個湯泡去皮尖味辛溫 右四味以水九升先煮麻黃減二升去上沬內諸藥煮取
GJ.cool OCR Result 麻黃湯方開 麻黃 三兩去節 味甘溫桂枝 二兩去皮 味辛熱 甘草 兩炙 味甘平 杏仁 七十個湯泡去 皮尖味辛溫 右四味以水九升先煮麻黃減二升去上沫内諸藥煮取
Merged Result 麻黃湯方 麻黃 三兩 去節 味甘溫 桂枝 二兩 去皮 味辛熱 甘草 一兩 炙 味甘平 杏仁 七十個 湯泡去皮尖 味辛溫 ////// 右四味以水九升先煮麻黃減二升去上沬內諸藥煮取
Figure 1: Merging OCR
Chinese Text Project Diff Tool
The manual comparison is done with the Chinese Text Project Diff Tool1. Its diff tool can identify and highlight differences between two passages. The tool greatly speeds up the process.
Formatting Data
When we merge the OCR result, we formatted the formulae with the sequence of {formula name, herb name, dosage name, preparation method, property}. This would favour the formula modelling later.
Original Text | Formatted Text |
---|---|
甘草 一兩灸 味甘平 | 甘草 一兩 灸 味甘平 |
- Chinese Text Project Diff Tool, Chinese Text Project, https://ctext.org/plugins/texttools/#diff ↩︎