Merging OCR Result

Merging OCR Result

We need to merge the extracted text from the OCR result. The merging is carried out manually because there are only 112 formulae in Shang-Han-Lun. In Figure 1, the passages on the left both refer to Ma-Huang-Tang-Fang (麻黃湯方). Due to the manual proofreading, the ASCDC OCR result usually has a higher accuracy. On the other hand, the GJ.cool OCR Result is more precise in detecting linebreaks. In Figure 1, the text highlighted in blue should be recognized as 「甘草\n一兩灸\n味咁平」, where “\n” represents a linebreak. ASCDC OCR engine recognizes all of the words correctly while GJ.cool’s engine recognizes the linebreak correctly.

ASCDC OCR Result

麻黃湯方
麻黃三兩去節味甘溫桂枝二兩去皮味辛熱甘草一兩炙味甘平
杏仁七十個湯泡去皮尖味辛溫
右四味以水九升先煮麻黃減二升去上沬內諸藥煮取
GJ.cool OCR Result

麻黃湯方開
麻黃
三兩去節
味甘溫桂枝
二兩去皮
味辛熱
甘草
兩炙
味甘平
杏仁
七十個湯泡去
皮尖味辛溫
右四味以水九升先煮麻黃減二升去上沫内諸藥煮取
Merged Result

麻黃湯方
麻黃
三兩
去節
味甘溫
桂枝
二兩
去皮
味辛熱
甘草
一兩
炙
味甘平
杏仁
七十個
湯泡去皮尖
味辛溫
//////
右四味以水九升先煮麻黃減二升去上沬內諸藥煮取

Figure 1: Merging OCR

Figure 2: Original text of 「甘草 一兩灸 味甘平」

Chinese Text Project Diff Tool

The manual comparison is done with the Chinese Text Project Diff Tool1. Its diff tool can identify and highlight differences between two passages. The tool greatly speeds up the process.

Figure 3: Result of Chinese Text Project Diff Tool

Formatting Data

When we merge the OCR result, we formatted the formulae with the sequence of {formula name, herb name, dosage name, preparation method, property}. This would favour the formula modelling later.

Original TextFormatted Text
甘草
一兩灸
味甘平
甘草
一兩

味甘平
Table 1: Differences between the original text and the formatted text
  1. Chinese Text Project Diff Tool, Chinese Text Project, https://ctext.org/plugins/texttools/#diff ↩︎