Formulae Separation
In this stage, we will extract the formulae from the OCR result. With the identified terminal keywords, the extraction can be done with a linear scan. This methodology can be applied to the OCR result from both ASCDC and Gujicool. The algorithm is described in the following pseudocode.
# Pseudocode of formulae extraction
formulae_collection = empty_list
current_formula = empty_list
recording = False
for each page in ocr_result:
for each line in page:
if line.startswith(any of the starting keywords):
mark start of the formula
set recording to True
record current line in current_formula
skip to next line
if line.contain(any of the ending keywords):
mark end of the formula
set recording to False
record current line in current_formula
add current_formula to formulae_collection
clear current_formula
skip to next line
if recording is True:
add current line to current_formula
By observing the patterns in this edition of Shang-Han-Lun, we have identified the terminal keywords. The starting keyword is the formula name while the ending keywords are 「右、上、於」. The ending keywords usually describe the number of herbs used in the formula or the derivation from the parent formula. The detailed process will be explained in the following passage. The blue text shown below is our target formula, the Ma-Huang-Tang-Fang (麻黃湯方). The formula starts with its name and ends with 「右」. The phrase 右四味 means that 4 herbs are used in this formula.
《仲景全書》卷二 頁10 太陽病頭痛發熱身疼腰痛骨節疼痛惡風無汗而喘者 麻黃湯主之 此太陽傷寒也寒則傷榮頭痛身疼腰痛以至牽連骨 節疼痛者太陽經榮血不利也內經曰風寒客于人使 人毫毛畢直皮膚閉而為熱者寒在表也風并於衛衛 實而榮虛者自汗出而惡風寒也寒并於榮榮實而衛 虛者無汗而惡風也以榮強衛弱故氣逆而喘與麻黃 湯以發其汗 麻黃湯方 麻黃三兩去節味甘溫桂枝二兩去皮味辛熱甘草一兩炙味甘平 杏仁七十個湯泡去皮尖味辛溫
《仲景全書》卷二 頁11 右四味以水九升先煮麻黃減二升去上沬內諸藥煮取 半升令去滓溫服八合覆取微似汗不須啜粥餘如桂枝 法將息 內經曰寒淫於內治以甘熱佐以苦辛麻黃甘草開肌 發汗桂枝杏仁散寒下氣 太陽與陽明合病喘而胸滿者不可下宜麻黃湯主之 陽受氣於胸中喘而胸滿者陽氣不宣發壅而逆也心 下滿腹滿皆為實當下之此以為胸滿非裏實故不可 下雖有陽明然與太陽合病為屬表是與麻黃湯發汗 太陽病十日以去脈浮細而嗜臥者外已解也設胸滿脇 痛者與小柴胡湯脈但浮者與麻黃湯一
Table 1 gives more examples of the ending keywords. The keyword 「於」indicates that the current formula is derived from another formula. In this example, Gui-Zhi-Jia-Fu-Ji-Tang-Fang (桂枝加附子湯方) is derived from Gui-Zhi-Tang-Fang (桂枝湯方) by adding 附子.
Ending keyword | Usage | Formula extracted |
---|---|---|
右 | Indicate the number of herbs used | 麻黃湯方 麻黃三兩去節味甘溫桂枝二兩去皮味辛熱甘草一兩炙味甘平 杏仁七十個湯泡去皮尖味辛溫 右四味以水九升先煮麻黃減二升去上沬內諸藥煮取 |
上 | Indicate the number of herbs used | 芍藥甘草附子湯方 芍藥三兩味酸微寒甘草三兩炙味甘平 附子一枚炮去皮破八片味辛熱 巳上三味以水五升煮取一升五合去滓分溫服疑非仲 |
於 | Describe the derivation from the parent formula (桂枝湯方) | 桂枝加附子湯方 於桂枝湯方內加附子一枚炮去皮破八片餘依前法 |
Variant Chinese Characters
To prevent the variant Chinese characters (異體字) from causing a mismatch of keywords, we convert the following variant Chinese characters to a single form. For example, the Chinese formal numerals 「壹貳叄肆伍陸柒捌玖拾」are converted to the regular numerals 「一二三四五六七八九十」.
Python Code for Formulae Separation
It is highly recommended to view the code in a code editor.
First, prepare the data used in the separation. Initialize the dataframe columns if necessary.
import pandas as pd
ascdc_linebreak_targets = ["../ascdc_ocr_result/v1/v1_linebreak_targetText.txt",
"../ascdc_ocr_result/v2/v2_linebreak_targetText.txt",
"../ascdc_ocr_result/v3/v3_linebreak_targetText.txt"]
df = pd.read_csv("../shanghanlun_index_linebreak.csv")
# df['StartPage'] = np.nan
# df['EndPage'] = np.nan
# df['AlternativeName'] = df['AlternativeName'].fillna('-')
# if 'StartIndex' in df:
# del df['StartIndex']
# if 'EndIndex' in df:
# del df['EndIndex']
# df['Text'] = ''
Then, start the formulae separation.
formulaIndex = 0
for target in ascdc_linebreak_target:
with open(target, 'r') as f:
currentPageNumber = ''
volumeNumber = ''
contents = f.readlines()
# pageCount == -1 indicates that no formula is recording
pageCount = -1
for lIndex, line in enumerate(contents):
if formulaIndex > 110:
break
# skip empty line
if line == '\n':
continue
# detecting ======== 004971387_0XXX.jpg ========\n
if '=' in line:
# ["======== 004971387_0XXX", "jpg ========\n"]
lineSegments = line.split('.')
# "004971387_0XXX"
pageReferenceNumber = lineSegments[0].split(' ')[-1]
# [004971387", "0XXX"]
volumeNumber, currentPageNumber = [int(x) for x in pageReferenceNumber.split('_')]
# pageCount == -1 indicates that no formula is recording, no need to update pageCount
if pageCount != -1:
pageCount += 1
df.at[formulaIndex, 'Text'] += "//////\n"
# Each formula has limit of 2 pages
if pageCount == 2:
# last formula: 0th page and 1st page(last page)
df.at[formulaIndex, 'EndPage'] = currentPageNumber - 1
formulaIndex += 1
pageCount = -1
continue
# scan for formula name
if pageCount == -1:
altName = df.at[formulaIndex, 'AlternativeName']
if line.startswith(df.at[formulaIndex, 'Name']) or (altName != '-' and line.startswith(altName)):
pageCount = 0
df.at[formulaIndex, 'StartPage'] = currentPageNumber
df.at[formulaIndex, 'VolumeID'] = volumeNumber
# scan for ending keyword
if pageCount != -1:
indicating_herbCount = line.startswith('右') or '上' in line[:5]
indicating_dependency = line.startswith('於')
if indicating_herbCount or indicating_dependency:
df.at[formulaIndex, 'Text'] += line
if lIndex + 1 < len(contents) and indicating_dependency and '依' not in line and '法' not in line:
df.at[formulaIndex, 'Text'] += contents[lIndex + 1]
# mark the end for current formula
df.at[formulaIndex, 'EndPage'] = currentPageNumber
formulaIndex += 1
pageCount = -1
continue
if formulaIndex < 110:
altName = df.at[formulaIndex + 1, 'AlternativeName']
# triggered when the formula text does not contain the ending keywod
# and the scanner meet a new formula when the current formula has not ended
if line.startswith(df.at[formulaIndex + 1, 'Name']) or (altName != '-' and line.startswith(altName)):
print(f'triggered at {formulaIndex} {df.at[formulaIndex + 1, "Name"]}')
df.at[formulaIndex, 'EndPage'] = currentPageNumber
pageCount = 0
formulaIndex += 1
df.at[formulaIndex, 'StartPage'] = currentPageNumber
df.at[formulaIndex, 'VolumeID'] = volumeNumber
if 0 <= pageCount <= 1:
df.at[formulaIndex, 'Text'] += line