Formulae Separation

Formulae Separation

In this stage, we will extract the formulae from the OCR result. With the identified terminal keywords, the extraction can be done with a linear scan. This methodology can be applied to the OCR result from both ASCDC and Gujicool. The algorithm is described in the following pseudocode.

# Pseudocode of formulae extraction

formulae_collection = empty_list
current_formula = empty_list
recording = False

for each page in ocr_result:
    for each line in page:
        if line.startswith(any of the starting keywords):
            mark start of the formula
            set recording to True
            record current line in current_formula
            skip to next line
            
        if line.contain(any of the ending keywords):
            mark end of the formula
            set recording to False
            record current line in current_formula
            add current_formula to formulae_collection
            clear current_formula
            skip to next line

        if recording is True:
            add current line to current_formula

By observing the patterns in this edition of Shang-Han-Lun, we have identified the terminal keywords. The starting keyword is the formula name while the ending keywords are 「右、上、於」. The ending keywords usually describe the number of herbs used in the formula or the derivation from the parent formula. The detailed process will be explained in the following passage. The blue text shown below is our target formula, the Ma-Huang-Tang-Fang (麻黃湯方). The formula starts with its name and ends with 「右」. The phrase 右四味 means that 4 herbs are used in this formula.

《仲景全書》卷二 頁10

太陽病頭痛發熱身疼腰痛骨節疼痛惡風無汗而喘者
麻黃湯主之
此太陽傷寒也寒則傷榮頭痛身疼腰痛以至牽連骨
節疼痛者太陽經榮血不利也內經曰風寒客于人使
人毫毛畢直皮膚閉而為熱者寒在表也風并於衛衛
實而榮虛者自汗出而惡風寒也寒并於榮榮實而衛
虛者無汗而惡風也以榮強衛弱故氣逆而喘與麻黃
湯以發其汗
麻黃湯方
麻黃三兩去節味甘溫桂枝二兩去皮味辛熱甘草一兩炙味甘平
杏仁七十個湯泡去皮尖味辛溫
《仲景全書》卷二 頁11

右四味以水九升先煮麻黃減二升去上沬內諸藥煮取
半升令去滓溫服八合覆取微似汗不須啜粥餘如桂枝
法將息
內經曰寒淫於內治以甘熱佐以苦辛麻黃甘草開肌
發汗桂枝杏仁散寒下氣
太陽與陽明合病喘而胸滿者不可下宜麻黃湯主之
陽受氣於胸中喘而胸滿者陽氣不宣發壅而逆也心
下滿腹滿皆為實當下之此以為胸滿非裏實故不可
下雖有陽明然與太陽合病為屬表是與麻黃湯發汗
太陽病十日以去脈浮細而嗜臥者外已解也設胸滿脇
痛者與小柴胡湯脈但浮者與麻黃湯一

Table 1 gives more examples of the ending keywords. The keyword 「於」indicates that the current formula is derived from another formula. In this example, Gui-Zhi-Jia-Fu-Ji-Tang-Fang (桂枝加附子湯方) is derived from Gui-Zhi-Tang-Fang (桂枝湯方) by adding 附子.

Ending keywordUsageFormula extracted
Indicate the number of herbs used麻黃湯方
麻黃三兩去節味甘溫桂枝二兩去皮味辛熱甘草一兩炙味甘平
杏仁七十個湯泡去皮尖味辛溫
四味以水九升先煮麻黃減二升去上沬內諸藥煮取
Indicate the number of herbs used芍藥甘草附子湯方
芍藥三兩味酸微寒甘草三兩炙味甘平
附子一枚炮去皮破八片味辛熱
三味以水五升煮取一升五合去滓分溫服疑非仲
Describe the derivation from the parent formula (桂枝湯方)桂枝加附子湯方
桂枝湯方內加附子一枚炮去皮破八片餘依前法
Table 1: List of ending keywords and their usage

Variant Chinese Characters

To prevent the variant Chinese characters (異體字) from causing a mismatch of keywords, we convert the following variant Chinese characters to a single form. For example, the Chinese formal numerals 「壹貳叄肆伍陸柒捌玖拾」are converted to the regular numerals 「一二三四五六七八九十」.

Figure 1: List of variant Chinese characters

Python Code for Formulae Separation

It is highly recommended to view the code in a code editor.

First, prepare the data used in the separation. Initialize the dataframe columns if necessary.

import pandas as pd

ascdc_linebreak_targets = ["../ascdc_ocr_result/v1/v1_linebreak_targetText.txt",
                          "../ascdc_ocr_result/v2/v2_linebreak_targetText.txt",
                          "../ascdc_ocr_result/v3/v3_linebreak_targetText.txt"]

df = pd.read_csv("../shanghanlun_index_linebreak.csv")

# df['StartPage'] = np.nan
# df['EndPage'] = np.nan
# df['AlternativeName'] = df['AlternativeName'].fillna('-')
# if 'StartIndex' in df:
#     del df['StartIndex']
# if 'EndIndex' in df:
#     del df['EndIndex']
# df['Text'] = ''

Then, start the formulae separation.

formulaIndex = 0

for target in ascdc_linebreak_target:
    with open(target, 'r') as f:
        currentPageNumber = ''
        volumeNumber = ''

        contents = f.readlines()
        # pageCount == -1 indicates that no formula is recording
        pageCount = -1

        for lIndex, line in enumerate(contents):
            if formulaIndex > 110:
                break
                
            # skip empty line
            if line == '\n':
                continue

            # detecting ======== 004971387_0XXX.jpg ========\n
            if '=' in line:
                # ["======== 004971387_0XXX", "jpg ========\n"]
                lineSegments = line.split('.')
                # "004971387_0XXX"
                pageReferenceNumber = lineSegments[0].split(' ')[-1]
                # [004971387", "0XXX"]
                volumeNumber, currentPageNumber = [int(x) for x in pageReferenceNumber.split('_')]
                # pageCount == -1 indicates that no formula is recording, no need to update pageCount
                if pageCount != -1:
                    pageCount += 1
                    df.at[formulaIndex, 'Text'] += "//////\n"
                # Each formula has limit of 2 pages
                if pageCount == 2:
                    # last formula: 0th page and 1st page(last page)
                    df.at[formulaIndex, 'EndPage'] = currentPageNumber - 1 
                    formulaIndex += 1
                    pageCount = -1
                continue

            # scan for formula name
            if pageCount == -1:
                altName = df.at[formulaIndex, 'AlternativeName']
                if line.startswith(df.at[formulaIndex, 'Name']) or (altName != '-' and line.startswith(altName)):
                    pageCount = 0
                    df.at[formulaIndex, 'StartPage'] = currentPageNumber
                    df.at[formulaIndex, 'VolumeID'] = volumeNumber

            # scan for ending keyword
            if pageCount != -1:
                indicating_herbCount = line.startswith('右') or '上' in line[:5]
                indicating_dependency = line.startswith('於')
                
                if indicating_herbCount or indicating_dependency:
                    df.at[formulaIndex, 'Text'] += line
    
                    if lIndex + 1 < len(contents) and indicating_dependency and '依' not in line and '法' not in line:
                        df.at[formulaIndex, 'Text'] += contents[lIndex + 1]
    
                    # mark the end for current formula
                    df.at[formulaIndex, 'EndPage'] = currentPageNumber
                    formulaIndex += 1
                    pageCount = -1
                    continue

            if formulaIndex < 110:
                altName = df.at[formulaIndex + 1, 'AlternativeName']

                # triggered when the formula text does not contain the ending keywod
                # and the scanner meet a new formula when the current formula has not ended
                if line.startswith(df.at[formulaIndex + 1, 'Name']) or (altName != '-' and line.startswith(altName)):
                    print(f'triggered at {formulaIndex} {df.at[formulaIndex + 1, "Name"]}')
                    df.at[formulaIndex, 'EndPage'] = currentPageNumber
                    pageCount = 0
                    formulaIndex += 1
                    df.at[formulaIndex, 'StartPage'] = currentPageNumber
                    df.at[formulaIndex, 'VolumeID'] = volumeNumber

            if 0 <= pageCount <= 1:
                df.at[formulaIndex, 'Text'] += line