Formulae Modelling

Formulae Modelling

Overview

There are two types of artifacts in formulae modelling: the metadata of the formula and the structured representation of the formula content. Table 1 shows the metadata and Table 2 shows the structured representation. Note that there are 3 structured representations.

Name麻黃湯方
VolumeID4971388
Subchapter辨太陽病脈證并治中第六
StartPage10
EndPage11
Text麻黃湯方 麻黃三兩 去節 味甘溫 桂枝二兩 去皮 味辛熱 …
HerbCount4
ParentNone
Table 1: Metadata of Ma-Huang-Tang-Fang (麻黃湯方)
甘草桂枝麻黃杏仁芍藥
Chinese Quantity
麻黃湯方一兩二兩三兩七十個
Quantity in Gram
麻黃湯方15.62531.2546.875280
One-Hot Encoded
麻黃湯方11110
Table 2: Structured representations of Ma-Huang-Tang-Fang (麻黃湯方)

Extracting Metadata

The name, VolumnID and subchapter attributes are inputted manually according to the content page (Figure 1) of Zhongjing quan shu. StartPage, EndPage and Text are extracted in the Formulae Separation phase. Parent and HerbCount attributes are extracted in this phase.

Parent

As mentioned in the Formula Separation phase, the keyword 「於」indicates that the current formula is derived from another formula.

Code for extracting parent
# Aim: from those formulae with no clear indication of number of herb used
# identify the formulae which depend on other formula

import pandas as pd

merge_result = pd.read_excel("../shanghanlun_mergeOCR_afterExpansion.xlsx", header=0, index_col=0)

for index, (formula_name, data) in enumerate(merge_result.iterrows()):
    indexInExcel = index + 2
    sentences = data["Text"].strip().splitlines()
    sentences = sentences[1:] # discard the first line whcih is the formula name
    sentences = list(filter(lambda line: line != "//////", sentences))
    
    if "右" in sentences[-1] or "上" in sentences[-1]:
        continue

    if "依" in sentences[-1]:
        print(f"Row {indexInExcel}: {formula_name} depends on other formula.")
        print(f"\t{sentences[-1]}")
    else:
        print(f"Row {indexInExcel}: {formula_name} provides no info.")
        print(f"\t{sentences[-1]}")
        continue

    if sentences[-1].startswith("於") and not sentences[-1].startswith("於此方"):
        referFormula = sentences[-1].replace("於", "").partition("方")[0] + "方"
        merge_result.loc[formula_name, "parent"] = referFormula
        print(f"\t depend on {referFormula}")

HerbCount

There are 2 cases, depending on the ending words.

For HerbCount, the phrase after the ending words 「右」and 「上」will indicate the number of herbs used in this formula. For example, the phrase 「右四味」indicates that 4 herbs are used. We can translate the Chinese number 「四」 to the Arabic number 4 with the pycnnum package. Another translation example would be 「一十四」 to 14.

Figure 1: Content page of Zhongjing quan shu
Code for extracting herbCount on 「右」or「上」
# Aim: find the number of herb used in each formula
# if there is no indication, log it out for further data cleaning like expanding

import pandas as pd
import numpy as np
from pycnnum import cn2num

merge_result = pd.read_excel("../shanghanlun_mergeOCR_afterExpansion.xlsx", header=0, index_col=0)

# note that all the chinese financial num is replaced by ordinary num in last phase
chinese_ordinary_num = [*"一二三四五六七八九十廿百千"]

count = 0
for index, (formula_name, data) in enumerate(merge_result.iterrows()):
    indexInExcel = index + 2
    sentences = data["Text"].strip().splitlines()
    sentences = sentences[1:] # discard the first line whcih is the formula name
    sentences = list(filter(lambda line: line != "//////", sentences))
    
    if not ("右" in sentences[-1] or "上" in sentences[-1] or "於" in sentences[-1]):
        print(f"Row {indexInExcel}: {formula_name} has not listed the number of herbs used.")
        print(f"\t{sentences[-1]}")
        count += 1
        continue
        
    herbCountPhrase = sentences[-1].partition("味")[0]
    herbCountPhrase = herbCountPhrase.replace("右", "").replace("巳上", "").replace("右上", "")
    abnormalWords = list(filter(lambda char: char not in chinese_ordinary_num, herbCountPhrase))
    if len(abnormalWords) > 0:
        print(f"Row {indexInExcel}: {herbCountPhrase}. Abnormal: {abnormalWords}")
    else:
        herbCount = cn2num(herbCountPhrase)
        print(f"Row {indexInExcel}: {herbCount}")
        merge_result.loc[formula_name, "herbCount"] = herbCount
        
print(f"There are {count} formulae with no clear indications.")

If the formula uses the ending word 「於」 but not 「右」or 「上」, we need to expand the formula with its parent, which means filling the content of the parent formula to the child formula. The 1st expansion in Table 3 is done by program while the 2nd expansion is done manually. After the expansion, the formula ends with 「右」 or 「上」. We can use the above method to record the herb count.

Child formula text1st expansion2nd expansion
桂枝加附子湯方
桂枝湯方內加附子一枚炮去皮破八片餘依前法
桂枝加附子湯方
桂枝
三兩
去皮
味辛熱
芍藥
三兩
味苦酸微寒
甘草
二兩

味甘平
生薑
三兩

味辛溫
大棗
十二枚

味甘溫
右五味㕮咀以水七升微火煮取三升去滓適寒溫服一

於桂枝湯方内加附子一枚炮去皮破八片餘依前法
桂枝加附子湯方
桂枝
三兩
去皮
味辛熱
芍藥
三兩
味苦酸微寒
甘草
二兩

味甘平
生薑
三兩

味辛溫
大棗
十二枚

味甘溫

附子
一枚
炮去皮破八片

味㕮咀以水七升微火煮取三升去滓適寒溫服一
Table 3: Measurement of herbs in Shang-Han-Lun
Code for expanding formula
# Aim: Expand those formulae which depend on other formula 
# i.e. if formulaB depends on formulaA, add the 'Text' of formulaA into text of formulaB

import pandas as pd

merge_result = pd.read_excel("../shanghanlun_mergeOCR_afterExpansion.xlsx", header=0, index_col=0)

for index, (formula_name, data) in enumerate(merge_result.iterrows()):
    indexInExcel = index + 2
    if data["parent"] != '無':
        sentences = data["Text"]
        sentences = sentences.partition("\n")

        if len(sentences[1]) != -1:
            if data["parent"] not in merge_result.index:
                print(f"Row {indexInExcel}: Cannot find the source: {data['parent']}")
                print()
                continue
                
            if merge_result.loc[data["parent"], "parent"] != "無":
                currentFormula = formula_name
                print(f"Row {indexInExcel}: {formula_name} ", end='')
                while currentFormula != "無":
                    source = merge_result.loc[currentFormula, 'parent']
                    print(f"-> {source}", end='')
                    currentFormula = source
                print()
                print(f"\tThere is a complex dependency")
                print()
                continue

            referText = merge_result.loc[data["parent"], "Processed_Text"]
            referText = referText.partition('\n')[-1] # exclude the name of parent formula
            sentences = sentences[0] + "\n" + referText + ("" if referText[-1] == '\n' else '\n') + sentences[2]
            print(f"Row {indexInExcel}")
            print(sentences)
            print()
            merge_result.loc[formula_name, "Processed_Text"] = sentences

Three exceptions were found.

The first one is Shu-Fu-Tang-Fang (朮附湯方). The phrase 「於此方內」means that the parent formula is itself, which is logically impossible. After some research, we found that the phrase refers to the previous formula, which is Gui-Zhi-Jia-Fu-Ji-Tang-Fang (桂枝加附子湯方).

朮附湯方
於此方内去桂枝加白朮四兩依前法

The second one is Mi-Jian-Dao-Fang (蜜煎導方) and Zhu-Dan-Zhi-Fang (豬膽汁方). They contain no ending keywords. Therefore, we handle them manually.

蜜煎導方
蜜
七合
一味内銅器中微火煎之稍凝似飴狀攪之勿

豬膽汁方
大豬膽
一枚
瀉汁和醋少許以灌穀道中

Extracting Structured Representations

We have 3 ways to model the formula content.

The first table is filled with the Chinese quantity for each herb.

The second table is filled with the quantity in grams. Note that different herb uses different descriptions. For example, 甘草 is described in 兩 and 杏仁 is described in 個. Some herbs are measured by mass-specifiers while some are measured by count-specifiers. After some research, we converted the units/classifiers to grams.

The third table is one-hot encoded. The entry is filled with one if the herb is present, otherwise, it is filled with zero.

甘草桂枝麻黃杏仁芍藥
Chinese Quantity
麻黃湯方一兩二兩三兩七十個
Quantity in Gram
麻黃湯方15.62531.2546.875280
One-Hot Encoded
麻黃湯方11110
Table 2: Structured representations of 麻黃湯方
mass-specifiers兩、分、觔、斤、銖、錢、方寸匕
count-specifiers個、枚、升、把、合
comparative-specifier雞子大
Table 3: measurement of herbs in Shang-Han-Lun
杏仁烏梅桃仁
0.3g0.9g0.3g
Table 4: part of the conversion table

A critical part of the modelling is to identify the herb correctly. Fortunately, the ASCDC has provided us with a comprehensive Chinese Medicine List. Moreover, the team of Text Analysis on Collected Exegesis of Recipes1 in Data Analytics Practice Opportunity 2021/ 2022 has added more herb names to the ASCDC list. This saves us a lot of time. We found that 25 herbs were not recorded in the list, so we added them to the list.

Code for finding unrecorded herbs
import pandas as pd
from collections import Counter

merge_result = pd.read_excel("../shanghanlun_mergeOCR_afterExpansion.xlsx", header=0, index_col=0) 
herb_names = pd.read_csv("../herbs_name.csv", header=None)
herb_names_set = set(herb_names.iloc[:, 0])

count = 0
foundHerbs = set()
herbsFrequency = Counter()
for index, (formula_name, data) in enumerate(merge_result.iterrows()):
    indexInExcel = index + 2
    sentences = data["Processed_Text"].strip().splitlines()
    sentences = sentences[1:] # discard the first line whcih is the formula name
    sentences = list(filter(lambda line: line != "//////", sentences))

    if data['herbCount'] == -1:
        print(f"Row {indexInExcel}: {formula_name} has no herbCount yet.")
        continue

    formula_herbs = set()
    ordered_herbs = []
    observedHerbCount = 0
    actualHerbCount = data['herbCount']
    for sentence in sentences:
        if sentence in herb_names_set:
            observedHerbCount += 1
            formula_herbs.add(sentence)
            ordered_herbs.append(sentence)
            foundHerbs.add(sentence)
            herbsFrequency[sentence] += 1
    diff = actualHerbCount - observedHerbCount
    count += diff
    if (diff != 0):
        print(f"Row {indexInExcel}: {formula_name} has {diff} missing herbs.")
        print(f"\tObserved herbs: {ordered_herbs}.")
        
print(f"There are {count} missing herbs in herbs_name.")
print(f"Observed {len(foundHerbs)} herbs in ShangHanLun.")
print(f"Oberserved Herbs:\n{foundHerbs}")
print(f"The corresponding frequency of herbs: {herbsFrequency}")

Some formulae use 「各」to describe the dosage for multiple herbs. For example, the phrase 「各一兩」in the following formula means that the dosage for 芍藥, 生薑, 甘草 and 麻黃 are both 「一兩」.

桂枝麻黃各半湯方
桂枝
一兩十六銖
去皮
芍藥
生薑

甘草

麻黃

各一兩
去節
大棗
四枚

杏仁
二十四個
湯浸去皮尖及兩仁者
右七味以水五升先煮麻黃一二沸去上沫内諸藥煮取

Furthermore, 3 formulae (十棗湯方, 半夏散及湯方, 牡蠣澤瀉散方) use 「各等分」 to indicate that every herb used has the same dosage. However, the actual dosage is not mentioned. To prevent errors, we omit them in the quantity model.

半夏散及湯方
半夏

味辛溫
桂枝
去皮
味辛熱
甘草

味甘平
以上各等分
巳上三味各别搗篩巳合治之白飲和服方寸匕曰三服

Code for Modelling

First, we need to load the required data such as the unit conversion dictionary.

import pandas as pd

merge_result = pd.read_excel("../shanghanlun_mergeOCR_afterExpansion.xlsx", header=0, index_col=0)
herb_names = pd.read_csv("../herbs_name.csv", header=None)
herb_names_set = set(herb_names.iloc[:, 0])

special_units_excel = pd.read_excel("../special_unit_conversion.xlsx", sheet_name=None, index_col=0, nrows=1)

special_units_conversion = dict()
for special_unit, dataframe in special_units_excel.items():
    special_units_conversion.update(dataframe.to_dict('index'))

special_units = set([
    '個',
    '枚',
    '升',
    '把',
    '合',
    '雞子大'
])

ordinary_units = set([
    '兩',
    '分',
    '觔',
    '斤', 
    '銖',
    '錢'
])

# reference to:
# https://www.theqi.com/cmed/class/class1/note_18.html2
ordinary_units_conversion = {
    '兩': 15.625,
    '分': 4.05,
    '觔': 3.69,
    '斤': 3.69, 
    '銖': 0.65,
    '錢': 3,
}

chinese_ordinary_num = [*"一二三四五六七八九十廿百千半"]

units = tuple(ordinary_units.union(special_units))

quantity_model = pd.DataFrame(index=merge_result.index.copy())
formulae = merge_result.index.copy()
quantity_model = {formula: {} for formula in formulae}
chinese_quantity_model = {formula: {} for formula in formulae}
for formula in formulae:
    for herb, _ in herbsFrequency.most_common():
        quantity_model[formula][herb] = 0
        chinese_quantity_model[formula][herb] = '無'

unit_combinations = set()
arbitrary_equal_share = []
concrete_equal_share = []
abnormal_phrase = []
herb_with_size = []
missing_herb_formulae = []

Then, define some auxiliary functions

def getHerbDosage(currentHerb, targetUnit, sentence, indexInExcel):         
    # only 石膏 uses 雞子大
    if targetUnit == "雞子大":
        abnormal_phrase.append(f"Row {indexInExcel}: {formula_name}, {currentHerb + sentence}")
        return special_units_conversion["雞子大"]["石膏"]
        
    # eg. 十二兩半 partitioned into 十二,兩,半
    dosagePhrase, unit, addHalf = sentence.rpartition(targetUnit)

    if any(char not in chinese_ordinary_num for char in dosagePhrase):
        abnormal_phrase.append(f"Row {indexInExcel}: {formula_name}, {currentHerb + sentence}")
    
    unitWeight = 0
    if targetUnit in ordinary_units:
        unitWeight = ordinary_units_conversion[targetUnit]
    elif targetUnit in special_units:
        if  dosagePhrase.startswith("大者"):
            dosagePhrase = dosagePhrase[2:]
            concated_herbName = currentHerb + "大者"
            unitWeight = special_units_conversion[targetUnit][concated_herbName]
        else:
            unitWeight = special_units_conversion[targetUnit][currentHerb]
            
    # 半兩
    dosage = 0.5 if '半' == dosagePhrase else cn2num(dosagePhrase)
    if len(addHalf) > 0:
        dosage += 0.5
    if unitWeight != 0:
        dosage *= unitWeight
    return dosage

def getPresentUnits(sentence):
    presentUnits = []
    for unit in units:
        pos = sentence.find(unit)
        if pos != -1:
            presentUnits.append((pos, unit))
    presentUnits.sort(key=lambda pair: pair[0]) # sort by pos
    presentUnits = tuple([unit for (pos, unit) in presentUnits])
    return presentUnits

Here is the modelling for Chinese quantity and quantity in grams.

for index, (formula_name, data) in enumerate(merge_result.iterrows()):
    indexInExcel = index + 2
    sentences = data["Processed_Text"].strip()
    if "各等分" in sentences:
        arbitrary_equal_share.append(f"Row {indexInExcel}: {formula_name}")
        continue

    if "各" in sentences:            
        concrete_equal_share.append(f"Row {indexInExcel}: {formula_name}")
        print(f"Row {indexInExcel}: {formula_name} contains 各")

    sentences = sentences.splitlines()
    sentences = sentences[1:-1] # discard the formula name and the '右X味' line
    sentences = list(filter(lambda line: line != "//////", sentences))

    if data['herbCount'] == -1:
        print(f"Row {indexInExcel}: {formula_name} has no herbCount yet.")
        continue

    print(f"Row {indexInExcel}: {formula_name}")
    herbCount = 0 
    actualHerbCount = data['herbCount']
    currentHerbs = []
    for sentence in sentences:
        if sentence in herb_names_set:
            currentHerbs.append(sentence)
            continue
            
        if len(currentHerbs) == 0:
            continue

        if sentence == '方寸匕':
            print(currentHerb, sentence, 2)
            quantity_model[formula_name][currentHerb] = 2
            herbCount += 1
            continue
            
        if any(prep_word in sentence for prep_word in preparation_keywords):
            continue

        if sentence.endswith(units) or sentence[:-1].endswith(units):
            _, _, sentence = sentence.rpartition('各')
            while len(currentHerbs) > 0:
                currentHerb = currentHerbs.pop()
                chinese_quantity_model[formula_name][currentHerb] = sentence
                print(currentHerb, sentence, end=' ')
                currentDosage = 0
                if sentence.startswith("大者"):
                    herb_with_size.append(currentHerb)

                presentUnits = getPresentUnits(sentence)
                
                # handle case with multiple ordinary units, such as 一兩十分
                if len(presentUnits) > 1:
                    print(presentUnits, end=' ')
                    # since the units are scanned in the same order
                    # There will be no any permutations of (A, B) in unit_combinations
                    unit_combinations.add(presentUnits)

                    for unit in presentUnits:
                        slices = sentence.split(unit)
                        dosagePhrase = slices[0] + unit
                        sentence = "".join(slices[1:]) # prepare the sentence for next iteration
                        currentDosage += getHerbDosage(currentHerb, unit, dosagePhrase, indexInExcel)

                # one unit case
                else:
                    targetUnit = presentUnits[0]
                    currentDosage += getHerbDosage(currentHerb, targetUnit, sentence, indexInExcel)
                
                herbCount += 1
                quantity_model[formula_name][currentHerb] = currentDosage
                print(currentDosage)

    if herbCount != data['herbCount']:
        diff = data['herbCount'] - herbCount
        print(f"Row {indexInExcel}: {formula_name} miss {diff} herbs.")
        missing_herb_formulae.append(f"Row {indexInExcel}: {formula_name}")

    print()

print("Below are the unit combinations")
print(unit_combinations)
print("Below are the formula(s) containing 各等分")
print(arbitrary_equal_share)
print("Below are the formula(s) containing 各")
print(concrete_equal_share)
print("Below are the formula(s) having non-Chinese character in dosage")
print(abnormal_phrase)
print("Below are the formula(s) with missing herbs")
print(missing_herb_formulae)

quantity_model_df = pd.DataFrame.from_dict(quantity_model, orient='index')
chinese_quantity_df = pd.DataFrame.from_dict(chinese_quantity_model, orient='index')

Here is the modelling for one-hot model.

from collections import Counter 

one_hot_model = pd.DataFrame(index=merge_result.index.copy())
for herb, _ in herbsFrequency.most_common():
    one_hot_model[herb] = 0

for index, (formula_name, data) in enumerate(merge_result.iterrows()):
    indexInExcel = index + 2
    sentences = data["Processed_Text"].strip()        
    sentences = sentences.splitlines()
    sentences = sentences[1:] # discard the first line whcih is the formula name
    sentences = list(filter(lambda line: line != "//////", sentences))

    if data['herbCount'] == -1:
        print(f"Row {indexInExcel}: {formula_name} has no herbCount yet.")
        continue

    herbCount = 0 
    currentHerb = ""
    for sentence in sentences:
        if sentence in herb_names_set:
            currentHerb = sentence
            one_hot_model.loc[formula_name, currentHerb] = 1
            herbCount += 1
            continue
            
        if currentHerb == "":
            continue

    if herbCount != data['herbCount']:
        diff = data['herbCount'] - herbCount
        print(f"Row {indexInExcel}: {formula_name} miss {diff} herbs.")

one_hot_model.to_excel("../one_hot_model.xlsx", index=True)
  1. Text Analysis on Collected Exegesis of Recipes, Data Analytics Practice Opportunity 2021/22, https://dsprojects.lib.cuhk.edu.hk/en/projects/chinese-medicine/home/ ↩︎
  2. 郝萬山講《傷寒論》 第18講, https://www.theqi.com/cmed/class/class1/note_18.html ↩︎