Formulae Modelling
Overview
There are two types of artifacts in formulae modelling: the metadata of the formula and the structured representation of the formula content. Table 1 shows the metadata and Table 2 shows the structured representation. Note that there are 3 structured representations.
Name | 麻黃湯方 |
VolumeID | 4971388 |
Subchapter | 辨太陽病脈證并治中第六 |
StartPage | 10 |
EndPage | 11 |
Text | 麻黃湯方 麻黃三兩 去節 味甘溫 桂枝二兩 去皮 味辛熱 … |
HerbCount | 4 |
Parent | None |
甘草 | 桂枝 | 麻黃 | 杏仁 | 芍藥 | |
---|---|---|---|---|---|
Chinese Quantity | |||||
麻黃湯方 | 一兩 | 二兩 | 三兩 | 七十個 | 無 |
Quantity in Gram | |||||
麻黃湯方 | 15.625 | 31.25 | 46.875 | 28 | 0 |
One-Hot Encoded | |||||
麻黃湯方 | 1 | 1 | 1 | 1 | 0 |
Extracting Metadata
The name, VolumnID and subchapter attributes are inputted manually according to the content page (Figure 1) of Zhongjing quan shu. StartPage, EndPage and Text are extracted in the Formulae Separation phase. Parent and HerbCount attributes are extracted in this phase.
Parent
As mentioned in the Formula Separation phase, the keyword 「於」indicates that the current formula is derived from another formula.
Code for extracting parent
# Aim: from those formulae with no clear indication of number of herb used
# identify the formulae which depend on other formula
import pandas as pd
merge_result = pd.read_excel("../shanghanlun_mergeOCR_afterExpansion.xlsx", header=0, index_col=0)
for index, (formula_name, data) in enumerate(merge_result.iterrows()):
indexInExcel = index + 2
sentences = data["Text"].strip().splitlines()
sentences = sentences[1:] # discard the first line whcih is the formula name
sentences = list(filter(lambda line: line != "//////", sentences))
if "右" in sentences[-1] or "上" in sentences[-1]:
continue
if "依" in sentences[-1]:
print(f"Row {indexInExcel}: {formula_name} depends on other formula.")
print(f"\t{sentences[-1]}")
else:
print(f"Row {indexInExcel}: {formula_name} provides no info.")
print(f"\t{sentences[-1]}")
continue
if sentences[-1].startswith("於") and not sentences[-1].startswith("於此方"):
referFormula = sentences[-1].replace("於", "").partition("方")[0] + "方"
merge_result.loc[formula_name, "parent"] = referFormula
print(f"\t depend on {referFormula}")
HerbCount
There are 2 cases, depending on the ending words.
For HerbCount, the phrase after the ending words 「右」and 「上」will indicate the number of herbs used in this formula. For example, the phrase 「右四味」indicates that 4 herbs are used. We can translate the Chinese number 「四」 to the Arabic number 4 with the pycnnum
package. Another translation example would be 「一十四」 to 14.
Code for extracting herbCount on 「右」or「上」
# Aim: find the number of herb used in each formula
# if there is no indication, log it out for further data cleaning like expanding
import pandas as pd
import numpy as np
from pycnnum import cn2num
merge_result = pd.read_excel("../shanghanlun_mergeOCR_afterExpansion.xlsx", header=0, index_col=0)
# note that all the chinese financial num is replaced by ordinary num in last phase
chinese_ordinary_num = [*"一二三四五六七八九十廿百千"]
count = 0
for index, (formula_name, data) in enumerate(merge_result.iterrows()):
indexInExcel = index + 2
sentences = data["Text"].strip().splitlines()
sentences = sentences[1:] # discard the first line whcih is the formula name
sentences = list(filter(lambda line: line != "//////", sentences))
if not ("右" in sentences[-1] or "上" in sentences[-1] or "於" in sentences[-1]):
print(f"Row {indexInExcel}: {formula_name} has not listed the number of herbs used.")
print(f"\t{sentences[-1]}")
count += 1
continue
herbCountPhrase = sentences[-1].partition("味")[0]
herbCountPhrase = herbCountPhrase.replace("右", "").replace("巳上", "").replace("右上", "")
abnormalWords = list(filter(lambda char: char not in chinese_ordinary_num, herbCountPhrase))
if len(abnormalWords) > 0:
print(f"Row {indexInExcel}: {herbCountPhrase}. Abnormal: {abnormalWords}")
else:
herbCount = cn2num(herbCountPhrase)
print(f"Row {indexInExcel}: {herbCount}")
merge_result.loc[formula_name, "herbCount"] = herbCount
print(f"There are {count} formulae with no clear indications.")
If the formula uses the ending word 「於」 but not 「右」or 「上」, we need to expand the formula with its parent, which means filling the content of the parent formula to the child formula. The 1st expansion in Table 3 is done by program while the 2nd expansion is done manually. After the expansion, the formula ends with 「右」 or 「上」. We can use the above method to record the herb count.
Child formula text | 1st expansion | 2nd expansion |
桂枝加附子湯方 於桂枝湯方內加附子一枚炮去皮破八片餘依前法 | 桂枝加附子湯方 桂枝 三兩 去皮 味辛熱 芍藥 三兩 味苦酸微寒 甘草 二兩 炙 味甘平 生薑 三兩 切 味辛溫 大棗 十二枚 擘 味甘溫 右五味㕮咀以水七升微火煮取三升去滓適寒溫服一 於桂枝湯方内加附子一枚炮去皮破八片餘依前法 | 桂枝加附子湯方 桂枝 三兩 去皮 味辛熱 芍藥 三兩 味苦酸微寒 甘草 二兩 炙 味甘平 生薑 三兩 切 味辛溫 大棗 十二枚 擘 味甘溫 附子 一枚 炮去皮破八片 右六味㕮咀以水七升微火煮取三升去滓適寒溫服一 |
Code for expanding formula
# Aim: Expand those formulae which depend on other formula
# i.e. if formulaB depends on formulaA, add the 'Text' of formulaA into text of formulaB
import pandas as pd
merge_result = pd.read_excel("../shanghanlun_mergeOCR_afterExpansion.xlsx", header=0, index_col=0)
for index, (formula_name, data) in enumerate(merge_result.iterrows()):
indexInExcel = index + 2
if data["parent"] != '無':
sentences = data["Text"]
sentences = sentences.partition("\n")
if len(sentences[1]) != -1:
if data["parent"] not in merge_result.index:
print(f"Row {indexInExcel}: Cannot find the source: {data['parent']}")
print()
continue
if merge_result.loc[data["parent"], "parent"] != "無":
currentFormula = formula_name
print(f"Row {indexInExcel}: {formula_name} ", end='')
while currentFormula != "無":
source = merge_result.loc[currentFormula, 'parent']
print(f"-> {source}", end='')
currentFormula = source
print()
print(f"\tThere is a complex dependency")
print()
continue
referText = merge_result.loc[data["parent"], "Processed_Text"]
referText = referText.partition('\n')[-1] # exclude the name of parent formula
sentences = sentences[0] + "\n" + referText + ("" if referText[-1] == '\n' else '\n') + sentences[2]
print(f"Row {indexInExcel}")
print(sentences)
print()
merge_result.loc[formula_name, "Processed_Text"] = sentences
Three exceptions were found.
The first one is Shu-Fu-Tang-Fang (朮附湯方). The phrase 「於此方內」means that the parent formula is itself, which is logically impossible. After some research, we found that the phrase refers to the previous formula, which is Gui-Zhi-Jia-Fu-Ji-Tang-Fang (桂枝加附子湯方).
朮附湯方
於此方内去桂枝加白朮四兩依前法
The second one is Mi-Jian-Dao-Fang (蜜煎導方) and Zhu-Dan-Zhi-Fang (豬膽汁方). They contain no ending keywords. Therefore, we handle them manually.
蜜煎導方 蜜 七合 一味内銅器中微火煎之稍凝似飴狀攪之勿
豬膽汁方 大豬膽 一枚 瀉汁和醋少許以灌穀道中
Extracting Structured Representations
We have 3 ways to model the formula content.
The first table is filled with the Chinese quantity for each herb.
The second table is filled with the quantity in grams. Note that different herb uses different descriptions. For example, 甘草 is described in 兩 and 杏仁 is described in 個. Some herbs are measured by mass-specifiers while some are measured by count-specifiers. After some research, we converted the units/classifiers to grams.
The third table is one-hot encoded. The entry is filled with one if the herb is present, otherwise, it is filled with zero.
甘草 | 桂枝 | 麻黃 | 杏仁 | 芍藥 | |
---|---|---|---|---|---|
Chinese Quantity | |||||
麻黃湯方 | 一兩 | 二兩 | 三兩 | 七十個 | 無 |
Quantity in Gram | |||||
麻黃湯方 | 15.625 | 31.25 | 46.875 | 28 | 0 |
One-Hot Encoded | |||||
麻黃湯方 | 1 | 1 | 1 | 1 | 0 |
mass-specifiers | 兩、分、觔、斤、銖、錢、方寸匕 |
count-specifiers | 個、枚、升、把、合 |
comparative-specifier | 雞子大 |
杏仁 | 烏梅 | 桃仁 | |
個 | 0.3g | 0.9g | 0.3g |
A critical part of the modelling is to identify the herb correctly. Fortunately, the ASCDC has provided us with a comprehensive Chinese Medicine List. Moreover, the team of Text Analysis on Collected Exegesis of Recipes1 in Data Analytics Practice Opportunity 2021/ 2022 has added more herb names to the ASCDC list. This saves us a lot of time. We found that 25 herbs were not recorded in the list, so we added them to the list.
Code for finding unrecorded herbs
import pandas as pd
from collections import Counter
merge_result = pd.read_excel("../shanghanlun_mergeOCR_afterExpansion.xlsx", header=0, index_col=0)
herb_names = pd.read_csv("../herbs_name.csv", header=None)
herb_names_set = set(herb_names.iloc[:, 0])
count = 0
foundHerbs = set()
herbsFrequency = Counter()
for index, (formula_name, data) in enumerate(merge_result.iterrows()):
indexInExcel = index + 2
sentences = data["Processed_Text"].strip().splitlines()
sentences = sentences[1:] # discard the first line whcih is the formula name
sentences = list(filter(lambda line: line != "//////", sentences))
if data['herbCount'] == -1:
print(f"Row {indexInExcel}: {formula_name} has no herbCount yet.")
continue
formula_herbs = set()
ordered_herbs = []
observedHerbCount = 0
actualHerbCount = data['herbCount']
for sentence in sentences:
if sentence in herb_names_set:
observedHerbCount += 1
formula_herbs.add(sentence)
ordered_herbs.append(sentence)
foundHerbs.add(sentence)
herbsFrequency[sentence] += 1
diff = actualHerbCount - observedHerbCount
count += diff
if (diff != 0):
print(f"Row {indexInExcel}: {formula_name} has {diff} missing herbs.")
print(f"\tObserved herbs: {ordered_herbs}.")
print(f"There are {count} missing herbs in herbs_name.")
print(f"Observed {len(foundHerbs)} herbs in ShangHanLun.")
print(f"Oberserved Herbs:\n{foundHerbs}")
print(f"The corresponding frequency of herbs: {herbsFrequency}")
Some formulae use 「各」to describe the dosage for multiple herbs. For example, the phrase 「各一兩」in the following formula means that the dosage for 芍藥, 生薑, 甘草 and 麻黃 are both 「一兩」.
桂枝麻黃各半湯方
桂枝
一兩十六銖
去皮
芍藥
生薑
切
甘草
炙
麻黃
各一兩
去節
大棗
四枚
擘
杏仁
二十四個
湯浸去皮尖及兩仁者
右七味以水五升先煮麻黃一二沸去上沫内諸藥煮取
Furthermore, 3 formulae (十棗湯方, 半夏散及湯方, 牡蠣澤瀉散方) use 「各等分」 to indicate that every herb used has the same dosage. However, the actual dosage is not mentioned. To prevent errors, we omit them in the quantity model.
半夏散及湯方
半夏
洗
味辛溫
桂枝
去皮
味辛熱
甘草
炙
味甘平
以上各等分
巳上三味各别搗篩巳合治之白飲和服方寸匕曰三服
Code for Modelling
First, we need to load the required data such as the unit conversion dictionary.
import pandas as pd
merge_result = pd.read_excel("../shanghanlun_mergeOCR_afterExpansion.xlsx", header=0, index_col=0)
herb_names = pd.read_csv("../herbs_name.csv", header=None)
herb_names_set = set(herb_names.iloc[:, 0])
special_units_excel = pd.read_excel("../special_unit_conversion.xlsx", sheet_name=None, index_col=0, nrows=1)
special_units_conversion = dict()
for special_unit, dataframe in special_units_excel.items():
special_units_conversion.update(dataframe.to_dict('index'))
special_units = set([
'個',
'枚',
'升',
'把',
'合',
'雞子大'
])
ordinary_units = set([
'兩',
'分',
'觔',
'斤',
'銖',
'錢'
])
# reference to:
# https://www.theqi.com/cmed/class/class1/note_18.html2
ordinary_units_conversion = {
'兩': 15.625,
'分': 4.05,
'觔': 3.69,
'斤': 3.69,
'銖': 0.65,
'錢': 3,
}
chinese_ordinary_num = [*"一二三四五六七八九十廿百千半"]
units = tuple(ordinary_units.union(special_units))
quantity_model = pd.DataFrame(index=merge_result.index.copy())
formulae = merge_result.index.copy()
quantity_model = {formula: {} for formula in formulae}
chinese_quantity_model = {formula: {} for formula in formulae}
for formula in formulae:
for herb, _ in herbsFrequency.most_common():
quantity_model[formula][herb] = 0
chinese_quantity_model[formula][herb] = '無'
unit_combinations = set()
arbitrary_equal_share = []
concrete_equal_share = []
abnormal_phrase = []
herb_with_size = []
missing_herb_formulae = []
Then, define some auxiliary functions
def getHerbDosage(currentHerb, targetUnit, sentence, indexInExcel):
# only 石膏 uses 雞子大
if targetUnit == "雞子大":
abnormal_phrase.append(f"Row {indexInExcel}: {formula_name}, {currentHerb + sentence}")
return special_units_conversion["雞子大"]["石膏"]
# eg. 十二兩半 partitioned into 十二,兩,半
dosagePhrase, unit, addHalf = sentence.rpartition(targetUnit)
if any(char not in chinese_ordinary_num for char in dosagePhrase):
abnormal_phrase.append(f"Row {indexInExcel}: {formula_name}, {currentHerb + sentence}")
unitWeight = 0
if targetUnit in ordinary_units:
unitWeight = ordinary_units_conversion[targetUnit]
elif targetUnit in special_units:
if dosagePhrase.startswith("大者"):
dosagePhrase = dosagePhrase[2:]
concated_herbName = currentHerb + "大者"
unitWeight = special_units_conversion[targetUnit][concated_herbName]
else:
unitWeight = special_units_conversion[targetUnit][currentHerb]
# 半兩
dosage = 0.5 if '半' == dosagePhrase else cn2num(dosagePhrase)
if len(addHalf) > 0:
dosage += 0.5
if unitWeight != 0:
dosage *= unitWeight
return dosage
def getPresentUnits(sentence):
presentUnits = []
for unit in units:
pos = sentence.find(unit)
if pos != -1:
presentUnits.append((pos, unit))
presentUnits.sort(key=lambda pair: pair[0]) # sort by pos
presentUnits = tuple([unit for (pos, unit) in presentUnits])
return presentUnits
Here is the modelling for Chinese quantity
and quantity in grams
.
for index, (formula_name, data) in enumerate(merge_result.iterrows()):
indexInExcel = index + 2
sentences = data["Processed_Text"].strip()
if "各等分" in sentences:
arbitrary_equal_share.append(f"Row {indexInExcel}: {formula_name}")
continue
if "各" in sentences:
concrete_equal_share.append(f"Row {indexInExcel}: {formula_name}")
print(f"Row {indexInExcel}: {formula_name} contains 各")
sentences = sentences.splitlines()
sentences = sentences[1:-1] # discard the formula name and the '右X味' line
sentences = list(filter(lambda line: line != "//////", sentences))
if data['herbCount'] == -1:
print(f"Row {indexInExcel}: {formula_name} has no herbCount yet.")
continue
print(f"Row {indexInExcel}: {formula_name}")
herbCount = 0
actualHerbCount = data['herbCount']
currentHerbs = []
for sentence in sentences:
if sentence in herb_names_set:
currentHerbs.append(sentence)
continue
if len(currentHerbs) == 0:
continue
if sentence == '方寸匕':
print(currentHerb, sentence, 2)
quantity_model[formula_name][currentHerb] = 2
herbCount += 1
continue
if any(prep_word in sentence for prep_word in preparation_keywords):
continue
if sentence.endswith(units) or sentence[:-1].endswith(units):
_, _, sentence = sentence.rpartition('各')
while len(currentHerbs) > 0:
currentHerb = currentHerbs.pop()
chinese_quantity_model[formula_name][currentHerb] = sentence
print(currentHerb, sentence, end=' ')
currentDosage = 0
if sentence.startswith("大者"):
herb_with_size.append(currentHerb)
presentUnits = getPresentUnits(sentence)
# handle case with multiple ordinary units, such as 一兩十分
if len(presentUnits) > 1:
print(presentUnits, end=' ')
# since the units are scanned in the same order
# There will be no any permutations of (A, B) in unit_combinations
unit_combinations.add(presentUnits)
for unit in presentUnits:
slices = sentence.split(unit)
dosagePhrase = slices[0] + unit
sentence = "".join(slices[1:]) # prepare the sentence for next iteration
currentDosage += getHerbDosage(currentHerb, unit, dosagePhrase, indexInExcel)
# one unit case
else:
targetUnit = presentUnits[0]
currentDosage += getHerbDosage(currentHerb, targetUnit, sentence, indexInExcel)
herbCount += 1
quantity_model[formula_name][currentHerb] = currentDosage
print(currentDosage)
if herbCount != data['herbCount']:
diff = data['herbCount'] - herbCount
print(f"Row {indexInExcel}: {formula_name} miss {diff} herbs.")
missing_herb_formulae.append(f"Row {indexInExcel}: {formula_name}")
print()
print("Below are the unit combinations")
print(unit_combinations)
print("Below are the formula(s) containing 各等分")
print(arbitrary_equal_share)
print("Below are the formula(s) containing 各")
print(concrete_equal_share)
print("Below are the formula(s) having non-Chinese character in dosage")
print(abnormal_phrase)
print("Below are the formula(s) with missing herbs")
print(missing_herb_formulae)
quantity_model_df = pd.DataFrame.from_dict(quantity_model, orient='index')
chinese_quantity_df = pd.DataFrame.from_dict(chinese_quantity_model, orient='index')
Here is the modelling for one-hot model
.
from collections import Counter
one_hot_model = pd.DataFrame(index=merge_result.index.copy())
for herb, _ in herbsFrequency.most_common():
one_hot_model[herb] = 0
for index, (formula_name, data) in enumerate(merge_result.iterrows()):
indexInExcel = index + 2
sentences = data["Processed_Text"].strip()
sentences = sentences.splitlines()
sentences = sentences[1:] # discard the first line whcih is the formula name
sentences = list(filter(lambda line: line != "//////", sentences))
if data['herbCount'] == -1:
print(f"Row {indexInExcel}: {formula_name} has no herbCount yet.")
continue
herbCount = 0
currentHerb = ""
for sentence in sentences:
if sentence in herb_names_set:
currentHerb = sentence
one_hot_model.loc[formula_name, currentHerb] = 1
herbCount += 1
continue
if currentHerb == "":
continue
if herbCount != data['herbCount']:
diff = data['herbCount'] - herbCount
print(f"Row {indexInExcel}: {formula_name} miss {diff} herbs.")
one_hot_model.to_excel("../one_hot_model.xlsx", index=True)
- Text Analysis on Collected Exegesis of Recipes, Data Analytics Practice Opportunity 2021/22, https://dsprojects.lib.cuhk.edu.hk/en/projects/chinese-medicine/home/ ↩︎
- 郝萬山講《傷寒論》 第18講, https://www.theqi.com/cmed/class/class1/note_18.html ↩︎