Methodology
Text Digitalization
For the data source, we chose to extract our data from the digital copies of the text on the CUHK Library’s Digital Repository. However, before carrying out a deeper analysis, we had to convert the text images into computer texts. Special thanks to Academia Sinica which provided the OCR technology to us that the scanned pages on the Digital Repository were successfully transformed into digital text.
Proofreading
After receiving the first digitized draft, we conducted a proofreading process for the text and return it to Academia Sinica for calibration to reduce errors during the text scanning. Notwithstanding the fact that the scanned text can be converted into word correctly, the produced results were not well suitable for text analysis since the text were not organized and spaced. Therefore, after consideration, we switched our data source to Chinese Text Project, an online platform that contains a well aligned and marked version of the text. The text on the platform was carefully checked and compared with the OCR version before use.
Data Cleansing
Upon receiving the scanned word files, we prepared the keyword database for building up a scanning algorithm. During the process, we found that there were some ingredient names in the text which were not recorded in the database. Some name variations would also lower the accuracy of the algorithm. Therefore, data cleansing was carried out for improving accuracy. To further illustration, we manually compared some parts of the text with the ingredient database, before updating the content in the database or adding some exceptions for name variations in the algorithm.
Data Extraction
After data cleansing, we moved on to data extraction. The algorithm is written in Python becaues of its diverse libraries. The key code are shown in the followings.
ingredients = []
with open("藥物種名.csv", 'r', encoding="utf-8-sig") as file:
csvreader = csv.reader(file)
for row in csvreader:
str1 = ' '.join(row)
ingredients.append(str1)
First, the code imports the ingredient data from database (stored in .csv format) and stores the data into a list for coming up comparisons.
indexElement = []
with open("index.csv", "r", encoding = "utf-8") as file:
csvreader = csv.reader(file)
for row in csvreader:
strr = ' '.join(row)
indexElement.append(strr)
dictList = []
rcp = []
for x in range(len(indexElement)):
if x == 0 or not indexElement[x-1]:
dictList.append(indexElement[x])
elif not indexElement[x] or indexElement[x] == '.':
dictList.append(rcp)
rcp = []
continue
else:
rcp.append(indexElement[x])
indexTable = ConvertListToDict(dictList)
Hash table(indexElement)
Key: Name of topic
Value: List of recipes
herbs = {}
for x in indexTable:
with open(x + ".txt", "r", encoding = "utf-8") as fl:
line = fl.read()
for i in range(len(indexTable[x])):
if i + 1 != len(indexTable[x]):
portion = line[line.find(indexTable[x][i] + "〔"):line.find(indexTable[x][i+1] + "〔")]
tmp = []
for ingredient in ingredients:
if ingredient in portion:
tmp.append(ingredient)
tmp = list(set(tmp))
herbs[indexTable[x][i]] = tmp
else:
portion = line[line.find(indexTable[x][i] + "〔"):]
tmp = []
for ingredient in ingredients:
if ingredient in portion:
tmp.append(ingredient)
tmp = list(set(tmp))
herbs[indexTable[x][i]] = tmp
Add “〔” after the name of each recipe in the book for our program to recognise the portion to be read for each topic
Hash table (herbs)
Key: Name of recipe
Value: List of ingredients
frequency = {}
for func in indexTable:
tp = {}
for j in range(len(indexTable[func])):
for ingredient in herbs[indexTable[func][j]]:
if ingredient not in tp.keys():
tp[ingredient] = 1
else:
tp[ingredient] += 1
frequency[func] = tp
Nested hash table (frequency)
Key: Name of topic
Value: Another hash table which stores the frequency of ingredient in each topic
concurrent = {}
for x in indexTable:
tp = {}
for y in range(len(indexTable[x])):
for z in range(len(herbs[indexTable[x][y]])):
for i in range(z+1, len(herbs[indexTable[x][y]])):
a = herbs[indexTable[x][y]][z]
b = herbs[indexTable[x][y]][i]
if frozenset([a, b]) not in tp.keys():
tp[frozenset([a, b])] = 1
else:
tp[frozenset([a, b])] += 1
concurrent[x] = tp
Nested hash table(concurrent)
Key: Name of topic
Value: Another hash table which stores the frequency of each pair of ingredients