Methodology - Digital Scholarship Projects, CUHK Library

Methodology – Data Extraction

Since the content of The Bibliography is in Word table format with fragmented information, we could only manually process the data in retrieving different related objects:

Names: the authors and those people mentioned in their biography and related works
Titles of poetry works and other classical writings
Places: authors’ origin (in biography) and other related places
Institutions: including groups of interest, e.g. poet societies, their workplace, etc. appeared in their biography

A sample page content:

Why Manually Extracts the Data?

For integrity of the data for future use, we have retrieved all the above related items in Excel spreadsheet from each poet’s entry. We have tried non-manual methods (i.e. programming/software) in retrieving the objects to a list:

Python: we were not able to retrieve as detailed list as possible and so we just gave up to use
CORPRO 庫博: a tool developed by Prof. Chueh Ho-chia of National Taiwan University in retrieving object list from Chinese texts. We have tried and it has retrieved the list of people, but we needed to spent time to distinguish the 514 authors from the other related people

After testing the above methods, our Team decided to manually extract data from the book for all the objects so that we can do it as detailed as we can. It was a painful process, but the product will be a complete table for all the objects and their relationships. It will benefit future research when the data table could be open to use.

Below is the screenshot of the main table in which we have extracted all the data objects from each poet’s entry:

Then in this project we used the data for relationship network and spatial visualisation.