Skip to content
CUHK LogoCUHK Library LogoCUHK small library logo

If you like this page, please feel free to share it to your social networks!

Methodology - Data Extraction

Since the content of The Bibliography is in Word table format with fragmented information, we could only manually process the data in retrieving different related objects:

 

A sample page content:

Bibliography Sample page

 

Why Manually Extracts the Data?

For integrity of the data for future use, we have retrieved all the above related items in Excel spreadsheet from each poet's entry.  We have tried non-manual methods (i.e. programming/software) in retrieving the objects to a list:

  1. Python: we were not able to retrieve as detailed list as possible and so we just gave up to use
  2. CORPRO 庫博: a tool developed by Prof. Chueh Ho-chia of National Taiwan University in retrieving object list from Chinese texts. We have tried and it has retrieved the list of people, but we needed to spent time to distinguish the 514 authors from the other related people

After testing the above methods, our Team decided to manually extract data from the book for all the objects so that we can do it as detailed as we can.  It was a painful process, but the product will be a complete table for all the objects and their relationships.  It will benefit future research when the data table could be open to use.

 

Below is the screenshot of the main table in which we have extracted all the data objects from each poet's entry:

Poet's data in spreadsheet

 

Then in this project we used the data for relationship network and spatial visualisation.