CUHK LogoCUHK Library LogoCUHK small library logo

If you like this page, please feel free to share it to your social networks!

Text Data Preparation: A Practice in R using the Sheng Xuanhuai Collection

Text Data Preparation: A Practice in R using the Sheng Xuanhuai Collection

Text-mining as the means of extracting and analysing text data with the use of computer software or packages from a corpus of textual sources such as documents, correspondences, books and journals into understandable and meaningful patterns and relationships hidden in the text is becoming popular in digital scholarship research. This technique can be understood as the application of linguistic, statistical and machine learning methods to a set of structured or unstructured textual sources in order to make sense of the content with minimal manual effort. Many of the computer software or packages for this task are developed to analyse English language materials. It is quite demanding for computers to process and analyse natural language. It is even more difficult to handle Chinese language text in ancient writing style. The CUHK Library collects a lot of rare and precious Chinese language materials and has turned them into digital forms. The Library’s Digital Scholarship team is interested to experiment with text mining technique to shed light on the Library’s Chinese digital collections.

This project is initiated by the Library, and created by Dr. Yun Tai, our Postdoctoral Fellow in Digital Scholarship. It aims to make use of “R” to process and analyse Sheng Xuanhuai Collection that is owned by the Art Museum of CUHK to demonstrate how computational text processing and analysis can be done for Chinese texts. The Sheng Xuanhuai collection contains correspondences between the entrepreneur Sheng Xuanhuai and other individuals in the late Qing period. The Library has digitized all the 77 volumes of correspondences that have more than 30,000 pages in over 7.5 million of words. In addition, the transcription of these texts is available in text files. The texts are also coded with labels/variables such as title, sender name, receiver name, date, key words and locations mentioned in the texts. With this huge corpus of text data, the Digital Scholarship team finds that transforming these texts into machine readable formats allows researchers to conduct studies using computational text analysis and other relevant methods. Hence, this project aims to explore possible ways of conducting research with the texts included in this collection.

R segmenting packages (e.g. jiebaR) is selected in this project and has been applied to two volumes initially to demonstrate the proof of concept. The entire process of the computational text processing from setting up the R environment to the creation of a matrix of word counts (term-document matrix (TDM)) and the wordcloud is described in this project. All the computer codes are available here. This is only the first stage of work. It is hoped that its detailed documentation will enable researchers to further work on other Sheng documents to bring to the possibilities of text analysis such as text clustering or topic modelling as for the next stage of development. If you are interested to work with the team on the next stage, you are welcome to contact the team at