Data and Methods

Data and Methodology

The CUHK Electronic Theses and Dissertations (“ETD”) is a large collection of digitized graduate theses from postgraduate students across all disciplines of CUHK, hosted by the CUHK Library; it is one of the Digitised Collections and can be accessed here. It dates back to the beginnings of CUHK in the mid-1960s, and consists of both full text in pdf files as well as metadata of each thesis. The files are publicly accessible and can be downloaded.

Below is an example of one of the theses in the ETD (Fig. 1).

Fig. 1: Example of an ETD record (Source:

The metadata generally provides information on the theses title and the subjects it covers, the author, supervisors and their affiliation, the degree type, language, among a few other more technical details.

For the purpose of this analysis for the Data Analytics Practice Opportunity, we made use of the theses metadata, containing thesis title, subjects, authorship information, the departments/divisions, language, and graduation year. Several other fields were discarded. For the automatic mass download of the collection entries, we used the resumption token of the CUHK Digital Repository, provided by the library.

Below is an example for the data structure after retrieving the data.

    <oai_dc:dc xmlns:oai_dc="" xmlns:dc="" xmlns:xsi="" xsi:schemaLocation="">
      <dc:title>Inhibitory and facilitatory effects on the perception of repeatedly presented stimuli.</dc:title>
      <dc:title>Repetition effects</dc:title>
      <dc:subject>Priming (Psychology)</dc:subject>
      <dc:subject>Word recognition</dc:subject>
      <dc:description>Kin Fai Ellick, Wong.</dc:description>
      <dc:description>Thesis (M.Phil.)--Chinese University of Hong Kong, 1997.</dc:description>
      <dc:description>Includes bibliographical references (leaves 74-83).</dc:description>
      <dc:contributor>Wong, Kin Fai Ellick.</dc:contributor>
      <dc:contributor>Chinese University of Hong Kong Graduate School. Division of Psychology.</dc:contributor>
      <dc:format>86 leaves : ill. ; 30 cm.</dc:format>
      <dc:rights>Use of this resource is governed by the terms and conditions of the Creative Commons “Attribution-NonCommercial-NoDerivatives 4.0 International” License (</dc:rights>

The metadata for each thesis was extracted and stored in an .xml file. The different information was contained by the opening delimitator <…> and ended by </…>.


Certain elements were systematically extracted from the existing text data and factorized or retained as numeric variables. Tab. 1 below shows their distribution, whereas the processing and distribution of the text variables can be found in the next section.

Date1967-2021Publication year of thesis
Binary for the primary thesis language
DegreeMasters Degree
PhD Degree
Binary for the degree level the thesis was written for
(Top 10)
School of Life Sciences
CUHK Business School
Dep. of Physics
Dep. of Computer Science and Engineering
Dep. of Chemistry
Fac. of Education
Dep. of Electronic Engineering
Dep. of Medicine and Therapeutics
Dep. of Information Engineering
Dep. of Mathematics
Categorical variable for the affiliated departments. In a few cases a further distinction into the departments was not possible based on the metadata.
FacultyFac. of Science
Fac. of Engineering
Fac. of Social Science
Fac. of Arts
Fac. of Medicine
CUHK Business School
Fac. of Education
Fac. of Law
Categorical variable for the home faculty of the departments/divisions.
Academic AreaSTEM
Social Sciences
Categorical variable for the academic area of the faculties.
Tab. 1: Variables

The frequency distribution of the theses available by year indicates a left-skewed distribution, with an almost continuous increase in the number of theses by later years. For the first ten years (1967-1976), 315 theses are in the edited ETD dataset, compared to 6,200 for the last ten years of the analysis (2012-2021). The increase appears rather linear with little tendency of outlier years (Fig. 2).

Fig. 2: Number of Theses in Each Year

Highlighting the share of English and Chinese language theses, we can clearly observe the higher prevalence of theses written in English (Fig. 3). Over the last around ten years, the ratio of English-to-Chinese language theses seems to remain more or less stable, with a clear majority of theses written in English.

Fig. 3: Theses Frequency Distribution by Language

We see a diverging trend in the last years, with the number of PhD theses in the ETD growing, while the number of Masters theses seeing a decline (Fig. 4). This becomes particularly visible after the year 2009.

Fig. 4: Theses Frequency Distribution by Degree

From the metadata, the information of the departments/divisions – mainly department, but also schools or programs – was extracted (Tab. 2). For the theses written in the faculty of education and the faculty of law, a further classification was not possible from the information from the metadata.

Departments/DivisionsNumber of ThesesShare
School of Life Sciences1,7318.9
CUHK Business School1,6748.6
Dep. of Physics9314.8
Dep. of Computer Science and Engineering8254.2
Dep. of Chemistry8074.1
Fac. of Education7854.0
Dep. of Electronic Engineering7383.8
Dep. of Medicine and Therapeutics6673.4
Dep. of Information Engineering6433.3
Dep. of Mathematics6323.2
Dep. of Economics5592.9
School of Architecture5512.8
Dep. of Chinese Language and Literature5122.6
Dep. of Psychology4912.5
Dep. of History4862.5
Dep. of Systems Engineering and Engineering Management4682.4
Dep. of Mechanical and Automation Engineering3972.0
Dep. of Cultural and Religious Studies3922.0
Dep. of Statistics3882.0
School of Biomedical Sciences3671.9
Dep. of English3421.8
Dep. of Sociology3361.7
Dep. of Philosophy3021.5
Dep. of Anatomical and Cellular Pathology2871.5
School of Journalism and Communication2781.4
Dep. of Geography and Resource Management2691.4
Dep. of Social Work2041.0
Dep. of Music1911.0
Dep. of Surgery1911.0
Dep. of Government and Public Administration1901.0
Dep. of Fine Arts1830.9
Dep. of Anthropology1650.8
The Jockey Club School of Public Health and Primary Care1620.8
Nethersole School of Nursing1490.8
Dep. of Chemical Pathology1430.7
Dep. of Orthopaedics and Traumatology1380.7
Gender Studies Program1010.5
School of Chinese Medicine1000.5
Dep. of Obstetrics and Gynaecology860.4
School of Pharmacy860.4
Dep. of Linguistics and Modern Languages830.4
Dep. of Microbiology700.4
Dep. of Opthalmology and Visual Sciences640.3
Dep. of Translation580.3
Dep. of Biomedical Engineering470.2
Earth System and Geoinformation Sciences Program460.2
Dep. of Imaging and Interventional Radiology420.2
Fac. of Law420.2
Div. of Earth and Environmental Sciences380.2
Centre for China Studies370.2
Dep. of Anaesthesia and Intensive Care370.2
Dep. of Japanese Studies330.2
The School of Accountancy90.0
Dep. of Marketing40.0
Dep. of Finance10.0
The Divinity School of Chung Chi College10.0
Tab. 2: Departments/Divisions

The above departments were further manually allocated to their home faculties (Tab. 3).

FacultyNumber of ThesesShare
Faculty of Science4,64323.8
Faculty of Engineering3,11816.0
Faculty of Social Science3,01615.5
Faculty of Arts2,74814.1
Faculty of Medicine2,51912.9
CUHK Business School1,6888.7
Faculty of Education7854.0
Faculty of Law420.2
Tab. 3: Faculty

In a last step, they were categorized according to the commonly used academic macro-area. We acknowledge that there are different views about certain classifications. Moreover, different academic traditions can lead to different self-identification of a department. (Tab. 4).

Academic AreaNumber of ThesesShare
Social Sciences5,48928.1
Tab. 4: Academic Area

Unsupervised Natural Language Processing

For a latent topic model analysis it is assumed that within a body of different texts, there are undetected or hidden – “latent” – topics, through which the texts are connected. Treating texts as word vectors with one word as a unit, simple topic models calculate the correlations between each vector. For each text – in our case the word vectors from the titles and the keywords – the proportion of each topic to be present can then be calculated. If the model calculates the presence of a latent topic to be highly likely in a text, it will be assigned a value approximating 1; if it is highly unlikely, it will approximate 0.

At the beginning, we applied the conventional editing steps for common bag-of-words models on the levels of words.

  • Non-ASCII characters, numbers, and any form of punctuation symbols were excluded from the data, as they are non-interpretative or functional to a sentence.
  • Additional white spaces were limited to a single space, capitalized letters were made to lowercase letters.

In the second step, additional word groups which were irrelevant were excluded, as they generally are also non-interpretative or residual characters from earlier editing:

  • Single isolated letters, numerals, and roman numerals.
  • Auxiliary verbs in all tenses.
  • All pronouns, conjunctions, and prepositions, according to official English language lists.
  • Time words, including weekdays, times of the day, and months.
  • Words that refer to the structure or goal of a thesis, rather than their content, such as “review” or “hypothesis”.
  • Words in pinyin.

We also removed duplicated words, e.g. those that appeared both in the title and in the keywords. Finally, we also applied the stopword list of R (Version 2023.09.1), and stemmed the words. After transforming the text data into a document-term-matrix, we calculated a Correlational Topic Model for 15 latent topics, and derived the topic-per-document prevalence for each theses. If the predicted probability exceeded a 0.5 threshold, it was assigned a 1, if not it received a 0. The distribution of the dichotomized topic prevalence were used to constructed the edge weights for the network visualizations.