Data and Methods – Digital Scholarship Projects, CUHK Library

Data and Methodology

The CUHK Electronic Theses and Dissertations (“ETD”) is a large collection of digitized graduate theses from postgraduate students across all disciplines of CUHK, hosted by the CUHK Library; it is one of the Digitised Collections and can be accessed here. It dates back to the beginnings of CUHK in the mid-1960s, and consists of both full text in pdf files as well as metadata of each thesis. The files are publicly accessible and can be downloaded.

Below is an example of one of the theses in the ETD (Fig. 1).

**Fig. 1**: Example of an ETD record (Source: https://repository.lib.cuhk.edu.hk/en/islandora/object/cuhk%3A321953/metadata).

The metadata generally provides information on the theses title and the subjects it covers, the author, supervisors and their affiliation, the degree type, language, among a few other more technical details.

For the purpose of this analysis for the Data Analytics Practice Opportunity, we made use of the theses metadata, containing thesis title, subjects, authorship information, the departments/divisions, language, and graduation year. Several other fields were discarded. For the automatic mass download of the collection entries, we used the resumption token of the CUHK Digital Repository, provided by the library.

Below is an example for the data structure after retrieving the data.

<record>
  <header>
    <identifier>oai:cuhk-dr:cuhk_321953</identifier>
    <datestamp>2023-04-27T17:56:00Z</datestamp>
    <setSpec>cuhk_etd</setSpec>
  </header>
  <metadata>
    <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
      <dc:title>Inhibitory and facilitatory effects on the perception of repeatedly presented stimuli.</dc:title>
      <dc:title>Repetition effects</dc:title>
      <dc:subject>Memory</dc:subject>
      <dc:subject>Priming (Psychology)</dc:subject>
      <dc:subject>Word recognition</dc:subject>
      <dc:description>Kin Fai Ellick, Wong.</dc:description>
      <dc:description>Thesis (M.Phil.)--Chinese University of Hong Kong, 1997.</dc:description>
      <dc:description>Includes bibliographical references (leaves 74-83).</dc:description>
      <dc:contributor>Wong, Kin Fai Ellick.</dc:contributor>
      <dc:contributor>Chinese University of Hong Kong Graduate School. Division of Psychology.</dc:contributor>
      <dc:date>1997</dc:date>
      <dc:type>Text</dc:type>
      <dc:type>bibliography</dc:type>
      <dc:format>print</dc:format>
      <dc:format>86 leaves : ill. ; 30 cm.</dc:format>
      <dc:identifier>cuhk:321953</dc:identifier>
      <dc:identifier>http://library.cuhk.edu.hk/record=b5889345</dc:identifier>
      <dc:language>eng</dc:language>
      <dc:rights>Use of this resource is governed by the terms and conditions of the Creative Commons “Attribution-NonCommercial-NoDerivatives 4.0 International” License (http://creativecommons.org/licenses/by-nc-nd/4.0/)</dc:rights>
      <dc:identifier>https://repository.lib.cuhk.edu.hk/en/item/cuhk-321953</dc:identifier>
    </oai_dc:dc>
  </metadata>
</record>

The metadata for each thesis was extracted and stored in an .xml file. The different information was contained by the opening delimitator <…> and ended by </…>.

Description

Certain elements were systematically extracted from the existing text data and factorized or retained as numeric variables. Tab. 1 below shows their distribution, whereas the processing and distribution of the text variables can be found in the next section.

Variable	Categories	Description
Date	1967-2021	Publication year of thesis
Language	English Chinese	Binary for the primary thesis language
Degree	Masters Degree PhD Degree	Binary for the degree level the thesis was written for
Departments/Divisions (Top 10)	School of Life Sciences CUHK Business School Dep. of Physics Dep. of Computer Science and Engineering Dep. of Chemistry Fac. of Education Dep. of Electronic Engineering Dep. of Medicine and Therapeutics Dep. of Information Engineering Dep. of Mathematics [+46]	Categorical variable for the affiliated departments. In a few cases a further distinction into the departments was not possible based on the metadata.
Faculty	Fac. of Science Fac. of Engineering Fac. of Social Science Fac. of Arts Fac. of Medicine CUHK Business School Fac. of Education Fac. of Law	Categorical variable for the home faculty of the departments/divisions.
Academic Area	STEM Social Sciences Humanities	Categorical variable for the academic area of the faculties.

Tab. 1: Variables

The frequency distribution of the theses available by year indicates a left-skewed distribution, with an almost continuous increase in the number of theses by later years. For the first ten years (1967-1976), 315 theses are in the edited ETD dataset, compared to 6,200 for the last ten years of the analysis (2012-2021). The increase appears rather linear with little tendency of outlier years (Fig. 2).

**Fig. 2**: Number of Theses in Each Year

Highlighting the share of English and Chinese language theses, we can clearly observe the higher prevalence of theses written in English (Fig. 3). Over the last around ten years, the ratio of English-to-Chinese language theses seems to remain more or less stable, with a clear majority of theses written in English.

**Fig. 3**: Theses Frequency Distribution by Language

We see a diverging trend in the last years, with the number of PhD theses in the ETD growing, while the number of Masters theses seeing a decline (Fig. 4). This becomes particularly visible after the year 2009.

**Fig. 4**: Theses Frequency Distribution by Degree

From the metadata, the information of the departments/divisions – mainly department, but also schools or programs – was extracted (Tab. 2). For the theses written in the faculty of education and the faculty of law, a further classification was not possible from the information from the metadata.

Departments/Divisions	Number of Theses	Share
School of Life Sciences	1,731	8.9
CUHK Business School	1,674	8.6
Dep. of Physics	931	4.8
Dep. of Computer Science and Engineering	825	4.2
Dep. of Chemistry	807	4.1
Fac. of Education	785	4.0
Dep. of Electronic Engineering	738	3.8
Dep. of Medicine and Therapeutics	667	3.4
Dep. of Information Engineering	643	3.3
Dep. of Mathematics	632	3.2
Dep. of Economics	559	2.9
School of Architecture	551	2.8
Dep. of Chinese Language and Literature	512	2.6
Dep. of Psychology	491	2.5
Dep. of History	486	2.5
Dep. of Systems Engineering and Engineering Management	468	2.4
Dep. of Mechanical and Automation Engineering	397	2.0
Dep. of Cultural and Religious Studies	392	2.0
Dep. of Statistics	388	2.0
School of Biomedical Sciences	367	1.9
Dep. of English	342	1.8
Dep. of Sociology	336	1.7
Dep. of Philosophy	302	1.5
Dep. of Anatomical and Cellular Pathology	287	1.5
School of Journalism and Communication	278	1.4
Dep. of Geography and Resource Management	269	1.4
Dep. of Social Work	204	1.0
Dep. of Music	191	1.0
Dep. of Surgery	191	1.0
Dep. of Government and Public Administration	190	1.0
Dep. of Fine Arts	183	0.9
Dep. of Anthropology	165	0.8
The Jockey Club School of Public Health and Primary Care	162	0.8
Nethersole School of Nursing	149	0.8
Dep. of Chemical Pathology	143	0.7
Dep. of Orthopaedics and Traumatology	138	0.7
Gender Studies Program	101	0.5
School of Chinese Medicine	100	0.5
Dep. of Obstetrics and Gynaecology	86	0.4
School of Pharmacy	86	0.4
Dep. of Linguistics and Modern Languages	83	0.4
Dep. of Microbiology	70	0.4
Dep. of Opthalmology and Visual Sciences	64	0.3
Dep. of Translation	58	0.3
Dep. of Biomedical Engineering	47	0.2
Earth System and Geoinformation Sciences Program	46	0.2
Dep. of Imaging and Interventional Radiology	42	0.2
Fac. of Law	42	0.2
Div. of Earth and Environmental Sciences	38	0.2
Centre for China Studies	37	0.2
Dep. of Anaesthesia and Intensive Care	37	0.2
Dep. of Japanese Studies	33	0.2
The School of Accountancy	9	0.0
Dep. of Marketing	4	0.0
Dep. of Finance	1	0.0
The Divinity School of Chung Chi College	1	0.0
Missings	954	4.9
	19,513	100

Tab. 2: Departments/Divisions

The above departments were further manually allocated to their home faculties (Tab. 3).

Faculty	Number of Theses	Share
Faculty of Science	4,643	23.8
Faculty of Engineering	3,118	16.0
Faculty of Social Science	3,016	15.5
Faculty of Arts	2,748	14.1
Faculty of Medicine	2,519	12.9
CUHK Business School	1,688	8.7
Faculty of Education	785	4.0
Faculty of Law	42	0.2
Missings	954	4.9
	19,513	100

Tab. 3: Faculty

In a last step, they were categorized according to the commonly used academic macro-area. We acknowledge that there are different views about certain classifications. Moreover, different academic traditions can lead to different self-identification of a department. (Tab. 4).

Academic Area	Number of Theses	Share
STEM	10,280	52.7
Social Sciences	5,489	28.1
Humanities	2,790	14.3
Missings	954	4.9
	19,513	100

Tab. 4: Academic Area

Unsupervised Natural Language Processing

For a latent topic model analysis it is assumed that within a body of different texts, there are undetected or hidden – “latent” – topics, through which the texts are connected. Treating texts as word vectors with one word as a unit, simple topic models calculate the correlations between each vector. For each text – in our case the word vectors from the titles and the keywords – the proportion of each topic to be present can then be calculated. If the model calculates the presence of a latent topic to be highly likely in a text, it will be assigned a value approximating 1; if it is highly unlikely, it will approximate 0.

At the beginning, we applied the conventional editing steps for common bag-of-words models on the levels of words.

Non-ASCII characters, numbers, and any form of punctuation symbols were excluded from the data, as they are non-interpretative or functional to a sentence.
Additional white spaces were limited to a single space, capitalized letters were made to lowercase letters.

In the second step, additional word groups which were irrelevant were excluded, as they generally are also non-interpretative or residual characters from earlier editing:

Single isolated letters, numerals, and roman numerals.
Auxiliary verbs in all tenses.
All pronouns, conjunctions, and prepositions, according to official English language lists.
Time words, including weekdays, times of the day, and months.
Words that refer to the structure or goal of a thesis, rather than their content, such as “review” or “hypothesis”.
Words in pinyin.

We also removed duplicated words, e.g. those that appeared both in the title and in the keywords. Finally, we also applied the stopword list of R (Version 2023.09.1), and stemmed the words. After transforming the text data into a document-term-matrix, we calculated a Correlational Topic Model for 15 latent topics, and derived the topic-per-document prevalence for each theses. If the predicted probability exceeded a 0.5 threshold, it was assigned a 1, if not it received a 0. The distribution of the dichotomized topic prevalence were used to constructed the edge weights for the network visualizations.