Data and Preprocessing
The project began from the CUHK Archive of Professor Yang Chen-Ning, which the presentation summarises as 1,283 observations. These archival materials include articles and notebooks, correspondence, media materials, photographs, and other items. Because the archive alone does not fully represent Yang’s collaboration network, the team expanded the evidence base in different ways for each layer.
For the coauthorship layer, the archive was supplemented with publicly available bibliographic data from Google Scholar and INSPIRE-HEP. These two sources were cross-referenced and merged into a cleaned publication table containing 260 publications from 1947 to 2019. The merged file preserves publication metadata such as titles, year, journal, coauthors, DOI, and citation information.
For the correspondence layer, relevant letters were identified through keyword search, human inspection, and cross-reference within the archival files. The current processed table contains 88 correspondence items and 216 normalised person-item rows, covering 105 named individuals between 1951 and 1993. Institutional and country labels were then added to make the correspondence data usable for geographic analysis.
For the photo and co-mention layer, the team began with the provided archival dataset and used text extraction, field normalisation, name cleaning, translation, and geocoding to transform descriptive records into map-ready data. The map-processing pipeline visible in the shared folder moves from 327 rows with coordinate candidates, to 118 Yang-related rows, and finally to 107 confirmed geocoded records across 63 places between 1924 and 2006. A separate unified photo table contains 257 processed entries with bilingual names and locations.

[Photograph of the C.N. Yang Archive Opening Ceremony, with Sin Wai-kin, Yang Chen-Ning, and Li Kwok-Cheung]. (1999). C.N. Yang Archive (Hanger 10, Folder 30). The Chinese University of Hong Kong Library, Hong Kong.

[Photograph of the Honorary Doctor of Science Degree Ceremony at the Chinese University of Hong Kong, with Li Kwok-Cheung, Yang Chen-Ning, and Ho Man-Wui]. (1998). C.N. Yang Archive (Hanger 10, Folder 31). The Chinese University of Hong Kong Library, Hong Kong.
The workflow combined manual archival work with computational enrichment. According to the presentation and processed outputs, the main methods included:
- Python and Pandas for scraping, cleaning, reshaping, and deduplication
- Cross-referencing Google Scholar and INSPIRE-HEP for publication coverage
- Manual identification and validation of correspondence items
- LLM-assisted institution and country identification for people and organisations
- Geopy and geocoding-based standardisation for map coordinates
- Final filtering to keep only confirmed, cleaned, and analysable records