Methodology – Digital Scholarship Projects, CUHK Library

Methodology Involved

As we mentioned in the introduction, we have developed a companion translation model for the original ancient Chinese database so that readers can better compare the charts with the original text when retrieving them. Because of this, our research methodology is mainly divided into two parts: Visualization and Translation Modeling.

Methodology in Visualization

We mainly use Python and Excel as tools for data extraction and visualization. The raw data is stored in an Excel document, consisting of over 2000 pieces of land renting records. We determine the position of numerical data by specific words in the text, for example, from between the adverb “凡” and the quantifier “亩” we can obtain the total area 34. In this way, we extracted possible data we need using Excel functions.

Figure #1 : A column intercepted from the original Excel spreadsheet

Figure #2 : Obtaining the total area by feature words

Then, we use Python to convert the extracted words into modern expressions and translate the words into numerical data using the Python library”Chinese2digit”. For data with different units, we performed unit conversion based on the content of “Zoumalou slips Research”.

Figure #3 : Snatching possible data from the text

Figure #4 : Converting to modern expression

Figure #5 : Translating into numerical data for calculation

Finally, we organized a table containing various types of data from the text, as shown in the figure. This includes land area, recording time, drought conditions, etc.

import pandas as pd 
import chinese2digits as c2d 
import re

excel_file = pd.Excelfile('zoumalou1.xlsx') 
#sheet_names excel_file.sheet_names 
#print(sheet_names) 
data = pd.read_excel('zoumalou1.xlsx', sheet_name='Sheet1') 
print(data.columns) 

data['畝數'] = data['畝數'].str.replace（'廿', '二十')
data['畝數'] = data['畝數'].str.replace（'卅', '三十')
data['畝數'] = data['畝數'].str.replace（'卌', '四十')

data['步數'] = data['步數'].str.replace（'廿', '二十')
data['步數'] = data['步數'].str.replace（'卅', '三十')
data['步數'] = data['步數'].str.replace（'卌', '四十')

data.to_excel('output_file.xlsx', index=False) 
data1 = pd.read_excel('output_file.xlsx')

def process_string(value):
    processed_value =str(c2d. takeNumberFromstring(value))  
    match = re.findall(r'\[(.*?)\]', processed_value) 
    if match: 
        last_bracket_content = match[-1]
        processed_value = last_bracket_content  
    processed_value = processed_value.replace("'", ""）

    return processed_value

for column in['畝數", '步數']: 
    column_to_process = data1[column] 
    processed_column = column_to_process.astype(str).apply(     lambda x: process_string(x))
    data1[column] = processed_column
data1.to_excel("output4_file.xlsx', index=False)

Click on the following link to view the Visualization Demonstration.

Visualization Demonstration

Methodology in Translation Modeling

Earlier we showed some visual data on Zou Ma Lou slips, but if we look back at our original dataset, we see that the text in the earlier Excel sheet is so obscure that even though we have visual charts, we can’t relate it to these bewildering words. In order to solve this problem, we have improved a translation model for the literary texts to better interpret the original dataset. Our translation model gives natural language processing and large language modeling with three main features: keyword separation, predictive translation, and language converter.

Feature Words Extraction

Firstly, keyword separation, we use python-based package jieba, which is a Chinese word segmentation model specialized for ancient texts.

We use jieba to count the recurring words in the original data by frequency, and then infer the semantics of the words before and after by virtue of their localization in the passage.

Figure #8 : Cross-reference translation of feature words

This method is very effective for the Zoumalou slips, which has a consistent text structure, for example, the location, identity and person’s name in the original first sentence can be accurately recognized by jieba after the keywords are introduced.

Comparing Figure #10 and Figure #11, it is not difficult to find that the first part has been slightly adjusted, before optimization, the specific place name “下伍丘”, identity “軍吏” and person’s name “黃元” were wrongly separated, forming four phrases with no practical significance; after pre-training with the Zoumalou database, we can find that the optimized result accurately identifies the three phrases mentioned above.

import jieba as j
import pandas as pd
df = pd.read_csv('demonstration.csv')
#...#
for word in df:
    j.add_word(word)

t = "下伍丘軍吏黃元，田十町，凡廿一畝，皆二年常限。旱敗不收，畝收布六寸六分。凡為布□丈四尺二寸，……四年十一月三日付庫吏番有。畝收錢卅七，凡為錢七百九十五錢，四年十一月五日付庫吏番有。"

list = j.lcut(t)
for i in list :
    print(i)
#...#

Predictive Translation

We then used the BERT model to process the optimized corpus. Below is the rough structure diagram of the whole BERT model.

Figure #12 : Overall structure of the BERT model

What we have done is to reconstruct the mapping network in this diagram according to NLPCC 2023, which is actually a simplification of the fusion of the two mapping networks.

Figure #13 : Reconstructed network mapping structure of the CPT model

Here, we basically complete the preprocessing of the text, and then utilize the large language model ChatGPT to complete the rest of the translation task, i.e., translating the input ancient text and its tensor based on the attention mechanism with reference to the tokenizer token sequences.

Because of the version compatibility issue between repositories, we ignore some weight measurement points in the original translation model, which basically has no negative impact on the final translation results.

#...#
class CPTForConditionalGeneration(CPTPretrainedModel):
    base_model_prefix = "model"
    _keys_to_ignore_on_load_missing = [
        r"final_logits_bias",
        r"encoder\.version",
        r"decoder\.version",
        r"lm_head\.weight",
    ]

    def __init__(self, config):
        super().__init__(config)
        self.model = CPTModel(config)
        self.register_buffer("final_logits_bias", torch.zeros((1, self.model.shared.num_embeddings)))
        self.lm_head = nn.Linear(config.d_model, self.model.shared.num_embeddings, bias=False)

        self.init_weights()

    def get_encoder(self):
        return self.model.get_encoder()

    def get_decoder(self):
        return self.model.get_decoder()

    def resize_token_embeddings(self, new_num_tokens: int) -> nn.Embedding:
        new_embeddings = super().resize_token_embeddings(new_num_tokens)
        self._resize_final_logits_bias(new_num_tokens)
        return new_embeddings

    def _resize_final_logits_bias(self, new_num_tokens: int) -> None:
        old_num_tokens = self.final_logits_bias.shape[-1]
        if new_num_tokens <= old_num_tokens:
            new_bias = self.final_logits_bias[:, :new_num_tokens]
        else:
            extra_bias = torch.zeros((1, new_num_tokens - old_num_tokens), device=self.final_logits_bias.device)
            new_bias = torch.cat([self.final_logits_bias, extra_bias], dim=1)
        self.register_buffer("final_logits_bias", new_bias)

    def get_output_embeddings(self):
        return self.lm_head

    def set_output_embeddings(self, new_embeddings):
        self.lm_head = new_embeddings

    @add_start_docstrings_to_model_forward(CPT_INPUTS_DOCSTRING)
    @replace_return_docstrings(output_type=Seq2SeqLMOutput, config_class=_CONFIG_FOR_DOC)
#...#

#...#
@add_start_docstrings_to_model_forward(CPT_INPUTS_DOCSTRING)
    @add_code_sample_docstrings(
        tokenizer_class=_TOKENIZER_FOR_DOC,
        checkpoint=_CHECKPOINT_FOR_DOC,
        output_type=Seq2SeqSequenceClassifierOutput,
        config_class=_CONFIG_FOR_DOC,
    )
#...#

Adapted critical code in library file of Erya model
See LINK for source file details

Here, we have basically completed the preprocessing of the text, and then we utilize the large language model ChatGPT to complete the rest of the translation task, i.e., to translate the input ancient text based on the attention mechanism and with reference to the sequence of markers, the specific steps are as follows :

Preliminary analysis through CPT modeling
Determine the sequence of labels and the corresponding tensor of the text
Entering analyzed data as preconditions into ChatGPT
Using the defined ChatGPT for translation and output of ancient texts

Language Converter

Considering that users may have different language preferences, we have also added a language converter to convert the output to Simplified Chinese, Traditional Chinese and English.

Combining all the above mentioned models and methods, here is some of the key code for calling the API for ancient text translation :

import requests
from opencc import OpenCC
from transformers import BertTokenizer
from modeling_cpt import CPTForConditionalGeneration

#...#
def main():
    try:
        while True:            

            tokenizer = BertTokenizer.from_pretrained("RUCAIBox/Erya4FT")
            model = CPTForConditionalGeneration.from_pretrained("RUCAIBox/Erya4FT")

            content = input("Content: ")

            input_ids = tokenizer(content, return_tensors='pt')
            input_ids.pop("token_type_ids")

            pred_ids = model.generate(max_new_tokens=256, **input_ids)
            prep = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
            input_ids = ''.join(input_ids)
             
            url = # insert your api here #
            headers = {
                "Content-Type": "application/json",
                "Cache-Control": "no-cache",
                "Ocp-Apim-Subscription-Key": # insert your key here #
            }

            data = {
                "model": "gpt-35-turbo",
                "messages": [
                    {
                        "role": "user",
                        "content": "现在请你将以下的经过tokenizer的文言文文本翻译成现代文，首先给出tokenizer后的数据"+input_ids+"请参考以上数据正确识别词汇并翻译以下文言文文本"+content+"要求输出的内容中只有翻译结果而没有这些问题提示，结尾一定不能出现问题和提示"                    }
                ]
            }

            response = requests.post(url, headers=headers, json=data)
            response.raise_for_status()

            result = response.json()

            
            extracted_content = result['choices'][0]['message']['content']
            
            c = OpenCC('s2t')
            extracted_content = c.convert(extracted_content)
            
            print("\n"+"Result: "+extracted_content+"\n")
    except Exception as ex:
        print("异常:", ex)

if __name__ == "__main__":
    main()
#...#

Click on the following link to view the Translation Sample.

translation sample