Methodology
Bert Model
The BERT (Bidirectional Encoder Representation from Transformers) [1] model is a pre-trained language model based on Transformers, using a bidirectional encoder structure to learn language representations through pre-training [2]. The BERT model has achieved significant results in the field of natural language processing and is widely used in tasks such as text classification, named entity recognition, and sentiment analysis.
The principle of the BERT model is based on the Transformer architecture, which consists of encoders composed of multiple self-attention mechanisms [3]. Each encoder includes multiple attention heads and a feedforward neural network. The self-attention mechanism allows the model to simultaneously consider the information of all other words when generating the representation of each word, thereby achieving bidirectional contextual understanding.
The pre-training process of the BERT model includes two tasks: Masked Language Model (MLM) and Next Sentence Prediction (NSP)[4]. In MLM tasks, a portion of the words in the input text are randomly masked, and the model needs to predict these masked words. In NSP tasks, the model needs to predict whether two sentences are continuous in the original text in order to learn the relationship between sentences. The computational method applied by the BERT model is shown in Equation 1 [5].

In the formula, parameters Q, K, and V represent the query, key, and value, respectively. These three parameters are obtained through the linear transformation of the input embedding matrix. Dk is the dimension of the key vector. The Attention function calculates the dot product of the query and key, obtains the attention weight through the Softmax function, and then multiplies it with the value vector to obtain a weighted representation.
In the BERT model, first, MultiHead Self-Attention is applied to each attention head. Then, the outputs of the multiple heads are concatenated and processed through a feedforward neural network. Finally, the BERT model generates deeper bidirectional contextual representations by stacking multiple layers of encoders. The BERT model can learn rich language representations and achieve excellent performance in various natural language processing tasks.
[1] Li M, Chen L, Zhao J, et al. Sentiment analysis of Chinese stock reviews based on BERT model[J]. Applied Intelligence, 2021, 51: 5016-5024.
[2] Zhang T, Xu B, Thung F, et al. Sentiment analysis for software engineering: How far can pre-trained transformer models go?[C]//2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2020: 70-80.
[3] Gu X, Liu L, Yu H, et al. On the transformer growth for progressive bert training[J]. arXiv preprint arXiv:2010.12562, 2020.
[4] Won D Y, Lee Y J, Choi H J, et al. Contrastive representations pre-training for enhanced discharge summary BERT[C]//2021 IEEE 9th International Conference on Healthcare Informatics (ICHI). IEEE, 2021: 507-508.
[5] 張澤源,張光妲,李佳雨 & 王海珍.(2024).基于BERT的日常文本情感分析. 齊齊哈爾大學學報(自然科學版)(06),37-41. doi:10.20171/j.cnki.23-1419/n.2024.06.005.