Methodology

Workflow

This project can be split into two phases. The first phase involves data processing, including OCR scanning of the original press release documents. The second phase focuses on developing an interactive platform.

Phase 1: Data Processing

Dataset Preview

Documents from 1983–1997 and 1998–2006 are in two different formats. The former are in PDF format, requiring OCR to convert them into recognizable data. The latter are already in a directly usable format without needing OCR. Therefore, this step will focus on processing the data from 1983–1997.

OCR

(1)Use PaddlePaddle to do the first step OCR

(2)Use Grok3 to enhance the accuracy of the OCR result

(3)Use Grok3 to combine the document of the same event together (duplicated Chinese and English version)

Phase 2: Interactive Platform Developing