Workflow
This project can be split into two phases. The first phase involves data processing, including OCR scanning of the original press release documents. The second phase focuses on developing an interactive platform.

Phase 1: Data Processing
Dataset Preview
Documents from 1983–1997 and 1998–2006 are in two different formats. The former are in PDF format, requiring OCR to convert them into recognizable data. The latter are already in a directly usable format without needing OCR. Therefore, this step will focus on processing the data from 1983–1997.

OCR
(1)Use PaddlePaddle to do the first step OCR

(2)Use Grok3 to enhance the accuracy of the OCR result

(3)Use Grok3 to combine the document of the same event together (duplicated Chinese and English version)

Phase 2: Interactive Platform Developing

This phase developed an interactive web platform to visualize university event data from Markdown files and linked PDFs using Dash, Plotly, and Dash Bootstrap Components (DBC).
- Data Parsing and Preprocessing
Parsed Markdown files in repository using re for event extraction (title, date, content, labels) and stored in a Pandas DataFrame. Converted dates with pd.to_datetime. Integrated PDF metadata from Index Excel file using Pandas. - Event Categorization
Categorized events with a category_mapping dictionary, assigning labels to groups (e.g., Academic, Cultural) using the categorize_by_labels function. - Interactive Visualization
Built a timeline and pie chart using Plotly’s make_subplots. Timeline shows events by category with go.Scatter, colored via CATEGORY_COLORS. Pie chart displays category distribution. Features include:- Search with dbc.Input for keyword filtering.
- Multi-select dcc.Dropdown for label filtering (up to 50 labels).
- dcc.DatePickerRange and dcc.RangeSlider for date filtering, synced via callbacks.
- Clickable points display event details using dcc.Graph.
- PDF Integration
Linked PDFs to events via direct or similarity matching (difflib.SequenceMatcher). Generated previews with pdf2image, encoded as base64 JPEGs. Displayed full PDFs in an iframe via a Flask route (send_from_directory). Managed temporary files in ./temp. - User Interface
Used DBC with FLATLY theme for a responsive layout. Included a title, input controls, graph, event details panel, and PDF viewer. Callbacks handled interactivity, with dash.callback_context syncing controls. - Technical Notes
Ensured error handling for file parsing and PDF conversion. Optimized performance with limited dropdown options and file cleanup. Secured PDF serving with send_from_directory.
The platform enables dynamic event exploration and PDF access, leveraging Pandas, Dash, Plotly, DBC, re, pdf2image, base64, and Flask.