Methodology - Digital Scholarship Projects, CUHK Library

Data Preprocessing

The missing data was replaced by finding the average value within the columns.

df1 = df1.fillna(df1.mean())

As consistent with the paper, the cellular data is log2-transformed. For the remaining variables, a natural log transformation was applied for the econometric methods to reflect the rate of change, while a Z‑score transformation using StandardScaler() was applied for the machine learning methods for better prediction accuracy.

Research Structure

Our objective was to partition the research into three interconnected parts: biomarker estimation, anomaly detection, and aging intervention.

1) Biomarker Estimation Overview

Motivation: This section addresses the challenge of estimating biological age when biomarker measurements are unavailable. It explores how readily observable physiological, lifestyle, and clinical factors can help to predict an individual’s underlying aging process.

Method: Correlation analysis and linear regression were employed to identify significant relationships between observable factors and premature aging. KNN clustering was also performed to observe any interesting patterns between the clusters.

To begin, individual linear regressions were run for each variable, with age included in each regression to adjust for the effects of chronological age in biological aging. Then, regressions were run for entire variable groups (anthropometric and behavioural, other cellular variables were excluded since this section focuses on easily accessible factors for individuals), again with age included as an adjustment factor in each model. Finally, a full model regression was run with all variables. This initial process helped determine and visualise any individually significant variables affecting both p16 and p21 biomarkers.

Then, regression models with combinations of up to 7 variables were searched with the target of most individually significant variables and the highest adjusted R-squared score. This process is to find initial combinations of variables which are individually significant in predicting the senescence biomarkers for individuals, before using the results to run additional combinations to determine the final estimation model.

For visualisation purposes, correlation matrices were generated with all variables. Graphs were also generated to compare the individual regression models and the top 5 models in the combination search process for each of p16 and p21.

Finally, K‑means clustering was performed by first identifying the best silhouette score across different combinations of variables (such as, [“p16”, “sleep duration”]). With this knowledge, each feature set was then clustered. For each resulting cluster, mean values of key variables were computed and compared to identify potential patterns or trends that may in some way distinguish the natural groups identified by the algorithm.

2) Anomaly Detection Overview

Motivation: Using estimated and measured biomarker levels, this section focuses on identifying individuals whose biological aging deviates significantly from expected norms, thereby flagging those at elevated risk of accelerated premature aging. We anticipate that this approach will be of particular value in a clinical context, enabling the detection of premature aging‑related changes in patients with obesity.

Method: Supervised machine learning models were used to estimate p16 biomarker levels, given their strong correlation with the available variables. Then, the best‑performing model was selected as the estimator of the premature aging biomarker. This model was deployed to detect anomalies by comparing the residual between a user’s actual p16 value and the model’s prediction.

The dataset was randomly split into a train data and test data with 70:30 ratio.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

The candidate regression models included Linear Regression, Random Forest, Decision Tree, Support Vector Regression, and KNN regression. To identify the best-performing model and the best combination of features, a pipeline was constructed that performed feature selection using SelectKBest (with f_regression as the scoring function) followed by model fitting. The number of selected features k was varied from 2 up to the total number of available features. For each combination of model and k, the pipeline was trained on the training set and evaluated on the test set using R² and MSE. The pipeline allowed us to compare model performance across different feature subsets and select the configuration that minimised prediction error for p16 biomarker estimation. The below code snippet was used for this purpose.

for k in range(2, X.shape[1]+1): 
    for name, model in models.items(): 
        pipeline = Pipeline([ 
            ('best_features', SelectKBest(score_func = f_regression, k=k)), 
            ('model', model) 
        ]) 
 
        pipeline.fit(X_train, y_train) 
        y_pred = pipeline.predict(X_test) 
        selector = pipeline.named_steps['best_features'] 
        mask = selector.get_support() 
 
        list_of_feature = X_train.columns[mask].to_list() 
        results.append({ 
            "model" : name, 
            "k_features" : ', '.join(list_of_feature), 
            "r2" : r2_score(y_test, y_pred), 
            "mse" : mean_squared_error(y_test, y_pred), 
        })

For anomaly detection, we used the studentised residual to determine whether the difference between a user’s actual p16 value and the model‑estimated value was large enough to label the sample as an “anomaly”. The studentized residual effectively identifies whether a new sample is sufficiently distinct to be considered an influential point, making it well‑suited for this purpose [2]. It is important to note that this platform was intended for research purposes only and may not provide sufficiently accurate predictions for clinical use. Therefore, it should not be treated as a substitute for formal clinical diagnosis. Nevertheless, we hope that our findings could inform future clinical applications with larger sample sizes.

3) Aging Intervention Overview

Motivation: This section evaluates evidence-based interventions capable of slowing or reversing senescence biomarkers. It also examines practical frameworks for individuals to monitor the effectiveness of these interventions on their personal biological aging trajectory.

Method: ANOVA and linear regression were used to analyse trial data, assess the statistical significance of interventions and quantify their impact on preventing premature aging.

To determine the significance of the 12-week MVPA intervention program, ANOVA tests were conducted against all tracked (change in cellular and select anthropometric) variables. For visualisation, side-by-side box-and-whisker graphs were generated to compare the control group against the MVPA intervention group.

Similar to the biomarker estimation section, individual regression models were run for each recorded anthropometric variable against the change in both p16 and p21. This time, additional regressions with their respective interactive variables (with the dummy variable for the intervention group as 1 and the control group as 0) were also conducted to identify any additional effects MVPA had on the independent variables which was reflected in the change in the aging biomarkers. Finally, a correlation matrix is also used for visual interpretation.