Data Preprocessing

Raw Data

An integral part of preparing our dataset for the machine learning analysis involved addressing the issue of missing data. Given the complexity and the natural variability inherent in ecological datasets, such as those concerning seagrass growth under different environmental conditions, missing values were inevitable. To ensure the integrity and the usability of our dataset without compromising the validity of our analysis, we employed the K-Nearest Neighbors (KNN) imputation method for filling in missing data.

KNN Imputation Overview

KNN imputation is a sophisticated technique that estimates the missing value of a data point by considering the ‘K’ closest points in the parameter space, with ‘K’ being a predefined number. The missing values are imputed using the mean or median (depending on the data distribution) of these nearest neighbors, which are identified based on a distance metric, typically Euclidean distance. This method assumes that the way points are clustered in the parameter space can provide valuable information about the structure of the dataset, making it a particularly apt choice for ecological data where similar types of observations often cluster together in terms of their environmental conditions or physiological characteristics.

Brief Code for KNN

import pandas as pd
import numpy as np

from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=3)
array = imputer.fit_transform(array)

Preprossed Data

Machine Learning Algorithms

One-Hot Encoding


One-hot encoding transforms categorical variables into a form that could be provided to machine learning algorithms to better predict the outcome. For each unique category in a variable, one-hot encoding creates a new binary column, which takes the value 1 if the category is present for a record and 0 if not.


from sklearn.preprocessing import OneHotEncoder
categorical_features = ['Treatment', 'Species']
categorical_transformer = OneHotEncoder()

Linear Regression

Linear regression was the primary algorithm used in our analysis, given its effectiveness in identifying relationships between independent variables and a continuous dependent variable.

Model Selection

Linear regression was chosen for its simplicity, interpretability, and efficiency in predicting outcomes. It allows for the estimation of relationships between variables, providing a clear view of how different environmental factors and seagrass characteristics influence their resistance to heatwaves.


from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import pandas as pd
preprocessor = ColumnTransformer(
        ('cat', categorical_transformer, categorical_features)
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('regressor', LinearRegression())])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0), y_train)
y_pred = model.predict(X_test)

sample_data = pd.DataFrame({
    'Treatment': ['Control', 'Control', 'MHW', 'MHW'],
    'Species': ['H. ovalis', 'H. beccarii', 'H. ovalis', 'H. beccarii']

sample_predictions = model.predict(sample_data)

Random Forest Algorithm

To complement our linear regression analysis and address its limitations, we also employed the random forest algorithm. This approach is particularly useful for dealing with nonlinear relationships and interactions between variables that linear regression may not capture effectively.


from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
preprocessor = ColumnTransformer(
        ('cat', one_hot_encoder, categorical_features)
    remainder='passthrough'  # This line is not necessary since there are no numeric features, but it's left for completeness

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', DecisionTreeRegressor(random_state=42))  # Using DecisionTreeRegressor here
]), y_train)

y_pred = model.predict(X_test)

Model Validation

In our analysis, we validated the reliability of our machine learning models—primarily linear regression and random forest algorithms—using Mean Squared Error (MSE) as the key metric. MSE measures the average of the squares of errors, that is, the difference between the actual and predicted values, providing a clear indication of model accuracy. A lower MSE indicates a model that predicts more closely to actual observations, guiding us in refining our models for better performance. This approach allowed us to critically assess and improve our models’ predictions on the resilience of Halophila beccarii and Halophila ovalis against marine heatwaves, ensuring our findings are both reliable and valuable for conservation efforts.