The Biological Context
The green sea turtle (Chelonia mydas) is a large, migratory marine reptile found in tropical and subtropical oceans worldwide. As one of the few herbivorous sea turtles, they play a crucial role as "lawnmowers of the sea," maintaining the health of seagrass beds and coral reef ecosystems. Due to numerous threats, including habitat degradation, plastic pollution, and bycatch, the species is listed as Endangered on the IUCN Red List.

Project Objective
The primary goal of this analysis is to develop a robust Species Distribution Model (SDM) to predict suitable habitats for the green sea turtle on a global scale. This model aims to identify the key environmental factors driving their distribution and generate a predictive map that can guide conservation efforts.
Data and Methodology
1. Data Sources
- Occurrence Data: Species locations were obtained from the Global Biodiversity Information Facility (GBIF). The data was rigorously cleaned to include only reliable records (e.g., human observations) and remove duplicates.
- Environmental Data: A comprehensive set of marine environmental variables was sourced from Bio-Oracle. These predictors include sea surface temperature, salinity, depth (bathymetry), chlorophyll concentration, pH, and various nutrient levels.

Figure 1: Key environmental predictor variables used in the model. These maps show the global patterns of temperature, salinity, and ocean depth, which are critical drivers of marine life distribution.
2. Modeling and Validation
A Generalized Linear Model (GLM) was used to model the relationship between turtle presences and the environmental conditions. To build the model, a presence-background approach was used, comparing the environment at known turtle locations against 10,000 random "background" points from across the study area.
Crucially, to ensure the model's predictive power is not overestimated, the data was split into two independent sets:
- A Training Set (70% of the data) used to fit the GLM.
- A Testing Set (30% of the data) held back to evaluate the model's performance on unseen data.
The model's performance was quantified using the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) plot. Our model achieved an AUC of 0.9707 on the test data, indicating excellent predictive accuracy.
Results and Interpretation
The trained GLM was used to predict habitat suitability across the world's oceans. The resulting map highlights areas with environmental conditions favorable for Chelonia mydas.

Figure 2: Final model output. The top panel shows the predicted habitat suitability (warmer colors indicate higher suitability). The bottom panel overlays the known GBIF occurrences (blue dots), demonstrating a strong visual match between predictions and reality.
Interpretation of the Results
The model successfully identified key habitats for the green sea turtle. The areas with the highest predicted suitability (in yellow) correspond to:
- Warm, shallow coastal waters in tropical and subtropical zones.
- Major known congregation areas like the Caribbean Sea, the Coral Triangle in Southeast Asia, the Great Barrier Reef, and the coastal waters of Brazil and East Africa.
- The model correctly predicts low suitability in cold, deep, open-ocean environments, which are unsuitable for this species.
The strong visual alignment between the high-suitability areas and the independent occurrence points (bottom panel) provides confidence in the model's predictions. The high AUC score further validates that the model has effectively learned the species' environmental niche.
Value and Applications
This habitat suitability map is a powerful tool for conservation. It can be used to:
- Prioritize conservation efforts by identifying critical habitats that require protection.
- Inform the design and placement of Marine Protected Areas (MPAs).
- Assess how climate change might impact the future distribution of suitable habitats.
- Guide future field research to discover new populations.
Future Directions: Incorporating Biotic Interactions
This model is based on abiotic (non-living) factors like temperature and salinity. However, a species' distribution is also heavily influenced by biotic interactions, especially the availability of food.
A powerful next step would be to create a more ecologically complete model by incorporating the distribution of the green sea turtle's primary food sources: seagrasses and algae.
This could be accomplished by:
- Developing separate SDMs for key seagrass or algae species to predict their suitable habitats.
- Using the resulting food-source suitability map as an additional predictor variable in a new Chelonia mydas model.
Such an analysis would help distinguish between generally suitable areas and critical foraging grounds, providing a much more nuanced and powerful tool for targeted conservation actions.
Reproducible R Code
The complete R script used for this analysis is provided below for transparency and reproducibility.
# Load essential libraries
library(tidyverse)
library(sf)
library(geodata)
library(raster)
library(dismo)
library(rJava)
library(dplyr)
library(terra)
library(pROC)
# Initialize Java (needed for some dismo functions)
.jinit()
# ========== Load and clean GBIF occurrence data ==========
# Replace with the path to your data file
gbif_data <- read.table("/path/to/your/occurrence.txt",
header = TRUE, sep = "\t", quote = "", stringsAsFactors = FALSE, fill = TRUE)
# Filter for valid coordinates and reliable records
cleaned_data <- gbif_data %>%
filter(!is.na(decimalLatitude) & !is.na(decimalLongitude)) %>%
filter(decimalLatitude != 0 & decimalLongitude != 0) %>%
filter(basisOfRecord %in% c("HUMAN_OBSERVATION", "OBSERVATION")) %>%
distinct(decimalLatitude, decimalLongitude, .keep_all = TRUE)
# Convert to a spatial sf object
points_sf <- st_as_sf(cleaned_data, coords = c("decimalLongitude", "decimalLatitude"), crs = 4326)
# ========== Load and Assemble Environmental Data (Bio-Oracle) ==========
data_dir <- "/path/to/your/environmental_data/"
# Load individual NetCDF layers
temp_layer <- rast(file.path(data_dir, "thetao_baseline_2000_2019_depthsurf_74ff_39fa_9adc_U1746345739954.nc"))[[1]]
depth_layer <- rast(file.path(data_dir, "terrain_characteristics_6d78_4f59_9e1f_U1746345111304.nc"))[[1]]
# ... and so on for all other layers ...
sal_layer <- rast(file.path(data_dir, "so_baseline_2000_2019_depthsurf_49ed_5fc4_602c_U1746345608381.nc"))[[1]]
# Function to align all raster layers to the same grid
align_layer <- function(target, reference) {
target_crop <- crop(target, reference)
resample(target_crop, reference)
}
ref <- temp_layer
# Combine all aligned layers into a single stack
env_stack <- c(
temp_layer, align_layer(depth_layer, ref), align_layer(sal_layer, ref) # Abbreviated for example
# Add all other aligned layers here
)
# Make sure to name all layers
names(env_stack) <- c("temp", "depth", "salinity", ...)
# ========== Prepare Data for Modeling ==========
# Extract environmental values for presence points
points_vect <- vect(points_sf)
pres_vals <- terra::extract(env_stack, points_vect)
pres_vals$presence <- 1
# Generate 10,000 random background points
set.seed(42)
bg_points <- spatSample(env_stack, size = 10000, method = "random", na.rm = TRUE, as.points = TRUE)
bg_vals <- terra::extract(env_stack, bg_points)
bg_vals$presence <- 0
# Combine presence and background data into a single dataframe
model_df <- na.omit(rbind(pres_vals, bg_vals))
# ========== Train-Test Split and Model Evaluation ==========
# Split data into 70% training and 30% testing
set.seed(42)
train_idx <- sample(seq_len(nrow(model_df)), size = 0.7 * nrow(model_df))
train_data <- model_df[train_idx, ]
test_data <- model_df[-train_idx, ]
# Fit GLM on the TRAINING data only
glm_model <- glm(presence ~ ., data = train_data[, -1], family = binomial)
summary(glm_model)
# Predict on the held-out TEST data
test_preds <- predict(glm_model, newdata = test_data[, -which(names(test_data) == "presence")], type = "response")
# Evaluate performance with AUC
roc_obj <- roc(test_data$presence, test_preds)
auc_score <- auc(roc_obj)
print(paste("Model AUC on test data:", round(auc_score, 4)))
# ========== Predict and Plot Final Map ==========
# Use the trained model to predict suitability across the entire map
prediction <- predict(env_stack, glm_model, type = "response")
# Save the final plot
png("/path/to/save/your/plot/predicted_habitat.png", width=800, height=800)
par(mfrow= c(2,1), mar=c(2,2,2,2)) # Adjust margins
plot(prediction, main = "Predicted Habitat Suitability for Chelonia mydas")
plot(prediction, main = "With Chelonia mydas GBIF Occurrences")
points(points_sf, col = "blue", pch = 20, cex = 0.5)
dev.off()