Machine Learning for Climate Change Modeling

I built a large-scale, end-to-end data science pipeline to study how long-run climate trends align with agricultural performance across Africa. Rather than focusing on causal claims, the objective was to integrate massive, heterogeneous climate and crop datasets into a unified analytical framework, extract interpretable signals, and surface regions where climate stress and agricultural underperformance increasingly coincide.

The project emphasizes robust data ingestion, large-scale preprocessing, feature engineering across multiple time scales, and ensemble modeling, operating on tens of millions of raw records spanning over a century of climate and production data.

Total data processed (raw):

  • ~8.6M monthly climate observations

  • ~2.9M daily city-level temperature records

  • ~27.6M daily multivariable weather rows

  • ~2.5M agricultural production records

Feature Engineering


Large-Scale Data Ingestion & Cleaning: Designed fault-tolerant loaders and cleaning routines to reliably ingest climate and agricultural data with inconsistent encodings and formats.

Technical highlights:

  • Automatic fallback encoding handling (UTF-8 → Latin-1) to prevent pipeline failure

  • Parsing of monthly, daily, and annual time formats into standardized datetime indices

  • Explicit handling of invalid sentinel values (e.g., −99, “\N”) without silent imputation

Data reduction: Through geographic filtering and normalization, I reduced datasets from >41M rows to ~575K analytically relevant records, while preserving long-run trends.


Geographic Normalization & Dataset Integration: Built a robust city-to-country and region-level mapping system to align city-based climate data with country-level agricultural statistics.

Technical highlights:

  • Text normalization using accent stripping, casing control, and whitespace trimming

  • Fuzzy string matching (token-based similarity thresholds) to resolve naming inconsistencies

  • Explicit city-to-country linkage enabling climate–agriculture joins across scales

  • Regional grouping (North, West, East, Central, Southern Africa) for comparative analysis

This step enabled cross-scale analysis, allowing urban heat and long-term climate signals to be compared against national agricultural outcomes.


Feature Engineering Across Climate & Agriculture: Converted raw time-series data into interpretable, model-ready features capturing trends, variability, and climate stress.

Technical highlights:

  • Rolling averages and smoothing windows to isolate long-term temperature signals

  • Seasonal completeness filters to prevent biased yearly aggregates

  • Agricultural deviation metrics comparing observed production to expected growth

  • Climate interaction features linking temperature, rainfall, wind variability, and volatility

Feature groups engineered:

  • Long-term warming rates

  • Climate variability and interaction metrics

  • Composite heat vulnerability indicators


Machine Learning Forecasting & Risk Scoring: Implemented an ensemble learning framework to project future climate risk while maintaining interpretability and uncertainty awareness.

Technical highlights:

  • Decade-level aggregation to stabilize long-run learning signals

  • Ensemble of Random Forest and Gradient Boosting regressors

  • Explicit geographic features (latitude, longitude, distance from equator)

  • Standardized training pipeline with reproducible splits and scaling

  • Composite climate risk index combining exposure, warming magnitude, and variability

Scale & output:

  • Trained on 60+ years of historical climate behavior

  • Generated 144 future climate risk scenarios across cities, timelines, and pathways

  • Produced ranked risk tables and spatial vulnerability maps