Rufai Omowunmi Balogun, Caroline Margaux Gevaert, Derrick Mirindi, Aaron Opdyke, Capucine Anne Veronique Riom, Pierre Chrzanowski, and Edward Charles Anderson
This repository contains a config-driven pipeline for downloading and benchmarking global building-footprint and settlement-layer datasets over selected AOIs.
The download pipeline covers vector datasets such as Overture, Global Building Atlas, and 3D-GloBFP, and raster datasets such as Google Open Buildings, TEMPO, WSF-Tracker, and GHSL Built-up and Height.
The validation pipeline compares each candidate dataset against AOI-specific reference building footprints. It uses an AOI inventory file to find the city folder, AOI geometry, and reference data for each evaluation area.
The repository is set up to run over the AOIs listed in data/02_interim/aoi_tracker.csv. Each row points to the AOI boundary and reference footprint files for a city.
Reference dataset used for benchmarking is publicly accessible here:
The validation framework has two evaluation paths.
Vector datasets are validated as object-based building footprints. The pipeline clips and repairs geometries where needed, removes very small polygons below the configured minimum area threshold, tiles each AOI into fixed evaluation units, and performs greedy IoU-based one-to-one matching between reference and candidate footprints. It then reports TP, FP, FN, precision, recall, F1, IoU summaries, boundary F-scores, area error, size-bin metrics, and city-level count and density summaries.
Raster datasets are validated as gridded built-up layers. The pipeline reprojects each raster to the AOI CRS, aligns it to one or more evaluation grids, rasterizes the reference footprints into fractional built-up area, thresholds both reference and prediction to built-up masks, and computes area-based precision, recall, F1, relative area error, quantity disagreement, allocation disagreement, and building-count estimates derived from the reference mean building size.
Both paths write tile-level outputs first and then aggregate to city-level summary tables and figures.
| Module | Role |
|---|---|
src/downloader.py |
UrbanDownloader — thin orchestrator that loads the AOI inventory and dispatches per-dataset download runners |
src/download/* |
Source-specific download runners — vector: Overture, GBA, GloBFP; raster: Google OBT, TEMPO, GHSL, WSF Tracker — plus shared DuckDB connection setup |
src/validator.py |
UrbanValidator — thin orchestrator that dispatches vector and raster validation per city |
src/validate/vector_runner.py |
Tile-level vector validation, match consolidation, city summaries, density summaries, and vector figures |
src/validate/raster_runner.py |
Tile-level raster validation, city summaries, and raster figures |
src/metrics/vector/* |
IoU matching, boundary metrics, size-bin metrics, and vector tile assembly |
src/metrics/raster/* |
Raster alignment, reference rasterization, binarization, disagreement metrics, and raster tile assembly |
src/plots/output.py |
City-level summaries and the standard vector/raster figures |
src/plots/figures.py |
Figure dispatchers for vector and raster validation |
src/plots/visualize.py |
Interactive dataset-coverage map (build_inventory_map) for the AOI/reference inventory |
src/utils/* |
AOI loading, tiling, building loading, geometry repair, memory helpers, and weighted aggregation |
src/config.py |
Typed dataclass config for the download pipeline |
Configuration is split across two files:
configs/data_configs.yaml— controls which datasets to download and from whereconfigs/validation_configs.yaml— controls validation thresholds, candidate datasets, evaluation grids, and output format
conda env create -f environment.yaml
conda activate urban_validation
pip install duckdb psutil earthengine-api
earthengine authenticate # required for Google OBT and GHSL downloadsfrom src.downloader import UrbanDownloader
UrbanDownloader("configs/data_configs.yaml").download_vector()from src.downloader import UrbanDownloader
UrbanDownloader("configs/data_configs.yaml").download_raster()Raster files are saved as:
- Google OBT:
data/01_raw/<city>/raster/<city_slug>_obt_<year>.tif - Microsoft TEMPO:
data/01_raw/<city>/raster/<city_slug>_tempo_<quarter>.tif - GHSL:
data/01_raw/<city>/raster/<city_slug>_ghsl_<product>_<year>.tif
from src.validator import UrbanValidator
v = UrbanValidator("configs/validation_configs.yaml")
v.validate_vector()Outputs per city are written to outputs/metrics/<city>/ and outputs/figures/<city>/. Vector outputs include tile metrics, match records, city-level summary tables, size-bin metrics, and density/count summaries. Candidate datasets and preprocessing thresholds are controlled via configs/validation_configs.yaml.
See notebooks/01_vector_validator.ipynb for the Colab-ready notebook.
from src.validator import UrbanValidator
v = UrbanValidator("configs/validation_configs.yaml")
v.validate_raster()Each raster dataset entry in configs/validation_configs.yaml specifies a name, year, binarization method, and optionally a native resolution and rasterization settings. The pipeline resolves the exact file for each city from the year field — for example, setting year: 2020 for ghsl_built_s loads <city_slug>_ghsl_built_s_2020.tif. Multiple years of the same product can be validated by adding separate entries.
Each dataset's min_building_m2 sets the minimum-detectable-building threshold used to binarize predictions at native resolution (tau_frac_native = min_building_m2 / native_pixel_area); categorical binarization methods (wsf_tracker, binary, nonzero, value_in) bypass this and use predicted area > 0 directly. The global raster.preprocessing.min_building_m2 sets the threshold used to binarize the reference layer at each evaluation grid resolution, keeping the reference "built" definition consistent across datasets. raster.preprocessing.native_resolution_guard controls whether evaluation grids finer than a dataset's native_resolution_m (scaled by tolerance_factor) are skipped (skip_finer), rejected (error), or allowed with a warning (warn_only).
See notebooks/02_raster_validator.ipynb for the Colab-ready notebook.
The validation pipeline writes city-level summaries and figures for both vector and raster datasets. The standard outputs now include building-count analytics in addition to IoU, area error, and F1 summaries:
outputs/metrics/<city>/vector_city_summary_all_datasets.{parquet,csv}includes reference/candidate count totals and count deltas.outputs/metrics/<city>/vector_city_density_summary.{parquet,csv}includes per-source counts, densities, and count-vs-reference deltas.outputs/metrics/<city>/raster_city_summary_all_datasets.{parquet,csv}includes predicted/reference building counts, count deltas, and relative count differences.outputs/figures/<city>/contains the standard tile/F1/IoU plots plus building-count comparison figures.
The datasets are organized as follows:
data/
01_raw/<city>/
aoi/ AOI boundary files
vector/ candidate building parquets
raster/ raster settlement layers
02_interim/
aoi_tracker.csv
outputs/
metrics/<city>/
vector_metrics_tiles_<dataset>.parquet
vector_matches_all_datasets.parquet
vector_city_summary_all_datasets.{parquet,csv}
vector_city_density_summary.{parquet,csv}
raster_metrics_tiles_<dataset>.parquet
raster_city_summary_all_datasets.{parquet,csv}
figures/<city>/
*_f1_*.png
*_iou_*.png
*_building_counts_*.png
The exact filenames depend on the enabled datasets and evaluation grids in configs/validation_configs.yaml.
This code repository and corresponding datasets are distributed under the MIT License. See LICENSE for more information
@misc{urban_validation_gfdrr2026,
title={An assessment of satellite derived global urban datasets for operational and analytical use cases},
author={Rufai Omowunmi Balogun, Caroline Margaux Gevaert, Derrick Mirindi, Aaron Opdyke, Capucine Anne Veronique Riom, Pierre Chrzanowski, Edward Charles Anderson},
year={2026},
organization={GFDRR, The World Bank Group},
type={Dataset},
howpublished={\url{https://github.com/GFDRR/urban_validation}}
}
- The Global Facility for Disaster Reduction and Recovery (GFDRR), The World Bank Group,
- The Gates Foundation,
- The Humanitarian Openstreet Map (HOTOSM),
- University of Edinburgh
- The SPARC Project for access to datasets from partner countries and cities for on ground validation, and the broader World Bank Group Digital Earth Partnership Team for feedbacks and inputs


