Skip to content

GFDRR/urban_validation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

108 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project overview

This repository contains a config-driven pipeline for downloading and benchmarking global building-footprint and settlement-layer datasets over selected AOIs.

The download pipeline covers vector datasets such as Overture, Global Building Atlas, and 3D-GloBFP, and raster datasets such as Google Open Buildings, TEMPO, WSF-Tracker, and GHSL Built-up and Height.

The validation pipeline compares each candidate dataset against AOI-specific reference building footprints. It uses an AOI inventory file to find the city folder, AOI geometry, and reference data for each evaluation area.

Data availability

The repository is set up to run over the AOIs listed in data/02_interim/aoi_tracker.csv. Each row points to the AOI boundary and reference footprint files for a city.

AOIs with high quality references

Reference dataset used for benchmarking is publicly accessible here:

Validation approach

The validation framework has two evaluation paths.

Vector datasets are validated as object-based building footprints. The pipeline clips and repairs geometries where needed, removes very small polygons below the configured minimum area threshold, tiles each AOI into fixed evaluation units, and performs greedy IoU-based one-to-one matching between reference and candidate footprints. It then reports TP, FP, FN, precision, recall, F1, IoU summaries, boundary F-scores, area error, size-bin metrics, and city-level count and density summaries.

Vector validation method

Raster datasets are validated as gridded built-up layers. The pipeline reprojects each raster to the AOI CRS, aligns it to one or more evaluation grids, rasterizes the reference footprints into fractional built-up area, thresholds both reference and prediction to built-up masks, and computes area-based precision, recall, F1, relative area error, quantity disagreement, allocation disagreement, and building-count estimates derived from the reference mean building size.

Raster validation method

Both paths write tile-level outputs first and then aggregate to city-level summary tables and figures.

Code organization

Module Role
src/downloader.py UrbanDownloader — thin orchestrator that loads the AOI inventory and dispatches per-dataset download runners
src/download/* Source-specific download runners — vector: Overture, GBA, GloBFP; raster: Google OBT, TEMPO, GHSL, WSF Tracker — plus shared DuckDB connection setup
src/validator.py UrbanValidator — thin orchestrator that dispatches vector and raster validation per city
src/validate/vector_runner.py Tile-level vector validation, match consolidation, city summaries, density summaries, and vector figures
src/validate/raster_runner.py Tile-level raster validation, city summaries, and raster figures
src/metrics/vector/* IoU matching, boundary metrics, size-bin metrics, and vector tile assembly
src/metrics/raster/* Raster alignment, reference rasterization, binarization, disagreement metrics, and raster tile assembly
src/plots/output.py City-level summaries and the standard vector/raster figures
src/plots/figures.py Figure dispatchers for vector and raster validation
src/plots/visualize.py Interactive dataset-coverage map (build_inventory_map) for the AOI/reference inventory
src/utils/* AOI loading, tiling, building loading, geometry repair, memory helpers, and weighted aggregation
src/config.py Typed dataclass config for the download pipeline

Configuration is split across two files:

  • configs/data_configs.yaml — controls which datasets to download and from where
  • configs/validation_configs.yaml — controls validation thresholds, candidate datasets, evaluation grids, and output format

Usage Example

Setup requirements

conda env create -f environment.yaml
conda activate urban_validation
pip install duckdb psutil earthengine-api
earthengine authenticate   # required for Google OBT and GHSL downloads

Data Preparation and Download

Vector datasets download pipeline

from src.downloader import UrbanDownloader

UrbanDownloader("configs/data_configs.yaml").download_vector()

Raster download pipeline

from src.downloader import UrbanDownloader

UrbanDownloader("configs/data_configs.yaml").download_raster()

Raster files are saved as:

  • Google OBT: data/01_raw/<city>/raster/<city_slug>_obt_<year>.tif
  • Microsoft TEMPO: data/01_raw/<city>/raster/<city_slug>_tempo_<quarter>.tif
  • GHSL: data/01_raw/<city>/raster/<city_slug>_ghsl_<product>_<year>.tif

Data Validation Pipeline

Vector datasets validation

from src.validator import UrbanValidator

v = UrbanValidator("configs/validation_configs.yaml")
v.validate_vector()

Outputs per city are written to outputs/metrics/<city>/ and outputs/figures/<city>/. Vector outputs include tile metrics, match records, city-level summary tables, size-bin metrics, and density/count summaries. Candidate datasets and preprocessing thresholds are controlled via configs/validation_configs.yaml.

See notebooks/01_vector_validator.ipynb for the Colab-ready notebook.

Raster datasets validation

from src.validator import UrbanValidator

v = UrbanValidator("configs/validation_configs.yaml")
v.validate_raster()

Each raster dataset entry in configs/validation_configs.yaml specifies a name, year, binarization method, and optionally a native resolution and rasterization settings. The pipeline resolves the exact file for each city from the year field — for example, setting year: 2020 for ghsl_built_s loads <city_slug>_ghsl_built_s_2020.tif. Multiple years of the same product can be validated by adding separate entries.

Each dataset's min_building_m2 sets the minimum-detectable-building threshold used to binarize predictions at native resolution (tau_frac_native = min_building_m2 / native_pixel_area); categorical binarization methods (wsf_tracker, binary, nonzero, value_in) bypass this and use predicted area > 0 directly. The global raster.preprocessing.min_building_m2 sets the threshold used to binarize the reference layer at each evaluation grid resolution, keeping the reference "built" definition consistent across datasets. raster.preprocessing.native_resolution_guard controls whether evaluation grids finer than a dataset's native_resolution_m (scaled by tolerance_factor) are skipped (skip_finer), rejected (error), or allowed with a warning (warn_only).

See notebooks/02_raster_validator.ipynb for the Colab-ready notebook.

Result visualization

The validation pipeline writes city-level summaries and figures for both vector and raster datasets. The standard outputs now include building-count analytics in addition to IoU, area error, and F1 summaries:

  • outputs/metrics/<city>/vector_city_summary_all_datasets.{parquet,csv} includes reference/candidate count totals and count deltas.
  • outputs/metrics/<city>/vector_city_density_summary.{parquet,csv} includes per-source counts, densities, and count-vs-reference deltas.
  • outputs/metrics/<city>/raster_city_summary_all_datasets.{parquet,csv} includes predicted/reference building counts, count deltas, and relative count differences.
  • outputs/figures/<city>/ contains the standard tile/F1/IoU plots plus building-count comparison figures.

File Organization

The datasets are organized as follows:

data/
  01_raw/<city>/
    aoi/           AOI boundary files
    vector/        candidate building parquets
    raster/        raster settlement layers
  02_interim/
    aoi_tracker.csv

outputs/
  metrics/<city>/
    vector_metrics_tiles_<dataset>.parquet
    vector_matches_all_datasets.parquet
    vector_city_summary_all_datasets.{parquet,csv}
    vector_city_density_summary.{parquet,csv}
    raster_metrics_tiles_<dataset>.parquet
    raster_city_summary_all_datasets.{parquet,csv}
  figures/<city>/
    *_f1_*.png
    *_iou_*.png
    *_building_counts_*.png

The exact filenames depend on the enabled datasets and evaluation grids in configs/validation_configs.yaml.

License

This code repository and corresponding datasets are distributed under the MIT License. See LICENSE for more information

Citation

@misc{urban_validation_gfdrr2026,
  title={An assessment of satellite derived global urban datasets for operational and analytical use cases},
  author={Rufai Omowunmi Balogun, Caroline Margaux Gevaert, Derrick Mirindi, Aaron Opdyke, Capucine Anne Veronique Riom, Pierre Chrzanowski, Edward Charles Anderson},
  year={2026},
  organization={GFDRR, The World Bank Group},
  type={Dataset},
  howpublished={\url{https://github.com/GFDRR/urban_validation}}
}

Acknowledgment

  1. The Global Facility for Disaster Reduction and Recovery (GFDRR), The World Bank Group,
  2. The Gates Foundation,
  3. The Humanitarian Openstreet Map (HOTOSM),
  4. University of Edinburgh
  5. The SPARC Project for access to datasets from partner countries and cities for on ground validation, and the broader World Bank Group Digital Earth Partnership Team for feedbacks and inputs

About

An independent validation framework for assessing the accuracy and biases of globally available satellite-derived building datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors