Skip to content

address_levels hierarchy inconsistent: 3-level countries have variable depth (LV 62%, SK 55% missing finest level) #509

@yharby

Description

@yharby

Summary

In the 2026-03-18.0 release, several countries listed with 3 address_levels have a variable number of populated levels. When consumers assume the finest level (index [3]) always contains the city/municipality, they lose significant coverage — up to 62% for Latvia and 55% for Slovakia.

This is similar to #367 (US addresses with NULL address_levels[2]), but affects the 3-level countries more severely.

Affected Countries

3-level countries with variable depth

Country Total addresses level3 populated level3 NULL (city must come from level2 or level1) % lost if only checking level3
LV (Latvia) 548,712 208,256 340,456 62.0%
SK (Slovakia) 1,697,528 757,325 940,203 55.4%
EE (Estonia) 2,228,661 2,076,759 151,902 6.8%
IT (Italy) 25,914,431 25,912,438 1,993 <0.01%
TW (Taiwan) 9,630,602 9,630,597 5 <0.01%

Latvia — 3 distinct hierarchy patterns

Pattern 1 (111K): Major cities — only level1 populated
  level1=Rīga, level2=NULL, level3=NULL
  → City IS level1 (Rīga)

Pattern 2 (229K): Novads + town — level1 and level2 populated
  level1=Jēkabpils nov., level2=Jēkabpils, level3=NULL
  → City IS level2 (Jēkabpils)

Pattern 3 (208K): Novads + pagasts + village — all 3 levels
  level1=Olaines nov., level2=Olaines pag., level3=Jāņupe
  → City IS level3 (Jāņupe)

Slovakia — 2 patterns

Pattern 1 (940K): District — level3 NULL
  level1=Bratislavský, level2=Bratislava-Ružinov, level3=NULL
  → City IS level2 (Bratislava-Ružinov)

Pattern 2 (757K): Municipality — all 3 levels
  level1=Prešovský, level2=Spišská Nová Ves, level3=Spišská Nová Ves
  → City IS level3 (Spišská Nová Ves)

Estonia — 2 patterns

Pattern 1 (152K): Linn (town) — level3 NULL
  level1=Ida-Viru maakond, level2=Narva linn, level3=NULL
  → City IS level2 (Narva linn)

Pattern 2 (2.1M): Village/district — all 3 levels
  level1=Harju maakond, level2=Tallinna linn, level3=Kesklinn
  → City IS level3 (Kesklinn)

Also: US (related to #367)

The 2-level US data still has 37.6M addresses (30%) with address_levels[2] = NULL. Of those, 85% (32.1M) have postal_city as a fallback, but 5.5M US addresses have no city information at all — no level2 AND no postal_city. Top states affected: TX (1.7M), MS (852K), CA (575K), FL (455K).

Query to Reproduce

-- Shows all country × depth combinations
SELECT country,
  len(address_levels) as levels_count,
  CASE
    WHEN address_levels[3].value IS NOT NULL THEN 'level3'
    WHEN address_levels[2].value IS NOT NULL THEN 'level2'
    WHEN address_levels[1].value IS NOT NULL THEN 'level1'
    ELSE 'none'
  END AS finest_populated_level,
  count(*) as cnt
FROM read_parquet(
  's3://overturemaps-us-west-2/release/2026-03-18.0/theme=addresses/type=address/*',
  hive_partitioning=0
)
GROUP BY country, levels_count, finest_populated_level
ORDER BY country, levels_count, finest_populated_level

Suggestion

It would help consumers if the documentation clarified that:

  1. address_levels depth is variable within a country — the array length doesn't guarantee all values are populated
  2. The recommended city extraction pattern is a COALESCE cascade (finest → coarsest):
    COALESCE(address_levels[3].value, address_levels[2].value, address_levels[1].value)
  3. For US addresses without level2, postal_city is the intended fallback (and covers 85%)

This would prevent other consumers from hitting the same issue we did when building a geocoder on top of this data.

Environment

  • Release: 2026-03-18.0
  • Queried via DuckDB 1.5 + MotherDuck
  • 39 countries, 469M addresses total

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions