diff --git a/6-lsib-demo.md b/6-lsib-demo.md
index 8685d47..b2e7eff 100644
--- a/6-lsib-demo.md
+++ b/6-lsib-demo.md
@@ -1,8 +1,8 @@
 # 6. LSIB <-> Overture matching demo
 
-| [<< 5. Base Theme](5-base-theme.md) | [Home](README.md) | >> |
+| [<< 5. Base Theme](5-base-theme.md) | [Home](README.md) | [7. Matching polygon features to Overture >>](7-buildings-matching.md) |
 
-This page is a companion to companion to the LSIB <-> Overture matching demo in this notebook: `3-lsib_overture.ipynb`. The notebook is the runnable
+This page is a companion to companion to the LSIB <-> Overture matching demo in this notebook: `lsib-matching.ipynb`. The notebook is the runnable
 artifact; this file holds the conceptual background — why we made the
 choices we did, what each decision point means, and how to use the outputs
 beyond the notebook itself.
@@ -372,3 +372,7 @@ the headline:
    something the two datasets are (correctly) saying different things
    about. Preserve both sides' metadata in the output so downstream users
    can read the findings in context.
+
+---
+
+| [<< 5. Base Theme](5-base-theme.md) | [Home](README.md) | [7. Matching polygon features to Overture >>](7-buildings-matching.md) |
diff --git a/7-buildings-matching.md b/7-buildings-matching.md
index 206d808..9d5c766 100644
--- a/7-buildings-matching.md
+++ b/7-buildings-matching.md
@@ -1,6 +1,8 @@
-# Matching MGCP polygon features to Overture
+# 7. Matching polygon features to Overture
 
-This lesson is the prose companion to notebook `4-buildings-matching.ipynb`.
+| [<< 6. LSIB matching demo](6-lsib-demo.md) | [Home](README.md) | [8. Matching concepts and pipeline context >>](8-matching-concepts.md) |
+
+This lesson is the prose companion to notebook `buildings-matching.ipynb`.
 The notebook contains the runnable demo; this page explains what the demo
 is doing, why the methodology is designed the way it is, and how to read
 the results it produces.
@@ -30,7 +32,7 @@ doesn't change.
 
 ## Running the demo
 
-See `notebooks/4-buildings-matching.ipynb` and the project README for
+See `notebooks/buildings-matching.ipynb` and the project README for
 setup instructions. The notebook expects:
 
 - The MGCP W079N26 cell unpacked into `data/mgcp/W079N26/`. Download
@@ -282,109 +284,10 @@ the dataset as a whole; it's the feature code. The notebook
 cross-tabulates the five categories per feature code, and the patterns
 it surfaces are what the next two sections turn into adoption decisions.
 
-## The two-rate diagnostic
-
-The cross-tab is informative but dense. The two-rate diagnostic
-compresses it into two numbers per feature code:
-
-- **Match rate** — of all polygons in this feature code, what fraction
-  found any Overture match? (The inverse of the unmatched rate.)
-- **Clean rate** — of the polygons that matched, what fraction matched
-  cleanly 1:1 on every pass?
-
-The two move independently, and the combination is more informative
-than either alone:
-
-| Pattern | Example (this cell) | What it means |
-|---|---|---|
-| High match, high clean | AL015 Building (85% / 99%) | Most polygons match; almost all are clean. Direct GERS ID works. |
-| Low match, high clean | BH080 (32% / 100%) | When it matches, it matches well. Coverage gap, not schema mismatch. |
-| High match, low clean | BA030 Island (66% / 3%) | Coverage is fine, but matches fragment. Needs a link table. |
-| Zero match | BA040 Tidal Water, BH140 River | No polygon counterpart at all. Defer to a different geometry or theme. |
-
-Codes that fall in between on both axes (EC030 Trees at 45% / 41%,
-DA010 Soil Surface Region at 27% / 48%) need case-by-case judgment.
-Clean rate is undefined when match rate is zero; the matrix in the
-next section handles this by checking match rate first.
-
-## GERS adoption decision matrix
-
-The two rates per feature code, combined with a minimum-sample
-threshold, produce a categorical assignment. The notebook computes this
-with explicit thresholds (`MATCH_RATE_HIGH = 80`, `CLEAN_RATE_HIGH = 80`,
-`MIN_SAMPLE = 5`) that classify each fcode into one of five buckets:
-
-- **Direct GERS ID attachment.** High match rate, high clean rate. The
-  MGCP feature carries a GERS ID directly, and downstream joins work
-  without further mechanism.
-- **Link table.** High match rate, low clean rate. A `(mgcp_uid,
-  gers_id)` crosswalk handles the many-to-one or one-to-many cases.
-- **Deferred.** Zero match rate. Integration needs a different geometry
-  type or theme.
-- **Review.** Partial match rate, mixed clean rate. Human judgment
-  needed.
-- **Insufficient sample.** Below the minimum sample threshold.
-
-The notebook shows the full matrix. One finding from that run is worth
-surfacing in prose.
-
-### The link-table bucket is nearly empty
-
-The skeleton anticipated this bucket would be substantial: feature codes
-that match often but fragment when they do. The matrix run disagrees.
-The codes that fragment in this cell (BA030 Island, EC030 Trees) have
-moderate match rates, not high ones, so they land in `review` rather
-than `link table`. The codes with high match rates don't fragment.
-
-This is a finding about the data, not a flaw in the methodology. At the
-1:100K capture scale of the Bahamas cell, the cardinality problem
-appears to be largely binary: feature codes either match cleanly or
-don't match at all, with the "matches frequently but messily" middle
-ground appearing more rarely than expected. Whether this generalizes to
-other MGCP cells, to TDS v7, or to denser capture scales is exactly the
-kind of question the next iteration should investigate.
-
-The 80/80 thresholds are conservative. Lowering `MATCH_RATE_HIGH` to 50
-would pull moderate-match codes into `link table`, making the matrix
-look more like the skeleton expected, at the cost of recommending
-direct or link-table adoption for codes that match less than two-thirds
-of the time. The notebook makes the thresholds visible at the top of
-the classifier cell.
-
-## Limitations and next steps
-
-This demo is what it is: a methodology demonstration against one MGCP
-cell. What it shows is encouraging but not generalizable.
-
-- **Polygon-only matching.** The bulk of MGCP's feature coverage at
-  1:100K is captured as points. Point-in-polygon matching is the
-  natural follow-on.
-- **The Bahamas cell is sparse.** Mostly ocean, 1:100K, single
-  contributor, 2015. The methodology behaves differently at higher
-  densities and with more recent data.
-- **The two-tier IoU rule wasn't specifically validated for cross-scale
-  matching.** The ~13% low-tier rate suggests it's catching real
-  matches, but the rule's behavior under very different capture scales
-  is worth characterizing more rigorously.
-- **Centroid containment direction is asymmetric** in a way that may be
-  backwards for `base/land_cover` and similar passes.
-- **Schema version is TRD 3.0, not operational current.** Running the
-  methodology against TDS v7 or current MGCP TRD 4 is the natural next
-  test.
-
-The matrix doesn't tell you what to do; it tells you which decisions to
-make and on what evidence. For the data shown here, the decisions are:
-direct GERS adoption for buildings, defer water-line features to a
-different theme or geometry type, and review the middle band case by
-case.
-
-## References
-
-- Overture's [GERS documentation](https://docs.overturemaps.org/gers/).
-- The [Overture schema reference](https://docs.overturemaps.org/schema/reference/buildings/building/)
-  for `buildings/building` and the `base` theme types.
-- NGA's [Geospatial Analysis Integrity Tool (GAIT)](https://github.com/ngageoint/Geospatial-Analysis-Integrity-Tool)
-  for the canonical MGCP TRD 3.0 attribute and feature-code
-  definitions.
-- The TDS DCS Extraction Guide v7.1 (NGA) for feature-code names
-  shared between MGCP and TDS.
+---
+
+For a deeper look at the concepts behind cardinality, iterative matching,
+and how this methodology relates to Overture's production pipelines, see
+[Lesson 8: Matching concepts and pipeline context](8-matching-concepts.md).
+
+| [<< 6. LSIB matching demo](6-lsib-demo.md) | [Home](README.md) | [8. Matching concepts and pipeline context >>](8-matching-concepts.md) |
diff --git a/8-matching-concepts.md b/8-matching-concepts.md
new file mode 100644
index 0000000..7708c3d
--- /dev/null
+++ b/8-matching-concepts.md
@@ -0,0 +1,319 @@
+# 8. Matching concepts and pipeline context
+
+| [<< 7. Matching polygon features to Overture](7-buildings-matching.md) | [Home](README.md) |
+
+This page builds on the two matching demos in lessons 6 and 7. It
+covers the conceptual foundation for cardinality-based matching
+decisions, how to iterate on a matching pipeline, and how the demo
+methodology relates to the production pipelines that power Overture's
+official releases.
+
+## Why cardinality is the right diagnostic
+
+The word "cardinality" appears throughout data engineering and GIS work,
+usually in the context of database relationships. Esri's ArcGIS
+documentation defines it plainly: cardinality describes how records in
+two different tables relate to one another -- one-to-one, one-to-many,
+many-to-one, or many-to-many -- and treats it as foundational for any
+relate or relationship class.[1] Their usage, though, is about
+*modeling* relationships you already understand. You know a fire station
+has many personnel, so you declare that cardinality in the geodatabase
+schema.
+
+What makes cardinality useful in a matching context is different: you
+don't know the relationship in advance. You discover it from the
+geometry. And what you discover determines what you hand the end user.
+
+This idea has a formal counterpart in the record linkage literature. The
+standard framing, going back to Fellegi and Sunter (1969) and codified
+in Peter Christen's *Data Matching* (2012), treats linking as a
+classification problem: pairs of records are assigned to a match set or
+a non-match set based on comparison scores.[2] The Python Record Linkage
+Toolkit, a widely used open-source implementation of these methods,
+makes the distinction explicit in its API: `OneToOneLinking` and
+`OneToManyLinking` are separate post-classification steps, because the
+*shape* of the match -- not just the score -- determines how to handle
+the output.[3]
+
+Research in large-scale record linkage has established why this matters
+operationally. Moretti, Valentino, and Tuoto (2019) describe the
+"selection of unique links" as a distinct phase of a record linkage
+pipeline in official statistics -- a constrained optimization problem
+that specifically enforces 1:1 cardinality when that's what the
+application requires.[4] The key insight is that a matching algorithm
+can produce high-confidence scores while still generating a match
+structure that violates the cardinality the application actually needs.
+Zhang, Rubinstein, and Gemmell (2014) formalize this further, showing
+that enforcing 1:1 matching through bipartite graph optimization
+significantly improves precision and recall compared to simply
+thresholding similarity scores.[5]
+
+The framework in this demo extends that logic into a decision tool. We
+don't enforce any cardinality upfront -- we let the geometry produce
+whatever structure it produces, then observe the result. Observed
+cardinality at the feature-code level drives the adoption
+recommendation:
+
+- **1:1 (clean)** -- the match structure supports direct ID adoption.
+  The MGCP feature can carry a GERS ID, and downstream joins work
+  without further mechanism.
+- **1:many (aggregated)** -- one MGCP polygon matched several Overture
+  features. Direct adoption breaks; a link table is the artifact.
+- **many:1 (fragmented)** -- several MGCP polygons point at the same
+  Overture feature. Same result: the link table is the product.
+- **unmatched** -- no Overture counterpart. Defer.
+
+This is what the GERS adoption matrix is doing: it's cardinality,
+summarized at the feature-code level, translated into a recommendation
+about what kind of crosswalk artifact the downstream user actually
+needs. The matching algorithm is the means; cardinality observation is
+the decision mechanism.
+
+This framing isn't standard in the geospatial literature, which tends
+to treat matching as an end in itself and evaluate it on recall and
+precision. The contribution here is using observed match cardinality as
+the primary axis for an ID adoption decision -- treating the *shape* of
+the match as the variable that determines what you build, not just
+whether the match succeeded.
+
+## The two-rate diagnostic
+
+The cross-tab is informative but dense. The two-rate diagnostic
+compresses it into two numbers per feature code:
+
+- **Match rate** — of all polygons in this feature code, what fraction
+  found any Overture match? (The inverse of the unmatched rate.)
+- **Clean rate** — of the polygons that matched, what fraction matched
+  cleanly 1:1 on every pass?
+
+The two move independently, and the combination is more informative
+than either alone:
+
+| Pattern | Example (this cell) | What it means |
+|---|---|---|
+| High match, high clean | AL015 Building (85% / 99%) | Most polygons match; almost all are clean. Direct GERS ID works. |
+| Low match, high clean | BH080 (32% / 100%) | When it matches, it matches well. Coverage gap, not schema mismatch. |
+| High match, low clean | BA030 Island (66% / 3%) | Coverage is fine, but matches fragment. Needs a link table. |
+| Zero match | BA040 Tidal Water, BH140 River | No polygon counterpart at all. Defer to a different geometry or theme. |
+
+Codes that fall in between on both axes (EC030 Trees at 45% / 41%,
+DA010 Soil Surface Region at 27% / 48%) need case-by-case judgment.
+Clean rate is undefined when match rate is zero; the matrix in the
+next section handles this by checking match rate first.
+
+## GERS adoption decision matrix
+
+The two rates per feature code, combined with a minimum-sample
+threshold, produce a categorical assignment. The notebook computes this
+with explicit thresholds (`MATCH_RATE_HIGH = 80`, `CLEAN_RATE_HIGH = 80`,
+`MIN_SAMPLE = 5`) that classify each fcode into one of five buckets:
+
+- **Direct GERS ID attachment.** High match rate, high clean rate. The
+  MGCP feature carries a GERS ID directly, and downstream joins work
+  without further mechanism.
+- **Link table.** High match rate, low clean rate. A `(mgcp_uid,
+  gers_id)` crosswalk handles the many-to-one or one-to-many cases.
+- **Deferred.** Zero match rate. Integration needs a different geometry
+  type or theme.
+- **Review.** Partial match rate, mixed clean rate. Human judgment
+  needed.
+- **Insufficient sample.** Below the minimum sample threshold.
+
+The notebook shows the full matrix. One finding from that run is worth
+surfacing in prose.
+
+### The link-table bucket is nearly empty
+
+The skeleton anticipated this bucket would be substantial: feature codes
+that match often but fragment when they do. The matrix run disagrees.
+The codes that fragment in this cell (BA030 Island, EC030 Trees) have
+moderate match rates, not high ones, so they land in `review` rather
+than `link table`. The codes with high match rates don't fragment.
+
+This is a finding about the data, not a flaw in the methodology. At the
+1:100K capture scale of the Bahamas cell, the cardinality problem
+appears to be largely binary: feature codes either match cleanly or
+don't match at all, with the "matches frequently but messily" middle
+ground appearing more rarely than expected. Whether this generalizes to
+other MGCP cells, to TDS v7, or to denser capture scales is exactly the
+kind of question the next iteration should investigate.
+
+The 80/80 thresholds are conservative. Lowering `MATCH_RATE_HIGH` to 50
+would pull moderate-match codes into `link table`, making the matrix
+look more like the skeleton expected, at the cost of recommending
+direct or link-table adoption for codes that match less than two-thirds
+of the time. The notebook makes the thresholds visible at the top of
+the classifier cell.
+
+## How these demos relate to production matching pipelines
+
+The two demos in this workshop -- LSIB boundary matching and MGCP
+building matching -- are deliberate simplifications. They're designed to
+be readable and runnable on a laptop, and to expose the decision points
+clearly. Understanding what they leave out, and why, is part of
+understanding what production conflation actually involves.
+
+### Buildings: where the demo and production are algorithmically similar
+
+The Overture buildings production pipeline uses the same core matching
+signal as this demo: Intersection over Union (IoU), with a threshold of
+0.5. That's not a coincidence -- it's the right signal for buildings
+because geometry is the only reliable cross-source signal available.
+Building names are sparse or absent across most sources, and the largest
+source volumes come from ML-derived datasets (Microsoft, Google) that
+have no stable or meaningful name attributes at all.
+
+The production pipeline adds significant operational scaffolding the
+demo doesn't have:
+
+- **Pre-match filtering by violation type.** Buildings are checked for
+  size anomalies (too small or too large for their source type),
+  invalid geometries, and duplicate records before matching runs.
+  The demo skips this entirely.
+- **Source priority ordering.** When multiple sources produce a
+  building for the same location, production resolves the conflict
+  by strict priority: OSM first, then licensed government sources,
+  then ML-derived sources. The demo works against a single source.
+- **Cross-theme quality checks.** After matching and merging, the
+  production pipeline filters buildings that inappropriately
+  intersect roads, water bodies, or have digitization artifacts.
+  These require having the transportation and base themes available
+  alongside the buildings data.
+- **Scale.** The demo processes ~1,300 polygons against Overture's
+  S3 parquet files for a 1-degree cell. Production runs against
+  the full global corpus, partitioned across a Spark cluster.
+
+The matching algorithm itself -- IoU, threshold, best-match selection,
+cardinality filtering -- is structurally the same. The difference is
+everything around it.
+
+### Divisions: where production is architecturally different
+
+The LSIB boundary demo and the Overture divisions production pipeline
+share a goal (linking boundary features across datasets) but use
+substantially different approaches. The divisions pipeline adds three
+things the demo has no equivalent of.
+
+**H3-based blocking.** The production pipeline partitions candidate
+pairs spatially using H3 cells before any scoring runs. The H3
+resolution varies by administrative subtype: countries are blocked at
+a coarse resolution covering hundreds of thousands of square kilometers,
+while neighborhoods and microhoods are blocked at a resolution covering
+less than a square kilometer. This means a candidate pair is only scored
+if both features fall in overlapping H3 cells at the resolution
+appropriate for their subtype. In the LSIB demo, the equivalent of
+blocking is the `pair_key` join -- you only compare features that share
+a country-pair code. H3 blocking is the spatial generalization of that
+idea, applicable to any administrative feature regardless of whether it
+carries an explicit country-code attribute.
+
+**Multi-signal composite scoring.** The LSIB demo scores pairs on two
+geometric signals: buffer overlap and length ratio. The divisions
+pipeline scores on a composite of name similarity and geographic
+similarity, combined with configurable weights. The composite is
+evaluated under multiple weighting scenarios and the minimum score is
+taken -- a conservative strategy that penalizes pairs where one signal
+is strong but the other is weak. A near-perfect geometric match with a
+very different name scores poorly, and vice versa. Geographic overlap is
+still a signal, but it's one input into the composite rather than the
+whole answer.
+
+**Multilingual name embeddings.** Name similarity in production is
+computed using a cross-lingual sentence transformer model (XLM-RoBERTa)
+applied to all name variants for a feature, with average pooling across
+variants. This means "München" and "Munich" produce similar embeddings
+without requiring an explicit translation table. The LSIB demo has no
+name signal at all -- LSIB boundaries are identified by country-pair
+code, not by name.
+
+The reason divisions needs names and buildings doesn't comes back to
+the data. Every division has a name; it's the primary human-recognizable
+signal for whether two features represent the same entity. Most buildings
+don't have names, and for the ones that do, the name is often absent
+from ML-derived sources. The matching signal has to match what's
+reliably present in the data.
+
+### History matching: what both pipelines do that neither demo does
+
+Both the divisions and buildings production pipelines implement a form
+of history-first matching: before any geometric or attribute scoring
+runs, the pipeline checks whether a candidate feature was already
+assigned a GERS ID in a prior release. If it was, it gets that ID back
+directly, without re-running the full scoring logic.
+
+This is the production equivalent of the iterative matching approach
+described earlier in this lesson. In the demo, iteration is a manual
+loop: run the matcher, inspect the unresolved cases, apply a different
+strategy, append results to the crosswalk. In production, the first
+iteration already happened in a previous release, and the history pass
+captures its output automatically. New or changed features fall through
+to the full scoring pipeline; stable features are matched by identity.
+
+The crosswalk file this demo produces is, structurally, a hand-built
+version of that history register -- a record of which external IDs have
+already been resolved to GERS IDs, which can be checked before running
+any geometry comparison on subsequent passes.
+
+## Limitations and next steps
+
+This demo is what it is: a methodology demonstration against one MGCP
+cell. What it shows is encouraging but not generalizable.
+
+- **Polygon-only matching.** The bulk of MGCP's feature coverage at
+  1:100K is captured as points. Point-in-polygon matching is the
+  natural follow-on.
+- **The Bahamas cell is sparse.** Mostly ocean, 1:100K, single
+  contributor, 2015. The methodology behaves differently at higher
+  densities and with more recent data.
+- **The two-tier IoU rule wasn't specifically validated for cross-scale
+  matching.** The ~13% low-tier rate suggests it's catching real
+  matches, but the rule's behavior under very different capture scales
+  is worth characterizing more rigorously.
+- **Centroid containment direction is asymmetric** in a way that may be
+  backwards for `base/land_cover` and similar passes.
+- **Schema version is TRD 3.0, not operational current.** Running the
+  methodology against TDS v7 or current MGCP TRD 4 is the natural next
+  test.
+
+The matrix doesn't tell you what to do; it tells you which decisions to
+make and on what evidence. For the data shown here, the decisions are:
+direct GERS adoption for buildings, defer water-line features to a
+different theme or geometry type, and review the middle band case by
+case.
+
+## References
+
+- Overture's [GERS documentation](https://docs.overturemaps.org/gers/).
+- The [Overture schema reference](https://docs.overturemaps.org/schema/reference/buildings/building/)
+  for `buildings/building` and the `base` theme types.
+- NGA's [Geospatial Analysis Integrity Tool (GAIT)](https://github.com/ngageoint/Geospatial-Analysis-Integrity-Tool)
+  for the canonical MGCP TRD 3.0 attribute and feature-code
+  definitions.
+- The TDS DCS Extraction Guide v7.1 (NGA) for feature-code names
+  shared between MGCP and TDS.
+
+### Notes on cardinality sources
+
+[1] Esri, "Relates and Relationship Classes Explained," ArcGIS Training
+Blog, February 2022.
+https://community.esri.com/t5/esri-training-blog/relates-and-relationship-classes-explained/ba-p/900757
+
+[2] Christen, Peter. *Data Matching: Concepts and Techniques for Record
+Linkage, Entity Resolution, and Duplicate Detection.* Springer, 2012.
+Referenced via the Python Record Linkage Toolkit documentation.
+
+[3] Python Record Linkage Toolkit, Classification reference (v0.15).
+`OneToOneLinking` and `OneToManyLinking` classes.
+https://recordlinkage.readthedocs.io/en/latest/ref-classifiers.html
+
+[4] Moretti, Diego, Luca Valentino, and Tiziana Tuoto. "Optimization
+Routines for Enforcing One-to-One Matches in Record Linkage Problems."
+*The R Journal* 11/01, June 2019.
+https://journal.r-project.org/archive/2019/RJ-2019-008/RJ-2019-008.pdf
+
+[5] Zhang, Duo, Benjamin I. P. Rubinstein, and Jim Gemmell. "Principled
+Graph Matching Algorithms for Integrating Multiple Data Sources." arXiv
+preprint, 2014.
+https://arxiv.org/abs/1402.0282
+
+| [<< 7. Matching polygon features to Overture](7-buildings-matching.md) | [Home](README.md) |
diff --git a/README.md b/README.md
index 696b96a..1bbbfb6 100644
--- a/README.md
+++ b/README.md
@@ -15,6 +15,8 @@
 4. [Global Entity Reference System (GERS)](4-gers.md)
 5. [Base Theme](5-base-theme.md)
 6. [LSIB ↔ Overture matching demo](6-lsib-demo.md)
+7. [Matching polygon features to Overture](7-buildings-matching.md)
+8. [Matching concepts and pipeline context](8-matching-concepts.md)
 
 ---
 
@@ -71,4 +73,4 @@ ATTACH 'https://labs.overturemaps.org/data/latest.ddb' as overture;
 
 -- Now you can just reference `overture.place` for type=place features
 SELECT count(1) from overture.place;
-```
\ No newline at end of file
+```