Skip to content

NO TICKET fix(importers): prevent duplicate well-name collisions in well inventory and water-level imports#633

Merged
jirhiker merged 2 commits into
stagingfrom
kas-reinforce-duplicate-logic-check
Apr 13, 2026
Merged

NO TICKET fix(importers): prevent duplicate well-name collisions in well inventory and water-level imports#633
jirhiker merged 2 commits into
stagingfrom
kas-reinforce-duplicate-logic-check

Conversation

@ksmuczynski

@ksmuczynski ksmuczynski commented Apr 1, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR hardens the CSV importers around duplicate water well names.

It updates the well inventory importer so it:

  • preserves the current re-import/idempotency behavior
  • reuses an existing well when the row clearly matches a prior import
  • fails the row if a water well with the same Thing.name already exists elsewhere in the database
  • avoids silently creating duplicate Thing records

It also keeps the water-level importer defensive so ambiguous well lookups no longer crash the CLI and instead return a row-level validation error.

Why

This PR addresses the following problem / context:

I found that duplicate Thing.name records for water well could exist in the database, and at least one duplicate appears to have been introduced by the well inventory import path.

That caused two issues:

  • the well inventory importer created a second Thing with the same well name
  • the water-level importer crashed with MultipleResultsFound when trying to resolve a well by name

How

Implementation summary - the following was changed / added / removed:

  • Added a database-wide duplicate-name check in the well inventory import flow after the existing re-import detection logic
  • Fail the row with a clear well_name_point_id validation error when the well already exists
  • Kept the water-level duplicate-well guard so ambiguous lookups fail gracefully instead of aborting the command
  • Added regression coverage for both behaviors

Why This Approach

The well inventory importer still needs to support safe reruns of the same import, so the existing _find_existing_imported_well() logic is evaluated first.

Only if the row does not match a prior import do we check whether a well with the same name already exists in the database and block creation.

That gives us:

  • idempotent re-imports
  • no silent duplicate Thing creation
  • clearer operator-facing errors

Notes

Any special considerations, workarounds, or follow-up work to note?

  • This change does not clean up existing duplicate Thing.name rows already in the database. Any pre-existing duplicates should be identified and resolved separately.
  • This PR intentionally treats well inventory as a create-oriented importer, not a general update/upsert workflow for existing wells. If updating existing wells via CSV is needed, that should be scoped as a separate feature with explicit merge rules.

…orts

Preserve well inventory re-import idempotency while blocking creation of a new water well when the same `Thing.name` already exists in the database.

Also keep the water-level importer defensive against ambiguous well lookups so duplicate well names produce row-level validation errors instead of crashing the CLI with `MultipleResultsFound`.

This reduces the risk of creating duplicate `Thing` records and makes import failures clearer for operators.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Hardens the well inventory and water-level CSV import flows against duplicate Thing.name collisions for water well records, preventing silent duplicate creation and ensuring ambiguous well lookups fail as row-level validation errors instead of crashing.

Changes:

  • Well inventory import: after existing “re-import/idempotency” detection, block creation when a water well with the same Thing.name already exists elsewhere in the DB.
  • Water-level import: catch MultipleResultsFound during well resolution and surface a row-level validation error instead of aborting.
  • Adds regression tests covering both the “existing DB well name” inventory case and the “duplicate name matches” water-level case.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
services/well_inventory_csv.py Adds DB-wide duplicate-name guard and maps the resulting row error to well_name_point_id.
services/water_level_csv.py Catches MultipleResultsFound from ambiguous well-name lookups and records a row-level validation error.
tests/test_well_inventory.py Adds coverage asserting inventory import fails when the well name already exists in the DB.
tests/test_water_level_csv_service.py Adds coverage ensuring water-level import reports duplicate well-name matches instead of crashing.

Comment on lines +839 to +870

with session_ctx() as session:
session.add(Thing(name=row["well_name_point_id"], thing_type="water well"))
session.commit()

file_path = tmp_path / "well-inventory-existing-db-well.csv"
with file_path.open("w", encoding="utf-8", newline="") as f:
writer = csv.DictWriter(f, fieldnames=list(row.keys()))
writer.writeheader()
writer.writerow(row)

result = well_inventory_csv(file_path)

assert result.exit_code == 1, result.stderr
errors = result.payload.get("validation_errors", [])
assert errors
assert errors[0]["field"] == "well_name_point_id"
assert (
errors[0]["error"]
== "Well already exists in database for well_name_point_id 'TEST-0001'"
)

with session_ctx() as session:
things = (
session.query(Thing)
.filter(
Thing.name == row["well_name_point_id"],
Thing.thing_type == "water well",
)
.all()
)
assert len(things) == 1

Copilot AI Apr 1, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test inserts a Thing into the shared session-scoped test database but never deletes it. Because the test suite shares DB state across tests (see tests/conftest.py session fixture), leaving this row behind can make other well-inventory tests flaky by causing unexpected name collisions. Clean up the created Thing (e.g., capture its id and delete it in a finally, or use a fixture that yields and deletes).

Suggested change
with session_ctx() as session:
session.add(Thing(name=row["well_name_point_id"], thing_type="water well"))
session.commit()
file_path = tmp_path / "well-inventory-existing-db-well.csv"
with file_path.open("w", encoding="utf-8", newline="") as f:
writer = csv.DictWriter(f, fieldnames=list(row.keys()))
writer.writeheader()
writer.writerow(row)
result = well_inventory_csv(file_path)
assert result.exit_code == 1, result.stderr
errors = result.payload.get("validation_errors", [])
assert errors
assert errors[0]["field"] == "well_name_point_id"
assert (
errors[0]["error"]
== "Well already exists in database for well_name_point_id 'TEST-0001'"
)
with session_ctx() as session:
things = (
session.query(Thing)
.filter(
Thing.name == row["well_name_point_id"],
Thing.thing_type == "water well",
)
.all()
)
assert len(things) == 1
thing_name = row["well_name_point_id"]
try:
with session_ctx() as session:
session.add(Thing(name=thing_name, thing_type="water well"))
session.commit()
file_path = tmp_path / "well-inventory-existing-db-well.csv"
with file_path.open("w", encoding="utf-8", newline="") as f:
writer = csv.DictWriter(f, fieldnames=list(row.keys()))
writer.writeheader()
writer.writerow(row)
result = well_inventory_csv(file_path)
assert result.exit_code == 1, result.stderr
errors = result.payload.get("validation_errors", [])
assert errors
assert errors[0]["field"] == "well_name_point_id"
assert (
errors[0]["error"]
== "Well already exists in database for well_name_point_id 'TEST-0001'"
)
with session_ctx() as session:
things = (
session.query(Thing)
.filter(
Thing.name == thing_name,
Thing.thing_type == "water well",
)
.all()
)
assert len(things) == 1
finally:
# Clean up the Thing created for this test to avoid leaking state
with session_ctx() as session:
(
session.query(Thing)
.filter(
Thing.name == thing_name,
Thing.thing_type == "water well",
)
.delete(synchronize_session=False)
)
session.commit()

Copilot uses AI. Check for mistakes.
Comment on lines +283 to +288
if error_text.startswith(
"Well already exists in database for well_name_point_id "
):
field = "well_name_point_id"
else:
field = _extract_field_from_value_error(error_text)

Copilot AI Apr 1, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mapping the validation field by checking error_text.startswith(...) is brittle and couples control flow to an exact error-message string. Consider raising a structured/typed exception (or returning a structured error) that carries the target field explicitly, so future message wording changes don’t silently regress field attribution.

Suggested change
if error_text.startswith(
"Well already exists in database for well_name_point_id "
):
field = "well_name_point_id"
else:
field = _extract_field_from_value_error(error_text)
field = _extract_field_from_value_error(error_text)

Copilot uses AI. Check for mistakes.

@jirhiker jirhiker left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both of changes seem related to querying the db to check for an existing well. this should be a shared function

@ksmuczynski

Copy link
Copy Markdown
Contributor Author

There are two related checks here:

  1. well-inventory-specific re-import detection via _find_existing_imported_well()
  2. the generic database lookup for an existing water well by Thing.name

I think only the second one should be shared.

_find_existing_imported_well() is importer-specific idempotency logic. It is intentionally checking for evidence that the row matches a prior well-inventory import, which is different from a general “does a well with this
name already exist?” query.

Example for _find_existing_imported_well():

  • a prior well inventory import already created well TEST-0001
  • that import also created the expected well inventory field event on the same inventory datetime, or the expected groundwater-level sample name for the same measurement datetime
  • in that case, a rerun of the same inventory row should reuse the existing imported well and succeed

That is different from this case:

  • a Thing named TEST-0001 already exists in the database
  • but it was created by some other workflow, or the expected inventory lineage does not match
  • in that case, the row should not be treated as a safe rerun

The overlap between these changes is the generic Thing.name == well_name_point_id and thing_type == "water well" lookup. I agree that should be extracted into a shared helper and then each importer can apply its own
behavior to the result set.

Example for the generic lookup in water-level import:

  • CSV row references TEST-0001
  • if 0 wells match, report Unknown well_name_point_id
  • if 1 well matches, attach the measurement to that well
  • if 2+ wells match, report that multiple wells were found and fail the row

Example for the generic lookup in well-inventory import:

  • after _find_existing_imported_well() does not find a prior import match
  • if any water well already exists with Thing.name = TEST-0001
  • fail the row with “well already exists in the database” rather than creating a duplicate Thing

So I agree the generic water-well lookup should be shared, but I would keep _find_existing_imported_well() separate because it encodes well-inventory-specific rerun behavior rather than just a name-based existence check.

I can make that refactor in a follow-up commit.

@ksmuczynski

Copy link
Copy Markdown
Contributor Author

@jirhiker Okay, I added a helper find_water_wells_by_name(...) in services/thing_helper.py. It is reused in the well inventory and water-level CSV importers in the following way:

  • water-level importer (services/water_level_csv.py): uses it for 0 / 1 / many resolution
  • well inventory importer (services/well_inventory_csv.py): uses it only for the generic “already exists in DB” check after _find_existing_imported_well(...)

So the shared query is centralized, but the two importer-specific behaviors remain separate.

@ksmuczynski ksmuczynski requested a review from jirhiker April 8, 2026 15:17
Extract the generic water-well lookup by `Thing.name` into a shared helper and reuse it in the well inventory and water-level CSV importers.

Keep importer-specific behavior separate:
- water-level import still distinguishes unknown vs ambiguous matches
- well-inventory import still applies its own re-import detection before enforcing the duplicate-name check
Comment on lines +283 to +284
if error_text.startswith(
"Well already exists in database for well_name_point_id "

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this string is used in multiple places I would recommend creating a global constant at the beginning and then inserting it into f-strings wherever it is used. This way you don't have to worry about having the text correspond between its different invocations and can easily change it in the future. Something like

WELL_EXISTS_SUBSTRING = "Well already in database for well_name_point_id"

...
...
...
if error_text.startswith(WELL_EXISTS_SUBSTRING)
...
...
raise ValueError(f"{WELL_EXISTS_SUBSTRING} `{model.well_name_point_id}`")

)


def test_bulk_upload_water_levels_reports_duplicate_well_name_matches():

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test raises what appears to be an omission in the DB. Shouldn't names be unique? That would prevent this error from even occuring. There's even a TODO above the name column:

# TODO: should name be unique?

@jirhiker jirhiker merged commit 8cd1103 into staging Apr 13, 2026
8 checks passed
@jirhiker jirhiker deleted the kas-reinforce-duplicate-logic-check branch April 13, 2026 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants