Skip to content

refactor(waterdata): decompose the 2033-LOC utils.py god-module into cohesive modules#318

Draft
thodson-usgs wants to merge 1 commit into
DOI-USGS:mainfrom
thodson-usgs:refactor/split-waterdata-utils
Draft

refactor(waterdata): decompose the 2033-LOC utils.py god-module into cohesive modules#318
thodson-usgs wants to merge 1 commit into
DOI-USGS:mainfrom
thodson-usgs:refactor/split-waterdata-utils

Conversation

@thodson-usgs
Copy link
Copy Markdown
Collaborator

Why

dataretrieval/waterdata/utils.py had grown to 2033 LOC spanning ~6 unrelated domains — request building, response parsing, result finalization, pagination/async, stats post-processing, and validation — plus constants and the public engines. It was the package's one genuine god-module: changing how requests are built means navigating the same file as type coercion, datetime parsing, and the async driver.

(An architecture review concluded the package's OO is otherwise appropriate — the chunker is well-factored, the modern getters are cleanly delegated — so this is a modularization, not an OO-pattern refactor.)

What

Split it into six cohesive private modules under dataretrieval/waterdata/, moving every definition verbatim (no signature/logic changes):

module holds
_constants.py URLs, _OUTPUT_ID_BY_SERVICE, regexes, param sets (dependency-free)
_http.py headers, _error_body, _raise_for_non_200, retry-after
_validate.py arg normalization/validation (_get_args, _check_*)
_requests.py request building (_construct_api_requests, CQL2, dates)
_responses.py geometry-agnostic parsing / finalization / stats shaping
_engine.py pagination/async driver (_paginate, _run_sync, …)

utils.py (2033 → 651 LOC) becomes a thin façade that re-exports the internal API (explicit __all__, 56 names), so every existing from dataretrieval.waterdata.utils import … and mock.patch("dataretrieval.waterdata.utils.<name>") keeps working — no import sites or tests were touched.

Seven functions remain physically defined in the façade (get_ogc_data, _fetch_once, get_stats_data, _get_resp_data, _ogc_parse_response, _walk_pages, _handle_stats_nesting) because the test suite monkeypatches them (or gpd) by their module-global name: a function's global lookups resolve in its defining module, so they must stay in utils.py for the patches to take effect. The geopandas import probe stays with them, and the pagination logger keeps the name dataretrieval.waterdata.utils (a caplog test pins it).

Behavior-preserving

  • 56 top-level definitions moved AST-identically — none lost, none duplicated (verified mechanically).
  • 469 tests pass, 2 skipped; ruff clean; all 6 modules import without cycles; chunking.py untouched.

Note

Overlaps with the in-flight error-taxonomy (#313) and namespace (#315) PRs on waterdata/ imports — sequence on merge (rebasing each surfaces any conflict, as we've done for the others).

🤖 Generated with Claude Code

…ils/ package

## Why

`dataretrieval/waterdata/utils.py` had grown to 2033 LOC spanning ~6 unrelated
domains -- request building, response parsing, result finalization,
pagination/async, stats post-processing, and validation -- plus constants and
the public engines. It was the package's one genuine god-module. (An
architecture review found the package's OO is otherwise appropriate, so this is
a modularization, not an OO-pattern refactor.)

## What

Convert `utils.py` into a `utils/` package: the public surface stays in
`utils/__init__.py` (a thin facade) and the implementation is split across six
cohesive submodules, moving every definition verbatim (no signature/logic
changes):

| submodule | holds |
|---|---|
| `utils/constants.py` | URLs, `_OUTPUT_ID_BY_SERVICE`, regexes, param sets (dependency-free) |
| `utils/http.py` | headers, `_error_body`, `_raise_for_non_200`, retry-after |
| `utils/validate.py` | arg normalization/validation (`_get_args`, `_check_*`) |
| `utils/requests.py` | request building (`_construct_api_requests`, CQL2, dates) |
| `utils/responses.py` | geometry-agnostic parsing / finalization / stats shaping |
| `utils/engine.py` | pagination/async driver (`_paginate`, `_run_sync`, ...) |

`utils/__init__.py` re-exports the internal API (explicit `__all__`, 56 names),
so every existing `from dataretrieval.waterdata.utils import ...` and
`mock.patch("dataretrieval.waterdata.utils.<name>")` keeps working -- no import
sites or tests were touched. `dataretrieval.waterdata.utils` resolves to the
package's `__init__`, so the import path is unchanged from when it was a module.

Seven functions remain physically defined in `utils/__init__.py`
(`get_ogc_data`, `_fetch_once`, `get_stats_data`, `_get_resp_data`,
`_ogc_parse_response`, `_walk_pages`, `_handle_stats_nesting`) because the test
suite monkeypatches them (or `gpd`) by their `dataretrieval.waterdata.utils.*`
name, and a function's global lookups resolve in its defining module. The
geopandas probe stays with them, and the pagination logger keeps the name
`dataretrieval.waterdata.utils` (a caplog test pins it). These could later move
to the `engine`/`responses` submodules -- which do not import the package, so
there is no cycle -- but that requires re-targeting the test patches; left as a
follow-up.

## Behavior-preserving

- 56 top-level definitions moved verbatim -- none lost, none duplicated.
- 469 tests pass, 2 skipped; ruff clean; submodules import without cycles
  (`constants` <- `http`/`validate` <- `requests`/`responses` <- `engine` <-
  `__init__`); `chunking.py` untouched.

## Note

Overlaps with the error-taxonomy (DOI-USGS#313) and namespace (DOI-USGS#315) PRs on `waterdata/`
imports -- sequence on merge.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
@thodson-usgs thodson-usgs force-pushed the refactor/split-waterdata-utils branch from df8943e to cca750e Compare June 2, 2026 00:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant