From d152b509f5ac12b78900eb4030fdfc166b264257 Mon Sep 17 00:00:00 2001 From: igerber Date: Sat, 9 May 2026 10:35:52 -0400 Subject: [PATCH 1/9] HAD Phase 5 wave 1: agent-facing surfaces Add _handle_had + _handle_had_event_study to practitioner.py, routing both HeterogeneousAdoptionDiD result classes through HAD-specific Baker et al. (2025) step guidance: did_had_pretest_workflow (step 3), ContinuousDiD/CallawaySantAnna routing nudge (step 4), bandwidth_diagnostics + simultaneous bands (step 6), per-horizon WAS event-study disaggregation (step 7), design-auto-detection + last-cohort-only-WAS framing (step 8). Symmetric pair: _handle_continuous gains Step-4 routing to HAD on no-untreated panels - the HAD <-> ContinuousDiD routing loop is now bidirectional. Extend _check_nan_att with ndarray branch (lazy numpy import + np.all(np.isnan(arr)) semantics so partial-NaN arrays don't over-fire the warning). Scalar path bit-exact preserved across all 12 untouched handlers. Add full HAD section + result-class blocks + ## HAD Pretests index covering all 7 pretest entry points + Choosing-an-Estimator row to diff_diff/guides/llms-full.txt (the bundled-in-wheel agent reference). Tighten the existing Continuous treatment intensity Choosing row with "(some units untreated)" so the HAD vs ContinuousDiD contrast is explicit. Framing: "no untreated unit" / dose variation, never "no comparison group" - locked by negative-assertion tests on both handler text and llms-full.txt section. docs/doc-deps.yaml: remove the llms-full.txt deferral note on had.py and add llms-full.txt entries to had.py, had_pretests.py, and practitioner.py blocks. 21 new tests (14 in tests/test_practitioner.py::TestHADDispatch + 6 in tests/test_guides.py::TestLLMsFullHADCoverage + 1 fixture-minimality regression locking the "handlers are STRING-ONLY at runtime" stability invariant). Closes the Phase 5 "agent surfaces" gap; T21 pretest tutorial and T22 weighted/survey tutorial remain queued as separate notebook PRs. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 5 + TODO.md | 2 +- diff_diff/guides/llms-full.txt | 156 ++++++++++++++- diff_diff/practitioner.py | 340 ++++++++++++++++++++++++++++++--- docs/doc-deps.yaml | 13 +- tests/test_guides.py | 70 +++++++ tests/test_practitioner.py | 179 +++++++++++++++-- 7 files changed, 723 insertions(+), 42 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 663a6b60..95073e41 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,6 +5,11 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [Unreleased] + +### Added +- **HAD `practitioner_next_steps()` handler + `llms-full.txt` reference section** (Phase 5). Adds `_handle_had` and `_handle_had_event_study` to `diff_diff/practitioner.py::_HANDLERS`, routing both `HeterogeneousAdoptionDiDResults` (single-period) and `HeterogeneousAdoptionDiDEventStudyResults` (event-study) through HAD-specific Baker et al. (2025) step guidance: `did_had_pretest_workflow` (step 3 — paper Section 4.2 step-2 closure on the event-study path), `ContinuousDiD` / `CallawaySantAnna` routing nudge (step 4 — fires on the wrong-estimator-for-this-data path), `bandwidth_diagnostics` inspection on continuous designs and simultaneous (sup-t) `cband_*` reading on weighted event-study fits (step 6), per-horizon WAS event-study disaggregation (step 7), and the explicit design-auto-detection / last-cohort-only-WAS framing (step 8). Symmetric pair: `_handle_continuous` gains a Step-4 nudge to `HeterogeneousAdoptionDiD` for ContinuousDiD users on no-untreated panels — the routing loop is now bidirectional. Extends `_check_nan_att` with an ndarray branch via lazy `numpy` import for HAD's per-horizon `att` array; uses `np.all(np.isnan(arr))` semantics so partial-NaN arrays (legitimate event-study output under degenerate horizon-specific designs) do not over-fire the warning. Scalar path is bit-exact preserved across all 12 untouched handlers. Adds full HAD section + `HeterogeneousAdoptionDiDResults` / `HeterogeneousAdoptionDiDEventStudyResults` blocks + `## HAD Pretests` index covering all 7 pretest entry points + Choosing-an-Estimator row to `diff_diff/guides/llms-full.txt` (the bundled-in-wheel agent reference). Tightens the existing `Continuous treatment intensity` Choosing row to `(some units untreated)` so the contrast with the new HAD row is explicit. Framing convention follows the "no untreated unit" / dose variation language; locked by negative-assertion tests on both the handler text and the `llms-full.txt` HAD section. `docs/doc-deps.yaml` updated to remove the `llms-full.txt` deferral note on `had.py` and add `llms-full.txt` entries to `had.py`, `had_pretests.py`, and `practitioner.py` blocks. Patch-level (additive on stable surfaces). 21 new tests (14 in `tests/test_practitioner.py::TestHADDispatch` + 6 in `tests/test_guides.py::TestLLMsFullHADCoverage` + 1 fixture-minimality regression locking the "handlers are STRING-ONLY at runtime" stability invariant). Closes the Phase 5 "agent surfaces" gap; T21 pretest tutorial and T22 weighted/survey tutorial remain queued as separate notebook PRs. + ## [3.3.2] - 2026-04-26 ### Added diff --git a/TODO.md b/TODO.md index 73b0251d..17d2659e 100644 --- a/TODO.md +++ b/TODO.md @@ -109,7 +109,7 @@ Deferred items from PR reviews that were not addressed before merge. | `HeterogeneousAdoptionDiD` Phase 3 R-parity: Phase 3 ships coverage-rate validation on synthetic DGPs (not tight point parity against `chaisemartin::stute_test` / `yatchew_test`). Tight numerical parity requires aligning bootstrap seed semantics and `B` across numpy/R and is deferred. | `tests/test_had_pretests.py` | Phase 3 | Low | | `HeterogeneousAdoptionDiD` Phase 3 nprobust bandwidth for Stute: some Stute variants on continuous regressors use nprobust-style optimal bandwidth selection. Phase 3 uses OLS residuals from a 2-parameter linear fit (no bandwidth selection). nprobust integration is a future enhancement; not in paper scope. | `diff_diff/had_pretests.py::stute_test` | Phase 3 | Low | | `HeterogeneousAdoptionDiD` Phase 4: Pierce-Schott (2016) replication harness; reproduce paper Figure 2 values and Table 1 coverage rates. | `benchmarks/`, `tests/` | Phase 2a | Low | -| `HeterogeneousAdoptionDiD` Phase 5: `practitioner_next_steps()` integration, tutorial notebook, and `llms-full.txt` HeterogeneousAdoptionDiD section (preserving UTF-8 fingerprint). README catalog + bundled `llms.txt` entry + `docs/api/had.rst` + `docs/references.rst` citation landed in PR #372 docs refresh. | `diff_diff/practitioner.py`, `tutorials/`, `diff_diff/guides/llms-full.txt` | Phase 2a | Low | +| `HeterogeneousAdoptionDiD` Phase 5 follow-up tutorials (T21 HAD pretest workflow notebook + T22 weighted/survey HAD tutorial). `practitioner_next_steps()` HAD handlers + `llms-full.txt` HeterogeneousAdoptionDiD section + Choosing-an-Estimator row landed in Phase 5 wave 1. | `tutorials/`, `tests/test_t21_*_drift.py`, `tests/test_t22_*_drift.py` | Phase 2a | Low | | `HeterogeneousAdoptionDiD` time-varying dose on event study: Phase 2b REJECTS panels where `D_{g,t}` varies within a unit for `t >= F` (the aggregation uses `D_{g, F}` as the single regressor for all horizons, paper Appendix B.2 constant-dose convention). A follow-up PR could add a time-varying-dose estimator for these panels; current behavior is front-door rejection with a redirect to `ChaisemartinDHaultfoeuille`. | `diff_diff/had.py::_validate_had_panel_event_study` | Phase 2b | Low | | `HeterogeneousAdoptionDiD` repeated-cross-section support: paper Section 2 defines HAD on panel OR repeated cross-section, but Phase 2a is panel-only. RCS inputs (disjoint unit IDs between periods) are rejected by the balanced-panel validator with the generic "unit(s) do not appear in both periods" error. A follow-up PR will add an RCS identification path based on pre/post cell means (rather than unit-level first differences), with its own validator and a distinct `data_mode` / API surface. | `diff_diff/had.py::_validate_had_panel`, `diff_diff/had.py::_aggregate_first_difference` | Phase 2a | Medium | | SyntheticDiD: bootstrap cross-language parity anchor against R's default `synthdid::vcov(method="bootstrap")` (refit; rebinds `opts` per draw) or Julia `Synthdid.jl::src/vcov.jl::bootstrap_se` (refit by construction). Same-library validation (placebo-SE tracking, AER §6.3 MC truth) is in place; a cross-language anchor is desirable to bolster the methodology contract. Julia is the cleanest target — minimal wrapping work and refit-native vcov. Tolerance target: 1e-6 on Monte Carlo samples (different BLAS + RNG paths preclude 1e-10). The R-parity fixture from the previous release was deleted because it pinned the now-removed fixed-weight path. | `benchmarks/R/`, `benchmarks/julia/`, `tests/` | follow-up | Low | diff --git a/diff_diff/guides/llms-full.txt b/diff_diff/guides/llms-full.txt index e4f161c3..65f84ae5 100644 --- a/diff_diff/guides/llms-full.txt +++ b/diff_diff/guides/llms-full.txt @@ -590,6 +590,68 @@ results = est.fit(data, outcome='outcome', unit='unit', time='period', results.print_summary() ``` +### HeterogeneousAdoptionDiD + +HeterogeneousAdoption DiD estimator (de Chaisemartin, Ciccia, D'Haultfœuille & Knau 2026). Targets a Weighted Average Slope (WAS) on **Heterogeneous Adoption Designs where no unit remains untreated** — every unit receives the treatment at some positive dose level, so the comparison structure comes from dose variation across units rather than from an untreated holdout. Treatment varies in intensity, not in status. Uses a bias-corrected local-linear estimator at the dose support boundary on continuous-dose designs (Design 1' / Design 1) and a 2SLS Wald-IV estimator on the mass-point design. + +```python +HeterogeneousAdoptionDiD( + design: str = "auto", # "auto" / "continuous_at_zero" / "continuous_near_d_lower" / "mass_point" + alpha: float = 0.05, + n_bootstrap: int = 999, # Multiplier-bootstrap iterations for sup-t bands + seed: int | None = None, + h: float | None = None, # Bias-corrected local-linear bandwidth (auto-selected if None) + b: float | None = None, # Pilot bandwidth (auto-selected if None) + rcond: float | None = None, +) +``` + +**Alias:** `HAD` + +**fit() parameters:** + +```python +had.fit( + data: pd.DataFrame, + outcome_col: str, + unit_col: str, + time_col: str, + dose_col: str, + first_treat_col: str | None = None, # Required on staggered panels (last-cohort auto-filter trigger) + aggregate: str | None = None, # None (single scalar WAS) or "event_study" (per-horizon WAS) + cband: bool = True, # Simultaneous (sup-t) confidence bands on weighted event-study fits + survey_design: SurveyDesign | None = None, # Survey weights, strata, PSU, FPC + weights: np.ndarray | None = None, # pweight shortcut (mutually exclusive with survey_design) +) -> HeterogeneousAdoptionDiDResults | HeterogeneousAdoptionDiDEventStudyResults +``` + +**Usage:** + +```python +from diff_diff import HeterogeneousAdoptionDiD, did_had_pretest_workflow + +# Vet the testable identifying assumptions first: +report = did_had_pretest_workflow( + data, outcome_col='y', unit_col='unit', time_col='t', + dose_col='d', first_treat_col='first_treat') +print(report.summary()) + +# Single-period scalar WAS: +est = HeterogeneousAdoptionDiD() +results = est.fit(data, outcome_col='y', unit_col='unit', + time_col='t', dose_col='d', + first_treat_col='first_treat') +print(results.summary()) + +# Multi-period per-horizon WAS: +es = est.fit(data, outcome_col='y', unit_col='unit', + time_col='t', dose_col='d', + first_treat_col='first_treat', + aggregate='event_study') +``` + +**Staggered panels.** On multi-cohort panels with `aggregate="event_study"`, `fit()` auto-filters to the last treatment cohort plus never-treated units (paper Appendix B.2) and emits a `UserWarning` naming kept/dropped counts. The estimand is then a **last-cohort-only WAS**, not a multi-cohort average. For full multi-cohort staggered support, see `ChaisemartinDHaultfoeuille`. + ### StackedDiD Stacked DiD estimator (Wing, Freedman & Hollingsworth 2024). Addresses TWFE bias with corrective Q-weights. @@ -1157,6 +1219,65 @@ Each event study effect dict contains: `effect`, `se`, `t_stat`, `p_value`, `con **Methods:** `summary()`, `print_summary()`, `to_dataframe()` +### HeterogeneousAdoptionDiDResults + +Single-period results container for `HeterogeneousAdoptionDiD`. + +| Attribute | Type | Description | +|-----------|------|-------------| +| `att` | `float` | Point estimate of the WAS parameter on the β-scale | +| `se` | `float` | Standard error on the β-scale | +| `t_stat` | `float` | T-statistic | +| `p_value` | `float` | P-value | +| `conf_int` | `tuple[float, float]` | Confidence interval | +| `alpha` | `float` | CI level used at fit time | +| `design` | `str` | Resolved design: `"continuous_at_zero"`, `"continuous_near_d_lower"`, or `"mass_point"` | +| `target_parameter` | `str` | `"WAS"` (Design 1') or `"WAS_d_lower"` (Design 1 / mass-point) | +| `d_lower` | `float` | Support infimum (`0.0` on Design 1', `min(d)` otherwise) | +| `dose_mean` | `float` | `D_bar = (1/G) * sum(D_{g,2})` | +| `n_obs` | `int` | Units contributing to estimation | +| `n_treated` | `int` | Units with `D > d_lower` | +| `n_control` | `int` | Units at or below `d_lower` | +| `inference_method` | `str` | `"analytical_nonparametric"` or `"analytical_2sls"` | +| `vcov_type` | `str | None` | Mass-point only: `"classical"`, `"hc1"`, or `"cr1"` | +| `bandwidth_diagnostics` | `BandwidthResult | None` | MSE-DPI selector output (continuous designs); `None` on `mass_point` | +| `survey_metadata` | `SurveyMetadata | None` | Repo-standard survey metadata when `survey_design=` / `weights=` is supplied | + +**Methods:** `summary()`, `print_summary()`, `to_dict()`, `to_dataframe()` + +### HeterogeneousAdoptionDiDEventStudyResults + +Per-horizon event-study results container for `HeterogeneousAdoptionDiD` with `aggregate="event_study"`. The anchor horizon `e = -1` is excluded by construction. + +| Attribute | Type | Description | +|-----------|------|-------------| +| `event_times` | `np.ndarray` | Integer event-time labels `e = t - F`, sorted ascending | +| `att` | `np.ndarray` | Per-horizon WAS point estimates | +| `se` | `np.ndarray` | Per-horizon standard errors | +| `t_stat` | `np.ndarray` | Per-horizon t-statistics | +| `p_value` | `np.ndarray` | Per-horizon p-values | +| `conf_int_low` | `np.ndarray` | Pointwise CI lower bounds | +| `conf_int_high` | `np.ndarray` | Pointwise CI upper bounds | +| `cband_low` | `np.ndarray | None` | Simultaneous (sup-t) band lower bounds; `None` on unweighted fits or when `cband=False` | +| `cband_high` | `np.ndarray | None` | Simultaneous (sup-t) band upper bounds | +| `cband_crit_value` | `float | None` | Sup-t critical value used for the simultaneous band | +| `cband_method` | `str | None` | `"multiplier_bootstrap"` when populated | +| `cband_n_bootstrap` | `int | None` | Bootstrap iterations used for the band | +| `n_obs_per_horizon` | `np.ndarray` | Per-horizon contributing-unit counts | +| `alpha` | `float` | CI level used at fit time | +| `design` | `str` | Shared across horizons (paper Appendix B.2 invariant) | +| `target_parameter` | `str` | Same convention as the single-period result | +| `d_lower` | `float` | Support infimum, shared across horizons | +| `dose_mean` | `float` | `D_bar` on the fit sample | +| `F` | `object` | First-treatment period label | +| `n_units` | `int` | Unique units contributing to the fit (post last-cohort filter) | +| `inference_method` | `str` | `"analytical_nonparametric"` or `"analytical_2sls"` | +| `survey_metadata` | `SurveyMetadata | None` | Populated on weighted fits | +| `variance_formula` | `str | None` | Per-horizon variance family label | +| `effective_dose_mean` | `float | None` | Weighted denominator | + +**Methods:** `summary()`, `print_summary()`, `to_dict()`, `to_dataframe()` + ### TROPResults | Attribute | Type | Description | @@ -1265,6 +1386,38 @@ did = DifferenceInDifferences(inference="wild_bootstrap", n_bootstrap=999, results = did.fit(data, outcome='y', treatment='treated', time='post') ``` +## HAD Pretests + +Diagnostic pretests for the `HeterogeneousAdoptionDiD` identifying assumptions (de Chaisemartin, Ciccia, D'Haultfœuille & Knau 2026). The composite workflow `did_had_pretest_workflow` is the recommended entry point — call it before reporting WAS as causal. + +```python +from diff_diff import ( + did_had_pretest_workflow, + qug_test, stute_test, yatchew_hr_test, + stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test, +) + +# Composite workflow — bundles QUG + Stute + Yatchew per the paper's three-step battery +report = did_had_pretest_workflow( + data, outcome_col='y', unit_col='unit', time_col='t', + dose_col='d', first_treat_col='first_treat', + aggregate='overall', # or 'event_study' for joint Stute on multi-period panels + survey_design=None) # SurveyDesign for survey-aware pretests (Phase 4.5 C) +print(report.summary()) +print(report.all_pass, report.verdict) +``` + +Individual tests: + +- `qug_test(d)` — Assumption 5 support condition. Extreme order statistics, Exp(1)/Exp(1) limit law. **Permanently rejects** non-`None` `survey_design=` / `weights=` (`NotImplementedError`) per Phase 4.5 C0 deferral — extreme-value functionals are not smooth in the empirical CDF, so standard survey machinery does not yield a calibrated test. +- `stute_test(d, dy)` — Assumption 7 mean-independence of trends via Cramér-von Mises functional with Mammen wild bootstrap. Survey-aware via PSU-level Mammen multiplier bootstrap. +- `yatchew_hr_test(d, dy, *, null="linearity")` — Assumption 8 linearity of `E[ΔY|D]` via Yatchew (1997) heteroskedasticity-robust variance-ratio test. The `null="mean_independence"` mode (R `YatchewTest::yatchew_test(order=0)`) is also exposed for placebo-style mean-independence testing. Survey-aware via closed-form weighted variance components (no bootstrap). +- `stute_joint_pretest(residuals_dict, d)` — joint Cramér-von Mises across K horizons with shared-η Mammen wild bootstrap (Delgado-Manteiga 2001 / Hlávka-Hušková 2020). Residuals-in core; the two data-in wrappers below construct residuals for the two paper-spelled nulls. +- `joint_pretrends_test(...)` — joint pre-trends on K pre-periods (paper Section 4.2 step 2 closure on the event-study path). +- `joint_homogeneity_test(...)` — joint linearity-and-homogeneity on K post-periods. + +The QUG-under-survey deferral is permanent; the linearity-family pretests support `survey_design=` (pweight, PSU, FPC) per Phase 4.5 C. Stratified designs and replicate-weight designs are deferred to follow-up PRs. + ## Honest DiD Sensitivity Analysis Rambachan & Roth (2023) robust inference allowing bounded parallel trends violations. @@ -1734,7 +1887,8 @@ DIFF_DIFF_BACKEND=rust pytest # Force Rust (fail if unavailable) | Staggered treatment timing | `CallawaySantAnna`, `ImputationDiD`, or `SunAbraham` | | Few treated units / synthetic control | `SyntheticDiD` | | Interactive fixed effects / factor confounding | `TROP` | -| Continuous treatment intensity | `ContinuousDiD` | +| Continuous treatment intensity (some units untreated) | `ContinuousDiD` | +| No untreated unit / universal rollout (every unit treated at different doses) | `HeterogeneousAdoptionDiD` | | Two-criterion treatment, simultaneous (2x2x2 DDD) | `TripleDifference` | | Two-criterion treatment, staggered timing + eligibility | `StaggeredTripleDifference` | | Nonlinear outcome (binary/count) with staggered timing | `WooldridgeDiD` | diff --git a/diff_diff/practitioner.py b/diff_diff/practitioner.py index cd1d4235..5b942454 100644 --- a/diff_diff/practitioner.py +++ b/diff_diff/practitioner.py @@ -41,8 +41,11 @@ "ContinuousDiDResults": "ContinuousDiD", "TripleDifferenceResults": "TripleDifference (DDD)", "BaconDecompositionResults": "BaconDecomposition", + "HeterogeneousAdoptionDiDResults": "HeterogeneousAdoptionDiD (HAD)", + "HeterogeneousAdoptionDiDEventStudyResults": "HeterogeneousAdoptionDiD (Event Study)", } + # --------------------------------------------------------------------------- # Public API # --------------------------------------------------------------------------- @@ -83,9 +86,7 @@ def practitioner_next_steps( completed = set(completed_steps or []) unknown = completed - STEPS if unknown: - raise ValueError( - f"Unknown step names: {unknown}. Valid names: {sorted(STEPS)}" - ) + raise ValueError(f"Unknown step names: {unknown}. Valid names: {sorted(STEPS)}") # Estimation is always complete if we have a results object completed.add("estimation") @@ -543,10 +544,7 @@ def _handle_synthetic(results: Any): "ATTs — departures signal that something is being picked " "up pre-treatment, weakening the causal interpretation." ), - code=( - "placebo_df = results.in_time_placebo()\n" - "print(placebo_df)" - ), + code=("placebo_df = results.in_time_placebo()\n" "print(placebo_df)"), priority="medium", step_name="sensitivity", ), @@ -589,10 +587,7 @@ def _handle_synthetic(results: Any): "data. Show whether the ATT moves materially across a " "grid of values to gauge robustness to this choice." ), - code=( - "sens_df = results.sensitivity_to_zeta_omega()\n" - "print(sens_df)" - ), + code=("sens_df = results.sensitivity_to_zeta_omega()\n" "print(sens_df)"), priority="low", step_name="sensitivity", ), @@ -731,6 +726,28 @@ def _handle_continuous(results: Any): ), step_name="parallel_trends", ), + _step( + baker_step=4, + label="Switch to HeterogeneousAdoptionDiD if no untreated units", + why=( + "ContinuousDiD's identification assumes a never-treated " + "comparison group exists (units with dose = 0). When every " + "unit is treated at some positive dose level — a universal " + "rollout where treatment varies in intensity, not status — " + "use HeterogeneousAdoptionDiD instead. HAD identifies a " + "Weighted Average Slope (WAS) at the dose support boundary " + "by leveraging dose variation across units." + ), + code=( + "# If your panel has no units with first_treat == 0, switch:\n" + "from diff_diff import HeterogeneousAdoptionDiD\n" + "had = HeterogeneousAdoptionDiD()\n" + "had_results = had.fit(\n" + " data, outcome_col='y', unit_col='unit',\n" + " time_col='t', dose_col='d', first_treat_col='first_treat')" + ), + step_name="estimator_selection", + ), _step( baker_step=7, label="Plot dose-response curve", @@ -739,10 +756,7 @@ def _handle_continuous(results: Any): "level. The dose-response curve reveals the functional form " "of the treatment-dose relationship." ), - code=( - "from diff_diff import plot_dose_response\n" - "plot_dose_response(results)" - ), + code=("from diff_diff import plot_dose_response\n" "plot_dose_response(results)"), step_name="heterogeneity", ), _step( @@ -830,6 +844,257 @@ def _handle_bacon(results: Any): return steps, warnings +def _handle_had(results: Any): + """HeterogeneousAdoptionDiD single-period guidance. + + Five Baker et al. steps (3, 4, 6, 7, 8). HAD's design absence is + "no untreated unit" - comparison comes from dose variation across + units, not from an untreated holdout. Treatment varies in intensity, + not in status. + """ + steps = [ + _step( + baker_step=3, + label="Run the HAD pretest battery", + why=( + "did_had_pretest_workflow bundles the paper's three " + "testable identifying assumptions: QUG (Assumption 5 " + "support condition), Stute (Assumption 7 mean-independence " + "of trends), and Yatchew-HR (Assumption 8 linearity of " + "E[ΔY|D]). Assumption 5/6 boundary continuity is not " + "testable - the workflow vets what can be vetted." + ), + code=( + "from diff_diff import did_had_pretest_workflow\n" + "report = did_had_pretest_workflow(\n" + " data, outcome_col='y', unit_col='unit',\n" + " time_col='t', dose_col='d',\n" + " first_treat_col='first_treat')\n" + "print(report.summary())" + ), + step_name="parallel_trends", + ), + _step( + baker_step=4, + label="Switch to ContinuousDiD or CallawaySantAnna if untreated units exist", + why=( + "HAD targets the no-untreated-unit case where every unit " + "is treated at some positive dose. If your panel actually " + "contains units with D = 0 (genuinely untreated), HAD's " + "WAS divisor under-weights the never-treated subset and a " + "different estimator is correct: ContinuousDiD for " + "dose-response on data with untreated controls, or " + "CallawaySantAnna for binary-staggered timing." + ), + code=( + "# Check for untreated units:\n" + "if (data['first_treat'] == 0).any():\n" + " # Untreated units exist - switch to ContinuousDiD:\n" + " from diff_diff import ContinuousDiD\n" + " cdid = ContinuousDiD()\n" + " cdid_results = cdid.fit(\n" + " data, outcome='y', unit='unit', time='t',\n" + " first_treat='first_treat', dose='d')\n" + " # Or CallawaySantAnna for binary-staggered timing:\n" + " # from diff_diff import CallawaySantAnna\n" + " # cs = CallawaySantAnna(control_group='never_treated')" + ), + step_name="estimator_selection", + ), + _step( + baker_step=6, + label="Inspect bandwidth diagnostics (continuous designs)", + why=( + "Continuous-dose designs (continuous_at_zero / " + "continuous_near_d_lower) use an MSE-DPI bandwidth selector " + "for the bias-corrected local-linear estimator. Bandwidth " + "choice affects WAS - verify the selector landed on a " + "viable bandwidth (not boundary-clipped or near-degenerate). " + "result.bandwidth_diagnostics is None on the mass_point " + "design (parametric, no bandwidth)." + ), + code=( + "# Inspect the auto-selected bandwidths:\n" + "result.bandwidth_diagnostics # None on mass_point\n" + "# Re-fit with explicit h= / b= to test sensitivity" + ), + priority="medium", + step_name="sensitivity", + ), + _step( + baker_step=7, + label="Re-fit with aggregate='event_study' for per-horizon WAS", + why=( + "On multi-period panels, the event-study aggregate returns " + "per-event-time WAS estimates instead of a single scalar. " + "Reveals whether dose response grows, decays, or stabilizes " + "across post-treatment horizons. Pre-period placebos serve " + "as a parallel-trends sanity check." + ), + code=( + "from diff_diff import HeterogeneousAdoptionDiD\n" + "est = HeterogeneousAdoptionDiD()\n" + "es = est.fit(\n" + " data, outcome_col='y', unit_col='unit',\n" + " time_col='t', dose_col='d',\n" + " first_treat_col='first_treat',\n" + " aggregate='event_study')" + ), + priority="medium", + step_name="heterogeneity", + ), + _step( + baker_step=8, + label="Verify design auto-detection with explicit design=", + why=( + "design='auto' picks one of {continuous_at_zero, " + "continuous_near_d_lower, mass_point} from the dose " + "support. Re-fit with an explicit design= to verify the " + "auto-detection matched your panel structure - WAS vs " + "WAS_d_lower target parameters, and the bias-corrected " + "local-linear vs 2SLS estimation paths, differ in " + "interpretation." + ), + code=( + "# Refit with each candidate design and compare:\n" + "from diff_diff import HeterogeneousAdoptionDiD\n" + "for d in ['continuous_at_zero', 'continuous_near_d_lower',\n" + " 'mass_point']:\n" + " try:\n" + " alt = HeterogeneousAdoptionDiD(design=d).fit(...)\n" + " print(d, alt.att, alt.target_parameter)\n" + " except Exception as e:\n" + " print(d, 'not applicable:', e)" + ), + priority="medium", + step_name="robustness", + ), + ] + warnings = _check_nan_att(results) + return steps, warnings + + +def _handle_had_event_study(results: Any): + """HeterogeneousAdoptionDiD event-study guidance. + + Five Baker et al. steps (3, 4, 6, 7, 8). Same framing convention as + _handle_had: "no untreated unit", dose variation, treatment varies + in intensity not status. + """ + steps = [ + _step( + baker_step=3, + label="Run the HAD pretest battery (event-study mode)", + why=( + "On multi-period panels, did_had_pretest_workflow with " + "aggregate='event_study' runs QUG plus joint Stute " + "pre-trends plus joint homogeneity-linearity Stute. The " + "joint Stute variants close the paper Section 4.2 step-2 " + "gap that the overall path explicitly flags as deferred." + ), + code=( + "from diff_diff import did_had_pretest_workflow\n" + "report = did_had_pretest_workflow(\n" + " data, outcome_col='y', unit_col='unit',\n" + " time_col='t', dose_col='d',\n" + " first_treat_col='first_treat',\n" + " aggregate='event_study')\n" + "print(report.summary())" + ), + step_name="parallel_trends", + ), + _step( + baker_step=4, + label="Switch to ContinuousDiD or CallawaySantAnna if untreated units exist", + why=( + "HAD targets the no-untreated-unit case. If your panel " + "contains units with D = 0, switch to " + "ContinuousDiD(aggregate='eventstudy') for dose-response " + "event study with untreated controls, or CallawaySantAnna " + "with aggregate='event_study' for binary-staggered timing." + ), + code=( + "# Check for untreated units:\n" + "if (data['first_treat'] == 0).any():\n" + " from diff_diff import ContinuousDiD\n" + " cdid = ContinuousDiD()\n" + " es = cdid.fit(\n" + " data, outcome='y', unit='unit', time='t',\n" + " first_treat='first_treat', dose='d',\n" + " aggregate='eventstudy')" + ), + step_name="estimator_selection", + ), + _step( + baker_step=6, + label="Use simultaneous (sup-t) confidence bands when reading multiple horizons", + why=( + "Pointwise CIs over-reject when you read multiple horizons " + "as a joint pattern. On weighted fits (survey_design= or " + "weights=), fit(cband=True) constructs simultaneous (sup-t) " + "bands across horizons via multiplier bootstrap. " + "result.cband_low / cband_high give the band endpoints; " + "cband_crit_value reports the sup-t critical value used." + ), + code=( + "from diff_diff import HeterogeneousAdoptionDiD\n" + "est = HeterogeneousAdoptionDiD(n_bootstrap=999, seed=42)\n" + "es = est.fit(\n" + " data, outcome_col='y', unit_col='unit',\n" + " time_col='t', dose_col='d',\n" + " first_treat_col='first_treat',\n" + " aggregate='event_study',\n" + " survey_design=design, cband=True)\n" + "es.cband_low, es.cband_high # simultaneous band endpoints" + ), + priority="medium", + step_name="sensitivity", + ), + _step( + baker_step=7, + label="Inspect per-horizon WAS arrays + pre-period placebos", + why=( + "Per-horizon WAS reveals adoption-effect dynamics. " + "Pre-period placebo horizons (event_times <= -2) should be " + "near zero - large pre-period coefficients flag a " + "parallel-trends or anticipation problem. The anchor " + "horizon e = -1 is excluded by construction." + ), + code=( + "import numpy as np\n" + "es.event_times, es.att, es.se # per-horizon arrays\n" + "# Pre-period placebos (should be near zero):\n" + "pre_mask = es.event_times <= -2\n" + "es.att[pre_mask], es.se[pre_mask]" + ), + step_name="heterogeneity", + ), + _step( + baker_step=8, + label="Report the last-cohort-only WAS framing on staggered panels", + why=( + "On staggered panels (multiple treatment cohorts), fit() " + "auto-filters to the last treatment cohort plus " + "never-treated units and emits a UserWarning naming " + "kept/dropped counts (paper Appendix B.2). The resulting " + "estimand is a last-cohort-only WAS, NOT a multi-cohort " + "average - report it as such, and consider " + "ChaisemartinDHaultfoeuille for full staggered support." + ), + code=( + "# Inspect the kept/dropped cohort counts in the\n" + "# UserWarning emitted at fit time.\n" + "# For full multi-cohort support, see:\n" + "# from diff_diff import ChaisemartinDHaultfoeuille" + ), + priority="medium", + step_name="robustness", + ), + ] + warnings = _check_nan_att(results) + return steps, warnings + + def _handle_generic(results: Any): """Fallback for unknown result types.""" steps = [ @@ -880,6 +1145,8 @@ def _handle_generic(results: Any): "ContinuousDiDResults": _handle_continuous, "TripleDifferenceResults": _handle_triple, "BaconDecompositionResults": _handle_bacon, + "HeterogeneousAdoptionDiDResults": _handle_had, + "HeterogeneousAdoptionDiDEventStudyResults": _handle_had_event_study, } @@ -887,19 +1154,45 @@ def _handle_generic(results: Any): # Internal helpers # --------------------------------------------------------------------------- def _check_nan_att(results: Any) -> List[str]: - """Return warnings if ATT is NaN.""" + """Return warnings if ATT is NaN. + + Scalar path executes byte-identically to the pre-Phase-5 helper for + backcompat with the existing 12 untouched handlers. The ndarray + branch is reached only when ``float(att)`` raises TypeError on a + numpy array (HAD's event-study ``att`` field) and fires only when + every horizon is NaN - partial-NaN arrays are legitimate event-study + output (single-cluster collapse, degenerate horizon-specific design) + and would over-fire if flagged. Falls through to ``_handle_generic`` + too: any future estimator returning ndarray ``att`` without a + dedicated handler gets the same all-NaN warning shape. + """ # Check .att (DiDResults), .overall_att (staggered), .avg_att (MultiPeriod) att = getattr(results, "att", None) if att is None: att = getattr(results, "overall_att", None) if att is None: att = getattr(results, "avg_att", None) - if att is not None: + if att is None: + return [] + try: + scalar = float(att) + except (TypeError, ValueError): + # Ndarray path (HAD event-study, future ndarray-att estimators). + # Use np.all (not np.any): partial-NaN arrays are legitimate. try: - att = float(att) + import numpy as np + + arr = np.asarray(att, dtype=float) except (TypeError, ValueError): return [] - if att is not None and math.isnan(att): + if arr.size and bool(np.all(np.isnan(arr))): + return [ + "All per-horizon estimates are NaN — check data " + "preparation and model specification before proceeding " + "with diagnostics." + ] + return [] + if math.isnan(scalar): return [ "Estimation produced NaN ATT — check data preparation and " "model specification before proceeding with diagnostics." @@ -907,9 +1200,7 @@ def _check_nan_att(results: Any) -> List[str]: return [] -def _filter_steps( - steps: List[Dict[str, Any]], completed: Set[str] -) -> List[Dict[str, Any]]: +def _filter_steps(steps: List[Dict[str, Any]], completed: Set[str]) -> List[Dict[str, Any]]: """Remove steps whose _step_name is in the completed set.""" filtered = [] for s in steps: @@ -938,8 +1229,9 @@ def _print_output(output: Dict[str, Any]) -> None: for step in output["next_steps"]: priority = step.get("priority", "high") marker = "*" if priority == "high" else "-" - print(f"\n {marker} [{priority.upper()}] Step {step['baker_step']}: " - f"{step['label']}") + print( + f"\n {marker} [{priority.upper()}] Step {step['baker_step']}: " f"{step['label']}" + ) print(f" Why: {step['why']}") if step.get("code"): for line in step["code"].split("\n"): diff --git a/docs/doc-deps.yaml b/docs/doc-deps.yaml index 7c7aefca..e335c646 100644 --- a/docs/doc-deps.yaml +++ b/docs/doc-deps.yaml @@ -385,9 +385,9 @@ sources: section: "Universal Rollout (No Untreated Markets)" type: user_guide note: "Tip cross-link to T20 in the no-untreated section" - # Note: llms-full.txt does not yet have a HeterogeneousAdoptionDiD section - # (deferred to TODO.md Phase 5 follow-up); the dependency mapping will be - # added when that section lands. + - path: diff_diff/guides/llms-full.txt + section: "HeterogeneousAdoptionDiD" + type: user_guide diff_diff/had_pretests.py: drift_risk: medium @@ -401,6 +401,9 @@ sources: - path: diff_diff/guides/llms.txt section: "Estimators" type: user_guide + - path: diff_diff/guides/llms-full.txt + section: "HAD Pretests" + type: user_guide diff_diff/local_linear.py: drift_risk: low @@ -799,6 +802,10 @@ sources: docs: - path: diff_diff/guides/llms-practitioner.txt type: user_guide + - path: diff_diff/guides/llms-full.txt + section: "Practitioner Workflow" + type: user_guide + note: "HAD handlers (_handle_had / _handle_had_event_study) emit did_had_pretest_workflow + bandwidth_diagnostics references; symmetric Step-4 routing in _handle_continuous" # ── Visualization (visualization group) ──────────────────────────── diff --git a/tests/test_guides.py b/tests/test_guides.py index e51f6920..89651dd4 100644 --- a/tests/test_guides.py +++ b/tests/test_guides.py @@ -272,3 +272,73 @@ def test_module_docstring_mentions_helper(): import diff_diff assert "get_llm_guide" in diff_diff.__doc__ + + +# --------------------------------------------------------------------------- +# llms-full.txt — HeterogeneousAdoptionDiD coverage (Phase 5) +# --------------------------------------------------------------------------- +class TestLLMsFullHADCoverage: + """Lock the HAD section additions to llms-full.txt against deletion + or framing drift. Phase 5 surfaces the agent-facing API contract for + HeterogeneousAdoptionDiD on the bundled-in-wheel guide.""" + + def test_llms_full_has_had_section(self): + text = get_llm_guide("full") + assert "### HeterogeneousAdoptionDiD" in text + + def test_llms_full_had_results_classes(self): + text = get_llm_guide("full") + assert "### HeterogeneousAdoptionDiDResults" in text + assert "### HeterogeneousAdoptionDiDEventStudyResults" in text + + def test_llms_full_had_pretests_section(self): + text = get_llm_guide("full") + assert "## HAD Pretests" in text + for fn in ( + "did_had_pretest_workflow", + "qug_test", + "stute_test", + "yatchew_hr_test", + "stute_joint_pretest", + "joint_pretrends_test", + "joint_homogeneity_test", + ): + assert fn in text, f"HAD Pretests section missing reference to {fn}" + + def test_llms_full_had_choosing_row(self): + text = get_llm_guide("full") + # The Choosing-an-Estimator table must list HAD with a row that + # names either "no untreated" or "universal rollout" framing. + # Find the Choosing section and check within it. + idx = text.index("## Choosing an Estimator") + choosing = text[idx:] + assert "HeterogeneousAdoptionDiD" in choosing + assert ("no untreated" in choosing.lower()) or ("universal rollout" in choosing.lower()) + + def test_llms_full_had_framing_no_comparison_group(self): + # Per feedback_had_framing_precision: HAD's design absence is + # "no untreated unit" — comparison comes from dose variation + # across units. The phrases "no comparison group" and + # "missing comparison" must NOT appear in the HAD section. + text = get_llm_guide("full") + had_start = text.index("### HeterogeneousAdoptionDiD") + # Find the next top-level or H3 boundary that is NOT another HAD + # section to scope the assertion to HAD-specific content. The + # HAD estimator section is followed by ### StackedDiD; the + # results-class section ends at ### TROPResults. We check the + # estimator section text only (most likely place for framing + # drift). + had_end = text.index("### StackedDiD", had_start) + had_text = text[had_start:had_end].lower() + assert "no comparison group" not in had_text + assert "missing comparison" not in had_text + + def test_llms_full_paper_citation(self): + # Lead-author "D'Haultfœuille" appears in the HAD section. + # Naturally preserves the UTF-8 'œ' fingerprint asserted by + # test_utf8_encoding_preserved without a synthetic mark. + text = get_llm_guide("full") + had_start = text.index("### HeterogeneousAdoptionDiD") + had_end = text.index("### StackedDiD", had_start) + had_text = text[had_start:had_end] + assert "D'Haultfœuille" in had_text diff --git a/tests/test_practitioner.py b/tests/test_practitioner.py index 2b2b62c0..18eb2d76 100644 --- a/tests/test_practitioner.py +++ b/tests/test_practitioner.py @@ -1,11 +1,14 @@ """Tests for the practitioner guidance module.""" +import numpy as np import pytest from diff_diff import ( BaconDecomposition, CallawaySantAnna, DifferenceInDifferences, + HeterogeneousAdoptionDiDEventStudyResults, + HeterogeneousAdoptionDiDResults, MultiPeriodDiD, generate_did_data, generate_staggered_data, @@ -32,9 +35,7 @@ def did_data(): @pytest.fixture(scope="session") def staggered_data(): - return generate_staggered_data( - n_units=60, n_periods=8, treatment_effect=2.0, seed=42 - ) + return generate_staggered_data(n_units=60, n_periods=8, treatment_effect=2.0, seed=42) @pytest.fixture(scope="session") @@ -46,9 +47,7 @@ def did_results(did_data): @pytest.fixture(scope="session") def multi_period_results(did_data): es = MultiPeriodDiD() - return es.fit( - did_data, outcome="outcome", unit="unit", time="period", treatment="treated" - ) + return es.fit(did_data, outcome="outcome", unit="unit", time="period", treatment="treated") @pytest.fixture(scope="session") @@ -171,6 +170,43 @@ def mock_stacked_results(): return r +@pytest.fixture +def mock_had_results(): + r = HeterogeneousAdoptionDiDResults.__new__(HeterogeneousAdoptionDiDResults) + r.att = 0.5 + return r + + +@pytest.fixture +def mock_had_event_study_results(): + r = HeterogeneousAdoptionDiDEventStudyResults.__new__(HeterogeneousAdoptionDiDEventStudyResults) + # 5 horizons: e in {-3, -2, 0, 1, 2} + r.att = np.array([0.01, -0.02, 0.30, 0.45, 0.50]) + r.event_times = np.array([-3, -2, 0, 1, 2]) + return r + + +@pytest.fixture +def mock_had_results_nan_att(): + r = HeterogeneousAdoptionDiDResults.__new__(HeterogeneousAdoptionDiDResults) + r.att = float("nan") + return r + + +@pytest.fixture +def mock_had_event_study_results_all_nan(): + r = HeterogeneousAdoptionDiDEventStudyResults.__new__(HeterogeneousAdoptionDiDEventStudyResults) + r.att = np.full(5, np.nan) + return r + + +@pytest.fixture +def mock_had_event_study_results_partial_nan(): + r = HeterogeneousAdoptionDiDEventStudyResults.__new__(HeterogeneousAdoptionDiDEventStudyResults) + r.att = np.array([0.5, np.nan, 0.3]) + return r + + # --------------------------------------------------------------------------- # Tests: return schema # --------------------------------------------------------------------------- @@ -345,16 +381,12 @@ def test_filter_sensitivity(self, cs_results): assert len(filtered["next_steps"]) < len(full["next_steps"]) def test_filter_all_steps(self, cs_results): - output = practitioner_next_steps( - cs_results, completed_steps=list(STEPS), verbose=False - ) + output = practitioner_next_steps(cs_results, completed_steps=list(STEPS), verbose=False) assert len(output["next_steps"]) == 0 def test_invalid_step_name_raises(self, did_results): with pytest.raises(ValueError, match="Unknown step names"): - practitioner_next_steps( - did_results, completed_steps=["invalid_step"], verbose=False - ) + practitioner_next_steps(did_results, completed_steps=["invalid_step"], verbose=False) # --------------------------------------------------------------------------- @@ -439,7 +471,8 @@ def test_hausman_pretest_in_guidance(self, mock_efficient_results): def test_hausman_snippet_uses_classmethod(self, mock_efficient_results): output = practitioner_next_steps(mock_efficient_results, verbose=False) hausman_steps = [ - s for s in output["next_steps"] + s + for s in output["next_steps"] if "hausman" in s["label"].lower() or "Hausman" in s["label"] ] assert len(hausman_steps) > 0 @@ -458,3 +491,123 @@ class FakeResults: output = practitioner_next_steps(FakeResults(), verbose=False) assert len(output["next_steps"]) > 0 assert output["estimator"] == "FakeResults" + + +# --------------------------------------------------------------------------- +# Tests: HeterogeneousAdoptionDiD (HAD) handler dispatch +# --------------------------------------------------------------------------- +class TestHADDispatch: + def test_had_results_dispatch(self, mock_had_results): + output = practitioner_next_steps(mock_had_results, verbose=False) + assert len(output["next_steps"]) > 0 + assert output["estimator"] == "HeterogeneousAdoptionDiD (HAD)" + + def test_had_event_study_dispatch(self, mock_had_event_study_results): + output = practitioner_next_steps(mock_had_event_study_results, verbose=False) + assert len(output["next_steps"]) > 0 + assert output["estimator"] == "HeterogeneousAdoptionDiD (Event Study)" + + def test_had_pretest_workflow_referenced(self, mock_had_results): + output = practitioner_next_steps(mock_had_results, verbose=False) + all_code = " ".join(s.get("code", "") for s in output["next_steps"]) + assert "did_had_pretest_workflow" in all_code + + def test_had_event_study_pretest_workflow_referenced(self, mock_had_event_study_results): + output = practitioner_next_steps(mock_had_event_study_results, verbose=False) + all_code = " ".join(s.get("code", "") for s in output["next_steps"]) + assert "did_had_pretest_workflow" in all_code + assert "aggregate='event_study'" in all_code + + def test_had_bandwidth_diagnostics_referenced(self, mock_had_results): + output = practitioner_next_steps(mock_had_results, verbose=False) + all_text = " ".join( + (s.get("code", "") + " " + s.get("why", "")) for s in output["next_steps"] + ) + assert "bandwidth_diagnostics" in all_text + + def test_had_event_study_simultaneous_bands_referenced(self, mock_had_event_study_results): + output = practitioner_next_steps(mock_had_event_study_results, verbose=False) + all_text = " ".join( + (s.get("code", "") + " " + s.get("why", "")) for s in output["next_steps"] + ) + assert "cband" in all_text + # Either "sup-t" wording or "simultaneous" wording is acceptable. + assert ("sup-t" in all_text) or ("simultaneous" in all_text) + + def test_had_no_comparison_group_framing(self, mock_had_results, mock_had_event_study_results): + for fixture in (mock_had_results, mock_had_event_study_results): + output = practitioner_next_steps(fixture, verbose=False) + all_text = " ".join( + (s.get("code", "") + " " + s.get("why", "") + " " + s.get("label", "")) + for s in output["next_steps"] + ) + all_text += " ".join(output["warnings"]) + assert "no comparison group" not in all_text.lower() + assert "missing comparison" not in all_text.lower() + + def test_had_nan_warning_scalar(self, mock_had_results_nan_att): + output = practitioner_next_steps(mock_had_results_nan_att, verbose=False) + warnings = " ".join(output["warnings"]) + assert "NaN" in warnings or "nan" in warnings.lower() + + def test_had_event_study_nan_warning_array(self, mock_had_event_study_results_all_nan): + output = practitioner_next_steps(mock_had_event_study_results_all_nan, verbose=False) + warnings = " ".join(output["warnings"]) + assert "per-horizon" in warnings or "All" in warnings + + def test_had_partial_nan_array_no_warning(self, mock_had_event_study_results_partial_nan): + # Partial-NaN arrays are legitimate event-study output (some + # horizons may collapse on degenerate-design grounds while others + # remain finite). The all-NaN warning must NOT fire here. + output = practitioner_next_steps(mock_had_event_study_results_partial_nan, verbose=False) + # No "per-horizon" or "All ... NaN" warning string should appear. + warnings = " ".join(output["warnings"]) + assert "per-horizon" not in warnings + assert "All " not in warnings + + def test_had_step_4_estimator_selection_present( + self, mock_had_results, mock_had_event_study_results + ): + for fixture in (mock_had_results, mock_had_event_study_results): + output = practitioner_next_steps(fixture, verbose=False) + step_4_steps = [s for s in output["next_steps"] if s["baker_step"] == 4] + assert len(step_4_steps) >= 1 + all_text = " ".join((s.get("code", "") + " " + s.get("why", "")) for s in step_4_steps) + assert "ContinuousDiD" in all_text + assert "CallawaySantAnna" in all_text + + def test_handle_continuous_step_4_routes_to_had(self, mock_continuous_results): + # Symmetric pair: ContinuousDiD users with no untreated units + # should be routed to HeterogeneousAdoptionDiD. + output = practitioner_next_steps(mock_continuous_results, verbose=False) + step_4_steps = [s for s in output["next_steps"] if s["baker_step"] == 4] + assert len(step_4_steps) >= 1 + all_text = " ".join((s.get("code", "") + " " + s.get("why", "")) for s in step_4_steps) + assert "HeterogeneousAdoptionDiD" in all_text + + def test_handle_generic_ndarray_att_triggers_warning(self): + # Cross-handler regression: a future estimator that returns + # ndarray att and falls through to _handle_generic must produce + # the same all-NaN warning as the dedicated HAD event-study path. + class FutureNdarrayAttResults: + att: np.ndarray + + r = FutureNdarrayAttResults() + r.att = np.full(3, np.nan) + output = practitioner_next_steps(r, verbose=False) + warnings = " ".join(output["warnings"]) + assert "per-horizon" in warnings or "All" in warnings + + def test_had_handlers_string_only_no_attribute_reads( + self, mock_had_results, mock_had_event_study_results + ): + # Stability invariant #7: handlers are STRING-ONLY at runtime. + # The fixtures construct results with ONLY .att (and event_times + # on the event-study fixture); confirm no AttributeError is + # raised when the handlers run. Protects against a future + # refactor that starts reading result. inside a + # handler and silently breaks the minimal-fixture contract. + for fixture in (mock_had_results, mock_had_event_study_results): + output = practitioner_next_steps(fixture, verbose=False) + assert isinstance(output, dict) + assert "next_steps" in output From 6d6e950e322443f19131aacb0e855f37e367172a Mon Sep 17 00:00:00 2001 From: igerber Date: Sat, 9 May 2026 10:56:42 -0400 Subject: [PATCH 2/9] Address PR #402 R1 review (1 P1, 4 P2) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit P1 methodology fix: Step-4 routing in _handle_had + _handle_had_event_study no longer says "switch away from HAD if untreated units exist" - that contradicts REGISTRY § HeterogeneousAdoptionDiD edge cases (line 2403: "Authors do NOT require untreated units to be dropped"; line 2408 + had.py:1325: never-treated units RETAINED on staggered event-study). Reframed as the actual estimand differentiator: HAD targets WAS at the dose support boundary; ContinuousDiD targets per-dose ATT(d) / ACRT(d) and requires never-treated controls. Routing fires only when the user wants the ATT(d) estimand AND has never-treated controls, not on "untreated units exist". Tightens the corresponding Choosing-an-Estimator table row to surface WAS vs ATT(d) as the differentiator. P2 (a) signatures: llms-full.txt HAD constructor + fit() blocks now match the actual HeterogeneousAdoptionDiD.__init__ / .fit signatures exactly. Drops invented kwargs (h, b, rcond) and adds the real ones (d_lower, kernel, vcov_type, robust, cluster). aggregate default corrected from None to "overall". fit() now lists survey, weights, cband (positional-or-keyword) and survey_design + trends_lin (keyword-only). P2 (b) snippet bugs: result.bandwidth_diagnostics -> results.bandwidth_diagnostics (matching the plural convention of other handlers); sup-t snippet now imports SurveyDesign and constructs sd before passing survey_design=sd (was survey_design=design with no design defined). P2 (c) tests: New TestLLMsFullHADCoverage tests use inspect.signature(HeterogeneousAdoptionDiD.__init__) and .fit() to regress the documented signatures against the real API. New test_llms_full_had_section_methodology_compatible_with_untreated locks the negative assertion that the section does NOT carry framing contradicting the registry. Practitioner tests gain test_had_step_4_does_not_misframe_untreated_unit_routing + test_had_handler_snippets_are_valid_python_syntax (catches snippet syntax errors via ast.parse) + test_handle_continuous_step_4_snippet_is_valid_python. 83 tests pass (47 in test_practitioner including 5 new + 36 in test_guides including 9 new). Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 2 +- diff_diff/guides/llms-full.txt | 33 ++++++----- diff_diff/practitioner.py | 91 ++++++++++++++++------------- tests/test_guides.py | 101 ++++++++++++++++++++++++++++----- tests/test_practitioner.py | 82 +++++++++++++++++++++++++- 5 files changed, 237 insertions(+), 72 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 95073e41..f7eca04f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,7 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] ### Added -- **HAD `practitioner_next_steps()` handler + `llms-full.txt` reference section** (Phase 5). Adds `_handle_had` and `_handle_had_event_study` to `diff_diff/practitioner.py::_HANDLERS`, routing both `HeterogeneousAdoptionDiDResults` (single-period) and `HeterogeneousAdoptionDiDEventStudyResults` (event-study) through HAD-specific Baker et al. (2025) step guidance: `did_had_pretest_workflow` (step 3 — paper Section 4.2 step-2 closure on the event-study path), `ContinuousDiD` / `CallawaySantAnna` routing nudge (step 4 — fires on the wrong-estimator-for-this-data path), `bandwidth_diagnostics` inspection on continuous designs and simultaneous (sup-t) `cband_*` reading on weighted event-study fits (step 6), per-horizon WAS event-study disaggregation (step 7), and the explicit design-auto-detection / last-cohort-only-WAS framing (step 8). Symmetric pair: `_handle_continuous` gains a Step-4 nudge to `HeterogeneousAdoptionDiD` for ContinuousDiD users on no-untreated panels — the routing loop is now bidirectional. Extends `_check_nan_att` with an ndarray branch via lazy `numpy` import for HAD's per-horizon `att` array; uses `np.all(np.isnan(arr))` semantics so partial-NaN arrays (legitimate event-study output under degenerate horizon-specific designs) do not over-fire the warning. Scalar path is bit-exact preserved across all 12 untouched handlers. Adds full HAD section + `HeterogeneousAdoptionDiDResults` / `HeterogeneousAdoptionDiDEventStudyResults` blocks + `## HAD Pretests` index covering all 7 pretest entry points + Choosing-an-Estimator row to `diff_diff/guides/llms-full.txt` (the bundled-in-wheel agent reference). Tightens the existing `Continuous treatment intensity` Choosing row to `(some units untreated)` so the contrast with the new HAD row is explicit. Framing convention follows the "no untreated unit" / dose variation language; locked by negative-assertion tests on both the handler text and the `llms-full.txt` HAD section. `docs/doc-deps.yaml` updated to remove the `llms-full.txt` deferral note on `had.py` and add `llms-full.txt` entries to `had.py`, `had_pretests.py`, and `practitioner.py` blocks. Patch-level (additive on stable surfaces). 21 new tests (14 in `tests/test_practitioner.py::TestHADDispatch` + 6 in `tests/test_guides.py::TestLLMsFullHADCoverage` + 1 fixture-minimality regression locking the "handlers are STRING-ONLY at runtime" stability invariant). Closes the Phase 5 "agent surfaces" gap; T21 pretest tutorial and T22 weighted/survey tutorial remain queued as separate notebook PRs. +- **HAD `practitioner_next_steps()` handler + `llms-full.txt` reference section** (Phase 5). Adds `_handle_had` and `_handle_had_event_study` to `diff_diff/practitioner.py::_HANDLERS`, routing both `HeterogeneousAdoptionDiDResults` (single-period) and `HeterogeneousAdoptionDiDEventStudyResults` (event-study) through HAD-specific Baker et al. (2025) step guidance: `did_had_pretest_workflow` (step 3 — paper Section 4.2 step-2 closure on the event-study path), an estimand-difference routing nudge to `ContinuousDiD` (step 4 — fires when the user wants per-dose ATT(d) / ACRT(d) curves rather than HAD's WAS estimand and has never-treated controls; framed around estimand difference, NOT around the existence of untreated units, since HAD remains valid with a small never-treated share per REGISTRY § HeterogeneousAdoptionDiD edge cases and explicitly retains never-treated units on the staggered event-study path per paper Appendix B.2 / `had.py:1325`), `results.bandwidth_diagnostics` inspection on continuous designs and simultaneous (sup-t) `cband_*` reading on weighted event-study fits (step 6), per-horizon WAS event-study disaggregation (step 7), and the explicit design-auto-detection / last-cohort-only-WAS framing (step 8). Symmetric pair: `_handle_continuous` gains a Step-4 nudge to `HeterogeneousAdoptionDiD` for ContinuousDiD users on no-untreated panels (this direction is correct because ContinuousDiD's identification requires never-treated controls). Extends `_check_nan_att` with an ndarray branch via lazy `numpy` import for HAD's per-horizon `att` array; uses `np.all(np.isnan(arr))` semantics so partial-NaN arrays (legitimate event-study output under degenerate horizon-specific designs) do not over-fire the warning. Scalar path is bit-exact preserved across all 12 untouched handlers. Adds full HAD section + `HeterogeneousAdoptionDiDResults` / `HeterogeneousAdoptionDiDEventStudyResults` blocks + `## HAD Pretests` index covering all 7 pretest entry points + Choosing-an-Estimator row to `diff_diff/guides/llms-full.txt` (the bundled-in-wheel agent reference); the documented constructor + `fit()` signatures match the real `HeterogeneousAdoptionDiD.__init__` / `.fit` API exactly (verified by `inspect.signature`-based regression tests). Tightens the existing `Continuous treatment intensity` Choosing row to surface ATT(d) vs WAS as the estimand differentiator. `docs/doc-deps.yaml` updated to remove the `llms-full.txt` deferral note on `had.py` and add `llms-full.txt` entries to `had.py`, `had_pretests.py`, and `practitioner.py` blocks. Patch-level (additive on stable surfaces). 26 new tests (16 in `tests/test_practitioner.py::TestHADDispatch` + 9 in `tests/test_guides.py::TestLLMsFullHADCoverage` + 1 fixture-minimality regression locking the "handlers are STRING-ONLY at runtime" stability invariant). Closes the Phase 5 "agent surfaces" gap; T21 pretest tutorial and T22 weighted/survey tutorial remain queued as separate notebook PRs. ## [3.3.2] - 2026-04-26 diff --git a/diff_diff/guides/llms-full.txt b/diff_diff/guides/llms-full.txt index 65f84ae5..e2a2e2d3 100644 --- a/diff_diff/guides/llms-full.txt +++ b/diff_diff/guides/llms-full.txt @@ -592,17 +592,19 @@ results.print_summary() ### HeterogeneousAdoptionDiD -HeterogeneousAdoption DiD estimator (de Chaisemartin, Ciccia, D'Haultfœuille & Knau 2026). Targets a Weighted Average Slope (WAS) on **Heterogeneous Adoption Designs where no unit remains untreated** — every unit receives the treatment at some positive dose level, so the comparison structure comes from dose variation across units rather than from an untreated holdout. Treatment varies in intensity, not in status. Uses a bias-corrected local-linear estimator at the dose support boundary on continuous-dose designs (Design 1' / Design 1) and a 2SLS Wald-IV estimator on the mass-point design. +HeterogeneousAdoption DiD estimator (de Chaisemartin, Ciccia, D'Haultfœuille & Knau 2026). Targets a Weighted Average Slope (WAS) at the dose support boundary on **Heterogeneous Adoption Designs** — designs where treatment varies in dose intensity across units. Comparison comes from dose variation across units. The estimator does NOT require dropping never-treated units: a small share of never-treated units is fully compatible (paper edge case — Garrett et al. 2020 retained 12 untreated counties out of 2,954), and on staggered event-study panels never-treated units are explicitly retained as the untreated-group comparison (paper Appendix B.2). Uses a bias-corrected local-linear estimator at the dose support boundary on continuous-dose designs (Design 1' / Design 1) and a 2SLS Wald-IV estimator on the mass-point design. ```python HeterogeneousAdoptionDiD( - design: str = "auto", # "auto" / "continuous_at_zero" / "continuous_near_d_lower" / "mass_point" + design: str = "auto", # "auto" / "continuous_at_zero" / "continuous_near_d_lower" / "mass_point" + d_lower: float | None = None, # Support infimum; auto-detected when None + kernel: str = "epanechnikov", # Local-linear kernel alpha: float = 0.05, - n_bootstrap: int = 999, # Multiplier-bootstrap iterations for sup-t bands + vcov_type: str | None = None, # Mass-point only: "classical" (default) or "hc1" + robust: bool = False, # Mass-point only: HC1 robust SE shorthand + cluster: str | None = None, # Mass-point only: cluster column for CR1 cluster-robust SE + n_bootstrap: int = 999, # Multiplier-bootstrap iterations for sup-t bands (event-study + weighted) seed: int | None = None, - h: float | None = None, # Bias-corrected local-linear bandwidth (auto-selected if None) - b: float | None = None, # Pilot bandwidth (auto-selected if None) - rcond: float | None = None, ) ``` @@ -614,14 +616,17 @@ HeterogeneousAdoptionDiD( had.fit( data: pd.DataFrame, outcome_col: str, - unit_col: str, - time_col: str, dose_col: str, + time_col: str, + unit_col: str, first_treat_col: str | None = None, # Required on staggered panels (last-cohort auto-filter trigger) - aggregate: str | None = None, # None (single scalar WAS) or "event_study" (per-horizon WAS) + aggregate: str = "overall", # "overall" (single scalar WAS) or "event_study" (per-horizon WAS) + survey: SurveyDesign | None = None, # DEPRECATED alias of survey_design= + weights: np.ndarray | None = None, # DEPRECATED pweight shortcut alias cband: bool = True, # Simultaneous (sup-t) confidence bands on weighted event-study fits - survey_design: SurveyDesign | None = None, # Survey weights, strata, PSU, FPC - weights: np.ndarray | None = None, # pweight shortcut (mutually exclusive with survey_design) + *, + survey_design: SurveyDesign | None = None, # Canonical survey-design kwarg (weights, strata, PSU, FPC) + trends_lin: bool = False, # Eq 17 linear-trend detrending (event-study; mutually exclusive with survey_design) ) -> HeterogeneousAdoptionDiDResults | HeterogeneousAdoptionDiDEventStudyResults ``` @@ -636,7 +641,7 @@ report = did_had_pretest_workflow( dose_col='d', first_treat_col='first_treat') print(report.summary()) -# Single-period scalar WAS: +# Single-period scalar WAS (aggregate="overall" default): est = HeterogeneousAdoptionDiD() results = est.fit(data, outcome_col='y', unit_col='unit', time_col='t', dose_col='d', @@ -1887,8 +1892,8 @@ DIFF_DIFF_BACKEND=rust pytest # Force Rust (fail if unavailable) | Staggered treatment timing | `CallawaySantAnna`, `ImputationDiD`, or `SunAbraham` | | Few treated units / synthetic control | `SyntheticDiD` | | Interactive fixed effects / factor confounding | `TROP` | -| Continuous treatment intensity (some units untreated) | `ContinuousDiD` | -| No untreated unit / universal rollout (every unit treated at different doses) | `HeterogeneousAdoptionDiD` | +| Continuous treatment intensity, per-dose ATT(d) / ACRT(d) (requires never-treated controls) | `ContinuousDiD` | +| Continuous treatment intensity, WAS at dose support boundary (compatible with universal rollout or small never-treated share) | `HeterogeneousAdoptionDiD` | | Two-criterion treatment, simultaneous (2x2x2 DDD) | `TripleDifference` | | Two-criterion treatment, staggered timing + eligibility | `StaggeredTripleDifference` | | Nonlinear outcome (binary/count) with staggered timing | `WooldridgeDiD` | diff --git a/diff_diff/practitioner.py b/diff_diff/practitioner.py index 5b942454..0cfb9876 100644 --- a/diff_diff/practitioner.py +++ b/diff_diff/practitioner.py @@ -876,28 +876,31 @@ def _handle_had(results: Any): ), _step( baker_step=4, - label="Switch to ContinuousDiD or CallawaySantAnna if untreated units exist", + label="Confirm WAS is the target estimand (vs ATT(d) for ContinuousDiD)", why=( - "HAD targets the no-untreated-unit case where every unit " - "is treated at some positive dose. If your panel actually " - "contains units with D = 0 (genuinely untreated), HAD's " - "WAS divisor under-weights the never-treated subset and a " - "different estimator is correct: ContinuousDiD for " - "dose-response on data with untreated controls, or " - "CallawaySantAnna for binary-staggered timing." + "HAD targets WAS (Weighted Average Slope) at the dose " + "support boundary. If you specifically want per-dose " + "ATT(d) / ACRT(d) dose-response curves AND your panel " + "has never-treated controls (units with first_treat == 0), " + "ContinuousDiD is the alternative — different estimand, " + "and ContinuousDiD's identification requires never-treated " + "controls. HAD itself remains valid even with a small " + "share of never-treated units (paper compatibility; see " + "REGISTRY § HeterogeneousAdoptionDiD edge cases — " + "Garrett et al. 2020 retained 12 untreated counties out " + "of 2,954). The choice is about estimand, not about " + "whether untreated units exist." ), code=( - "# Check for untreated units:\n" - "if (data['first_treat'] == 0).any():\n" - " # Untreated units exist - switch to ContinuousDiD:\n" - " from diff_diff import ContinuousDiD\n" - " cdid = ContinuousDiD()\n" - " cdid_results = cdid.fit(\n" - " data, outcome='y', unit='unit', time='t',\n" - " first_treat='first_treat', dose='d')\n" - " # Or CallawaySantAnna for binary-staggered timing:\n" - " # from diff_diff import CallawaySantAnna\n" - " # cs = CallawaySantAnna(control_group='never_treated')" + "# HAD reports WAS at the dose support boundary.\n" + "# If you instead want per-dose ATT(d)/ACRT(d) dose-response\n" + "# curves AND the panel has never-treated controls:\n" + "from diff_diff import ContinuousDiD\n" + "cdid = ContinuousDiD()\n" + "cdid_results = cdid.fit(\n" + " data, outcome='y', unit='unit', time='t',\n" + " first_treat='first_treat', dose='d',\n" + " aggregate='dose')" ), step_name="estimator_selection", ), @@ -910,13 +913,12 @@ def _handle_had(results: Any): "for the bias-corrected local-linear estimator. Bandwidth " "choice affects WAS - verify the selector landed on a " "viable bandwidth (not boundary-clipped or near-degenerate). " - "result.bandwidth_diagnostics is None on the mass_point " + "results.bandwidth_diagnostics is None on the mass_point " "design (parametric, no bandwidth)." ), code=( "# Inspect the auto-selected bandwidths:\n" - "result.bandwidth_diagnostics # None on mass_point\n" - "# Re-fit with explicit h= / b= to test sensitivity" + "results.bandwidth_diagnostics # None on mass_point" ), priority="medium", step_name="sensitivity", @@ -1005,23 +1007,29 @@ def _handle_had_event_study(results: Any): ), _step( baker_step=4, - label="Switch to ContinuousDiD or CallawaySantAnna if untreated units exist", + label="Confirm WAS is the target estimand (vs ATT(d) for ContinuousDiD)", why=( - "HAD targets the no-untreated-unit case. If your panel " - "contains units with D = 0, switch to " - "ContinuousDiD(aggregate='eventstudy') for dose-response " - "event study with untreated controls, or CallawaySantAnna " - "with aggregate='event_study' for binary-staggered timing." + "HAD targets per-event-time WAS at the dose support " + "boundary. If you instead want per-dose ATT(d) / ACRT(d) " + "dose-response curves AND your panel has never-treated " + "controls, ContinuousDiD(aggregate='eventstudy') is the " + "alternative — different estimand, requires never-treated. " + "HAD itself remains valid even with a small share of " + "never-treated units (paper compatibility); on staggered " + "panels HAD's last-cohort filter explicitly RETAINS " + "never-treated units as the untreated-group comparison " + "(paper Appendix B.2). The choice is about estimand." ), code=( - "# Check for untreated units:\n" - "if (data['first_treat'] == 0).any():\n" - " from diff_diff import ContinuousDiD\n" - " cdid = ContinuousDiD()\n" - " es = cdid.fit(\n" - " data, outcome='y', unit='unit', time='t',\n" - " first_treat='first_treat', dose='d',\n" - " aggregate='eventstudy')" + "# HAD reports per-event-time WAS at the dose boundary.\n" + "# If you instead want per-dose ATT(d)/ACRT(d) event-study\n" + "# curves AND the panel has never-treated controls:\n" + "from diff_diff import ContinuousDiD\n" + "cdid = ContinuousDiD()\n" + "cdid_es = cdid.fit(\n" + " data, outcome='y', unit='unit', time='t',\n" + " first_treat='first_treat', dose='d',\n" + " aggregate='eventstudy')" ), step_name="estimator_selection", ), @@ -1033,18 +1041,21 @@ def _handle_had_event_study(results: Any): "as a joint pattern. On weighted fits (survey_design= or " "weights=), fit(cband=True) constructs simultaneous (sup-t) " "bands across horizons via multiplier bootstrap. " - "result.cband_low / cband_high give the band endpoints; " - "cband_crit_value reports the sup-t critical value used." + "results.cband_low / results.cband_high give the band " + "endpoints; results.cband_crit_value reports the sup-t " + "critical value used." ), code=( - "from diff_diff import HeterogeneousAdoptionDiD\n" + "from diff_diff import HeterogeneousAdoptionDiD, SurveyDesign\n" + "# Construct your survey design (adapt to your data):\n" + "sd = SurveyDesign(weights='weight_col')\n" "est = HeterogeneousAdoptionDiD(n_bootstrap=999, seed=42)\n" "es = est.fit(\n" " data, outcome_col='y', unit_col='unit',\n" " time_col='t', dose_col='d',\n" " first_treat_col='first_treat',\n" " aggregate='event_study',\n" - " survey_design=design, cband=True)\n" + " survey_design=sd, cband=True)\n" "es.cband_low, es.cband_high # simultaneous band endpoints" ), priority="medium", diff --git a/tests/test_guides.py b/tests/test_guides.py index 89651dd4..bc3609a9 100644 --- a/tests/test_guides.py +++ b/tests/test_guides.py @@ -308,30 +308,101 @@ def test_llms_full_had_pretests_section(self): def test_llms_full_had_choosing_row(self): text = get_llm_guide("full") # The Choosing-an-Estimator table must list HAD with a row that - # names either "no untreated" or "universal rollout" framing. - # Find the Choosing section and check within it. + # accurately reflects the contract: HAD targets WAS at the dose + # support boundary and is compatible with universal-rollout + # panels (and panels with a small never-treated share — paper + # edge case at REGISTRY § HeterogeneousAdoptionDiD edge cases). idx = text.index("## Choosing an Estimator") choosing = text[idx:] assert "HeterogeneousAdoptionDiD" in choosing - assert ("no untreated" in choosing.lower()) or ("universal rollout" in choosing.lower()) - - def test_llms_full_had_framing_no_comparison_group(self): - # Per feedback_had_framing_precision: HAD's design absence is - # "no untreated unit" — comparison comes from dose variation - # across units. The phrases "no comparison group" and - # "missing comparison" must NOT appear in the HAD section. + # Row must mention WAS as the estimand differentiator (not a + # blanket "if untreated → not HAD" rule which would be wrong + # per registry). + assert "WAS" in choosing + + def test_llms_full_had_section_methodology_compatible_with_untreated(self): + # Per docs/methodology/REGISTRY.md HeterogeneousAdoptionDiD edge + # cases (line ~2403): "Authors do NOT require untreated units + # to be dropped" and (line ~2408) the staggered event-study path + # explicitly RETAINS never-treated units. The HAD section must + # NOT carry framing that says HAD is incompatible with + # never-treated / untreated units. text = get_llm_guide("full") had_start = text.index("### HeterogeneousAdoptionDiD") - # Find the next top-level or H3 boundary that is NOT another HAD - # section to scope the assertion to HAD-specific content. The - # HAD estimator section is followed by ### StackedDiD; the - # results-class section ends at ### TROPResults. We check the - # estimator section text only (most likely place for framing - # drift). had_end = text.index("### StackedDiD", had_start) had_text = text[had_start:had_end].lower() + # Negative assertions on framing that contradicts the registry. assert "no comparison group" not in had_text assert "missing comparison" not in had_text + forbidden_phrases = ( + "no never-treated units", + "requires no untreated", + "drop untreated", + "must not contain untreated", + "not compatible with untreated", + ) + for phrase in forbidden_phrases: + assert phrase not in had_text, ( + f"HAD section must not carry the phrase {phrase!r}: " + f"per REGISTRY § HeterogeneousAdoptionDiD edge cases, " + f"HAD is compatible with a small share of never-treated " + f"units and explicitly retains them on staggered " + f"event-study panels (Appendix B.2)." + ) + + def test_llms_full_had_constructor_signature_matches_real_api(self): + # Documented constructor parameter list must align with the + # actual HeterogeneousAdoptionDiD.__init__ signature. Catches + # the failure mode where the guide invents kwargs that don't + # exist (h, b, rcond) or omits real ones (d_lower, kernel, + # vcov_type, robust, cluster). + import inspect + + from diff_diff import HeterogeneousAdoptionDiD + + sig_params = set(inspect.signature(HeterogeneousAdoptionDiD.__init__).parameters) + sig_params.discard("self") + text = get_llm_guide("full") + had_start = text.index("### HeterogeneousAdoptionDiD") + had_end = text.index("### StackedDiD", had_start) + had_text = text[had_start:had_end] + block_start = had_text.index("HeterogeneousAdoptionDiD(") + # Multi-line signature ends with "\n)" — close-paren on its own + # line. Searching for ")" alone would hit close-parens inside + # parameter comments (e.g. "(default)"). + block_end = had_text.index("\n)", block_start) + ctor_block = had_text[block_start:block_end] + for param in sig_params: + assert f"{param}:" in ctor_block or f"{param} " in ctor_block, ( + f"Constructor block in the HAD guide section is missing " + f"the real public parameter {param!r}. The guide must " + f"document the actual HeterogeneousAdoptionDiD.__init__ " + f"signature." + ) + + def test_llms_full_had_fit_signature_matches_real_api(self): + # Documented fit() parameter list must align with the actual + # HeterogeneousAdoptionDiD.fit signature. + import inspect + + from diff_diff import HeterogeneousAdoptionDiD + + sig_params = set(inspect.signature(HeterogeneousAdoptionDiD.fit).parameters) + sig_params.discard("self") + text = get_llm_guide("full") + had_start = text.index("### HeterogeneousAdoptionDiD") + had_end = text.index("### StackedDiD", had_start) + had_text = text[had_start:had_end] + block_start = had_text.index("had.fit(") + block_end = had_text.index(") -> ", block_start) + fit_block = had_text[block_start:block_end] + for param in sig_params: + assert f"{param}:" in fit_block or f"{param} " in fit_block, ( + f"fit() block in the HAD guide section is missing the " + f"real public parameter {param!r}. The guide must " + f"document the actual HeterogeneousAdoptionDiD.fit " + f"signature." + ) def test_llms_full_paper_citation(self): # Lead-author "D'Haultfœuille" appears in the HAD section. diff --git a/tests/test_practitioner.py b/tests/test_practitioner.py index 18eb2d76..a993dc52 100644 --- a/tests/test_practitioner.py +++ b/tests/test_practitioner.py @@ -568,13 +568,54 @@ def test_had_partial_nan_array_no_warning(self, mock_had_event_study_results_par def test_had_step_4_estimator_selection_present( self, mock_had_results, mock_had_event_study_results ): + # Step-4 must surface the WAS-vs-ATT(d) estimand difference (not + # a blanket "if untreated → not HAD" rule which would contradict + # REGISTRY § HeterogeneousAdoptionDiD edge cases lines ~2403/2408). for fixture in (mock_had_results, mock_had_event_study_results): output = practitioner_next_steps(fixture, verbose=False) step_4_steps = [s for s in output["next_steps"] if s["baker_step"] == 4] assert len(step_4_steps) >= 1 - all_text = " ".join((s.get("code", "") + " " + s.get("why", "")) for s in step_4_steps) + all_text = " ".join( + (s.get("code", "") + " " + s.get("why", "") + " " + s.get("label", "")) + for s in step_4_steps + ) + # Routing nudge must name ContinuousDiD as the estimand + # alternative; framing must center on WAS vs ATT(d) (the + # actual estimand differentiator), NOT on whether untreated + # units exist. assert "ContinuousDiD" in all_text - assert "CallawaySantAnna" in all_text + assert "WAS" in all_text + assert "ATT(d)" in all_text + + def test_had_step_4_does_not_misframe_untreated_unit_routing( + self, mock_had_results, mock_had_event_study_results + ): + # Per REGISTRY: HAD is compatible with a small share of + # never-treated units (paper edge case), and on staggered + # event-study panels never-treated units are explicitly RETAINED + # (Appendix B.2 / had.py:1325). The Step-4 routing must NOT + # carry the wrong "if untreated → not HAD" framing. + for fixture in (mock_had_results, mock_had_event_study_results): + output = practitioner_next_steps(fixture, verbose=False) + step_4_steps = [s for s in output["next_steps"] if s["baker_step"] == 4] + all_text = " ".join( + (s.get("code", "") + " " + s.get("why", "") + " " + s.get("label", "")) + for s in step_4_steps + ).lower() + forbidden_phrases = ( + "switch away from had", + "had's was divisor under-weights", + "drop untreated", + "must drop never-treated", + ) + for phrase in forbidden_phrases: + assert phrase not in all_text, ( + f"HAD Step-4 must not carry the phrase {phrase!r}: " + f"per REGISTRY § HeterogeneousAdoptionDiD edge cases, " + f"HAD is compatible with a small share of never-treated " + f"units and explicitly retains them on staggered " + f"event-study panels." + ) def test_handle_continuous_step_4_routes_to_had(self, mock_continuous_results): # Symmetric pair: ContinuousDiD users with no untreated units @@ -611,3 +652,40 @@ def test_had_handlers_string_only_no_attribute_reads( output = practitioner_next_steps(fixture, verbose=False) assert isinstance(output, dict) assert "next_steps" in output + + def test_had_handler_snippets_are_valid_python_syntax( + self, mock_had_results, mock_had_event_study_results + ): + # Snippet smoke test: every code block emitted by the HAD + # handlers must parse as valid Python. Catches the failure mode + # where snippets reference undefined names with placeholder + # syntax that doesn't compile (e.g. `survey_design=design` with + # no `design` defined in scope, or attribute typos that break + # copy/paste). + import ast + + for fixture in (mock_had_results, mock_had_event_study_results): + output = practitioner_next_steps(fixture, verbose=False) + for step in output["next_steps"]: + code = step.get("code", "") + if not code.strip(): + continue + try: + ast.parse(code) + except SyntaxError as e: + pytest.fail( + f"Step {step['baker_step']} ({step['label']!r}) " + f"emits a code snippet that does not parse as " + f"valid Python: {e}\n\nSnippet:\n{code}" + ) + + def test_handle_continuous_step_4_snippet_is_valid_python(self, mock_continuous_results): + # Same syntax check on the symmetric Step-4 in _handle_continuous. + import ast + + output = practitioner_next_steps(mock_continuous_results, verbose=False) + step_4_steps = [s for s in output["next_steps"] if s["baker_step"] == 4] + for step in step_4_steps: + code = step.get("code", "") + if code.strip(): + ast.parse(code) # raises SyntaxError on failure From ed2842a83881f1ffdc050862159357fdd47578f8 Mon Sep 17 00:00:00 2001 From: igerber Date: Sat, 9 May 2026 11:09:38 -0400 Subject: [PATCH 3/9] Address PR #402 R2 review (1 P1, 1 P2) P1 pretest assumption labels: _handle_had step-3 + llms-full.txt HAD Pretests section + qug_test/stute_test bullets misstated which paper assumption each shipped test actually tests: - qug_test was labeled "Assumption 5 support condition", but QUG tests H_0: d_lower = 0 (paper Theorem 4 / step 1 of the workflow). Assumption 5 is the Design 1 sign-identification condition and is NOT testable via pre-trends per REGISTRY.md:2270. - stute_test was labeled "Assumption 7 mean-independence", but stute_test is the Assumption 8 linearity test (paper Section 4.2 step 3 / Appendix D). Assumption 7 is pre-trends (step 2). - did_had_pretest_workflow(aggregate="overall") was implied to cover step 2, but the workflow runs steps 1 + 3 only - step 2 is explicitly not covered on the overall path (had_pretests.py:4434-4441 + the workflow's verdict flags the gap). Rewrote both surfaces to match the actual contracts: QUG = paper Theorem 4 support-infimum test (step 1, decides Design 1' vs Design 1); Stute / Yatchew-HR = Assumption 8 linearity tests (step 3); Assumption 7 step 2 closure requires aggregate="event_study" (joint Stute pre-trends). Assumption 7 / step 2 gap is explicitly flagged on the overall path so agents do not assume coverage where there is none. P2 result-class field tables incomplete: HeterogeneousAdoptionDiDResults table was missing n_mass_point, n_above_d_lower, cluster_name, bias_corrected_fit, variance_formula, effective_dose_mean. HeterogeneousAdoptionDiDEventStudyResults table was missing vcov_type, cluster_name, bandwidth_diagnostics, bias_corrected_fit, filter_info. Added all missing fields with correct types and descriptions. Tests added (3 new, 86 total): - test_llms_full_had_results_class_field_lists_match_real_dataclass: uses dataclasses.fields() to enumerate every public field on both result classes and assert each appears in the documented table. Catches future drift where new fields land but the guide is not updated. - test_llms_full_had_pretests_assumption_labels_correct: scans the qug_test and stute_test bullets in the HAD Pretests section and enforces positive labels (support-infimum / Theorem 4 / linearity) + forbids positive Assumption-5 / Assumption-7 misclaims (negative disclaimers like "QUG does NOT test Assumption 5" remain allowed). - test_had_step_3_pretest_assumption_labels_correct: same checks on the practitioner.py _handle_had step-3 why-text; also requires positive acknowledgment of the Assumption 7 / step 2 gap on the overall workflow path. Co-Authored-By: Claude Opus 4.7 (1M context) --- diff_diff/guides/llms-full.txt | 48 ++++++++++----- diff_diff/practitioner.py | 23 ++++--- tests/test_guides.py | 107 +++++++++++++++++++++++++++++++++ tests/test_practitioner.py | 46 ++++++++++++++ 4 files changed, 201 insertions(+), 23 deletions(-) diff --git a/diff_diff/guides/llms-full.txt b/diff_diff/guides/llms-full.txt index e2a2e2d3..bc6184c5 100644 --- a/diff_diff/guides/llms-full.txt +++ b/diff_diff/guides/llms-full.txt @@ -1226,7 +1226,7 @@ Each event study effect dict contains: `effect`, `se`, `t_stat`, `p_value`, `con ### HeterogeneousAdoptionDiDResults -Single-period results container for `HeterogeneousAdoptionDiD`. +Single-period results container for `HeterogeneousAdoptionDiD`. The table below enumerates every public dataclass field; a regression test in `tests/test_guides.py` (`test_llms_full_had_results_class_field_lists_match_real_dataclass`) compares this list against the real `dataclasses.fields()` of the result class. | Attribute | Type | Description | |-----------|------|-------------| @@ -1243,16 +1243,22 @@ Single-period results container for `HeterogeneousAdoptionDiD`. | `n_obs` | `int` | Units contributing to estimation | | `n_treated` | `int` | Units with `D > d_lower` | | `n_control` | `int` | Units at or below `d_lower` | +| `n_mass_point` | `int | None` | Mass-point design only: units exactly at `d_lower`; `None` on continuous designs | +| `n_above_d_lower` | `int | None` | Mass-point design only: units strictly above `d_lower`; `None` on continuous designs | | `inference_method` | `str` | `"analytical_nonparametric"` or `"analytical_2sls"` | | `vcov_type` | `str | None` | Mass-point only: `"classical"`, `"hc1"`, or `"cr1"` | -| `bandwidth_diagnostics` | `BandwidthResult | None` | MSE-DPI selector output (continuous designs); `None` on `mass_point` | +| `cluster_name` | `str | None` | Cluster column name when CR1 cluster-robust SE is requested; `None` otherwise | | `survey_metadata` | `SurveyMetadata | None` | Repo-standard survey metadata when `survey_design=` / `weights=` is supplied | +| `bandwidth_diagnostics` | `BandwidthResult | None` | MSE-DPI selector output (continuous designs); `None` on `mass_point` | +| `bias_corrected_fit` | `BiasCorrectedFit | None` | Phase 1c bias-corrected local-linear fit object (continuous designs); `None` on `mass_point` | +| `variance_formula` | `str | None` | HAD-specific SE label on the weighted continuous path: `"pweight"` (CCT 2014 weighted-robust) or `"survey_binder_tsl"` (Binder 1983); `None` on unweighted / mass-point fits | +| `effective_dose_mean` | `float | None` | Weighted denominator used by the β̂-scale rescaling on the weighted continuous path; `None` on unweighted fits | **Methods:** `summary()`, `print_summary()`, `to_dict()`, `to_dataframe()` ### HeterogeneousAdoptionDiDEventStudyResults -Per-horizon event-study results container for `HeterogeneousAdoptionDiD` with `aggregate="event_study"`. The anchor horizon `e = -1` is excluded by construction. +Per-horizon event-study results container for `HeterogeneousAdoptionDiD` with `aggregate="event_study"`. The anchor horizon `e = -1` is excluded by construction. The table below enumerates every public dataclass field; a regression test (`test_llms_full_had_results_class_field_lists_match_real_dataclass`) compares this list against the real `dataclasses.fields()`. | Attribute | Type | Description | |-----------|------|-------------| @@ -1263,11 +1269,6 @@ Per-horizon event-study results container for `HeterogeneousAdoptionDiD` with `a | `p_value` | `np.ndarray` | Per-horizon p-values | | `conf_int_low` | `np.ndarray` | Pointwise CI lower bounds | | `conf_int_high` | `np.ndarray` | Pointwise CI upper bounds | -| `cband_low` | `np.ndarray | None` | Simultaneous (sup-t) band lower bounds; `None` on unweighted fits or when `cband=False` | -| `cband_high` | `np.ndarray | None` | Simultaneous (sup-t) band upper bounds | -| `cband_crit_value` | `float | None` | Sup-t critical value used for the simultaneous band | -| `cband_method` | `str | None` | `"multiplier_bootstrap"` when populated | -| `cband_n_bootstrap` | `int | None` | Bootstrap iterations used for the band | | `n_obs_per_horizon` | `np.ndarray` | Per-horizon contributing-unit counts | | `alpha` | `float` | CI level used at fit time | | `design` | `str` | Shared across horizons (paper Appendix B.2 invariant) | @@ -1277,9 +1278,19 @@ Per-horizon event-study results container for `HeterogeneousAdoptionDiD` with `a | `F` | `object` | First-treatment period label | | `n_units` | `int` | Unique units contributing to the fit (post last-cohort filter) | | `inference_method` | `str` | `"analytical_nonparametric"` or `"analytical_2sls"` | +| `vcov_type` | `str | None` | Mass-point only: `"classical"`, `"hc1"`, or `"cr1"`; `None` on continuous designs | +| `cluster_name` | `str | None` | Cluster column name when CR1 is requested; `None` otherwise | | `survey_metadata` | `SurveyMetadata | None` | Populated on weighted fits | +| `bandwidth_diagnostics` | `list[BandwidthResult | None] | None` | Per-horizon MSE-DPI selector output (continuous designs); `None` on `mass_point`; entries can be `None` on degenerate horizons | +| `bias_corrected_fit` | `list[BiasCorrectedFit | None] | None` | Per-horizon Phase 1c bias-corrected local-linear fit objects; `None` on `mass_point`; entries can be `None` on degenerate horizons | +| `filter_info` | `dict | None` | Staggered last-cohort auto-filter metadata (`F_last`, `n_kept`, `n_dropped`, `dropped_cohorts`); `None` when no filter applied | | `variance_formula` | `str | None` | Per-horizon variance family label | | `effective_dose_mean` | `float | None` | Weighted denominator | +| `cband_low` | `np.ndarray | None` | Simultaneous (sup-t) band lower bounds; `None` on unweighted fits or when `cband=False` | +| `cband_high` | `np.ndarray | None` | Simultaneous (sup-t) band upper bounds | +| `cband_crit_value` | `float | None` | Sup-t critical value used for the simultaneous band | +| `cband_method` | `str | None` | `"multiplier_bootstrap"` when populated | +| `cband_n_bootstrap` | `int | None` | Bootstrap iterations used for the band | **Methods:** `summary()`, `print_summary()`, `to_dict()`, `to_dataframe()` @@ -1393,7 +1404,7 @@ results = did.fit(data, outcome='y', treatment='treated', time='post') ## HAD Pretests -Diagnostic pretests for the `HeterogeneousAdoptionDiD` identifying assumptions (de Chaisemartin, Ciccia, D'Haultfœuille & Knau 2026). The composite workflow `did_had_pretest_workflow` is the recommended entry point — call it before reporting WAS as causal. +Diagnostic pretests for the `HeterogeneousAdoptionDiD` identifying assumptions (de Chaisemartin, Ciccia, D'Haultfœuille & Knau 2026). The composite workflow `did_had_pretest_workflow` is the recommended entry point — call it before reporting WAS as causal. The workflow follows paper Section 4.2's three-step battery: **step 1** is the QUG support-infimum test (decides whether Design 1' or Design 1 applies); **step 2** is the Assumption 7 pre-trends test (joint Stute on the event-study path; explicitly NOT covered on the overall path because a single-pre-period panel cannot support the joint variant); **step 3** is the Assumption 8 linearity test (`stute_test` or `yatchew_hr_test`). On the default `aggregate="overall"` path the workflow runs steps 1 + 3 only and the returned `verdict` flags the Assumption 7 gap; pass `aggregate="event_study"` on a multi-period panel to close that gap. ```python from diff_diff import ( @@ -1402,11 +1413,16 @@ from diff_diff import ( stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test, ) -# Composite workflow — bundles QUG + Stute + Yatchew per the paper's three-step battery +# Composite workflow: +# aggregate="overall" -> steps 1 + 3 (QUG + Assumption 8 linearity) +# step 2 (Assumption 7 pre-trends) NOT covered; +# verdict explicitly flags this gap. +# aggregate="event_study" -> steps 1 + 2 + 3 (QUG + joint Stute pre-trends + +# joint homogeneity-linearity Stute) on multi-period panels. report = did_had_pretest_workflow( data, outcome_col='y', unit_col='unit', time_col='t', dose_col='d', first_treat_col='first_treat', - aggregate='overall', # or 'event_study' for joint Stute on multi-period panels + aggregate='overall', survey_design=None) # SurveyDesign for survey-aware pretests (Phase 4.5 C) print(report.summary()) print(report.all_pass, report.verdict) @@ -1414,12 +1430,12 @@ print(report.all_pass, report.verdict) Individual tests: -- `qug_test(d)` — Assumption 5 support condition. Extreme order statistics, Exp(1)/Exp(1) limit law. **Permanently rejects** non-`None` `survey_design=` / `weights=` (`NotImplementedError`) per Phase 4.5 C0 deferral — extreme-value functionals are not smooth in the empirical CDF, so standard survey machinery does not yield a calibrated test. -- `stute_test(d, dy)` — Assumption 7 mean-independence of trends via Cramér-von Mises functional with Mammen wild bootstrap. Survey-aware via PSU-level Mammen multiplier bootstrap. -- `yatchew_hr_test(d, dy, *, null="linearity")` — Assumption 8 linearity of `E[ΔY|D]` via Yatchew (1997) heteroskedasticity-robust variance-ratio test. The `null="mean_independence"` mode (R `YatchewTest::yatchew_test(order=0)`) is also exposed for placebo-style mean-independence testing. Survey-aware via closed-form weighted variance components (no bootstrap). +- `qug_test(d)` — paper Theorem 4 support-infimum test (`H_0: d_lower = 0`; the QUG decides whether Design 1' or Design 1 applies in step 1 of the workflow). Extreme order statistics, Exp(1)/Exp(1) limit law. The QUG itself does NOT test Assumption 5 (which is the Design 1 sign-identification condition and is not testable via pre-trends per registry). **Permanently rejects** non-`None` `survey_design=` / `weights=` (`NotImplementedError`) per Phase 4.5 C0 deferral — extreme-value functionals are not smooth in the empirical CDF, so standard survey machinery does not yield a calibrated test. +- `stute_test(d, dy)` — Assumption 8 linearity of `E[ΔY|D]` (paper Section 4.2 step 3) via Stute Cramér-von Mises functional with Mammen wild bootstrap. Survey-aware via PSU-level Mammen multiplier bootstrap. +- `yatchew_hr_test(d, dy, *, null="linearity")` — Assumption 8 linearity of `E[ΔY|D]` (alternative test for step 3) via Yatchew (1997) heteroskedasticity-robust variance-ratio test. The `null="mean_independence"` mode (R `YatchewTest::yatchew_test(order=0)`) is also exposed for placebo-style mean-independence testing. Survey-aware via closed-form weighted variance components (no bootstrap). - `stute_joint_pretest(residuals_dict, d)` — joint Cramér-von Mises across K horizons with shared-η Mammen wild bootstrap (Delgado-Manteiga 2001 / Hlávka-Hušková 2020). Residuals-in core; the two data-in wrappers below construct residuals for the two paper-spelled nulls. -- `joint_pretrends_test(...)` — joint pre-trends on K pre-periods (paper Section 4.2 step 2 closure on the event-study path). -- `joint_homogeneity_test(...)` — joint linearity-and-homogeneity on K post-periods. +- `joint_pretrends_test(...)` — Assumption 7 joint pre-trends on K pre-periods (paper Section 4.2 step 2 closure on the event-study path). +- `joint_homogeneity_test(...)` — joint linearity-and-homogeneity on K post-periods (event-study step 3 alternative). The QUG-under-survey deferral is permanent; the linearity-family pretests support `survey_design=` (pweight, PSU, FPC) per Phase 4.5 C. Stratified designs and replicate-weight designs are deferred to follow-up PRs. diff --git a/diff_diff/practitioner.py b/diff_diff/practitioner.py index 0cfb9876..d468cc86 100644 --- a/diff_diff/practitioner.py +++ b/diff_diff/practitioner.py @@ -857,12 +857,18 @@ def _handle_had(results: Any): baker_step=3, label="Run the HAD pretest battery", why=( - "did_had_pretest_workflow bundles the paper's three " - "testable identifying assumptions: QUG (Assumption 5 " - "support condition), Stute (Assumption 7 mean-independence " - "of trends), and Yatchew-HR (Assumption 8 linearity of " - "E[ΔY|D]). Assumption 5/6 boundary continuity is not " - "testable - the workflow vets what can be vetted." + "On a two-period panel did_had_pretest_workflow runs " + "paper Section 4.2 step 1 (QUG support-infimum test - " + "decides Design 1' vs Design 1) and step 3 (Stute / " + "Yatchew-HR Assumption 8 linearity tests). Step 2 " + "(Assumption 7 pre-trends) is NOT covered on the overall " + "path - a single pre-period cannot support the joint " + "Stute variant - and the returned verdict explicitly " + "flags that gap. To close step 2, refit on a multi-period " + "panel with aggregate='event_study'. Assumptions 3 / 5 / 6 " + "(uniform continuity at the boundary, Design 1 sign / " + "WAS_d_lower identification) are NOT testable via " + "pre-trends - the workflow vets only what can be vetted." ), code=( "from diff_diff import did_had_pretest_workflow\n" @@ -870,7 +876,10 @@ def _handle_had(results: Any): " data, outcome_col='y', unit_col='unit',\n" " time_col='t', dose_col='d',\n" " first_treat_col='first_treat')\n" - "print(report.summary())" + "print(report.summary())\n" + "# verdict explicitly flags the Assumption 7 gap on the\n" + "# overall path; aggregate='event_study' on a multi-period\n" + "# panel adds joint Stute pre-trends + joint homogeneity-linearity." ), step_name="parallel_trends", ), diff --git a/tests/test_guides.py b/tests/test_guides.py index bc3609a9..57752bf9 100644 --- a/tests/test_guides.py +++ b/tests/test_guides.py @@ -413,3 +413,110 @@ def test_llms_full_paper_citation(self): had_end = text.index("### StackedDiD", had_start) had_text = text[had_start:had_end] assert "D'Haultfœuille" in had_text + + def test_llms_full_had_results_class_field_lists_match_real_dataclass(self): + # Every public dataclass field on HeterogeneousAdoptionDiDResults + # and HeterogeneousAdoptionDiDEventStudyResults must appear in the + # documented field table. Catches the failure mode where new + # result fields land but the guide isn't updated, so agents + # treating llms-full.txt as the authoritative surface miss + # available diagnostics / metadata. + import dataclasses + + from diff_diff import ( + HeterogeneousAdoptionDiDEventStudyResults, + HeterogeneousAdoptionDiDResults, + ) + + text = get_llm_guide("full") + + # Single-period result class + sp_start = text.index("### HeterogeneousAdoptionDiDResults") + sp_end = text.index("### HeterogeneousAdoptionDiDEventStudyResults", sp_start) + sp_block = text[sp_start:sp_end] + for field in dataclasses.fields(HeterogeneousAdoptionDiDResults): + assert f"`{field.name}`" in sp_block, ( + f"HeterogeneousAdoptionDiDResults guide block is missing " + f"the public dataclass field {field.name!r}. The table " + f"must enumerate every field so agents see all available " + f"diagnostics / metadata." + ) + + # Event-study result class + es_start = text.index("### HeterogeneousAdoptionDiDEventStudyResults") + es_end = text.index("### TROPResults", es_start) + es_block = text[es_start:es_end] + for field in dataclasses.fields(HeterogeneousAdoptionDiDEventStudyResults): + assert f"`{field.name}`" in es_block, ( + f"HeterogeneousAdoptionDiDEventStudyResults guide block " + f"is missing the public dataclass field {field.name!r}." + ) + + def test_llms_full_had_pretests_assumption_labels_correct(self): + # Per docs/methodology/REGISTRY.md HeterogeneousAdoptionDiD + # § "Assumptions / Theorems / Estimators": + # - Assumption 5 = Design 1 sign identification (NOT testable) + # - Assumption 6 = Design 1 WAS_d_lower identification (NOT testable) + # - Assumption 7 = pre-trends (paper Section 4.2 step 2) + # - Assumption 8 = linearity (paper Section 4.2 step 3) + # The HAD Pretests section must NOT mislabel these: + # - qug_test is the support-infimum test (H0: d_lower = 0), + # NOT "Assumption 5" (which is non-testable per registry). + # - stute_test is Assumption 8 (linearity), NOT Assumption 7. + text = get_llm_guide("full") + pretests_start = text.index("## HAD Pretests") + pretests_end = text.index("## Honest DiD", pretests_start) + pretests_block = text[pretests_start:pretests_end] + # qug_test bullet: must positively label QUG as a support-infimum + # test, NOT as a positive "Assumption 5 support condition" claim + # (a negative disclaimer "does NOT test Assumption 5" is fine). + forbidden_qug_positive_claims = ( + "Assumption 5 support condition", + "QUG (Assumption 5", + "qug_test`) — Assumption 5", + "qug_test(d)` — Assumption 5", + ) + # stute_test bullet: must positively label as Assumption 8 + # linearity, NOT as Assumption 7 mean-independence. + forbidden_stute_positive_claims = ( + "stute_test(d, dy)` — Assumption 7", + "Stute (Assumption 7", + "Assumption 7 mean-independence", + ) + for line in pretests_block.splitlines(): + if line.startswith("- `qug_test"): + # Positive claim of what QUG IS: + assert ( + "support-infimum" in line + or "support infimum" in line + or "Theorem 4" in line + or "H_0: d_lower" in line + ), ( + f"qug_test bullet must positively label QUG as the " + f"support-infimum / Theorem-4 test. Line: {line!r}" + ) + for phrase in forbidden_qug_positive_claims: + assert phrase not in line, ( + f"qug_test bullet must not positively claim QUG " + f"is an 'Assumption 5' test ({phrase!r}). QUG " + f"tests H_0: d_lower = 0; Assumption 5 is the " + f"Design 1 sign-identification condition (NOT " + f"testable per registry). A negative disclaimer " + f"that QUG does NOT test Assumption 5 is fine. " + f"Line: {line!r}" + ) + if line.startswith("- `stute_test"): + # Positive claim of what Stute IS: + assert "Assumption 8" in line or "linearity" in line.lower(), ( + f"stute_test bullet must positively label as " + f"Assumption 8 / linearity test. Line: {line!r}" + ) + for phrase in forbidden_stute_positive_claims: + assert phrase not in line, ( + f"stute_test bullet must not positively claim " + f"Stute is an Assumption 7 mean-independence " + f"test ({phrase!r}). stute_test is Assumption 8 " + f"linearity (paper Section 4.2 step 3); " + f"Assumption 7 is pre-trends (step 2, only " + f"covered on the event-study path). Line: {line!r}" + ) diff --git a/tests/test_practitioner.py b/tests/test_practitioner.py index a993dc52..298ba8cb 100644 --- a/tests/test_practitioner.py +++ b/tests/test_practitioner.py @@ -689,3 +689,49 @@ def test_handle_continuous_step_4_snippet_is_valid_python(self, mock_continuous_ code = step.get("code", "") if code.strip(): ast.parse(code) # raises SyntaxError on failure + + def test_had_step_3_pretest_assumption_labels_correct(self, mock_had_results): + # Per docs/methodology/REGISTRY.md and diff_diff/had_pretests.py + # docstrings: + # - did_had_pretest_workflow(aggregate="overall") covers paper + # Section 4.2 steps 1 + 3 ONLY; step 2 (Assumption 7 + # pre-trends) is explicitly NOT covered on the overall path. + # - qug_test = support-infimum test (H0: d_lower = 0), + # NOT "Assumption 5" (Design 1 sign identification, which is + # not testable per registry). + # - stute_test = Assumption 8 linearity, NOT Assumption 7 + # mean-independence. + # The single-period Step-3 guidance must not mislabel these. + output = practitioner_next_steps(mock_had_results, verbose=False) + step_3_steps = [s for s in output["next_steps"] if s["baker_step"] == 3] + assert len(step_3_steps) == 1 + why = step_3_steps[0].get("why", "") + # Must NOT call QUG an "Assumption 5" test. + assert "QUG (Assumption 5" not in why, ( + "Step-3 why-text must not call QUG an 'Assumption 5' test - " + "QUG tests H_0: d_lower = 0 (paper Theorem 4); Assumption 5 " + "is the Design 1 sign-identification condition and is NOT " + "testable per registry." + ) + # Must NOT claim Stute is Assumption 7 mean-independence. + forbidden = ( + "Stute (Assumption 7", + "Stute / Yatchew-HR Assumption 7", + "Assumption 7 mean-independence", + ) + for phrase in forbidden: + assert phrase not in why, ( + f"Step-3 why-text must not carry the phrase {phrase!r} - " + f"stute_test / yatchew_hr_test are Assumption 8 linearity " + f"tests (paper Section 4.2 step 3); Assumption 7 (pre-trends) " + f"is paper step 2 and is NOT covered on the overall workflow " + f"path - the workflow's verdict explicitly flags that gap." + ) + # Must positively acknowledge the Assumption 7 / step 2 gap on + # the overall path (not silently imply it's covered). + assert "Assumption 7" in why or "step 2" in why, ( + "Step-3 why-text must explicitly mention Assumption 7 / step 2 " + "to acknowledge the gap on the overall workflow path - " + "agents reading the guidance must not assume the workflow " + "covers what it does not cover." + ) From 00ed3ab37edaf51ce02aad6ae618564d5eb1fbed Mon Sep 17 00:00:00 2001 From: igerber Date: Sat, 9 May 2026 11:29:52 -0400 Subject: [PATCH 4/9] Address PR #402 R3 review (2 P2) P2 (a) variance_formula / effective_dose_mean field descriptions in the single-period results table understated weighted mass-point fits. The single-period table previously said both fields were continuous-path-only / None on mass-point fits, but per had.py:3585-3629 weighted mass-point fits populate variance_formula in {"pweight_2sls", "survey_binder_tsl_2sls"} and effective_dose_mean as the weighted Wald-IV dose gap. Both rows now enumerate all four variance_formula labels and describe the mass-point Wald-IV dose-gap denominator explicitly. P2 (b) llms-practitioner.txt Step 4 decision tree was stale: it routed ALL continuous-intensity designs to ContinuousDiD, omitting HAD. That contradicted the new runtime estimator-selection guidance and would steer agents to the wrong estimator on universal-rollout panels. Updated the continuous branch to distinguish ContinuousDiD (per-dose ATT(d) / ACRT(d), requires never-treated controls) from HeterogeneousAdoptionDiD (WAS at dose support boundary, compatible with universal rollout or small never-treated share). Tests added (2 new, 88 total): - test_llms_full_had_variance_formula_describes_all_designs: scans the variance_formula and effective_dose_mean rows in the single-period HAD results table; asserts all four variance_formula labels (pweight, survey_binder_tsl, pweight_2sls, survey_binder_tsl_2sls) and the mass-point Wald-IV dose-gap semantics are present. - test_llms_practitioner_step_4_distinguishes_had_from_continuous: parses the Step 4 decision tree from get_llm_guide("practitioner"); asserts both ContinuousDiD and HeterogeneousAdoptionDiD appear in the continuous branch and that never-treated / universal-rollout framing is the routing distinguisher. Co-Authored-By: Claude Opus 4.7 (1M context) --- diff_diff/guides/llms-full.txt | 4 +- diff_diff/guides/llms-practitioner.txt | 9 +++- tests/test_guides.py | 71 ++++++++++++++++++++++++++ 3 files changed, 81 insertions(+), 3 deletions(-) diff --git a/diff_diff/guides/llms-full.txt b/diff_diff/guides/llms-full.txt index bc6184c5..17dca40f 100644 --- a/diff_diff/guides/llms-full.txt +++ b/diff_diff/guides/llms-full.txt @@ -1251,8 +1251,8 @@ Single-period results container for `HeterogeneousAdoptionDiD`. The table below | `survey_metadata` | `SurveyMetadata | None` | Repo-standard survey metadata when `survey_design=` / `weights=` is supplied | | `bandwidth_diagnostics` | `BandwidthResult | None` | MSE-DPI selector output (continuous designs); `None` on `mass_point` | | `bias_corrected_fit` | `BiasCorrectedFit | None` | Phase 1c bias-corrected local-linear fit object (continuous designs); `None` on `mass_point` | -| `variance_formula` | `str | None` | HAD-specific SE label on the weighted continuous path: `"pweight"` (CCT 2014 weighted-robust) or `"survey_binder_tsl"` (Binder 1983); `None` on unweighted / mass-point fits | -| `effective_dose_mean` | `float | None` | Weighted denominator used by the β̂-scale rescaling on the weighted continuous path; `None` on unweighted fits | +| `variance_formula` | `str | None` | HAD-specific SE label on weighted fits, populated on BOTH continuous and mass-point designs: `"pweight"` (continuous, CCT 2014 weighted-robust on the `weights=` shortcut), `"survey_binder_tsl"` (continuous, Binder 1983 TSL on the `survey_design=` path), `"pweight_2sls"` (mass-point, weighted 2SLS HC1 / CR1 sandwich on the `weights=` shortcut), or `"survey_binder_tsl_2sls"` (mass-point, Binder 1983 TSL on the `survey_design=` path). `None` on unweighted fits | +| `effective_dose_mean` | `float | None` | Weighted denominator used by the β̂-scale rescaling, populated on weighted fits across all designs: weighted `mean(d)` (`continuous_at_zero`), weighted `mean(d − d_lower)` (`continuous_near_d_lower`), or weighted Wald-IV dose gap `mean(d | Z=1, w) − mean(d | Z=0, w)` (`mass_point`). `None` on unweighted fits | **Methods:** `summary()`, `print_summary()`, `to_dict()`, `to_dataframe()` diff --git a/diff_diff/guides/llms-practitioner.txt b/diff_diff/guides/llms-practitioner.txt index acb0adaa..c853c2b7 100644 --- a/diff_diff/guides/llms-practitioner.txt +++ b/diff_diff/guides/llms-practitioner.txt @@ -158,7 +158,14 @@ Is this a triple-difference (DDD) design? (Two criteria: e.g., policy + eligibil |-- YES, staggered timing: StaggeredTripleDifference (SDDD) | Is treatment continuous (doses/intensities)? -|-- YES: ContinuousDiD (CDiD) +|-- YES, panel has never-treated units (some units with first_treat == 0, +| i.e. dose == 0 throughout): ContinuousDiD (CDiD) for per-dose +| ATT(d) / ACRT(d) dose-response curves +|-- YES, no never-treated units (universal rollout — every unit treated +| at some positive dose): HeterogeneousAdoptionDiD (HAD) for +| Weighted Average Slope (WAS) at the dose support boundary. +| HAD is also compatible with a small never-treated share if +| the WAS estimand is what you want. | Is treatment adoption staggered (multiple cohorts, different timing)? |-- YES: Do NOT use plain TWFE. Use one of: diff --git a/tests/test_guides.py b/tests/test_guides.py index 57752bf9..2a53f9bd 100644 --- a/tests/test_guides.py +++ b/tests/test_guides.py @@ -452,6 +452,77 @@ def test_llms_full_had_results_class_field_lists_match_real_dataclass(self): f"is missing the public dataclass field {field.name!r}." ) + def test_llms_full_had_variance_formula_describes_all_designs(self): + # Per diff_diff/had.py:3585-3629, weighted mass-point fits populate + # variance_formula in {"pweight_2sls", "survey_binder_tsl_2sls"} and + # weighted continuous fits in {"pweight", "survey_binder_tsl"}. The + # documented description must cover ALL four labels (not just the + # two continuous ones) so agents reading the guide on a weighted + # mass-point fit do not misread the available inference metadata. + text = get_llm_guide("full") + sp_start = text.index("### HeterogeneousAdoptionDiDResults") + sp_end = text.index("### HeterogeneousAdoptionDiDEventStudyResults", sp_start) + sp_block = text[sp_start:sp_end] + # Find the variance_formula row in the table. + for line in sp_block.splitlines(): + if line.startswith("| `variance_formula`"): + for label in ( + "pweight", + "survey_binder_tsl", + "pweight_2sls", + "survey_binder_tsl_2sls", + ): + assert label in line, ( + f"variance_formula row must enumerate the {label!r} " + f"label - weighted mass-point fits populate " + f"pweight_2sls / survey_binder_tsl_2sls per " + f"had.py:3585-3629. Line: {line!r}" + ) + break + else: + pytest.fail("variance_formula row not found in HAD results table") + # effective_dose_mean: must mention mass-point Wald-IV dose gap. + for line in sp_block.splitlines(): + if line.startswith("| `effective_dose_mean`"): + assert "mass_point" in line or "Wald-IV" in line or "mass-point" in line, ( + f"effective_dose_mean row must mention mass-point " + f"semantics - weighted mass-point fits populate the " + f"weighted Wald-IV dose gap per had.py:3642-3660. " + f"Line: {line!r}" + ) + break + else: + pytest.fail("effective_dose_mean row not found in HAD results table") + + def test_llms_practitioner_step_4_distinguishes_had_from_continuous(self): + # The official practitioner workflow guide (returned by + # get_llm_guide("practitioner")) routes continuous treatments. It + # must distinguish ContinuousDiD (per-dose ATT(d), requires + # never-treated controls) from HeterogeneousAdoptionDiD (WAS at + # dose boundary, compatible with universal rollout). Pre-PR the + # decision tree routed ALL continuous-intensity designs to + # ContinuousDiD - which is wrong for no-untreated panels. + text = get_llm_guide("practitioner") + # Locate the Step 4 decision tree. + s4_start = text.index("## Step 4: Choose Estimation Method") + # Step 5 is the next section header; cap the slice there. + s5_start = text.index("## Step ", s4_start + 1) + s4_block = text[s4_start:s5_start] + # Both HAD and ContinuousDiD must appear in the continuous branch. + assert "HeterogeneousAdoptionDiD" in s4_block, ( + "practitioner guide Step 4 decision tree must mention " + "HeterogeneousAdoptionDiD as the alternative to ContinuousDiD " + "on no-untreated / universal-rollout panels." + ) + assert "ContinuousDiD" in s4_block + # Universal-rollout / no-untreated framing should be present so + # readers know which branch routes where. + assert "never-treated" in s4_block.lower() or "untreated" in s4_block.lower(), ( + "practitioner guide Step 4 must describe the never-treated / " + "universal-rollout distinction that drives the HAD vs " + "ContinuousDiD routing." + ) + def test_llms_full_had_pretests_assumption_labels_correct(self): # Per docs/methodology/REGISTRY.md HeterogeneousAdoptionDiD # § "Assumptions / Theorems / Estimators": From d4b909157a6a38dfcf7d41355985efcfd1406154 Mon Sep 17 00:00:00 2001 From: igerber Date: Sat, 9 May 2026 11:46:55 -0400 Subject: [PATCH 5/9] Address PR #402 R4 review (1 P1, 1 P3) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit P1 HAD Step-3 overstated pretest coverage on weighted/survey fits: practitioner_next_steps() said did_had_pretest_workflow runs QUG on both the overall and event-study paths without noting that the workflow explicitly skips QUG whenever survey_design= / survey= / weights= is supplied (Phase 4.5 C0 deferral, had_pretests.py:4488-4495 + REGISTRY § "QUG Null Test" Note (Phase 4.5 C0)). On weighted fits the workflow emits a UserWarning and returns a linearity-conditional verdict only. Both _handle_had and _handle_had_event_study Step-3 why-text + code snippets now explicitly state that survey-weighted fits skip QUG and yield a linearity-conditional verdict (the weighted verdict is conditional on QUG holding by assumption). The event-study text also notes that joint Stute pre-trends and joint homogeneity-linearity themselves remain available under survey weighting via the PSU-level Mammen multiplier bootstrap. P3 REGISTRY § HeterogeneousAdoptionDiD requirements checklist was stale: marked "Phase 5: practitioner_next_steps() integration" and "Phase 5 (remaining): llms-full.txt section" as pending. Updated to reflect this PR landing wave 1 of Phase 5; only T21 (HAD pretest workflow tutorial) and T22 (weighted/survey HAD tutorial) remain queued, both tracked in TODO.md. Tests added (1 new, 89 total): - test_had_step_3_flags_qug_under_survey_deferral: asserts both HAD handler variants surface the QUG-under-survey skip and the linearity-conditional-verdict caveat. Without this caveat agents may assume step 1 / Design 1' vs Design 1 was checked on weighted fits when the library deliberately does not check it there. Co-Authored-By: Claude Opus 4.7 (1M context) --- diff_diff/practitioner.py | 40 +++++++++++++++++++++++++++--------- docs/methodology/REGISTRY.md | 5 +++-- tests/test_practitioner.py | 37 +++++++++++++++++++++++++++++++++ 3 files changed, 70 insertions(+), 12 deletions(-) diff --git a/diff_diff/practitioner.py b/diff_diff/practitioner.py index d468cc86..8ccec7f0 100644 --- a/diff_diff/practitioner.py +++ b/diff_diff/practitioner.py @@ -857,18 +857,26 @@ def _handle_had(results: Any): baker_step=3, label="Run the HAD pretest battery", why=( - "On a two-period panel did_had_pretest_workflow runs " - "paper Section 4.2 step 1 (QUG support-infimum test - " + "On a two-period unweighted panel did_had_pretest_workflow " + "runs paper Section 4.2 step 1 (QUG support-infimum test - " "decides Design 1' vs Design 1) and step 3 (Stute / " "Yatchew-HR Assumption 8 linearity tests). Step 2 " "(Assumption 7 pre-trends) is NOT covered on the overall " "path - a single pre-period cannot support the joint " "Stute variant - and the returned verdict explicitly " "flags that gap. To close step 2, refit on a multi-period " - "panel with aggregate='event_study'. Assumptions 3 / 5 / 6 " - "(uniform continuity at the boundary, Design 1 sign / " - "WAS_d_lower identification) are NOT testable via " - "pre-trends - the workflow vets only what can be vetted." + "panel with aggregate='event_study'. On survey-weighted " + "fits (survey_design= / survey= / weights=) the workflow " + "skips QUG with a UserWarning (permanent Phase 4.5 C0 " + "deferral - extreme order statistics are not smooth " + "functionals of the empirical CDF) and returns a " + "linearity-conditional verdict only - so step 1 coverage " + "is unweighted-only and the reported verdict on weighted " + "fits is conditional on QUG holding by assumption. " + "Assumptions 3 / 5 / 6 (uniform continuity at the " + "boundary, Design 1 sign / WAS_d_lower identification) " + "are NOT testable via pre-trends - the workflow vets only " + "what can be vetted." ), code=( "from diff_diff import did_had_pretest_workflow\n" @@ -879,7 +887,9 @@ def _handle_had(results: Any): "print(report.summary())\n" "# verdict explicitly flags the Assumption 7 gap on the\n" "# overall path; aggregate='event_study' on a multi-period\n" - "# panel adds joint Stute pre-trends + joint homogeneity-linearity." + "# panel adds joint Stute pre-trends + joint homogeneity-linearity.\n" + "# Passing survey_design= / weights= skips QUG (Phase 4.5 C0)\n" + "# and returns a linearity-conditional verdict only." ), step_name="parallel_trends", ), @@ -997,11 +1007,21 @@ def _handle_had_event_study(results: Any): baker_step=3, label="Run the HAD pretest battery (event-study mode)", why=( - "On multi-period panels, did_had_pretest_workflow with " - "aggregate='event_study' runs QUG plus joint Stute " + "On multi-period unweighted panels, did_had_pretest_workflow " + "with aggregate='event_study' runs QUG plus joint Stute " "pre-trends plus joint homogeneity-linearity Stute. The " "joint Stute variants close the paper Section 4.2 step-2 " - "gap that the overall path explicitly flags as deferred." + "gap that the overall path explicitly flags as deferred. " + "On survey-weighted fits (survey_design= / survey= / " + "weights=) the workflow skips QUG with a UserWarning " + "(permanent Phase 4.5 C0 deferral) and returns a " + "linearity-conditional verdict only - so step 1 coverage " + "is unweighted-only on the event-study path too, and the " + "weighted verdict is conditional on QUG holding by " + "assumption. The joint Stute pre-trends and joint " + "homogeneity-linearity tests themselves remain available " + "under survey weighting via PSU-level Mammen multiplier " + "bootstrap." ), code=( "from diff_diff import did_had_pretest_workflow\n" diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index 5e5b6824..a59949e6 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -2548,9 +2548,10 @@ Shipped in `diff_diff/had_pretests.py` as `stute_joint_pretest()` (residuals-in - [x] Phase 3: `did_had_pretest_workflow()` composite helper. Two-period `data`-only entry point (Phase 2a overall-path dispatch); reduces panel via `_aggregate_first_difference` and runs all three IMPLEMENTED tests at a shared `alpha`. `seed` forwards to `stute_test` only (QUG and Yatchew are deterministic). Returns `HADPretestReport` with priority-ordered verdict string. Because Phase 3 ships steps 1 + 3 of the paper's four-step workflow but **not** step 2 (Assumption 7 pre-trends test via Equation 18), the fail-to-reject verdict explicitly flags the Assumption 7 gap rather than claiming unconditional TWFE safety: `"QUG and linearity diagnostics fail-to-reject; Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)"`. Verdict priority follows the paper's one-way rule (TWFE admissible only if NO test rejects): **conclusive rejections are the primary verdict and are NEVER hidden by inconclusive status** — any unresolved-step note is appended via `"; additional steps unresolved: ..."` rather than replacing the rejection. The pure `"inconclusive - QUG NaN"` / `"inconclusive - both Stute and Yatchew linearity tests NaN"` forms only fire when NO conclusive test rejects AND a required step is unresolved. The partial-workflow fail-to-reject verdict may carry a `"(Yatchew NaN - skipped)"` (or Stute) suffix when one linearity test is NaN but the other is conclusive (step 3 resolved via the paper's "Stute OR Yatchew" wording). Bundled rejection-reason strings name each failed assumption in the conclusive-rejection case. `all_pass` is `True` iff QUG is conclusive AND at least one of Stute/Yatchew is conclusive AND no conclusive test rejects. **Non-negative-dose contract**: all three raw linearity helpers (`qug_test`, `stute_test`, `yatchew_hr_test`) raise a front-door `ValueError` on any `d < 0`, mirroring the `_validate_had_panel` guard (paper Section 2 HAD support restriction). Multi-period panels pre-slice to `(F-1, F)` before calling; joint-horizon dispatch deferred to Phase 3 follow-up. - [ ] Phase 4: Pierce-Schott (2016) replication harness reproduces Figure 2 values. - [ ] Phase 4: Full DGP 1/2/3 coverage-rate reproduction from Table 1. -- [ ] Phase 5: `practitioner_next_steps()` integration for HAD results. +- [x] Phase 5 (wave 1, PR #402): `practitioner_next_steps()` integration for HAD results - `_handle_had` and `_handle_had_event_study` route both result classes through HAD-specific Baker et al. (2025) step guidance with bidirectional HAD ↔ ContinuousDiD Step-4 routing closure. The `_check_nan_att` helper extends to ndarray `att` (HAD event-study) via `np.all(np.isnan(arr))` semantics; scalar path bit-exact preserved. +- [x] Phase 5 (wave 1, PR #402): `llms-full.txt` HeterogeneousAdoptionDiD section + result-class blocks + `## HAD Pretests` index + Choosing-an-Estimator row landed; constructor / fit() signatures match the real API (regression-tested via `inspect.signature`); result-class field tables enumerate every public dataclass field (regression-tested via `dataclasses.fields()`); `llms-practitioner.txt` Step 4 decision tree distinguishes ContinuousDiD (per-dose ATT(d), needs never-treated) from HeterogeneousAdoptionDiD (WAS, universal-rollout-compatible). - [x] Phase 5 (partial): README catalog one-liner, bundled `llms.txt` `## Estimators` entry, `docs/api/had.rst` (autoclass for the three classes), and `docs/references.rst` citation landed in PR #372 docs refresh. -- [ ] Phase 5 (remaining): Tutorial notebook + `llms-full.txt` HeterogeneousAdoptionDiD section (preserving the UTF-8 fingerprint). +- [ ] Phase 5 (remaining): T21 HAD pretest workflow tutorial + T22 weighted/survey HAD tutorial - tracked in `TODO.md`. - [ ] Documentation of non-testability of Assumptions 5 and 6. - [ ] Warnings for staggered treatment timing (redirect to `ChaisemartinDHaultfoeuille`). - [ ] `NotImplementedError` phase pointer when `covariates=` is passed (Theorem 6 future work). diff --git a/tests/test_practitioner.py b/tests/test_practitioner.py index 298ba8cb..fa62dd0f 100644 --- a/tests/test_practitioner.py +++ b/tests/test_practitioner.py @@ -690,6 +690,43 @@ def test_handle_continuous_step_4_snippet_is_valid_python(self, mock_continuous_ if code.strip(): ast.parse(code) # raises SyntaxError on failure + def test_had_step_3_flags_qug_under_survey_deferral( + self, mock_had_results, mock_had_event_study_results + ): + # Per diff_diff/had_pretests.py:4488-4495 + REGISTRY § "QUG Null + # Test" Note (Phase 4.5 C0): when survey_design= / survey= / + # weights= is supplied, did_had_pretest_workflow skips the QUG + # step with a UserWarning and returns a linearity-conditional + # verdict only. Both HAD handler variants must surface this + # caveat so agents do not assume step 1 / Design 1' vs Design 1 + # was checked on weighted fits when the library deliberately + # cannot check it there. + for fixture in (mock_had_results, mock_had_event_study_results): + output = practitioner_next_steps(fixture, verbose=False) + step_3_steps = [s for s in output["next_steps"] if s["baker_step"] == 3] + assert len(step_3_steps) == 1 + text = (step_3_steps[0].get("why", "") + " " + step_3_steps[0].get("code", "")).lower() + # Must mention that survey-weighted fits skip QUG. + assert "skip" in text and "qug" in text, ( + "Step-3 text must explicitly say survey-weighted fits " + "skip QUG (Phase 4.5 C0 deferral). Without this caveat " + "agents may assume step 1 / Design 1' vs Design 1 was " + "checked on weighted fits when the library deliberately " + "does not check it there." + ) + # Must mention "linearity-conditional" verdict OR equivalent + # framing so agents know the weighted verdict is conditional + # on QUG holding by assumption. + assert ( + "linearity-conditional" in text + or "linearity conditional" in text + or "qug holding by assumption" in text + ), ( + "Step-3 text must describe the weighted verdict as " + "linearity-conditional / conditional on QUG holding by " + "assumption." + ) + def test_had_step_3_pretest_assumption_labels_correct(self, mock_had_results): # Per docs/methodology/REGISTRY.md and diff_diff/had_pretests.py # docstrings: From 0ecb635284ed477709c7aa6592da3f435785665d Mon Sep 17 00:00:00 2001 From: igerber Date: Sat, 9 May 2026 11:54:29 -0400 Subject: [PATCH 6/9] Address PR #402 R5 review (1 P3, doc-drift fix) P3 dataclass-docstring drift: PR #402 R3 fixed the llms-full.txt field descriptions to acknowledge that weighted mass-point HAD fits populate variance_formula in {"pweight_2sls", "survey_binder_tsl_2sls"} and effective_dose_mean as the weighted Wald-IV dose gap (per had.py:3585-3660), but the HeterogeneousAdoptionDiDResults dataclass field docstrings in had.py:347-366 still said those fields were continuous-only / None on mass-point - leaving two source-of-truth surfaces disagreeing about the same public result object. Updated both field docstrings to enumerate all four variance_formula labels (continuous + mass-point variants under both `weights=` shortcut and `survey_design=` paths) and to describe the mass-point weighted Wald-IV dose-gap denominator semantics (`mean(D | Z=1, w) - mean(D | Z=0, w)` where Z = 1{D > d_lower}). Tests added (1 new, 90 total): - test_had_results_dataclass_docstrings_match_weighted_mass_point_contract: uses inspect.getsource(HeterogeneousAdoptionDiDResults) to scan the class source and assert the variance_formula docstring mentions both pweight_2sls and survey_binder_tsl_2sls labels, and the effective_dose_mean docstring mentions mass-point Wald-IV semantics. Locks both field docstrings against drift back to the continuous-only framing now that the llms-full.txt guide and the actual fit() code populate these on mass-point fits. Co-Authored-By: Claude Opus 4.7 (1M context) --- diff_diff/had.py | 42 ++++++++++++++++++-------------- tests/test_practitioner.py | 49 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 73 insertions(+), 18 deletions(-) diff --git a/diff_diff/had.py b/diff_diff/had.py index 2819cd09..3a7aef74 100644 --- a/diff_diff/had.py +++ b/diff_diff/had.py @@ -345,25 +345,31 @@ class HeterogeneousAdoptionDiDResults: # Phase 4.5 weighted-path extras (optional so unweighted fits stay unchanged) variance_formula: Optional[str] = None - """HAD-specific label for the SE formula on the weighted continuous - path: ``"pweight"`` (weighted-robust CCT 2014) under ``weights=``, - ``"survey_binder_tsl"`` (Binder 1983 TSL with PSU/strata/FPC) under - ``survey=SurveyDesign(...)``, ``None`` on unweighted or mass-point - fits. Orthogonal to ``survey_metadata`` which is the repo-standard - :class:`diff_diff.survey.SurveyMetadata` shared with downstream - report/diagnostic consumers (no HAD-specific leakage).""" + """HAD-specific label for the SE formula on weighted fits, populated + on BOTH continuous and mass-point designs (Phase 4.5 A / B): + ``"pweight"`` (continuous, weighted-robust CCT 2014 under the + ``weights=`` shortcut), ``"survey_binder_tsl"`` (continuous, Binder + 1983 TSL with PSU/strata/FPC under ``survey_design=SurveyDesign(...)``), + ``"pweight_2sls"`` (mass-point, weighted 2SLS HC1/CR1 sandwich + under the ``weights=`` shortcut), or ``"survey_binder_tsl_2sls"`` + (mass-point, Binder 1983 TSL under ``survey_design=``). ``None`` on + unweighted fits. Orthogonal to ``survey_metadata`` which is the + repo-standard :class:`diff_diff.survey.SurveyMetadata` shared with + downstream report/diagnostic consumers (no HAD-specific leakage).""" effective_dose_mean: Optional[float] = None - """Weighted denominator used by the beta-scale rescaling on the - continuous path: ``sum(w_g · D_g) / sum(w_g)`` for - ``continuous_at_zero`` or ``sum(w_g · (D_g - d_lower)) / sum(w_g)`` - for ``continuous_near_d_lower``. Reduces bit-exactly to - ``dose_mean`` / ``mean(D - d_lower)`` when weights are uniform or - absent. ``None`` when ``fit()`` was called without - ``survey=`` / ``weights=`` (use ``dose_mean`` there). Exists because - ``dose_mean`` is the raw sample mean of the dose column; under - weighted fits the estimator's actual denominator is the weighted - mean, and users reconstructing the β-scale value by hand need the - weighted one.""" + """Weighted denominator used by the beta-scale rescaling, populated + on weighted fits across all designs: ``sum(w_g · D_g) / sum(w_g)`` + on ``continuous_at_zero``, ``sum(w_g · (D_g - d_lower)) / sum(w_g)`` + on ``continuous_near_d_lower``, and the weighted Wald-IV dose gap + ``mean(D | Z=1, w) - mean(D | Z=0, w)`` on ``mass_point`` (where + ``Z = 1{D > d_lower}``). On the continuous designs reduces + bit-exactly to ``dose_mean`` / ``mean(D - d_lower)`` when weights + are uniform or absent. ``None`` when ``fit()`` was called without + ``survey_design=`` / ``survey=`` / ``weights=`` (use ``dose_mean`` + there). Exists because ``dose_mean`` is the raw sample mean of the + dose column; under weighted fits the estimator's actual denominator + is the weighted form above, and users reconstructing the β-scale + value by hand need the weighted one.""" def __repr__(self) -> str: base = ( diff --git a/tests/test_practitioner.py b/tests/test_practitioner.py index fa62dd0f..aad9c96b 100644 --- a/tests/test_practitioner.py +++ b/tests/test_practitioner.py @@ -690,6 +690,55 @@ def test_handle_continuous_step_4_snippet_is_valid_python(self, mock_continuous_ if code.strip(): ast.parse(code) # raises SyntaxError on failure + def test_had_results_dataclass_docstrings_match_weighted_mass_point_contract(self): + # PR #402 R3 fixed the llms-full.txt field descriptions to + # acknowledge that weighted mass-point fits populate + # variance_formula in {"pweight_2sls", "survey_binder_tsl_2sls"} + # and effective_dose_mean as the weighted Wald-IV dose gap (per + # had.py:3585-3660). PR #402 R5 P3 caught that the dataclass + # field docstrings still said those fields were continuous-only + # / None on mass-point - leaving two source-of-truth surfaces + # disagreeing about the same public result object. Lock the + # dataclass docstrings against drift back to the continuous-only + # framing. + import inspect + + from diff_diff.had import HeterogeneousAdoptionDiDResults + + # Field docstrings live as raw __doc__ on the FieldDescriptor / + # in __dataclass_fields__'s metadata; read them via the type's + # source-level docstring attached to the class via the field's + # `__doc__` after assignment in the class body. + # Easier: read the class source via inspect.getsource() and check + # the field-docstring blocks we care about. + src = inspect.getsource(HeterogeneousAdoptionDiDResults) + # variance_formula docstring must enumerate all 4 labels. + assert "pweight_2sls" in src, ( + "HeterogeneousAdoptionDiDResults.variance_formula docstring " + "must mention `pweight_2sls` (weighted mass-point HC1/CR1 " + "label per had.py:3585-3629). Otherwise the dataclass " + "docstring contradicts llms-full.txt and the actual " + "implementation." + ) + assert "survey_binder_tsl_2sls" in src, ( + "HeterogeneousAdoptionDiDResults.variance_formula docstring " + "must mention `survey_binder_tsl_2sls` (weighted mass-point " + "Binder-TSL label)." + ) + # effective_dose_mean docstring must mention mass-point Wald-IV. + assert "mass_point" in src or "mass-point" in src, ( + "HeterogeneousAdoptionDiDResults.effective_dose_mean " + "docstring must mention mass-point semantics; weighted " + "mass-point fits populate it as the weighted Wald-IV dose " + "gap per had.py:3642-3660." + ) + assert "Wald-IV" in src or "Z=1" in src, ( + "HeterogeneousAdoptionDiDResults.effective_dose_mean " + "docstring must describe the weighted Wald-IV dose gap " + "semantics (or the underlying Z=1/Z=0 subgroup-mean form) " + "for mass-point fits." + ) + def test_had_step_3_flags_qug_under_survey_deferral( self, mock_had_results, mock_had_event_study_results ): From da24a59596c0f8435aa72d0347fc3ffca21022e5 Mon Sep 17 00:00:00 2001 From: igerber Date: Sat, 9 May 2026 12:13:38 -0400 Subject: [PATCH 7/9] Address PR #402 R6 review (1 P3, mass-point + survey vcov caveat) P3 mass-point + survey vcov requirement: per had.py:3495-3507 the mass-point design rejects the default classical vcov family on the survey_design= path with NotImplementedError (the survey path composes Binder-TSL on the HC1-scale influence function, which targets V_HC1 rather than the classical sandwich). The Step-6 sup-t / cband snippet in _handle_had_event_study and the HAD section in llms-full.txt presented weighted event-study fits as a generic survey_design= path without surfacing this constraint, so the example as written would fail at fit time on a mass-point panel. Both surfaces now make the requirement explicit: - The Step-6 snippet uses HeterogeneousAdoptionDiD(vcov_type='hc1', ...) with an inline comment explaining that hc1 is required on mass-point + survey and is a no-op on the continuous designs (which use the CCT-2014 robust SE regardless), making it a safe default for the survey-aware example. - A new "Mass-point + survey constraint" paragraph in the HAD section of llms-full.txt documents the same requirement and routing. Tests added (2 new, 92 total): - test_had_event_study_sup_t_snippet_uses_hc1_for_mass_point_survey_compatibility: asserts the sup-t / cband snippet either uses vcov_type='hc1' / robust=True or surfaces the mass-point + survey vcov requirement inline so agents adapting the snippet on a mass-point panel know to add it. - test_llms_full_had_section_documents_mass_point_survey_vcov_requirement: asserts the HAD section documents the mass-point + survey vcov requirement (vcov_type mention paired with mass-point context). Co-Authored-By: Claude Opus 4.7 (1M context) --- diff_diff/guides/llms-full.txt | 2 ++ diff_diff/practitioner.py | 11 +++++++++- tests/test_guides.py | 24 ++++++++++++++++++++ tests/test_practitioner.py | 40 ++++++++++++++++++++++++++++++++++ 4 files changed, 76 insertions(+), 1 deletion(-) diff --git a/diff_diff/guides/llms-full.txt b/diff_diff/guides/llms-full.txt index 17dca40f..8de2788b 100644 --- a/diff_diff/guides/llms-full.txt +++ b/diff_diff/guides/llms-full.txt @@ -657,6 +657,8 @@ es = est.fit(data, outcome_col='y', unit_col='unit', **Staggered panels.** On multi-cohort panels with `aggregate="event_study"`, `fit()` auto-filters to the last treatment cohort plus never-treated units (paper Appendix B.2) and emits a `UserWarning` naming kept/dropped counts. The estimand is then a **last-cohort-only WAS**, not a multi-cohort average. For full multi-cohort staggered support, see `ChaisemartinDHaultfoeuille`. +**Mass-point + survey constraint.** When fitting `design="mass_point"` with `survey_design=` (or the deprecated `survey=` alias), `vcov_type="hc1"` (or `robust=True`) is required: the survey path composes the standard error via Binder-TSL on the HC1-scale influence function, so the default classical sandwich path raises `NotImplementedError`. Passing `vcov_type="hc1"` is a safe default on weighted survey examples since `vcov_type` is unused on the continuous designs (CCT-2014 robust SE is the only formula there). + ### StackedDiD Stacked DiD estimator (Wing, Freedman & Hollingsworth 2024). Addresses TWFE bias with corrective Q-weights. diff --git a/diff_diff/practitioner.py b/diff_diff/practitioner.py index 8ccec7f0..973c1ecf 100644 --- a/diff_diff/practitioner.py +++ b/diff_diff/practitioner.py @@ -1078,7 +1078,16 @@ def _handle_had_event_study(results: Any): "from diff_diff import HeterogeneousAdoptionDiD, SurveyDesign\n" "# Construct your survey design (adapt to your data):\n" "sd = SurveyDesign(weights='weight_col')\n" - "est = HeterogeneousAdoptionDiD(n_bootstrap=999, seed=42)\n" + "# vcov_type='hc1' is REQUIRED on the mass-point design under\n" + "# survey_design= (the default classical sandwich raises\n" + "# NotImplementedError on the survey path because the\n" + "# Binder-TSL composition consumes the HC1-scale IF -\n" + "# see had.py:3495-3507). On the continuous designs the\n" + "# vcov_type kwarg is unused (CCT-2014 robust SE is the\n" + "# only formula), so passing vcov_type='hc1' is a no-op\n" + "# there and a safe default for the survey-aware example.\n" + "est = HeterogeneousAdoptionDiD(\n" + " n_bootstrap=999, seed=42, vcov_type='hc1')\n" "es = est.fit(\n" " data, outcome_col='y', unit_col='unit',\n" " time_col='t', dose_col='d',\n" diff --git a/tests/test_guides.py b/tests/test_guides.py index 2a53f9bd..afc41251 100644 --- a/tests/test_guides.py +++ b/tests/test_guides.py @@ -452,6 +452,30 @@ def test_llms_full_had_results_class_field_lists_match_real_dataclass(self): f"is missing the public dataclass field {field.name!r}." ) + def test_llms_full_had_section_documents_mass_point_survey_vcov_requirement(self): + # Per had.py:3495-3507 the mass-point design rejects the default + # classical vcov family on the survey_design= path + # (NotImplementedError). The HAD section must surface this + # requirement so an agent reading llms-full.txt and writing a + # weighted mass-point fit knows to pass vcov_type='hc1' + # explicitly. Without this caveat the documented fit() example + # can fail at fit time on a mass-point panel. + text = get_llm_guide("full") + had_start = text.index("### HeterogeneousAdoptionDiD") + had_end = text.index("### StackedDiD", had_start) + had_text = text[had_start:had_end] + # Must mention the mass-point + survey vcov requirement. + # Accept either explicit "vcov_type" mention near "mass" wording + # or the explicit "hc1" / "robust=True" pairing with mass-point. + lower = had_text.lower() + assert "vcov_type" in lower and ("mass-point" in lower or "mass_point" in lower), ( + "HAD section must document the mass-point + survey vcov " + "requirement: passing vcov_type='hc1' (or robust=True) is " + "required on design='mass_point' under survey_design= " + "(per had.py:3495-3507). Without this caveat the documented " + "weighted fit example can raise NotImplementedError." + ) + def test_llms_full_had_variance_formula_describes_all_designs(self): # Per diff_diff/had.py:3585-3629, weighted mass-point fits populate # variance_formula in {"pweight_2sls", "survey_binder_tsl_2sls"} and diff --git a/tests/test_practitioner.py b/tests/test_practitioner.py index aad9c96b..836a76b9 100644 --- a/tests/test_practitioner.py +++ b/tests/test_practitioner.py @@ -690,6 +690,46 @@ def test_handle_continuous_step_4_snippet_is_valid_python(self, mock_continuous_ if code.strip(): ast.parse(code) # raises SyntaxError on failure + def test_had_event_study_sup_t_snippet_uses_hc1_for_mass_point_survey_compatibility( + self, mock_had_event_study_results + ): + # Per had.py:3495-3507 the mass-point design rejects the + # default classical vcov family on the survey_design= path + # (NotImplementedError). The Step-6 sup-t snippet shows a + # generic weighted event-study fit; if it uses the default + # vcov_type a copy/paste on a mass-point panel raises at + # fit time. Snippet must either use vcov_type='hc1' / + # robust=True OR explicitly note the requirement so agents + # can adapt. + output = practitioner_next_steps(mock_had_event_study_results, verbose=False) + step_6_steps = [s for s in output["next_steps"] if s["baker_step"] == 6] + assert len(step_6_steps) >= 1 + # Find the sup-t / cband step (sensitivity step). + sup_t = next( + (s for s in step_6_steps if "cband" in s.get("code", "")), + None, + ) + assert sup_t is not None, "sup-t / cband step not found at baker_step=6" + snippet = sup_t.get("code", "") + # Either the snippet itself uses vcov_type='hc1' / robust=True + # OR it documents the requirement inline (so agents adapting + # the snippet on a mass-point panel know to add it). + ok = ( + "vcov_type='hc1'" in snippet + or 'vcov_type="hc1"' in snippet + or "robust=True" in snippet + or ("mass-point" in snippet and "vcov_type" in snippet) + or ("mass_point" in snippet and "vcov_type" in snippet) + ) + assert ok, ( + "Sup-t / cband snippet must either use vcov_type='hc1' / " + "robust=True or surface the mass-point + survey vcov " + "requirement inline. Per had.py:3495-3507 the default " + "classical sandwich raises NotImplementedError on the " + "mass-point + survey path; the example as written would " + "fail at fit time on a mass-point panel." + ) + def test_had_results_dataclass_docstrings_match_weighted_mass_point_contract(self): # PR #402 R3 fixed the llms-full.txt field descriptions to # acknowledge that weighted mass-point fits populate From 4a247587df7099933e0624c2d34099f53fe36d65 Mon Sep 17 00:00:00 2001 From: igerber Date: Sat, 9 May 2026 12:30:18 -0400 Subject: [PATCH 8/9] Address PR #402 R7 review (1 P1 + 1 P2) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit P1 step-2 / Assumption 7 closure precondition: per docs/methodology/REGISTRY.md HeterogeneousAdoptionDiD § "Assumption 7 / step 2 closure" + had_pretests.py:4738-4756 + 2769, aggregate="event_study" closes paper Section 4.2 step 2 ONLY IF the panel has at least one earlier placebo pre-period beyond the base F-1. With only the base F-1 pre-period available (minimal 3-period event-study, or 4-period under trends_lin=True where the consumed F-2 placebo is auto-dropped), the workflow sets pretrends_joint=None, all_pass=False, and appends 'joint pre-trends skipped (no earlier pre-period)' to the verdict - step 2 stays uncovered. The previous Step-3 wording in both _handle_had and _handle_had_event_study + the HAD Pretests intro in llms-full.txt said generically that aggregate="event_study" closes the step-2 gap, which is overbroad and could mislead agents on minimal valid event-study panels. All three surfaces now make the precondition explicit AND document the pretrends_joint=None / 'joint pre-trends skipped' fallback verdict so agents know what to expect when the precondition fails. P2 missing regression coverage: the prior tests locked assumption labels and the QUG-under-survey caveat but did not lock the earlier-pre-period precondition - that is why the overstatement landed in the new agent-facing surfaces without tripping the existing guide / practitioner tests. Tests added (2 new, 94 total): - test_had_step_3_documents_earlier_pre_period_precondition_for_step_2: asserts both HAD handler variants surface the 'earlier pre-period' / placebo precondition AND the pretrends_joint=None / 'joint pre-trends skipped' fallback. - test_llms_full_had_pretests_documents_earlier_pre_period_precondition: same lock on the HAD Pretests section in llms-full.txt. Co-Authored-By: Claude Opus 4.7 (1M context) --- diff_diff/guides/llms-full.txt | 2 +- diff_diff/practitioner.py | 20 ++++++++++++++---- tests/test_guides.py | 24 ++++++++++++++++++++++ tests/test_practitioner.py | 37 ++++++++++++++++++++++++++++++++++ 4 files changed, 78 insertions(+), 5 deletions(-) diff --git a/diff_diff/guides/llms-full.txt b/diff_diff/guides/llms-full.txt index 8de2788b..7e66f9b3 100644 --- a/diff_diff/guides/llms-full.txt +++ b/diff_diff/guides/llms-full.txt @@ -1406,7 +1406,7 @@ results = did.fit(data, outcome='y', treatment='treated', time='post') ## HAD Pretests -Diagnostic pretests for the `HeterogeneousAdoptionDiD` identifying assumptions (de Chaisemartin, Ciccia, D'Haultfœuille & Knau 2026). The composite workflow `did_had_pretest_workflow` is the recommended entry point — call it before reporting WAS as causal. The workflow follows paper Section 4.2's three-step battery: **step 1** is the QUG support-infimum test (decides whether Design 1' or Design 1 applies); **step 2** is the Assumption 7 pre-trends test (joint Stute on the event-study path; explicitly NOT covered on the overall path because a single-pre-period panel cannot support the joint variant); **step 3** is the Assumption 8 linearity test (`stute_test` or `yatchew_hr_test`). On the default `aggregate="overall"` path the workflow runs steps 1 + 3 only and the returned `verdict` flags the Assumption 7 gap; pass `aggregate="event_study"` on a multi-period panel to close that gap. +Diagnostic pretests for the `HeterogeneousAdoptionDiD` identifying assumptions (de Chaisemartin, Ciccia, D'Haultfœuille & Knau 2026). The composite workflow `did_had_pretest_workflow` is the recommended entry point — call it before reporting WAS as causal. The workflow follows paper Section 4.2's three-step battery: **step 1** is the QUG support-infimum test (decides whether Design 1' or Design 1 applies); **step 2** is the Assumption 7 pre-trends test (joint Stute on the event-study path; explicitly NOT covered on the overall path because a single-pre-period panel cannot support the joint variant); **step 3** is the Assumption 8 linearity test (`stute_test` or `yatchew_hr_test`). On the default `aggregate="overall"` path the workflow runs steps 1 + 3 only and the returned `verdict` flags the Assumption 7 gap; pass `aggregate="event_study"` on a multi-period panel **with at least one earlier placebo pre-period beyond the base `F-1`** to close that gap. With only the base `F-1` pre-period available (minimal 3-period event-study, or 4-period under `trends_lin=True` where the consumed `F-2` placebo is dropped), the workflow still sets `pretrends_joint=None`, `all_pass=False`, and appends `joint pre-trends skipped (no earlier pre-period)` to the verdict — step 2 stays uncovered. ```python from diff_diff import ( diff --git a/diff_diff/practitioner.py b/diff_diff/practitioner.py index 973c1ecf..1d3b8e73 100644 --- a/diff_diff/practitioner.py +++ b/diff_diff/practitioner.py @@ -865,7 +865,13 @@ def _handle_had(results: Any): "path - a single pre-period cannot support the joint " "Stute variant - and the returned verdict explicitly " "flags that gap. To close step 2, refit on a multi-period " - "panel with aggregate='event_study'. On survey-weighted " + "panel with aggregate='event_study' AND verify the panel " + "has at least one earlier placebo pre-period beyond F-1; " + "if only the base pre-period F-1 is available, the " + "workflow still sets pretrends_joint=None, all_pass=False, " + "and a 'joint pre-trends skipped (no earlier pre-period)' " + "verdict suffix - in that case step 2 stays uncovered " + "even on the event-study path. On survey-weighted " "fits (survey_design= / survey= / weights=) the workflow " "skips QUG with a UserWarning (permanent Phase 4.5 C0 " "deferral - extreme order statistics are not smooth " @@ -1010,9 +1016,15 @@ def _handle_had_event_study(results: Any): "On multi-period unweighted panels, did_had_pretest_workflow " "with aggregate='event_study' runs QUG plus joint Stute " "pre-trends plus joint homogeneity-linearity Stute. The " - "joint Stute variants close the paper Section 4.2 step-2 " - "gap that the overall path explicitly flags as deferred. " - "On survey-weighted fits (survey_design= / survey= / " + "joint Stute pre-trends variant closes the paper Section " + "4.2 step-2 gap ONLY IF the panel carries at least one " + "earlier placebo pre-period beyond the base F-1. With " + "only the base F-1 pre-period present (e.g. a minimal " + "valid 3-period event-study fit, or a 4-period fit under " + "trends_lin=True where the consumed F-2 placebo gets " + "dropped), pretrends_joint=None, all_pass=False, and the " + "verdict carries 'joint pre-trends skipped (no earlier " + "pre-period)' - step 2 stays uncovered. On survey-weighted fits (survey_design= / survey= / " "weights=) the workflow skips QUG with a UserWarning " "(permanent Phase 4.5 C0 deferral) and returns a " "linearity-conditional verdict only - so step 1 coverage " diff --git a/tests/test_guides.py b/tests/test_guides.py index afc41251..bb704a6a 100644 --- a/tests/test_guides.py +++ b/tests/test_guides.py @@ -547,6 +547,30 @@ def test_llms_practitioner_step_4_distinguishes_had_from_continuous(self): "ContinuousDiD routing." ) + def test_llms_full_had_pretests_documents_earlier_pre_period_precondition(self): + # Same precondition as the practitioner test: per + # docs/methodology/REGISTRY.md HeterogeneousAdoptionDiD + # § "Assumption 7 / step 2 closure" + had_pretests.py:4738-4756 + + # 2769, aggregate="event_study" closes step 2 ONLY IF the + # panel carries at least one earlier placebo pre-period beyond + # the base F-1. The HAD Pretests section in llms-full.txt must + # document this precondition so agents do not assume any + # multi-period event-study fit closes step 2. + text = get_llm_guide("full") + pretests_start = text.index("## HAD Pretests") + pretests_end = text.index("## Honest DiD", pretests_start) + pretests_block = text[pretests_start:pretests_end] + lower = pretests_block.lower() + assert "earlier" in lower and ("pre-period" in lower or "placebo" in lower), ( + "HAD Pretests section must document the 'earlier pre-period' " + "precondition for step-2 closure on the event-study path." + ) + assert "skipped" in lower or "pretrends_joint=none" in lower, ( + "HAD Pretests section must surface the " + "'joint pre-trends skipped' / pretrends_joint=None fallback " + "when no earlier pre-period exists." + ) + def test_llms_full_had_pretests_assumption_labels_correct(self): # Per docs/methodology/REGISTRY.md HeterogeneousAdoptionDiD # § "Assumptions / Theorems / Estimators": diff --git a/tests/test_practitioner.py b/tests/test_practitioner.py index 836a76b9..2c9c5813 100644 --- a/tests/test_practitioner.py +++ b/tests/test_practitioner.py @@ -779,6 +779,43 @@ def test_had_results_dataclass_docstrings_match_weighted_mass_point_contract(sel "for mass-point fits." ) + def test_had_step_3_documents_earlier_pre_period_precondition_for_step_2( + self, mock_had_results, mock_had_event_study_results + ): + # Per docs/methodology/REGISTRY.md HeterogeneousAdoptionDiD + # § "Assumption 7 / step 2 closure" + had_pretests.py:4738-4756 + + # 2769: aggregate="event_study" closes step 2 ONLY IF the panel + # carries at least one earlier placebo pre-period beyond the + # base F-1. With only F-1 available the workflow sets + # pretrends_joint=None, all_pass=False, and the verdict carries + # 'joint pre-trends skipped (no earlier pre-period)'. Both HAD + # handler variants must surface this precondition - otherwise + # agents reading the guidance can think any multi-period + # event-study fit closes step 2 when it does not. + for fixture in (mock_had_results, mock_had_event_study_results): + output = practitioner_next_steps(fixture, verbose=False) + step_3_steps = [s for s in output["next_steps"] if s["baker_step"] == 3] + assert len(step_3_steps) == 1 + text = (step_3_steps[0].get("why", "") + " " + step_3_steps[0].get("code", "")).lower() + # Must mention "earlier" pre-period / placebo precondition. + assert "earlier" in text and ("pre-period" in text or "placebo" in text), ( + "Step-3 text must mention the 'earlier pre-period' " + "precondition for closing Assumption 7 / step 2 on the " + "event-study path. With only the base F-1 pre-period " + "the workflow returns pretrends_joint=None and the " + "verdict carries 'joint pre-trends skipped (no earlier " + "pre-period)' - step 2 stays uncovered." + ) + # Must mention the skip-fallback verdict so agents know + # what to expect when the precondition fails. + assert "skipped" in text or "pretrends_joint=none" in text, ( + "Step-3 text must surface the 'joint pre-trends skipped' " + "/ pretrends_joint=None fallback when no earlier " + "pre-period exists - otherwise agents cannot tell " + "whether step 2 was actually covered on a minimal " + "event-study fit." + ) + def test_had_step_3_flags_qug_under_survey_deferral( self, mock_had_results, mock_had_event_study_results ): From b2bfdd09d4f3fc7253a742f7b16aec8e6c040cae Mon Sep 17 00:00:00 2001 From: igerber Date: Sat, 9 May 2026 12:38:17 -0400 Subject: [PATCH 9/9] Address PR #402 R8 review (1 P3, to_dict() docstring drift) P3 doc drift: PR #402 R3 fixed llms-full.txt, R5 fixed the dataclass field docstrings, but HeterogeneousAdoptionDiDResults.to_dict() still described variance_formula as continuous-only ("pweight" / "survey_binder_tsl") and omitted the mass-point Wald-IV effective_dose_mean semantics. Three internal source-of-truth surfaces were now disagreeing about the same public result object's to_dict() output shape. Updated to_dict() docstring to enumerate all four variance_formula labels (pweight, survey_binder_tsl, pweight_2sls, survey_binder_tsl_2sls) and to describe the per-design effective_dose_mean semantics (continuous mean of D / D - d_lower vs mass-point weighted Wald-IV dose gap mean(D | Z=1, w) - mean(D | Z=0, w)). Mirrors the field-docstring contract from R5. Tests added (1 new, 95 total): - test_had_results_to_dict_docstring_matches_weighted_mass_point_contract: reads HeterogeneousAdoptionDiDResults.to_dict.__doc__ and asserts it enumerates all four variance_formula labels and the mass-point Wald-IV effective_dose_mean semantics. Mirrors the existing dataclass-field-docstring lock. Co-Authored-By: Claude Opus 4.7 (1M context) --- diff_diff/had.py | 15 +++++++++++++-- tests/test_practitioner.py | 38 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 51 insertions(+), 2 deletions(-) diff --git a/diff_diff/had.py b/diff_diff/had.py index 3a7aef74..6a717bcb 100644 --- a/diff_diff/had.py +++ b/diff_diff/had.py @@ -483,9 +483,20 @@ def to_dict(self) -> Dict[str, Any]: ``design_effect`` / ``sum_weights`` / ``weight_range`` + ``n_strata`` / ``n_psu`` / ``df_survey`` (latter three ``None`` on the ``weights=`` shortcut). - - ``variance_formula``: ``"pweight"`` or ``"survey_binder_tsl"``. + - ``variance_formula``: HAD-specific SE label, populated on BOTH + continuous and mass-point designs (Phase 4.5 A / B): + ``"pweight"`` (continuous, weighted-robust CCT 2014 under + ``weights=``), ``"survey_binder_tsl"`` (continuous, Binder + 1983 TSL under ``survey_design=``), ``"pweight_2sls"`` + (mass-point, weighted 2SLS HC1/CR1 sandwich under ``weights=``), + or ``"survey_binder_tsl_2sls"`` (mass-point, Binder 1983 TSL + under ``survey_design=``). See the field docstring above for + the full contract. - ``effective_dose_mean``: weighted denominator used by the - beta-scale rescaling.""" + beta-scale rescaling - weighted ``mean(D)`` on + ``continuous_at_zero``, weighted ``mean(D - d_lower)`` on + ``continuous_near_d_lower``, or the weighted Wald-IV dose gap + ``mean(D | Z=1, w) - mean(D | Z=0, w)`` on ``mass_point``.""" return { "att": self.att, "se": self.se, diff --git a/tests/test_practitioner.py b/tests/test_practitioner.py index 2c9c5813..0db8d02e 100644 --- a/tests/test_practitioner.py +++ b/tests/test_practitioner.py @@ -730,6 +730,44 @@ def test_had_event_study_sup_t_snippet_uses_hc1_for_mass_point_survey_compatibil "fail at fit time on a mass-point panel." ) + def test_had_results_to_dict_docstring_matches_weighted_mass_point_contract(self): + # Parallel to the dataclass-field-docstring regression below: + # PR #402 R8 P3 caught that HeterogeneousAdoptionDiDResults.to_dict() + # docstring still described variance_formula as continuous-only + # / "pweight" or "survey_binder_tsl", contradicting the field + # docstrings (fixed in R5) and llms-full.txt (fixed in R3). + # Lock the to_dict() docstring against drift back. + from diff_diff.had import HeterogeneousAdoptionDiDResults + + doc = HeterogeneousAdoptionDiDResults.to_dict.__doc__ or "" + for label in ( + "pweight", + "survey_binder_tsl", + "pweight_2sls", + "survey_binder_tsl_2sls", + ): + assert label in doc, ( + f"HeterogeneousAdoptionDiDResults.to_dict() docstring " + f"must enumerate the {label!r} variance_formula label - " + f"weighted mass-point fits populate pweight_2sls / " + f"survey_binder_tsl_2sls per had.py:3585-3629. The " + f"to_dict() docstring is a public source-of-truth " + f"surface and must match the field docstrings + " + f"llms-full.txt HAD section." + ) + # effective_dose_mean: must mention mass-point Wald-IV semantics. + assert "mass_point" in doc or "mass-point" in doc, ( + "HeterogeneousAdoptionDiDResults.to_dict() docstring must " + "describe the mass-point effective_dose_mean semantics; " + "weighted mass-point fits populate it as the weighted " + "Wald-IV dose gap per had.py:3642-3660." + ) + assert "Wald-IV" in doc or "Z=1" in doc, ( + "HeterogeneousAdoptionDiDResults.to_dict() docstring must " + "describe the weighted Wald-IV dose gap semantics for " + "mass-point fits." + ) + def test_had_results_dataclass_docstrings_match_weighted_mass_point_contract(self): # PR #402 R3 fixed the llms-full.txt field descriptions to # acknowledge that weighted mass-point fits populate