Skip to content

HAD Phase 5 wave 1: agent-facing surfaces (_handle_had + llms-full.txt)#402

Merged
igerber merged 9 commits intomainfrom
had-phase-5-agent-surfaces
May 9, 2026
Merged

HAD Phase 5 wave 1: agent-facing surfaces (_handle_had + llms-full.txt)#402
igerber merged 9 commits intomainfrom
had-phase-5-agent-surfaces

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 9, 2026

Summary

  • Adds _handle_had and _handle_had_event_study to practitioner.py::_HANDLERS, routing both HAD result classes through HAD-specific Baker et al. (2025) step guidance (5 steps: pretest workflow, sibling-estimator routing, bandwidth/sup-t bands, event-study disaggregation, design-auto-detection + last-cohort-only-WAS framing).
  • Symmetric pair: _handle_continuous gains a Step-4 nudge to HeterogeneousAdoptionDiD for ContinuousDiD users on no-untreated panels — bidirectional routing closure.
  • Extends _check_nan_att with an ndarray branch (lazy numpy import, np.all(np.isnan(arr)) semantics so partial-NaN arrays don't over-fire). Scalar path bit-exact preserved across all 12 untouched handlers.
  • Adds full HAD section + result-class blocks + ## HAD Pretests index (7 pretest entry points) + Choosing-an-Estimator row to diff_diff/guides/llms-full.txt. Tightens the existing Continuous treatment intensity Choosing row with (some units untreated) for explicit contrast with the new HAD row.
  • 21 new tests (14 in TestHADDispatch + 6 in TestLLMsFullHADCoverage + 1 fixture-minimality regression locking the "handlers are STRING-ONLY at runtime" stability invariant).

Closes the Phase 5 "agent surfaces" gap; T21 pretest tutorial and T22 weighted/survey tutorial remain queued as separate notebook PRs.

Test plan

  • pytest tests/test_practitioner.py -v (47 passed; 16 new HAD tests)
  • pytest tests/test_guides.py -v (31 passed; 6 new HAD coverage tests)
  • black diff_diff/practitioner.py tests/test_practitioner.py tests/test_guides.py
  • ruff check diff_diff/practitioner.py tests/test_practitioner.py tests/test_guides.py
  • mypy diff_diff/practitioner.py (clean; pre-existing prep_dgp.py errors unrelated)
  • CI green after ready-for-ci label

🤖 Generated with Claude Code

Add _handle_had + _handle_had_event_study to practitioner.py, routing
both HeterogeneousAdoptionDiD result classes through HAD-specific Baker
et al. (2025) step guidance: did_had_pretest_workflow (step 3),
ContinuousDiD/CallawaySantAnna routing nudge (step 4),
bandwidth_diagnostics + simultaneous bands (step 6), per-horizon WAS
event-study disaggregation (step 7), design-auto-detection +
last-cohort-only-WAS framing (step 8).

Symmetric pair: _handle_continuous gains Step-4 routing to HAD on
no-untreated panels - the HAD <-> ContinuousDiD routing loop is now
bidirectional.

Extend _check_nan_att with ndarray branch (lazy numpy import +
np.all(np.isnan(arr)) semantics so partial-NaN arrays don't over-fire
the warning). Scalar path bit-exact preserved across all 12 untouched
handlers.

Add full HAD section + result-class blocks + ## HAD Pretests index
covering all 7 pretest entry points + Choosing-an-Estimator row to
diff_diff/guides/llms-full.txt (the bundled-in-wheel agent reference).
Tighten the existing Continuous treatment intensity Choosing row with
"(some units untreated)" so the HAD vs ContinuousDiD contrast is
explicit. Framing: "no untreated unit" / dose variation, never "no
comparison group" - locked by negative-assertion tests on both
handler text and llms-full.txt section.

docs/doc-deps.yaml: remove the llms-full.txt deferral note on had.py
and add llms-full.txt entries to had.py, had_pretests.py, and
practitioner.py blocks.

21 new tests (14 in tests/test_practitioner.py::TestHADDispatch +
6 in tests/test_guides.py::TestLLMsFullHADCoverage + 1 fixture-minimality
regression locking the "handlers are STRING-ONLY at runtime" stability
invariant).

Closes the Phase 5 "agent surfaces" gap; T21 pretest tutorial and T22
weighted/survey tutorial remain queued as separate notebook PRs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

Overall Assessment

⚠️ Needs changes

Executive Summary

  • P1 The new HAD estimator-selection guidance now says “if untreated units exist, switch away from HAD,” but the registry and had.py explicitly allow a small untreated share and retain never-treated units on the staggered event-study path.
  • P2 llms-full.txt documents a HAD constructor/fit signature that does not exist (h, b, rcond; aggregate=None) instead of the real public API.
  • P2 Two new practitioner snippets use undefined names (result.bandwidth_diagnostics, survey_design=design), and the new tests only check keyword presence, not whether the advertised snippets/signatures are valid.
  • _check_nan_att’s ndarray handling itself looks correct and consistent with the HAD event-study NaN contract; I did not find an inference/SE regression in the live code path.

Methodology

Code Quality

  • P2 Two new practitioner snippets are internally inconsistent: result.bandwidth_diagnostics uses an undefined singular name, and the sup-t example passes survey_design=design with no design object in scope (diff_diff/practitioner.py:L905-L920, diff_diff/practitioner.py:L1029-L1048). Impact: copy/paste of the advertised guidance fails immediately. Concrete fix: use declared placeholders (results.bandwidth_diagnostics, survey_design=sd or similar) and add a compile/smoke test for emitted code blocks.

Performance

No findings.

Maintainability

No additional findings beyond the doc/test issues below.

Tech Debt

No findings. The TODO.md change properly tracks the remaining tutorial follow-up rather than leaving it implicit.

Security

No findings.

Documentation/Tests

  • P2 The new llms-full.txt HAD API block does not match the actual public API. The guide invents constructor args h, b, and rcond, omits real args d_lower, kernel, vcov_type, robust, and cluster, and documents aggregate: str | None = None instead of the real aggregate="overall" default (diff_diff/guides/llms-full.txt:L597-L625, diff_diff/had.py:L2641-L2651, diff_diff/had.py:L2786-L2806, diff_diff/had.py:L2845-L2906). Impact: agents using the bundled guide will generate failing HAD calls. Concrete fix: sync the guide to the real signatures/defaults and add a regression test that compares the documented parameter list to inspect.signature(HeterogeneousAdoptionDiD.__init__) and .fit().
  • P2 The new tests lock in the bad framing and only assert substring presence, not semantic correctness or snippet validity (tests/test_guides.py:L280-L344, tests/test_practitioner.py:L499-L610). Impact: CI currently blesses the methodology drift above and would not catch the signature/snippet errors. Concrete fix: replace the “must say no untreated/universal rollout” assertion with checks aligned to the registry contract, and add signature/snippet smoke tests.

Path to Approval

  1. Rewrite HAD selection guidance in practitioner.py and llms-full.txt so it matches the registry: HAD can coexist with a small untreated share and, on staggered event-study panels, with last-cohort-plus-never-treated filtering.
  2. Correct the llms-full.txt HAD constructor/fit blocks to the actual had.py signatures and defaults.
  3. Fix the broken practitioner snippets and add tests that validate semantics/signatures/snippet executability, not just keyword presence.

P1 methodology fix: Step-4 routing in _handle_had + _handle_had_event_study
no longer says "switch away from HAD if untreated units exist" - that
contradicts REGISTRY § HeterogeneousAdoptionDiD edge cases (line 2403:
"Authors do NOT require untreated units to be dropped"; line 2408 +
had.py:1325: never-treated units RETAINED on staggered event-study).
Reframed as the actual estimand differentiator: HAD targets WAS at the
dose support boundary; ContinuousDiD targets per-dose ATT(d) / ACRT(d)
and requires never-treated controls. Routing fires only when the user
wants the ATT(d) estimand AND has never-treated controls, not on
"untreated units exist". Tightens the corresponding Choosing-an-Estimator
table row to surface WAS vs ATT(d) as the differentiator.

P2 (a) signatures: llms-full.txt HAD constructor + fit() blocks now
match the actual HeterogeneousAdoptionDiD.__init__ / .fit signatures
exactly. Drops invented kwargs (h, b, rcond) and adds the real ones
(d_lower, kernel, vcov_type, robust, cluster). aggregate default
corrected from None to "overall". fit() now lists survey, weights,
cband (positional-or-keyword) and survey_design + trends_lin
(keyword-only).

P2 (b) snippet bugs: result.bandwidth_diagnostics -> results.bandwidth_diagnostics
(matching the plural convention of other handlers); sup-t snippet now
imports SurveyDesign and constructs sd before passing
survey_design=sd (was survey_design=design with no design defined).

P2 (c) tests: New TestLLMsFullHADCoverage tests use
inspect.signature(HeterogeneousAdoptionDiD.__init__) and .fit() to
regress the documented signatures against the real API. New
test_llms_full_had_section_methodology_compatible_with_untreated locks
the negative assertion that the section does NOT carry framing
contradicting the registry. Practitioner tests gain
test_had_step_4_does_not_misframe_untreated_unit_routing +
test_had_handler_snippets_are_valid_python_syntax (catches snippet
syntax errors via ast.parse) +
test_handle_continuous_step_4_snippet_is_valid_python.

83 tests pass (47 in test_practitioner including 5 new + 36 in
test_guides including 9 new).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 9, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 6d6e950e322443f19131aacb0e855f37e367172a


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior re-review blockers look resolved: the untreated-unit/HAD routing now matches the registry, the HAD ctor/fit signatures in llms-full.txt match had.py, and the broken practitioner snippets were fixed.
  • P1 — The new HAD pretest guidance misstates what the shipped tests actually test: QUG is documented as “Assumption 5,” stute_test() as “Assumption 7,” and _handle_had() implies the default two-period workflow covers Assumption 7 even though the actual workflow explicitly does not.
  • P2 — The new HAD result-class blocks in llms-full.txt are not API-complete, and the new tests do not protect that surface from drift.
  • The remaining tutorial follow-up is properly tracked in TODO.md, so that part is non-blocking.

Methodology

  • P1 — diff_diff/practitioner.py:L857-L865 and diff_diff/guides/llms-full.txt:L1417-L1419 misstate the paper/registry contract. qug_test() is the support-infimum/QUG test (H0: d_lower = 0), not “Assumption 5”; stute_test() is the Assumption 8 linearity test, not an Assumption 7 mean-independence test; and did_had_pretest_workflow(..., aggregate="overall") runs steps 1 and 3 only, with Assumption 7/step 2 explicitly absent on that path. The shipped source and registry say this directly at diff_diff/had_pretests.py:L1291-L1373, diff_diff/had_pretests.py:L1565-L1668, diff_diff/had_pretests.py:L4434-L4449, docs/methodology/REGISTRY.md:L2265-L2271, and docs/methodology/REGISTRY.md:L2480-L2489. Impact: the new agent-facing surfaces can tell users that the default two-period workflow validated Assumption 7 or Design 1 Assumption 5 when it did not, which is a methodology misrepresentation on the identification-check surface. Concrete fix: rewrite the overall HAD pretest descriptions so they say “QUG/support-infimum test + Assumption 8 linearity tests; Assumption 7 pre-trends is only covered on aggregate="event_study",” and remove the Assumption 5 mapping from QUG.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

  • P3 — TODO.md:L112 properly tracks the remaining Phase 5 tutorial follow-up (T21/T22). Impact: none; this is acceptable deferred work under the project rules. Concrete fix: none required in this PR.

Security

No findings.

Documentation/Tests

  • P2 — The new llms-full.txt result-class tables are incomplete relative to the actual HAD result dataclasses. The single-period block omits real fields such as n_mass_point, n_above_d_lower, cluster_name, bias_corrected_fit, variance_formula, and effective_dose_mean; the event-study block omits vcov_type, cluster_name, bandwidth_diagnostics, bias_corrected_fit, and filter_info. Compare diff_diff/guides/llms-full.txt:L1227-L1284 with diff_diff/had.py:L330-L355 and diff_diff/had.py:L684-L733. The new tests only check section presence, untreated-unit framing, and signature params, not result-field completeness or assumption labeling (tests/test_guides.py:L285-L358, tests/test_practitioner.py:L510-L688). Impact: agents treating llms-full.txt as the authoritative surface will miss available diagnostics/metadata, and CI will not catch further drift on these blocks. Concrete fix: either make the tables exhaustive or label them as partial summaries, and add a regression that compares the documented field list against the actual dataclass fields or to_dict() keys.
  • Could not execute pytest in this sandbox: pytest is not installed here, and importing diff_diff also fails because numpy is unavailable. This review is therefore static.

Path to Approval

  1. Correct the HAD pretest wording in diff_diff/practitioner.py and diff_diff/guides/llms-full.txt so the test/assumption mapping matches had_pretests.py and REGISTRY.md: QUG = support-infimum test, stute_test()/yatchew_hr_test() = Assumption 8 linearity, and Assumption 7 is only covered on the event-study path.
  2. Add regression tests that fail if the docs or practitioner text again claim that qug_test() is Assumption 5, stute_test() is Assumption 7, or did_had_pretest_workflow(..., aggregate="overall") covers step 2/Assumption 7.

P1 pretest assumption labels: _handle_had step-3 + llms-full.txt HAD
Pretests section + qug_test/stute_test bullets misstated which paper
assumption each shipped test actually tests:
- qug_test was labeled "Assumption 5 support condition", but QUG tests
  H_0: d_lower = 0 (paper Theorem 4 / step 1 of the workflow).
  Assumption 5 is the Design 1 sign-identification condition and is
  NOT testable via pre-trends per REGISTRY.md:2270.
- stute_test was labeled "Assumption 7 mean-independence", but
  stute_test is the Assumption 8 linearity test (paper Section 4.2
  step 3 / Appendix D). Assumption 7 is pre-trends (step 2).
- did_had_pretest_workflow(aggregate="overall") was implied to cover
  step 2, but the workflow runs steps 1 + 3 only - step 2 is
  explicitly not covered on the overall path (had_pretests.py:4434-4441
  + the workflow's verdict flags the gap).

Rewrote both surfaces to match the actual contracts: QUG = paper
Theorem 4 support-infimum test (step 1, decides Design 1' vs Design 1);
Stute / Yatchew-HR = Assumption 8 linearity tests (step 3); Assumption
7 step 2 closure requires aggregate="event_study" (joint Stute
pre-trends). Assumption 7 / step 2 gap is explicitly flagged on the
overall path so agents do not assume coverage where there is none.

P2 result-class field tables incomplete: HeterogeneousAdoptionDiDResults
table was missing n_mass_point, n_above_d_lower, cluster_name,
bias_corrected_fit, variance_formula, effective_dose_mean.
HeterogeneousAdoptionDiDEventStudyResults table was missing vcov_type,
cluster_name, bandwidth_diagnostics, bias_corrected_fit, filter_info.
Added all missing fields with correct types and descriptions.

Tests added (3 new, 86 total):
- test_llms_full_had_results_class_field_lists_match_real_dataclass:
  uses dataclasses.fields() to enumerate every public field on both
  result classes and assert each appears in the documented table.
  Catches future drift where new fields land but the guide is not
  updated.
- test_llms_full_had_pretests_assumption_labels_correct: scans the
  qug_test and stute_test bullets in the HAD Pretests section and
  enforces positive labels (support-infimum / Theorem 4 / linearity)
  + forbids positive Assumption-5 / Assumption-7 misclaims (negative
  disclaimers like "QUG does NOT test Assumption 5" remain allowed).
- test_had_step_3_pretest_assumption_labels_correct: same checks on
  the practitioner.py _handle_had step-3 why-text; also requires
  positive acknowledgment of the Assumption 7 / step 2 gap on the
  overall workflow path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 9, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: ed2842a83881f1ffdc050862159357fdd47578f8


Overall Assessment

✅ Looks good

Executive Summary

  • The prior re-review methodology blocker is resolved: the new HAD practitioner text and llms-full pretest section now correctly map QUG to the support-infimum test, Stute/Yatchew to Assumption 8, and the aggregate="overall" workflow to paper steps 1 + 3 only.
  • The new HAD handler dispatch, ndarray all-NaN warning path, and llms-full dataclass-field coverage are consistent with had.py and had_pretests.py.
  • P2 — diff_diff/guides/llms-full.txt still misdocuments the single-period HAD variance_formula and effective_dose_mean fields for weighted mass_point fits; the code populates 2SLS-specific values there.
  • P2 [Newly identified] — the official practitioner guide returned by get_llm_guide("practitioner") still routes all continuous-intensity designs to ContinuousDiD and omits HAD, so the new runtime estimator-selection guidance is not propagated across all agent-facing workflow docs.
  • The Phase 5 tutorial follow-up is properly tracked in TODO.md, so that part is non-blocking.

Methodology

  • No findings. The previous assumption-label mismatch is fixed in diff_diff/practitioner.py:L857-L871 and diff_diff/guides/llms-full.txt:L1407-L1438, which now match diff_diff/had_pretests.py:L4435-L4449 and docs/methodology/REGISTRY.md:L2480-L2501.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • P3 — The remaining HAD tutorial work is appropriately tracked in TODO.md:L112-L112. Impact: none. Concrete fix: none required in this PR.

Security

  • No findings.

Documentation/Tests

  • P2 — The new single-period HAD results table still understates weighted mass_point metadata. diff_diff/guides/llms-full.txt:L1249-L1255 says variance_formula and effective_dose_mean are continuous-path-only / None on mass-point fits, but the actual static fit populates pweight_2sls / survey_binder_tsl_2sls and a weighted Wald-IV dose-gap on weighted mass-point paths in diff_diff/had.py:L3585-L3628 and diff_diff/had.py:L3642-L3659. Impact: agents using llms-full.txt as the authoritative result-surface guide will misread available inference metadata on weighted mass-point fits. Concrete fix: update those two field descriptions to include the 2SLS labels and weighted mass-point denominator semantics, and extend the guide regression to check these semantics instead of only field-name presence.
  • P2 [Newly identified] — The practitioner workflow guide is still stale for continuous-treatment estimator selection. diff_diff/_guides_api.py:L7-L10 exposes llms-practitioner.txt as the official "practitioner" guide, and diff_diff/guides/llms-full.txt:L34-L37 tells users to consult it, but diff_diff/guides/llms-practitioner.txt:L151-L162 still routes every continuous-intensity design to ContinuousDiD. That now conflicts with the new runtime reroute in diff_diff/practitioner.py:L729-L749 and with ContinuousDiD's never-treated requirement in diff_diff/continuous_did.py:L348-L360. Impact: agent consumers following the official workflow guide can still be steered to the wrong estimator on universal-rollout / no-untreated panels. Concrete fix: update the Step 4 decision tree and surrounding workflow prose to distinguish ContinuousDiD from HeterogeneousAdoptionDiD, and add a regression on get_llm_guide("practitioner") coverage.

Execution note: this was a static review only. I could not run pytest in this sandbox because pytest is not installed, and importing diff_diff fails here because numpy is unavailable.

P2 (a) variance_formula / effective_dose_mean field descriptions in
the single-period results table understated weighted mass-point fits.
The single-period table previously said both fields were
continuous-path-only / None on mass-point fits, but per had.py:3585-3629
weighted mass-point fits populate variance_formula in {"pweight_2sls",
"survey_binder_tsl_2sls"} and effective_dose_mean as the weighted
Wald-IV dose gap. Both rows now enumerate all four variance_formula
labels and describe the mass-point Wald-IV dose-gap denominator
explicitly.

P2 (b) llms-practitioner.txt Step 4 decision tree was stale: it routed
ALL continuous-intensity designs to ContinuousDiD, omitting HAD. That
contradicted the new runtime estimator-selection guidance and would
steer agents to the wrong estimator on universal-rollout panels.
Updated the continuous branch to distinguish ContinuousDiD (per-dose
ATT(d) / ACRT(d), requires never-treated controls) from
HeterogeneousAdoptionDiD (WAS at dose support boundary, compatible with
universal rollout or small never-treated share).

Tests added (2 new, 88 total):
- test_llms_full_had_variance_formula_describes_all_designs: scans the
  variance_formula and effective_dose_mean rows in the single-period
  HAD results table; asserts all four variance_formula labels (pweight,
  survey_binder_tsl, pweight_2sls, survey_binder_tsl_2sls) and the
  mass-point Wald-IV dose-gap semantics are present.
- test_llms_practitioner_step_4_distinguishes_had_from_continuous:
  parses the Step 4 decision tree from get_llm_guide("practitioner");
  asserts both ContinuousDiD and HeterogeneousAdoptionDiD appear in the
  continuous branch and that never-treated / universal-rollout framing
  is the routing distinguisher.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 9, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 00ed3ab37edaf51ce02aad6ae618564d5eb1fbed


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior re-review doc issues are mostly resolved: llms-full.txt now documents weighted mass-point HAD metadata in diff_diff/guides/llms-full.txt:L1254-L1255, and the official practitioner guide now distinguishes ContinuousDiD from HeterogeneousAdoptionDiD in diff_diff/guides/llms-practitioner.txt:L160-L168.
  • P1 — the new HAD Step-3 runtime guidance overstates pretest coverage on weighted/survey fits. practitioner_next_steps() says did_had_pretest_workflow runs QUG on the overall and event-study paths, but the actual workflow skips QUG whenever survey_design= / survey= / weights= is used.
  • The ndarray branch in _check_nan_att() looks correct: it preserves the old scalar path and warns only on all-NaN arrays, avoiding false positives on partial-NaN HAD event-study output in diff_diff/practitioner.py:L1176-L1218.
  • P3docs/methodology/REGISTRY.md still marks the HAD practitioner integration / llms-full section as pending, so the authoritative registry is stale relative to this PR and TODO.md:L112-L112.

Methodology

  • Severity: P1. diff_diff/practitioner.py:L857-L882 and diff_diff/practitioner.py:L997-L1014 state that did_had_pretest_workflow runs QUG on the overall and event-study paths. But the actual workflow explicitly skips QUG under survey_design= / survey= / weights= and returns only a linearity-conditional verdict per diff_diff/had_pretests.py:L4488-L4495 and docs/methodology/REGISTRY.md:L2445-L2450. Impact: agent consumers of the new runtime HAD guidance can be told that the support-infimum / Design-1-vs-Design-1' step was checked on weighted designs when the library deliberately cannot check it there. Concrete fix: update both HAD Step-3 messages and snippets to say that QUG coverage applies only on unweighted paths, and explicitly note that weighted paths skip QUG and require a linearity-conditional interpretation; add a regression covering this wording.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The remaining tutorial follow-up is appropriately tracked in TODO.md:L112-L112.

Security

No findings.

Documentation/Tests

  • Severity: P3. docs/methodology/REGISTRY.md:L2507-L2555 still says the HAD practitioner_next_steps() integration and llms-full.txt HAD section are pending, even though this PR lands them and TODO.md:L112-L112 reflects that. Impact: the Methodology Registry is now out of sync with the shipped agent-facing surface, which weakens it as the source of truth for future reviews. Concrete fix: update the Phase 5 notes/checklist in REGISTRY.md to reflect the landed wave-1 surfaces and leave only the tutorial follow-ups pending.
  • Static review only: I could not run pytest here because pytest is not installed in this sandbox.

Path to Approval

  1. Amend the HAD Step-3 guidance in diff_diff/practitioner.py:L857-L882 and diff_diff/practitioner.py:L997-L1014 so it explicitly states that survey_design= / survey= / weights= skip QUG and yield only a linearity-conditional verdict.
  2. Add targeted regressions in tests/test_practitioner.py asserting that both HAD handler variants mention the QUG-under-survey deferral and do not imply full step-1 coverage on weighted paths.

P1 HAD Step-3 overstated pretest coverage on weighted/survey fits:
practitioner_next_steps() said did_had_pretest_workflow runs QUG on
both the overall and event-study paths without noting that the workflow
explicitly skips QUG whenever survey_design= / survey= / weights= is
supplied (Phase 4.5 C0 deferral, had_pretests.py:4488-4495 + REGISTRY
§ "QUG Null Test" Note (Phase 4.5 C0)). On weighted fits the workflow
emits a UserWarning and returns a linearity-conditional verdict only.

Both _handle_had and _handle_had_event_study Step-3 why-text + code
snippets now explicitly state that survey-weighted fits skip QUG and
yield a linearity-conditional verdict (the weighted verdict is
conditional on QUG holding by assumption). The event-study text also
notes that joint Stute pre-trends and joint homogeneity-linearity
themselves remain available under survey weighting via the PSU-level
Mammen multiplier bootstrap.

P3 REGISTRY § HeterogeneousAdoptionDiD requirements checklist was
stale: marked "Phase 5: practitioner_next_steps() integration" and
"Phase 5 (remaining): llms-full.txt section" as pending. Updated to
reflect this PR landing wave 1 of Phase 5; only T21 (HAD pretest
workflow tutorial) and T22 (weighted/survey HAD tutorial) remain
queued, both tracked in TODO.md.

Tests added (1 new, 89 total):
- test_had_step_3_flags_qug_under_survey_deferral: asserts both HAD
  handler variants surface the QUG-under-survey skip and the
  linearity-conditional-verdict caveat. Without this caveat agents may
  assume step 1 / Design 1' vs Design 1 was checked on weighted fits
  when the library deliberately does not check it there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 9, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: d4b909157a6a38dfcf7d41355985efcfd1406154


Overall Assessment

✅ Looks good

Executive Summary

  • Previous P1 resolved: both new HAD Step-3 handler variants now explicitly state that survey/weighted workflows skip QUG and return only a linearity-conditional verdict, matching the shipped workflow and registry. diff_diff/practitioner.py:L857-L893, diff_diff/practitioner.py:L1007-L1034, diff_diff/had_pretests.py:L4488-L4565
  • The new HAD/ContinuousDiD routing is methodologically consistent with the current implementation: ContinuousDiD still requires a D=0 comparison group in this library, while HAD remains valid with a small never-treated share and retains never-treated units on the staggered event-study path. diff_diff/continuous_did.py:L348-L360, docs/methodology/REGISTRY.md:L721-L723, docs/methodology/REGISTRY.md:L2400-L2408, docs/methodology/REGISTRY.md:L2535-L2536
  • _check_nan_att()’s ndarray branch is conservative and correctly scoped: it warns only on all-NaN arrays and avoids false positives on legitimate partial-NaN HAD event-study output. diff_diff/practitioner.py:L1196-L1238
  • Severity: P3. llms-full.txt now correctly documents weighted mass-point variance_formula / effective_dose_mean, but the unchanged single-period result-field docstrings in diff_diff/had.py still describe those fields as continuous-only / None on mass-point. diff_diff/guides/llms-full.txt:L1254-L1255, diff_diff/had.py:L347-L366, implementation diff_diff/had.py:L3565-L3660
  • Static review only: this sandbox lacks numpy and pytest, so I could not run the targeted tests or import the package.

Methodology

  • No findings. The prior QUG-under-survey wording defect is fixed, and the new practitioner/guide text now matches the workflow and registry on step coverage, untreated-unit compatibility, and last-cohort-only event-study framing. diff_diff/practitioner.py:L857-L893, diff_diff/practitioner.py:L1007-L1034, diff_diff/guides/llms-full.txt:L1405-L1440, docs/methodology/REGISTRY.md:L2431-L2450, docs/methodology/REGISTRY.md:L2551-L2554

Code Quality

  • No findings. _check_nan_att() preserves the old scalar path and adds a narrow ndarray fallback without introducing inline inference logic or partial NaN guards. diff_diff/practitioner.py:L1196-L1238

Performance

  • No findings. The new runtime cost on the common path is minimal, and the numpy import is lazy and only taken for array-valued att. diff_diff/practitioner.py:L1217-L1225

Maintainability

  • No findings. Phase-5 bookkeeping is now internally aligned across registry, TODO, and doc dependency mapping. docs/doc-deps.yaml:L383-L405, docs/methodology/REGISTRY.md:L2551-L2554, TODO.md:L112

Tech Debt

  • No findings. The remaining HAD follow-up work is properly tracked as tutorial debt rather than left as an implicit gap. TODO.md:L112

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: llms-full.txt and the actual implementation agree that single-period weighted mass-point HAD fits populate variance_formula and effective_dose_mean, but the unchanged HeterogeneousAdoptionDiDResults field docstrings still say those fields are continuous-only / None on mass-point. That leaves two internal source-of-truth surfaces disagreeing about the same public result object. diff_diff/guides/llms-full.txt:L1254-L1255, diff_diff/had.py:L347-L366, diff_diff/had.py:L3565-L3660
  • Concrete fix: update the variance_formula and effective_dose_mean field docstrings in diff_diff/had.py to include the mass-point labels and weighted Wald-IV dose-gap semantics already implemented; optionally add a doc-drift regression for those descriptions alongside the existing guide tests.

P3 dataclass-docstring drift: PR #402 R3 fixed the llms-full.txt field
descriptions to acknowledge that weighted mass-point HAD fits populate
variance_formula in {"pweight_2sls", "survey_binder_tsl_2sls"} and
effective_dose_mean as the weighted Wald-IV dose gap (per
had.py:3585-3660), but the HeterogeneousAdoptionDiDResults dataclass
field docstrings in had.py:347-366 still said those fields were
continuous-only / None on mass-point - leaving two source-of-truth
surfaces disagreeing about the same public result object.

Updated both field docstrings to enumerate all four variance_formula
labels (continuous + mass-point variants under both `weights=` shortcut
and `survey_design=` paths) and to describe the mass-point weighted
Wald-IV dose-gap denominator semantics (`mean(D | Z=1, w) -
mean(D | Z=0, w)` where Z = 1{D > d_lower}).

Tests added (1 new, 90 total):
- test_had_results_dataclass_docstrings_match_weighted_mass_point_contract:
  uses inspect.getsource(HeterogeneousAdoptionDiDResults) to scan the
  class source and assert the variance_formula docstring mentions both
  pweight_2sls and survey_binder_tsl_2sls labels, and the
  effective_dose_mean docstring mentions mass-point Wald-IV semantics.
  Locks both field docstrings against drift back to the
  continuous-only framing now that the llms-full.txt guide and the
  actual fit() code populate these on mass-point fits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 9, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 0ecb635284ed477709c7aa6592da3f435785665d


Overall Assessment

✅ Looks good

Executive Summary

  • The prior P3 from the last review is resolved: the HAD result-class docstrings in diff_diff/had.py:L347-L371 now match the weighted mass-point behavior already documented in diff_diff/guides/llms-full.txt:L1254-L1255.
  • The new HAD practitioner routing is methodologically aligned with the registry: ContinuousDiD still requires never-treated / D=0 support, while HAD remains compatible with a small untreated share and retains never-treated units on the staggered event-study path (diff_diff/practitioner.py:L729-L748, diff_diff/practitioner.py:L897-L923, diff_diff/practitioner.py:L1038-L1063, diff_diff/guides/llms-practitioner.txt:L160-L168, diff_diff/continuous_did.py:L348-L360, docs/methodology/REGISTRY.md:L2400-L2408).
  • The new HAD pretest guidance now uses the right assumption labels and correctly surfaces the survey-weighted QUG deferral (diff_diff/practitioner.py:L857-L893, diff_diff/practitioner.py:L1007-L1034, diff_diff/guides/llms-full.txt:L1405-L1438, docs/methodology/REGISTRY.md:L2548-L2552).
  • _check_nan_att() is a narrow, conservative change: it only warns on all-NaN ndarray att and leaves partial-NaN event-study arrays alone, which matches the documented HAD event-study contract (diff_diff/practitioner.py:L1196-L1235).
  • P3 only: the new weighted HAD event-study sup-t guidance omits the documented mass-point vcov_type='hc1' requirement, so the default survey example can still raise NotImplementedError on mass-point weighted fits.
  • Static review only: I could not run pytest or import the package in this sandbox because pytest and numpy are unavailable.

Methodology

No findings. The changed HAD guidance matches the Methodology Registry on pretest scope, untreated-unit compatibility, and last-cohort-only event-study framing (diff_diff/practitioner.py:L857-L1122, diff_diff/guides/llms-full.txt:L1405-L1438, docs/methodology/REGISTRY.md:L2374-L2382, docs/methodology/REGISTRY.md:L2400-L2408, docs/methodology/REGISTRY.md:L2551-L2552).

Code Quality

No findings. _check_nan_att() remains a presentation-layer warning helper and does not introduce any inline inference anti-pattern; estimator inference still routes through safe_inference() (diff_diff/practitioner.py:L1196-L1235, diff_diff/had.py:L3553-L3562, diff_diff/had.py:L4447-L4455).

Performance

No findings. The added ndarray handling is lazy and only imports numpy on the rare array-valued att path (diff_diff/practitioner.py:L1217-L1228).

Maintainability

No findings. The Phase 5 bookkeeping is internally aligned across registry, TODO, and doc dependency mapping (docs/methodology/REGISTRY.md:L2551-L2554, TODO.md:L112-L112, docs/doc-deps.yaml:L388-L405).

Tech Debt

No findings. Remaining HAD tutorial work is explicitly tracked in TODO.md:L112-L112, so it is non-blocking under the stated review policy.

Security

No findings.

Documentation/Tests

  • Severity: P3. Impact: diff_diff/practitioner.py:L1067-L1088 and the new HAD fit docs in diff_diff/guides/llms-full.txt:L616-L630 present weighted event-study sup-t usage as a generic survey_design= / weights= path, but mass-point HAD rejects the default classical variance family whenever the IF matrix is consumed downstream (diff_diff/had.py:L3495-L3507, diff_diff/had.py:L4277-L4291). That restriction is already documented in the registry (docs/methodology/REGISTRY.md:L2380-L2382), so this is informational rather than blocking, but the example as written can still fail on a supported design family.
    Concrete fix: either make the example explicitly continuous-design-only, or add vcov_type='hc1' / robust=True to the weighted mass-point example and note the caveat inline. Add a regression test for that runtime contract; current tests only check that the guide mentions sup-t and that snippets parse as Python, not that the weighted mass-point example reflects the supported vcov combinations (tests/test_practitioner.py:L528-L535, tests/test_practitioner.py:L656-L680, tests/test_guides.py:L383-L405).

P3 mass-point + survey vcov requirement: per had.py:3495-3507 the
mass-point design rejects the default classical vcov family on the
survey_design= path with NotImplementedError (the survey path composes
Binder-TSL on the HC1-scale influence function, which targets V_HC1
rather than the classical sandwich). The Step-6 sup-t / cband snippet
in _handle_had_event_study and the HAD section in llms-full.txt
presented weighted event-study fits as a generic survey_design= path
without surfacing this constraint, so the example as written would
fail at fit time on a mass-point panel.

Both surfaces now make the requirement explicit:
- The Step-6 snippet uses HeterogeneousAdoptionDiD(vcov_type='hc1', ...)
  with an inline comment explaining that hc1 is required on mass-point
  + survey and is a no-op on the continuous designs (which use the
  CCT-2014 robust SE regardless), making it a safe default for the
  survey-aware example.
- A new "Mass-point + survey constraint" paragraph in the HAD section
  of llms-full.txt documents the same requirement and routing.

Tests added (2 new, 92 total):
- test_had_event_study_sup_t_snippet_uses_hc1_for_mass_point_survey_compatibility:
  asserts the sup-t / cband snippet either uses vcov_type='hc1' /
  robust=True or surfaces the mass-point + survey vcov requirement
  inline so agents adapting the snippet on a mass-point panel know to
  add it.
- test_llms_full_had_section_documents_mass_point_survey_vcov_requirement:
  asserts the HAD section documents the mass-point + survey vcov
  requirement (vcov_type mention paired with mass-point context).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 9, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: da24a59596c0f8435aa72d0347fc3ffca21022e5


Overall Assessment

⚠️ Needs changes

The prior P3 about weighted HAD event-study guidance is resolved: the new guide and practitioner Step 6 now surface the mass-point vcov_type='hc1' requirement (diff_diff/guides/llms-full.txt:L658-L660, diff_diff/practitioner.py:L1066-L1097). But I found one new P1 on the changed HAD pretest guidance: the new agent-facing text overstates when Assumption 7 is actually checked on the event-study path.

Executive Summary

  • Previous review item resolved: the mass-point + weighted/survey vcov_type='hc1' caveat is now documented on the new HAD surfaces (diff_diff/guides/llms-full.txt:L658-L660, diff_diff/practitioner.py:L1066-L1097).
  • P1 [Newly identified]: the new HAD pretest guidance says aggregate='event_study' on a multi-period panel closes the step-2 / Assumption 7 gap, but the registry and runtime contract only allow that when there is at least one earlier pre-period beyond F-1 (diff_diff/guides/llms-full.txt:L1407-L1423, diff_diff/practitioner.py:L857-L893, diff_diff/practitioner.py:L1007-L1034, docs/methodology/REGISTRY.md:L2487-L2489).
  • The underlying workflow already handles the edge case correctly: when no earlier placebo pre-period exists, it sets pretrends_joint=None, keeps all_pass=False, and flags joint pre-trends skipped (diff_diff/had_pretests.py:L4738-L4756, diff_diff/had_pretests.py:L4822-L4831, tests/test_had_pretests.py:L2443-L2449, tests/test_had_pretests.py:L4998-L5049).
  • _check_nan_att() remains a conservative presentation-layer helper; I did not find a new inference anti-pattern or NaN-propagation bug in the changed estimator code (diff_diff/practitioner.py:L1205-L1249).
  • I could not run pytest here because the sandbox does not have the pytest module installed.

Methodology

  • Severity: P1 [Newly identified]
    Impact: The new HAD pretest copy in diff_diff/guides/llms-full.txt:L1407-L1423 and both new practitioner handlers in diff_diff/practitioner.py:L857-L893 and diff_diff/practitioner.py:L1007-L1034 tells users that switching to aggregate='event_study' on a multi-period panel closes the Assumption 7 / paper-step-2 gap. That is broader than the documented contract. The Methodology Registry says step 2 closes only when there are two pre-periods (the base F-1 plus an earlier placebo); otherwise the workflow must leave pretrends_joint=None and report joint pre-trends skipped (no earlier pre-period) (docs/methodology/REGISTRY.md:L2487-L2489). The implementation and tests follow that narrower rule (diff_diff/had_pretests.py:L4738-L4756, diff_diff/had_pretests.py:L4822-L4831, tests/test_had_pretests.py:L2443-L2449, tests/test_had_pretests.py:L4998-L5049). On minimal valid event-study panels, the new agent-facing guidance can therefore tell users/LLMs that step 2 was covered when it was not.
    Concrete fix: Update every changed Step-3 / HAD-pretests surface to say that aggregate='event_study' closes step 2 only if an earlier placebo pre-period beyond F-1 exists (and survives any trends_lin consumption / filtering). Otherwise say explicitly that the workflow returns pretrends_joint=None, all_pass=False, and a joint pre-trends skipped (no earlier pre-period) verdict.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. Remaining HAD tutorial follow-ups are properly tracked in TODO.md:L107-L113, so they are non-blocking under the review policy.

Security

  • No findings.

Documentation/Tests

  • Severity: P2
    Impact: The new guide/practitioner regression tests lock assumption labels and the QUG-under-survey caveat, but they do not lock the “event-study only closes step 2 when an earlier placebo pre-period exists” precondition. That is why the overstatement above can land in the new agent-facing surfaces without tripping the added tests (tests/test_guides.py:L550-L617, tests/test_practitioner.py:L782-L835). The underlying workflow contract is already regression-tested elsewhere (tests/test_had_pretests.py:L2443-L2449, tests/test_had_pretests.py:L4998-L5049).
    Concrete fix: Add explicit tests/test_guides.py and tests/test_practitioner.py assertions that the HAD pretest text mentions the earlier-pre-period requirement and the pretrends_joint=None / joint pre-trends skipped fallback.

Path to Approval

  1. Correct the new HAD pretest guidance in diff_diff/practitioner.py and diff_diff/guides/llms-full.txt so it does not claim that any multi-period aggregate='event_study' fit closes Assumption 7; require an earlier placebo pre-period beyond F-1, and document the joint pre-trends skipped fallback otherwise.
  2. Add guide/practitioner regression tests for that caveat, mirroring the existing workflow-contract tests in tests/test_had_pretests.py:L2443-L2449 and tests/test_had_pretests.py:L4998-L5049.

P1 step-2 / Assumption 7 closure precondition: per
docs/methodology/REGISTRY.md HeterogeneousAdoptionDiD
§ "Assumption 7 / step 2 closure" + had_pretests.py:4738-4756 + 2769,
aggregate="event_study" closes paper Section 4.2 step 2 ONLY IF the
panel has at least one earlier placebo pre-period beyond the base F-1.
With only the base F-1 pre-period available (minimal 3-period
event-study, or 4-period under trends_lin=True where the consumed F-2
placebo is auto-dropped), the workflow sets pretrends_joint=None,
all_pass=False, and appends 'joint pre-trends skipped (no earlier
pre-period)' to the verdict - step 2 stays uncovered.

The previous Step-3 wording in both _handle_had and
_handle_had_event_study + the HAD Pretests intro in llms-full.txt said
generically that aggregate="event_study" closes the step-2 gap, which
is overbroad and could mislead agents on minimal valid event-study
panels. All three surfaces now make the precondition explicit AND
document the pretrends_joint=None / 'joint pre-trends skipped' fallback
verdict so agents know what to expect when the precondition fails.

P2 missing regression coverage: the prior tests locked assumption
labels and the QUG-under-survey caveat but did not lock the
earlier-pre-period precondition - that is why the overstatement
landed in the new agent-facing surfaces without tripping the existing
guide / practitioner tests.

Tests added (2 new, 94 total):
- test_had_step_3_documents_earlier_pre_period_precondition_for_step_2:
  asserts both HAD handler variants surface the 'earlier pre-period'
  / placebo precondition AND the pretrends_joint=None / 'joint
  pre-trends skipped' fallback.
- test_llms_full_had_pretests_documents_earlier_pre_period_precondition:
  same lock on the HAD Pretests section in llms-full.txt.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 9, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 4a247587df7099933e0624c2d34099f53fe36d65


Overall Assessment

Looks good

The prior P1 from the re-review is resolved. I did not find any unmitigated P0/P1 issues in the changed HAD guidance, registry updates, or practitioner dispatch surfaces.

Executive Summary

  • The previous methodology mismatch is fixed: the new HAD guide and practitioner text now correctly state that event-study step-2 closure requires an earlier placebo pre-period beyond F-1, matching both the Methodology Registry and the runtime workflow fallback in did_had_pretest_workflow() (docs/methodology/REGISTRY.md:L2482-L2489, diff_diff/had_pretests.py:L4738-L4797, diff_diff/guides/llms-full.txt:L1409-L1442, diff_diff/practitioner.py:L857-L898, diff_diff/practitioner.py:L1014-L1047).
  • The weighted-path caveats are surfaced correctly on the new agent-facing HAD surfaces: QUG-under-survey remains explicitly deferred, and mass-point survey examples now call out the vcov_type="hc1" requirement (diff_diff/guides/llms-full.txt:L658-L660, diff_diff/guides/llms-full.txt:L1435-L1442, diff_diff/practitioner.py:L874-L898, diff_diff/practitioner.py:L1078-L1110).
  • The new _check_nan_att() ndarray branch is a presentation-layer helper only; it warns only on all-NaN per-horizon arrays and does not introduce a new inference anti-pattern (diff_diff/practitioner.py:L1217-L1250).
  • P3 [Newly identified]: HeterogeneousAdoptionDiDResults.to_dict() still documents weighted variance_formula as only "pweight" / "survey_binder_tsl", which now disagrees with the updated field docstrings and actual mass-point weighted outputs (diff_diff/had.py:L347-L369, diff_diff/had.py:L471-L488).
  • I could not run the targeted tests in this environment because pytest and numpy are not installed.

Methodology

  • No findings. The changed HAD-facing guidance now matches the documented source contract for Assumption 7 / step-2 closure and its skip fallback (docs/methodology/REGISTRY.md:L2482-L2489, diff_diff/had_pretests.py:L4738-L4797).

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. The remaining HAD tutorial follow-ups are explicitly tracked in TODO.md:L107-L113, so they are non-blocking under the review policy.

Security

  • No findings.

Documentation/Tests

  • Severity: P3 [Newly identified]
    Impact: HeterogeneousAdoptionDiDResults.to_dict() still describes weighted variance_formula as continuous-only and omits the mass-point Wald-IV denominator wording, so not all public documentation surfaces agree after this PR updated the field docstrings and llms-full (diff_diff/had.py:L347-L369, diff_diff/had.py:L471-L488). This is informational only; the runtime fields themselves appear correct.
    Concrete fix: Update the to_dict() docstring in diff_diff/had.py:L471-L488 to mirror the full weighted contract already documented in diff_diff/had.py:L347-L369, including "pweight_2sls" / "survey_binder_tsl_2sls" and the mass-point effective_dose_mean semantics.
  • I could not execute pytest here because the sandbox is missing both pytest and numpy, so test validation is based on diff inspection rather than a local run.

P3 doc drift: PR #402 R3 fixed llms-full.txt, R5 fixed the dataclass
field docstrings, but HeterogeneousAdoptionDiDResults.to_dict() still
described variance_formula as continuous-only ("pweight" /
"survey_binder_tsl") and omitted the mass-point Wald-IV
effective_dose_mean semantics. Three internal source-of-truth surfaces
were now disagreeing about the same public result object's
to_dict() output shape.

Updated to_dict() docstring to enumerate all four variance_formula
labels (pweight, survey_binder_tsl, pweight_2sls,
survey_binder_tsl_2sls) and to describe the per-design
effective_dose_mean semantics (continuous mean of D / D - d_lower vs
mass-point weighted Wald-IV dose gap mean(D | Z=1, w) -
mean(D | Z=0, w)). Mirrors the field-docstring contract from R5.

Tests added (1 new, 95 total):
- test_had_results_to_dict_docstring_matches_weighted_mass_point_contract:
  reads HeterogeneousAdoptionDiDResults.to_dict.__doc__ and asserts it
  enumerates all four variance_formula labels and the mass-point
  Wald-IV effective_dose_mean semantics. Mirrors the existing
  dataclass-field-docstring lock.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 9, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: b2bfdd09d4f3fc7253a742f7b16aec8e6c040cae


Overall Assessment

Looks good

Executive Summary

  • The prior re-review P3 is resolved: HeterogeneousAdoptionDiDResults.to_dict() and the weighted-field docstrings now agree with the actual weighted mass-point contract (diff_diff/had.py:L347-L372, diff_diff/had.py:L468-L499).
  • The new HAD practitioner guidance matches the Methodology Registry on step-2 closure requiring an earlier placebo pre-period, QUG deferral under survey weighting, untreated-unit compatibility, and last-cohort-only event-study framing (docs/methodology/REGISTRY.md:L2481-L2489, docs/methodology/REGISTRY.md:L2533-L2538, diff_diff/practitioner.py:L857-L899, diff_diff/practitioner.py:L1013-L1155).
  • The ContinuousDiD ↔ HAD routing is consistent with implementation: this codebase still rejects no-D=0 ContinuousDiD panels, while HAD remains valid with a small never-treated share when the estimand is WAS (diff_diff/continuous_did.py:L348-L360, diff_diff/practitioner.py:L729-L748, diff_diff/practitioner.py:L902-L929).
  • _check_nan_att()’s new ndarray branch is consistent with the HAD event-study NaN contract: it warns only on all-NaN horizon arrays and does not over-fire on partial-NaN outputs (diff_diff/practitioner.py:L1217-L1261, diff_diff/had.py:L540-L545).
  • The new guide/practitioner surfaces are well regression-locked for signature drift, methodology wording, weighted mass-point docs, and runtime handler behavior (tests/test_guides.py:L280-L640, tests/test_practitioner.py:L499-L935).
  • I could not run the targeted tests locally because this sandbox does not have pytest or numpy installed.

Methodology

  • No findings. The changed HAD-facing guidance is aligned with the registry and cited method notes, including the earlier-pre-period requirement for step-2 closure, the QUG-under-survey deferral, untreated-unit compatibility, and the Appendix B.2 last-cohort filter (docs/methodology/REGISTRY.md:L2400-L2408, docs/methodology/REGISTRY.md:L2481-L2489, docs/methodology/REGISTRY.md:L2533-L2538, diff_diff/guides/llms-full.txt:L593-L660, diff_diff/guides/llms-full.txt:L1407-L1442, diff_diff/guides/llms-practitioner.txt:L151-L168).

Code Quality

  • No findings. The new practitioner handlers stay string-only at runtime, avoid inline inference anti-patterns, and the ndarray NaN warning path is narrowly scoped (diff_diff/practitioner.py:L847-L1261, tests/test_practitioner.py:L642-L691).

Performance

  • No findings. The only runtime change is a lazy numpy import in the exceptional ndarray path of _check_nan_att(), so there is no meaningful hot-path regression from the diff inspection (diff_diff/practitioner.py:L1238-L1255).

Maintainability

  • No findings. The new HAD docs and practitioner surfaces are backed by focused regression tests for signatures, field tables, wording, and dispatch behavior, which should make future drift obvious (tests/test_guides.py:L353-L519, tests/test_practitioner.py:L733-L935).

Tech Debt

  • No findings. The remaining HAD tutorial work is explicitly tracked in TODO.md, so it is non-blocking under the review policy (TODO.md:L111-L114).

Security

  • No findings.

Documentation/Tests

  • No findings. The previous public-doc mismatch on weighted mass-point variance_formula / effective_dose_mean is fixed and now covered by tests (diff_diff/had.py:L347-L372, diff_diff/had.py:L468-L499, tests/test_practitioner.py:L733-L818, tests/test_guides.py:L479-L519).
  • Residual risk: I could not execute pytest here because the environment is missing both pytest and numpy, so test validation is based on diff inspection only.

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 9, 2026
@igerber igerber merged commit df270ab into main May 9, 2026
28 of 29 checks passed
@igerber igerber deleted the had-phase-5-agent-surfaces branch May 9, 2026 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant