Use AbstractPPL AD interface#1363
Conversation
- remove DifferentiationInterface from Project.toml deps and compat - replace DI gradient preparation/evaluation with AbstractPPL.prepare and AbstractPPL.value_and_gradient - reuse LogDensityAt and closures without local AbstractPPL.prepare piracy - guard the Mooncake precompile workload until AbstractPPLMooncakeExt is loaded - pin AbstractPPL to the evaluator-interface branch in [sources] for this environment Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# Conflicts: # Project.toml # src/logdensityfunction.jl
2340748 to
e1cac2f
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1363 +/- ##
==========================================
- Coverage 82.30% 81.70% -0.61%
==========================================
Files 50 50
Lines 3543 3536 -7
==========================================
- Hits 2916 2889 -27
- Misses 627 647 +20 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Fix #1364 TODO - [x] use `main` as base, as otherwise benchmarking would fail, need to change back to #1364 before merging. - ~~Julia 1.10’s Pkg does not reliably apply the root [sources] override during build/test, so min CI fails.~~ This change switches `LogDensityFunction` AD preparation to pass a structural `LogDensityAt` problem directly into `AbstractPPL.prepare(...)` instead of wrapping it in backend-specific anonymous closures. It also removes the old `_use_closure` machinery, since the structural problem object now provides the stable one-argument evaluator shape that AD backends need. The accompanying test coverage adds a focused `ReverseDiff` check to confirm that `AutoReverseDiff(; compile=true)` still retains a compiled tape and can be reused across repeated `logdensity_and_gradient` calls. Benchmarks ``` +----------------------+------------+--------------+--------+--------------+ | Backend | PR grad μs | main grad μs | Ratio | Summary | +----------------------+------------+--------------+--------+--------------+ | ForwardDiff | 0.322 | 0.322 | 1.001x | 0.1% slower | | Forward Mooncake | 0.900 | 2.601 | 0.346x | 2.89x faster | | Forward Enzyme | 0.789 | 1.051 | 0.751x | 24.9% faster | | ReverseDiff | 14.239 | 15.225 | 0.935x | 6.5% faster | | ReverseDiff compiled | 5.355 | 5.872 | 0.912x | 8.8% faster | | Reverse Mooncake | 0.892 | 0.928 | 0.961x | 3.9% faster | | Reverse Enzyme | 1.253 | 1.268 | 0.988x | 1.2% faster | +----------------------+------------+--------------+--------+--------------+ ``` --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
|
DynamicPPL.jl documentation for PR #1363 is available at: |
Update LogDensityFunction gradient evaluation to match the AbstractPPL evaluator interface and make the floattypes environment resolve that source branch explicitly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drops the base-vs-head comparison entirely. The benchmark workflow now runs once on the PR head, on a pinned `ubuntu-22.04` runner, and reports absolute log-density times plus gradient/log-density ratios in the posted comment. Output schema follows Mooncake's bench harness; readers compare against recent main-branch comments to spot regressions. Noise reduction in `run_ad`: per-sample incremental GC teardown and a full GC before each measurement keep accumulated garbage from triggering mid-sample collections that inflate individual samples. Adds a `benchmark_seconds` knob for tightening the median estimate. Also removes the synthetic reference timing that normalised eval times against a non-DPPL function. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Inline `Models.jl` and `DynamicPPLBenchmarks.jl` into `benchmarks.jl` and convert `benchmarks/Project.toml` from a package to a flat environment, mirroring Mooncake.jl's `bench/run_benchmarks.jl` layout. Also: take dim from `length(r.params)` (run_ad already constructed the LDF) so models are no longer evaluated twice on the success path; switch results to NamedTuples so `print_results` reads `r.name`/`r.dim`/...; extract `transform_strategy(islinked)` helper; drop unused imports. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts: # benchmarks/Project.toml # benchmarks/src/DynamicPPLBenchmarks.jl
Run a full cross-product of the 9 model configs × 4 AD backends ×
{linked, unlinked} = 72 rows, ordered model → linked → backend so each
model's eight rows are adjacent for side-by-side inspection.
`:reversediff_compiled` is excluded because compiled tapes are
input-dependent and silently produce wrong gradients on
parameter-dependent control flow (see CLAUDE.md).
Per-row logging mirrors Mooncake's `bench/run_benchmarks.jl`: an
`(i / N, name, (linked = …))` header, the backend on its own line,
then `t(logdensity)` / `t(grad)` formatted with units. `model_dimension`
is now defensive (returns `missing` on init failures) and the table
formats `missing` dims as `err`, so combos that crash during dimension
lookup still produce a well-formed row instead of derailing the run.
Also: add a `setup` stage to `run_ad`'s Chairmarks pipeline that
deep-copies `params` per sample, matching Mooncake's harness — setup
runs before the timed window, so the copy is excluded from
measurements. Widen `combos` to a typed `Tuple{...}[]` so it accepts
models with non-default contexts (e.g. the `condition`-wrapped
`loop_univariate`/`multivariate` rows).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…apsible full table LDA's discrete `Categorical` RVs make `linked = false` ill-defined for gradient-based AD, so all four backends previously errored on those rows. Skip them at combination time, leaving 68 rows. In markdown mode, emit a `### Gist: Smorgasbord` block with just that model's eight rows (Smorgasbord covers the broadest set of DPPL features, so it is the most informative single row band), then put the full 68-row table inside `<details><summary>` so it is collapsed by default in GitHub PR comments. Plain (non-markdown) output is unchanged. Drop the now-redundant `### Benchmark Results` heading from the workflow body since the script emits its own. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Required as a direct dep so the benchmarks project resolves cleanly without a manual `Pkg.resolve()` after `Pkg.instantiate()`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
976a8eb to
49b1837
Compare
The Benchmarking comment now reads top-down as: SHA in the heading, Smorgasbord gist as a level-3 section, explanatory paragraph, full table as a level-3 section, and computer info as a foldable <details> with an inline <b>-styled summary (avoids markdown inside <details>, which renders inconsistently). The two integration jobs added to CI.yml were still on setup-julia@v2; bumped to @V3 for consistency with the main job. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Splits Benchmarking.yml into three jobs (benchmark-pr, benchmark-main, post-comment). The PR-comment now shows PR-head numbers up top and main's numbers in a foldout below, with a column legend and labelled SHAs. If benchmark-main fails (e.g. transitionally before this branch lands on main, since main's bench script does not yet support markdown mode), post-comment still posts PR-head numbers and notes the main job result inline. Workflow hardening: concurrency group cancels superseded PR runs, 60min timeout per bench job, explicit pull-requests: write permission so fork PRs can post comments. Body assembly moved out of the YAML literal block scalar into a shell heredoc using `body-path:` (peter-evans pattern), avoiding env-var multi-line interpolation. Bench script no longer prints the obsolete "compare against main-branch PR run" paragraph -- main is now in the same comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts: # .github/workflows/Benchmarking.yml # benchmarks/Project.toml # benchmarks/benchmarks.jl # src/test_utils/ad.jl
The 68-row table dominates the PR comment vertically. Wrap it in a <details><summary> block so it is collapsed by default, leaving the Smorgasbord gist as the at-a-glance view. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Long-form table (one row per (model, linked, backend)) becomes a pivoted table where each (Model, Dim, Linked) row spans all four AD backends as columns. 68 rows collapse to 17, the gist/full-table split goes away (single unified table), and the gist constant + foldout wrapper in benchmarks/benchmarks.jl are dropped. `t(logdensity)` does not depend on the AD backend, so the four primal samples per group are noise around a common value — take the minimum as the most stable estimate. Per-backend `err` cells render independently of the primal column. Workflow comment restructure to match: SHA goes inline in the H2 title, "PR head"/"Main" preamble lines drop, main-branch results live under their own H3 with a "Click to see" foldout (and a fallback H3 note if benchmark-main failed), Computer Information is its own H3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch the `evaluator-interface` revision overrides in every Project.toml to `main` now that the evaluator API has landed there. Also drop redundant parens in a `@model` Type default in benchmarks/benchmarks.jl. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the LogDensityAt struct with a deprecation shim that emits `Base.depwarn` and returns an `AbstractPPL.Evaluators.VectorEvaluator` wrapping a closure over `logdensity_internal`. The new path is the sanctioned one now that AD prep flows through `AbstractPPL.prepare`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the closure-over-state form with the new `context::Tuple` kwarg on `AbstractPPL.prepare`, which threads constants through to the problem function. Equivalent semantics, no closure construction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pass `check_dims=false` to match the pre-deprecation `LogDensityAt`'s unchecked call behavior; callers that relied on it shouldn't suddenly hit a `DimensionMismatch`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move test/integration/{enzyme,reversediff}/ to test/ext/{DynamicPPLEnzymeCoreExt,DynamicPPLReverseDiffExt}/ and convert the MarginalLogDensities split (subdir env + sibling script) to a single test/ext/DynamicPPLMarginalLogDensitiesExt/{Project.toml,main.jl}.
Each integration env now follows the same layout. CI.yml updated accordingly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove the `AbstractPPL = {url = ..., rev = "main"}` overrides from all
eight Project.toml files. Compat is already at 0.15, so the resolver
picks up the registered release.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fresh-buffer benchmark setup - In `DynamicPPL.TestUtils.AD.run_ad`, use `setup = deepcopy($params)` for both the primal and gradient `@be ` calls so each Chairmarks sample starts from a fresh input buffer instead of reusing the same vector across samples. Matches Mooncake's bench harness; setup runs outside the timed window, so the copy isn't measured. Plus some minor tweaks. ~~This appears to have a noticeable impact on log-density evaluation for tiny models, first noticed in TuringLang#1363~~: e.g., `simple assume observe(linked = true)` logdensity improves from ~20 ns to ~4 ns. In practise, these tiny nanosecond overheads won't matter at all, so the improvement here is only for perfectionists. EDIT: likely `LogDensityAt` -> `logdensity_internal` reduced logdensity's overhead by 20 ns in TuringLang#1363 --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
shravanngoswamii
left a comment
There was a problem hiding this comment.
@yebai few suggestions and improvement comments.
|
Mooncake and Enzyme are faster on this PR. So the 10× regression reported in comments are no longer valid. main branch d2052a1:
Model dim linked primal FwdDiff RvsDiff Mooncake Enzyme
-----------------------------------------------------------------------------------------------
Simple assume observe* 1 false 3.09 ns 12.60 1542.87 40.05 8.81
Simple assume observe* 1 true 3.09 ns 12.88 1690.22 43.55 9.27
This PR 3bdeb74:
Model dim linked primal FwdDiff RvsDiff Mooncake Enzyme
-----------------------------------------------------------------------------------------------
Simple assume observe* 1 false 3.26 ns 12.76 2101.00 37.30 8.23
Simple assume observe* 1 true 3.02 ns 13.37 1832.40 38.87 8.85My system: julia> versioninfo()
Julia Version 1.12.6
Commit 15346901f00 (2026-04-09 19:20 UTC)
Build Info:
Official https://julialang.org release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × 12th Gen Intel(R) Core(TM) i5-12500H
WORD_SIZE: 64
LLVM: libLLVM-18.1.7 (ORCJIT, alderlake)
GC: Built with stock GC
Threads: 1 default, 1 interactive, 1 GC (on 16 virtual cores)
julia> |
Co-authored-by: Shravan Goswami <123811742+shravanngoswamii@users.noreply.github.com>
Co-authored-by: Shravan Goswami <123811742+shravanngoswamii@users.noreply.github.com>
Drop redundant inner `using AbstractPPL: AbstractPPL` (already imported at the top of the file) and reuse `ldf.model` instead of constructing a second `tiny()` instance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
To clarify, these metrics were never valid; they arose from a misunderstanding. For some models, log-density evaluations occur on the order of nanoseconds, making the gradient-to-primal ratios highly noisy across runs and machines. For these tiny models, the primal and gradient timings themselves are more reliable. More generally, these benchmarks are intended primarily to catch obvious allocation or type-stability regressions and are meant for DynamicPPL developers only. |
Replace the alarmist warning with a gentler note clarifying that the benchmarks are aimed at DynamicPPL developers and primarily catch allocation or type-stability regressions. Move the `*`-row explanation out of the per-PR comment into the README's interpretation notes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migrates DynamicPPL's AD plumbing to the new
AbstractPPL.prepare/value_and_gradient!!interface introduced in AbstractPPL 0.15, retiring DynamicPPL's direct dependency on DifferentiationInterface.The
LogDensityFunctionconstructor now hands AD preparation toAbstractPPL.prepare(adtype, logdensity_internal, x; context=..., check_dims=false)andLogDensityProblems.logdensity_and_gradientsimply forwards toAbstractPPL.value_and_gradient!!. The internal_use_closureheuristic that picked betweenBase.Fix1closures andDI.Constantarguments per backend is removed — that decision now lives in AbstractPPL.logdensity_atis renamed tologdensity_internal(with a const alias for compatibility), andLogDensityAtbecomes a deprecation shim that returns anAbstractPPL.Evaluators.VectorEvaluator.Replacing #1354.
Tests currently fail due to #1364Notes: The primal and gradient times are identical to
mainexcept for tiny models, where this branch beatsmainfor logdensity evaluation while gradient remains comparable.Details
The apparent AD regression in the tiny linked row is mostly caused by a much faster
primal denominator, not by slower raw gradients.
Reported row:
So Mooncake is slightly faster in raw time for this row. The ratio looks worse because
t(logdensity)became tiny.Root cause
On Julia 1.11, old main's optimised primal still contains a call boundary:
Current branch inlines through the same path, so the one-dimensional linked model reduces
to loading
params[1]and evaluating two scalar Normal logpdfs.Local Julia 1.11.9 checks:
Adding
@inlineonly toLogDensityProblems.logdensitydid not materially change oldmain. The important boundary is the assume/init path.