docs(geneva): add user guide for profiling stateful UDF memory#248
Conversation
Adds a new page under Geneva → Transforms that walks Geneva users through profiling their own stateful UDFs with memray to find memory leaks before they cause worker OOMs. Covers: - Why stateful UDFs are memory-risk-prone (actor lifetime, cross-batch retention) - The opt-in tracker pattern: wrap setup() with memray.Tracker conditional on an env var, propagated to Ray workers via ray_cluster(extra_env=...) - How to read memray summary / flamegraph / tree output - Four common leak patterns (growing cache, accumulating buffers, closure captures, ML-library autograd retention) with bad/good code side by side - A "does my profiling actually work?" sanity check Linear: https://linear.app/lancedb/issue/GEN-512 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Preview deployment for your docs. Learn more about Mintlify Previews.
|
…ng guide memray alone gives users the Python-side allocation story, but real worker OOMs are easier to attribute by also watching process RSS (``resource.getrusage().ru_maxrss``) and Arrow's own allocator (``pyarrow.total_allocated_bytes()``) side-by-side over time. Adds: - A drop-in ``log_memory(seq)`` helper Geneva users can paste into their UDF (with macOS/Linux ru_maxrss unit handling) - A short explanation of what each number actually represents (rss = OS-given, arrow_live = live Arrow buffers, gap = "everything else": Python heap, native libs, allocator retention) - A five-row diagnostic table mapping observed patterns to root cause and concrete first-fix advice (allocator retention, real Arrow leak, oversized peak, pathological row, healthy) - A pointer to Geneva's reference integration test, whose stdout logs this exact breakdown so users can see the patterns in practice Linear: https://linear.app/lancedb/issue/GEN-512 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dantasse
left a comment
There was a problem hiding this comment.
Should this be in "running jobs" section instead of the "transforms" section?
Otherwise, looks good, easy enough to follow! Just a few questions, nothing blocking I think
|
|
||
| with ray_cluster( | ||
| local=True, | ||
| extra_env={"MY_UDF_MEMRAY_OUT_DIR": "/tmp/my-udf-profile"}, |
There was a problem hiding this comment.
| extra_env={"MY_UDF_MEMRAY_OUT_DIR": "/tmp/my-udf-profile"}, | |
| extra_env={"MY_UDF_MEMRAY_OUT_DIR": os.environ.get("MY_UDF_MEMRAY_OUT_DIR")}, |
maybe?
also is with ray_cluster(...) what we're currently recommending ppl to do for a local ray cluster with extra env vars?
|
|
||
| memray reports two values you'll care about most: | ||
|
|
||
| - **Peak heap** (`metadata.peak_memory`) — the high-water mark. This is what triggers OOMs. A peak well above your `setup()` allocations means a batch transiently doubles memory before freeing. |
There was a problem hiding this comment.
it could mean transient doubling, or a lot of other things, right? like a slow memory leak would still result in high peak_memory, I assume
There was a problem hiding this comment.
correct, it would just take longer to surface, which is unfortunately what some of our customers seem to be experiencing.
|
|
||
| | Pattern | Diagnosis | First thing to try | | ||
| |---|---|---| | ||
| | `rss` climbs slowly, `arrow_live` flat near zero, big growing `gap` | Allocator retention — Python freed it but `glibc` is keeping the pages | `ctypes.CDLL("libc.so.6").malloc_trim(0)` periodically (Linux only); or set `MALLOC_TRIM_THRESHOLD_=131072` | |
There was a problem hiding this comment.
where would I set MALLOC_TRIM_THRESHOLD_? I guess that's an env var? This is confusing to me - I don't have a sense what's happening here, so I would probably copy these in as magic incantations but feel bad about it :)
But this whole section is a little over my head anyway. Good to have it, I think, but I hope I never need it!
There was a problem hiding this comment.
Yeah that was a new one for me as well, it's just a linux environment variable. So much of the tuning in Linux/MacOS/Windows is different, really makes me wish there was a more standard API/config for this stuff.
|
|
||
| If `memray summary` doesn't show leaked bytes growing roughly with batch count after this change, your tracker isn't actually attached (most often: the env var isn't reaching workers — re-check `extra_env`). | ||
|
|
||
| Geneva's own test suite ships a reference implementation of this pattern in `src/stress_tests/_memray_probe.py` and `src/stress_tests/test_memray_stateful_udf.py`, plus a GitHub Actions workflow (`memray-stateful-udf-profile.yml`) that uploads the per-actor `.bin` and rendered flamegraph as a CI artifact. Feel free to copy that scaffolding for your own project's UDFs. |
There was a problem hiding this comment.
Can we reference the geneva source? I guess so - if you pay for geneva, you get the source. It kinda feels weird; we basically never do otherwise. But I guess there's not anything actually wrong here. Ok, I guess I don't really have a question here 😆
Summary
Adds a new page Geneva → Transforms → Profiling Memory that helps Geneva users find memory leaks in their own stateful UDFs with memray, before those leaks cause worker OOMs in production backfills.
Counterpart to lancedb/geneva#796 (the integration test + CI workflow). This PR is the user-facing guide; the geneva PR is the in-repo regression gate.
What the page covers
FatalWorkerOOMError, RSS growing during backfill, models / caches / running statistics inself).setup()withmemray.Trackerconditional on an env var, then propagate the env var to Ray workers viaray_cluster(extra_env=...).lru_cache(maxsize=...)pa.Arraytorch.inference_mode()/eval()Files
docs/geneva/udfs/profiling-memory.mdx— new page (~250 lines)docs/docs.json— adds the new page to the Transforms sidebar group, betweenerror_handlingandblobsTest plan
docs.jsonis valid JSON (python3 -c 'import json; json.load(open("docs/docs.json"))')title,sidebarTitle,description,icon)/geneva/udfs/udfs,/geneva/jobs/troubleshooting,/geneva/udfs/advanced-configuration🤖 Generated with Claude Code