Skip to content

docs(geneva): add user guide for profiling stateful UDF memory#248

Merged
justinrmiller merged 3 commits into
mainfrom
gen-512-memray-udf-profiling-docs
May 21, 2026
Merged

docs(geneva): add user guide for profiling stateful UDF memory#248
justinrmiller merged 3 commits into
mainfrom
gen-512-memray-udf-profiling-docs

Conversation

@justinrmiller
Copy link
Copy Markdown
Contributor

Summary

Adds a new page Geneva → Transforms → Profiling Memory that helps Geneva users find memory leaks in their own stateful UDFs with memray, before those leaks cause worker OOMs in production backfills.

Counterpart to lancedb/geneva#796 (the integration test + CI workflow). This PR is the user-facing guide; the geneva PR is the in-repo regression gate.

What the page covers

  1. Why stateful UDFs need memory profiling — actor lifetime, cross-batch retention, why a leak that looks tiny in a unit test becomes catastrophic on a 10M-row backfill.
  2. When to profile — concrete symptoms (e.g. FatalWorkerOOMError, RSS growing during backfill, models / caches / running statistics in self).
  3. How to profile with memray — the opt-in tracker pattern: wrap setup() with memray.Tracker conditional on an env var, then propagate the env var to Ray workers via ray_cluster(extra_env=...).
  4. Reading the output — what peak heap and leaked allocations actually mean, what a healthy vs unhealthy profile looks like.
  5. Four common leak patterns — with bad → good code side by side:
    • Unbounded cache → lru_cache(maxsize=...)
    • Accumulating per-call buffers → summarize, don't retain
    • Closure capture of batch arrays → extract small values, not the whole pa.Array
    • PyTorch autograd retention → torch.inference_mode() / eval()
  6. A confidence check — temporarily plant a known leak to verify the tracker is actually attached (most common failure: env var not reaching workers).

Files

  • docs/geneva/udfs/profiling-memory.mdx — new page (~250 lines)
  • docs/docs.json — adds the new page to the Transforms sidebar group, between error_handling and blobs

Test plan

  • docs.json is valid JSON (python3 -c 'import json; json.load(open("docs/docs.json"))')
  • Frontmatter matches the Mintlify style used by sibling pages (title, sidebarTitle, description, icon)
  • Internal links resolve: /geneva/udfs/udfs, /geneva/jobs/troubleshooting, /geneva/udfs/advanced-configuration
  • Preview the rendered page in Mintlify before merge

🤖 Generated with Claude Code

Adds a new page under Geneva → Transforms that walks Geneva users
through profiling their own stateful UDFs with memray to find memory
leaks before they cause worker OOMs. Covers:

- Why stateful UDFs are memory-risk-prone (actor lifetime,
  cross-batch retention)
- The opt-in tracker pattern: wrap setup() with memray.Tracker
  conditional on an env var, propagated to Ray workers via
  ray_cluster(extra_env=...)
- How to read memray summary / flamegraph / tree output
- Four common leak patterns (growing cache, accumulating buffers,
  closure captures, ML-library autograd retention) with bad/good
  code side by side
- A "does my profiling actually work?" sanity check

Linear: https://linear.app/lancedb/issue/GEN-512

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mintlify
Copy link
Copy Markdown
Contributor

mintlify Bot commented May 19, 2026

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
lancedb-bcbb4faf 🟢 Ready View Preview May 19, 2026, 11:32 PM

…ng guide

memray alone gives users the Python-side allocation story, but real
worker OOMs are easier to attribute by also watching process RSS
(``resource.getrusage().ru_maxrss``) and Arrow's own allocator
(``pyarrow.total_allocated_bytes()``) side-by-side over time. Adds:

- A drop-in ``log_memory(seq)`` helper Geneva users can paste into
  their UDF (with macOS/Linux ru_maxrss unit handling)
- A short explanation of what each number actually represents
  (rss = OS-given, arrow_live = live Arrow buffers, gap = "everything
  else": Python heap, native libs, allocator retention)
- A five-row diagnostic table mapping observed patterns to root cause
  and concrete first-fix advice (allocator retention, real Arrow leak,
  oversized peak, pathological row, healthy)
- A pointer to Geneva's reference integration test, whose stdout logs
  this exact breakdown so users can see the patterns in practice

Linear: https://linear.app/lancedb/issue/GEN-512

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@dantasse dantasse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in "running jobs" section instead of the "transforms" section?

Otherwise, looks good, easy enough to follow! Just a few questions, nothing blocking I think

Comment thread docs/geneva/udfs/profiling-memory.mdx Outdated

with ray_cluster(
local=True,
extra_env={"MY_UDF_MEMRAY_OUT_DIR": "/tmp/my-udf-profile"},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
extra_env={"MY_UDF_MEMRAY_OUT_DIR": "/tmp/my-udf-profile"},
extra_env={"MY_UDF_MEMRAY_OUT_DIR": os.environ.get("MY_UDF_MEMRAY_OUT_DIR")},

maybe?
also is with ray_cluster(...) what we're currently recommending ppl to do for a local ray cluster with extra env vars?


memray reports two values you'll care about most:

- **Peak heap** (`metadata.peak_memory`) — the high-water mark. This is what triggers OOMs. A peak well above your `setup()` allocations means a batch transiently doubles memory before freeing.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could mean transient doubling, or a lot of other things, right? like a slow memory leak would still result in high peak_memory, I assume

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, it would just take longer to surface, which is unfortunately what some of our customers seem to be experiencing.


| Pattern | Diagnosis | First thing to try |
|---|---|---|
| `rss` climbs slowly, `arrow_live` flat near zero, big growing `gap` | Allocator retention — Python freed it but `glibc` is keeping the pages | `ctypes.CDLL("libc.so.6").malloc_trim(0)` periodically (Linux only); or set `MALLOC_TRIM_THRESHOLD_=131072` |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where would I set MALLOC_TRIM_THRESHOLD_? I guess that's an env var? This is confusing to me - I don't have a sense what's happening here, so I would probably copy these in as magic incantations but feel bad about it :)

But this whole section is a little over my head anyway. Good to have it, I think, but I hope I never need it!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that was a new one for me as well, it's just a linux environment variable. So much of the tuning in Linux/MacOS/Windows is different, really makes me wish there was a more standard API/config for this stuff.


If `memray summary` doesn't show leaked bytes growing roughly with batch count after this change, your tracker isn't actually attached (most often: the env var isn't reaching workers — re-check `extra_env`).

Geneva's own test suite ships a reference implementation of this pattern in `src/stress_tests/_memray_probe.py` and `src/stress_tests/test_memray_stateful_udf.py`, plus a GitHub Actions workflow (`memray-stateful-udf-profile.yml`) that uploads the per-actor `.bin` and rendered flamegraph as a CI artifact. Feel free to copy that scaffolding for your own project's UDFs.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reference the geneva source? I guess so - if you pay for geneva, you get the source. It kinda feels weird; we basically never do otherwise. But I guess there's not anything actually wrong here. Ok, I guess I don't really have a question here 😆

@justinrmiller justinrmiller merged commit 51b0570 into main May 21, 2026
2 checks passed
@justinrmiller justinrmiller deleted the gen-512-memray-udf-profiling-docs branch May 21, 2026 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants