Skip to content

gh-149584: Fix excessive overhead in the Tachyon profiler regarding the cache behavior#149649

Open
pablogsal wants to merge 11 commits into
python:mainfrom
pablogsal:speedups
Open

gh-149584: Fix excessive overhead in the Tachyon profiler regarding the cache behavior#149649
pablogsal wants to merge 11 commits into
python:mainfrom
pablogsal:speedups

Conversation

@pablogsal
Copy link
Copy Markdown
Member

@pablogsal pablogsal commented May 10, 2026

Some ideas after the discussion in the issue with @maurycy. The profiler was spending too much time on repeated remote-memory bookkeeping, full remote page reads for small fixed-size structs, repeated remote writes of unchanged frame-cache state, and Python object allocation churn on steady-state frame-cache hits.

This PR improves the profiler by:

  • reading hot interpreter/thread/frame structs with exact remote reads instead of page-cache reads
  • tracking the number of live page-cache entries so cache clear/search only scans the used prefix
  • batching predicted interpreter/thread/frame reads with process_vm_readv() on Linux
  • reusing cached FrameInfo and thread id objects when frame-cache hits dominate

The last_profiled_frame remote-write suppression is already present in current upstream/main, so this branch keeps that baseline behavior and builds on top of it.

Benchmark

Benchmarked with:

./python Tools/inspection/benchmark_external_inspection.py --duration 3

For the per-commit measurements, I used the same benchmark workload in quiet mode with cache_frames=True and all_threads=True.

Step Work rate Sample rate Avg call time Incremental win
upstream/main baseline 382,593/s 363,963/s 2.614 us -
Exact reads for hot structs 549,741/s 514,642/s 1.819 us +167,148/s (+43.7%)
Track live page-cache entries 747,946/s 688,282/s 1.337 us +198,205/s (+36.1%)
Batch predicted remote reads 897,703/s 806,084/s 1.114 us +149,756/s (+20.0%)
Reuse profiler result objects 1,187,314/s 1,047,698/s 0.842 us +289,611/s (+32.3%)

Final benchmark using the benchmark script:

Metric Baseline Final Change
Work rate 382,593/s 1,180,434/s 3.08x faster
Sample rate 363,963/s 1,038,965/s 2.85x faster
Avg call time 2.614 us 0.85 us -67.5%
Success rate 99.67% 99.84% +0.17 pp

Final benchmark output:

Average call time:   0.85 us
Work rate:           1,180,434.2 calls/sec
Sample rate:         1,038,964.9 samples/sec
Success rate:        99.84%

pablogsal added 2 commits May 10, 2026 19:31
Use exact remote reads for interpreter state, thread state, and
interpreter frame structs instead of pulling full remote pages into the
profiler page cache. This matches the core change from
python#149585.
The profiler clears the page cache between samples, so live entries are
always packed at the front. Track the live count and only clear/search
that prefix instead of scanning all 1024 slots on the hot path.
@pablogsal
Copy link
Copy Markdown
Member Author

@maurycy do you mind reviewing this PR?

pablogsal added 3 commits May 10, 2026 20:22
Use the frame cache to predict the next thread state and top frame
address, then batch interpreter/thread/frame reads with process_vm_readv
when profiling a Linux target. Reuse prefetched frame buffers in the
frame walker when the prediction is valid.
Cache the last FrameInfo tuple per code object/instruction offset, reuse
cached thread id objects, and append cached parent frames directly on
full frame-cache hits. This cuts Python allocation churn in the
steady-state profiler path.
@maurycy
Copy link
Copy Markdown
Contributor

maurycy commented May 11, 2026

@pablogsal Awesome job, I'm delighted :) Definitely will do!

Perhaps it's worth an entry in the "Optimizations" section?

One thing that I'm wondering immediately is extending benchmark_external_inspection with get_async_stack_trace or even get_all_awaited_by.

Comment thread Modules/_remote_debugging/code_objects.c
Comment thread Modules/_remote_debugging/threads.c Outdated
Comment thread Modules/_remote_debugging/module.c
Comment thread Modules/_remote_debugging/_remote_debugging.h Outdated
@pablogsal
Copy link
Copy Markdown
Member Author

One thing that I'm wondering immediately is extending benchmark_external_inspection with get_async_stack_trace or even get_all_awaited_by.

Added async benchmark modes so the benchmark harness can exercise get_stack_trace(), get_async_stack_trace(), and get_all_awaited_by().

@pablogsal
Copy link
Copy Markdown
Member Author

Perhaps it's worth an entry in the "Optimizations" section?

Since is beta1 I think we are good for now. Later makes sense

@pablogsal pablogsal requested a review from maurycy May 13, 2026 00:03
@pablogsal
Copy link
Copy Markdown
Member Author

Can you give it another go?

@maurycy
Copy link
Copy Markdown
Contributor

maurycy commented May 13, 2026

@pablogsal Left two more comments:

(If they do not show, you have to check the resolved threads; the comments are follow-ups.)

@pablogsal
Copy link
Copy Markdown
Member Author

Will take a look in 1h

@maurycy
Copy link
Copy Markdown
Contributor

maurycy commented May 13, 2026

@pablogsal One last thing (in the current round): after v. before numbers for the newly added async benchmarks in the description, to see the gain and for future reference (for git blame tourists, like me)

@pablogsal
Copy link
Copy Markdown
Member Author

I'm traveling so I don't have access to the same box where I run the original ones so I will need to do these in a different one :)

@maurycy
Copy link
Copy Markdown
Contributor

maurycy commented May 13, 2026

I'm traveling so I don't have access to the same box where I run the original ones so I will need to do these in a different one :)

Thank you. Happy PyCon! :)

@pablogsal
Copy link
Copy Markdown
Member Author

@maurycy addressed the comments

Comment thread Modules/_remote_debugging/module.c Outdated
Comment on lines +526 to +529
InterpreterThreadCacheEntry *entry =
&self->cached_tstates[interpreter_thread_cache_index(interpreter_addr)];
entry->interpreter_addr = interpreter_addr;
entry->thread_state_addr = thread_state_addr;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given:

typedef struct {
    uintptr_t interpreter_addr;
    uintptr_t thread_state_addr;
    uint64_t code_object_generation;
} InterpreterThreadCacheEntry;

https://github.com/python/cpython/pull/149649/changes#diff-a6ce0bef7825b17047f05fdec7aa610bc1d2d2b368e72b6871a79c3b1d2dfaf9R242-R246

Can code_object_generation become stale here?

Comment on lines +561 to +562
self->cached_generation_interpreter_addr = interpreter_addr;
self->cached_code_object_generation = code_object_generation;
Copy link
Copy Markdown
Contributor

@maurycy maurycy May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given:

typedef struct {
    uintptr_t interpreter_addr;
    uintptr_t thread_state_addr;
    uint64_t code_object_generation;
} InterpreterThreadCacheEntry;

https://github.com/python/cpython/pull/149649/changes#diff-a6ce0bef7825b17047f05fdec7aa610bc1d2d2b368e72b6871a79c3b1d2dfaf9R242-R246

Can thread_state_addr become stale here?

I've added these two "mirror" comments because I'm puzzled about the lifecycle of InterpreterThreadCacheEntry.

Copy link
Copy Markdown
Member Author

@pablogsal pablogsal May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, let me push a fix. The intended design is that each writer fully owns its cache slot: the same pattern the other shortcut already uses, where tstate and generation each have their own (interpreter_addr, value) pair. The L2 table joins both values under one slot, which is what makes the lifecycle hard to reason about. I'll split InterpreterThreadCacheEntry into two single-purpose tables, InterpreterTstateCacheEntryand InterpreterGenerationCacheEntry, so the L2 matches the L1 design and each writer fully owns its table.

Comment on lines +219 to 222
PyObject *last_frame_info;
ptrdiff_t last_addrq;
uintptr_t addr_code_adaptive;
} CachedCodeMetadata;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is also per interpreter, or not?

Copy link
Copy Markdown
Member Author

@pablogsal pablogsal May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is notactually per-interpreter: code_object_cache is a single hashtable on the unwinder, keyed by the remote code object's address. It's shared across all interpreters in the target process.

@maurycy
Copy link
Copy Markdown
Contributor

maurycy commented May 20, 2026

@pablogsal Thank you. 🙏🏼 I've left three comments. The complexity gets real. 😞 Bump about the new benchmark numbers. Random idea is improving test coverage with many interpreters, similarly to:

def test_cache_vs_no_cache_equivalence(self):
"""Test that cache_frames=True and cache_frames=False produce equivalent results."""
script_body = """\
def level3():
sock.sendall(b"ready"); sock.recv(16)
def level2():
level3()
def level1():
level2()
level1()
"""
with self._target_process(script_body) as (
p,
client_socket,
make_unwinder,
):
_wait_for_signal(client_socket, b"ready")
# Sample with cache
unwinder_cache = make_unwinder(cache_frames=True)
frames_cached = self._get_frames_with_retry(
unwinder_cache, {"level1", "level2", "level3"}
)
# Sample without cache
unwinder_no_cache = make_unwinder(cache_frames=False)
frames_no_cache = self._get_frames_with_retry(
unwinder_no_cache, {"level1", "level2", "level3"}
)
client_socket.sendall(b"done")
self.assertIsNotNone(frames_cached)
self.assertIsNotNone(frames_no_cache)
# Same number of frames
self.assertEqual(len(frames_cached), len(frames_no_cache))
# Same function names in same order
funcs_cached = [f.funcname for f in frames_cached]
funcs_no_cache = [f.funcname for f in frames_no_cache]
self.assertEqual(funcs_cached, funcs_no_cache)
# For level3 (leaf frame), due to timing races we can be at either
# sock.sendall() or sock.recv() - both are valid. For parent frames,
# the locations should match exactly.
# Valid locations for level3: line 3 has two statements
# sock.sendall(b"ready") -> col 4-26
# sock.recv(16) -> col 28-41
level3_valid_cols = {(4, 26), (28, 41)}
for i in range(len(frames_cached)):
loc_cached = frames_cached[i].location
loc_no_cache = frames_no_cache[i].location
if frames_cached[i].funcname == "level3":
# Leaf frame: can be at either statement
self.assertIn(
(loc_cached.col_offset, loc_cached.end_col_offset),
level3_valid_cols,
)
self.assertIn(
(loc_no_cache.col_offset, loc_no_cache.end_col_offset),
level3_valid_cols,
)
else:
# Parent frames: must match exactly
self.assertEqual(loc_cached, loc_no_cache)

@pablogsal
Copy link
Copy Markdown
Member Author

@pablogsal Thank you. 🙏🏼 I've left three comments. The complexity gets real. 😞 Bump about the new benchmark numbers.

I won't have access to my box until next week :(

Random idea is improving test coverage with many interpreters, similarly to:

I don't think this helps a lot. I tried out of curiosity and this test passes without any of the bugfixes we have added. We can try to add more tests in a separate PR if you want :)

@pablogsal
Copy link
Copy Markdown
Member Author

Measured in a different machine:

get_async_stack_trace() — baseline vs final

Metric Baseline Final Change
Work rate 101,117/s 160,804/s 1.59x faster
Sample rate 97,857/s 153,686/s 1.57x faster
Avg call time 9.89 µs 6.22 µs -37.1%
Success rate 98.93% 98.90% -0.03 pp

get_all_awaited_by() — baseline vs final

Metric Baseline Final Change
Work rate 19,241/s 22,103/s 1.15x faster
Sample rate 19,157/s 21,973/s 1.15x faster
Avg call time 51.97 µs 45.24 µs -12.9%
Success rate 100.00% 100.00% flat

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants