gh-149584: Fix excessive overhead in the Tachyon profiler regarding the cache behavior by pablogsal · Pull Request #149649 · python/cpython

pablogsal · 2026-05-10T18:29:54Z

Some ideas after the discussion in the issue with @maurycy. The profiler was spending too much time on repeated remote-memory bookkeeping, full remote page reads for small fixed-size structs, repeated remote writes of unchanged frame-cache state, and Python object allocation churn on steady-state frame-cache hits.

This PR improves the profiler by:

reading hot interpreter/thread/frame structs with exact remote reads instead of page-cache reads
tracking the number of live page-cache entries so cache clear/search only scans the used prefix
batching predicted interpreter/thread/frame reads with process_vm_readv() on Linux
reusing cached FrameInfo and thread id objects when frame-cache hits dominate

The last_profiled_frame remote-write suppression is already present in current upstream/main, so this branch keeps that baseline behavior and builds on top of it.

Benchmark

Benchmarked with:

./python Tools/inspection/benchmark_external_inspection.py --duration 3

For the per-commit measurements, I used the same benchmark workload in quiet mode with cache_frames=True and all_threads=True.

Step	Work rate	Sample rate	Avg call time	Incremental win
`upstream/main` baseline	382,593/s	363,963/s	2.614 us	-
Exact reads for hot structs	549,741/s	514,642/s	1.819 us	+167,148/s (+43.7%)
Track live page-cache entries	747,946/s	688,282/s	1.337 us	+198,205/s (+36.1%)
Batch predicted remote reads	897,703/s	806,084/s	1.114 us	+149,756/s (+20.0%)
Reuse profiler result objects	1,187,314/s	1,047,698/s	0.842 us	+289,611/s (+32.3%)

Final benchmark using the benchmark script:

Metric	Baseline	Final	Change
Work rate	382,593/s	1,180,434/s	3.08x faster
Sample rate	363,963/s	1,038,965/s	2.85x faster
Avg call time	2.614 us	0.85 us	-67.5%
Success rate	99.67%	99.84%	+0.17 pp

Final benchmark output:

Average call time:   0.85 us
Work rate:           1,180,434.2 calls/sec
Sample rate:         1,038,964.9 samples/sec
Success rate:        99.84%

Issue: _remote_debugging: reading whole pages over and over #149584

Use exact remote reads for interpreter state, thread state, and interpreter frame structs instead of pulling full remote pages into the profiler page cache. This matches the core change from python#149585.

The profiler clears the page cache between samples, so live entries are always packed at the front. Track the live count and only clear/search that prefix instead of scanning all 1024 slots on the hot path.

pablogsal · 2026-05-10T18:34:01Z

@maurycy do you mind reviewing this PR?

Use the frame cache to predict the next thread state and top frame address, then batch interpreter/thread/frame reads with process_vm_readv when profiling a Linux target. Reuse prefetched frame buffers in the frame walker when the prediction is valid.

Cache the last FrameInfo tuple per code object/instruction offset, reuse cached thread id objects, and append cached parent frames directly on full frame-cache hits. This cuts Python allocation churn in the steady-state profiler path.

maurycy · 2026-05-11T09:46:05Z

@pablogsal Awesome job, I'm delighted :) Definitely will do!

Perhaps it's worth an entry in the "Optimizations" section?

One thing that I'm wondering immediately is extending benchmark_external_inspection with get_async_stack_trace or even get_all_awaited_by.

pablogsal · 2026-05-13T00:02:11Z

One thing that I'm wondering immediately is extending benchmark_external_inspection with get_async_stack_trace or even get_all_awaited_by.

Added async benchmark modes so the benchmark harness can exercise get_stack_trace(), get_async_stack_trace(), and get_all_awaited_by().

pablogsal · 2026-05-13T00:03:18Z

Perhaps it's worth an entry in the "Optimizations" section?

Since is beta1 I think we are good for now. Later makes sense

pablogsal · 2026-05-13T00:04:00Z

Can you give it another go?

maurycy · 2026-05-13T08:26:06Z

@pablogsal Left two more comments:

gh-149584: Fix excessive overhead in the Tachyon profiler regarding the cache behavior #149649 (comment)
gh-149584: Fix excessive overhead in the Tachyon profiler regarding the cache behavior #149649 (comment)

(If they do not show, you have to check the resolved threads; the comments are follow-ups.)

pablogsal · 2026-05-13T08:27:10Z

Will take a look in 1h

maurycy · 2026-05-13T09:04:59Z

@pablogsal One last thing (in the current round): after v. before numbers for the newly added async benchmarks in the description, to see the gain and for future reference (for git blame tourists, like me)

pablogsal · 2026-05-13T09:22:50Z

I'm traveling so I don't have access to the same box where I run the original ones so I will need to do these in a different one :)

maurycy · 2026-05-13T09:52:48Z

I'm traveling so I don't have access to the same box where I run the original ones so I will need to do these in a different one :)

Thank you. Happy PyCon! :)

pablogsal · 2026-05-19T23:10:14Z

@maurycy addressed the comments

maurycy · 2026-05-20T03:40:26Z

+    InterpreterThreadCacheEntry *entry =
+        &self->cached_tstates[interpreter_thread_cache_index(interpreter_addr)];
+    entry->interpreter_addr = interpreter_addr;
+    entry->thread_state_addr = thread_state_addr;


Given:

typedef struct { uintptr_t interpreter_addr; uintptr_t thread_state_addr; uint64_t code_object_generation; } InterpreterThreadCacheEntry;

https://github.com/python/cpython/pull/149649/changes#diff-a6ce0bef7825b17047f05fdec7aa610bc1d2d2b368e72b6871a79c3b1d2dfaf9R242-R246

Can code_object_generation become stale here?

maurycy · 2026-05-20T03:43:50Z

+        self->cached_generation_interpreter_addr = interpreter_addr;
+        self->cached_code_object_generation = code_object_generation;


Given:

typedef struct { uintptr_t interpreter_addr; uintptr_t thread_state_addr; uint64_t code_object_generation; } InterpreterThreadCacheEntry;

https://github.com/python/cpython/pull/149649/changes#diff-a6ce0bef7825b17047f05fdec7aa610bc1d2d2b368e72b6871a79c3b1d2dfaf9R242-R246

Can thread_state_addr become stale here?

I've added these two "mirror" comments because I'm puzzled about the lifecycle of InterpreterThreadCacheEntry.

Oh sorry, let me push a fix. The intended design is that each writer fully owns its cache slot: the same pattern the other shortcut already uses, where tstate and generation each have their own (interpreter_addr, value) pair. The L2 table joins both values under one slot, which is what makes the lifecycle hard to reason about. I'll split InterpreterThreadCacheEntry into two single-purpose tables, InterpreterTstateCacheEntryand InterpreterGenerationCacheEntry, so the L2 matches the L1 design and each writer fully owns its table.

maurycy · 2026-05-20T03:51:25Z

+    PyObject *last_frame_info;
+    ptrdiff_t last_addrq;
    uintptr_t addr_code_adaptive;
 } CachedCodeMetadata;


But this is also per interpreter, or not?

This is notactually per-interpreter: code_object_cache is a single hashtable on the unwinder, keyed by the remote code object's address. It's shared across all interpreters in the target process.

maurycy · 2026-05-20T04:01:58Z

@pablogsal Thank you. 🙏🏼 I've left three comments. The complexity gets real. 😞 Bump about the new benchmark numbers. Random idea is improving test coverage with many interpreters, similarly to:

cpython/Lib/test/test_external_inspection.py

Lines 3381 to 3452 in a8e3247

    
               def test_cache_vs_no_cache_equivalence(self): 
        
                   """Test that cache_frames=True and cache_frames=False produce equivalent results.""" 
        
                   script_body = """\ 
        
                       def level3(): 
        
                           sock.sendall(b"ready"); sock.recv(16) 
        
                       def level2(): 
        
                           level3() 
        
                       def level1(): 
        
                           level2() 
        
                       level1() 
        
                       """ 
        
                   with self._target_process(script_body) as ( 
        
                       p, 
        
                       client_socket, 
        
                       make_unwinder, 
        
                   ): 
        
                       _wait_for_signal(client_socket, b"ready") 
        
                       # Sample with cache 
        
                       unwinder_cache = make_unwinder(cache_frames=True) 
        
                       frames_cached = self._get_frames_with_retry( 
        
                           unwinder_cache, {"level1", "level2", "level3"} 
        
                       ) 
        
                       # Sample without cache 
        
                       unwinder_no_cache = make_unwinder(cache_frames=False) 
        
                       frames_no_cache = self._get_frames_with_retry( 
        
                           unwinder_no_cache, {"level1", "level2", "level3"} 
        
                       ) 
        
                       client_socket.sendall(b"done") 
        
                   self.assertIsNotNone(frames_cached) 
        
                   self.assertIsNotNone(frames_no_cache) 
        
                   # Same number of frames 
        
                   self.assertEqual(len(frames_cached), len(frames_no_cache)) 
        
                   # Same function names in same order 
        
                   funcs_cached = [f.funcname for f in frames_cached] 
        
                   funcs_no_cache = [f.funcname for f in frames_no_cache] 
        
                   self.assertEqual(funcs_cached, funcs_no_cache) 
        
                   # For level3 (leaf frame), due to timing races we can be at either 
        
                   # sock.sendall() or sock.recv() - both are valid. For parent frames, 
        
                   # the locations should match exactly. 
        
                   # Valid locations for level3: line 3 has two statements 
        
                   #   sock.sendall(b"ready") -> col 4-26 
        
                   #   sock.recv(16) -> col 28-41 
        
                   level3_valid_cols = {(4, 26), (28, 41)} 
        
                   for i in range(len(frames_cached)): 
        
                       loc_cached = frames_cached[i].location 
        
                       loc_no_cache = frames_no_cache[i].location 
        
                       if frames_cached[i].funcname == "level3": 
        
                           # Leaf frame: can be at either statement 
        
                           self.assertIn( 
        
                               (loc_cached.col_offset, loc_cached.end_col_offset), 
        
                               level3_valid_cols, 
        
                           ) 
        
                           self.assertIn( 
        
                               (loc_no_cache.col_offset, loc_no_cache.end_col_offset), 
        
                               level3_valid_cols, 
        
                           ) 
        
                       else: 
        
                           # Parent frames: must match exactly 
        
                           self.assertEqual(loc_cached, loc_no_cache)

pablogsal · 2026-05-20T05:56:17Z

@pablogsal Thank you. 🙏🏼 I've left three comments. The complexity gets real. 😞 Bump about the new benchmark numbers.

I won't have access to my box until next week :(

Random idea is improving test coverage with many interpreters, similarly to:

I don't think this helps a lot. I tried out of curiosity and this test passes without any of the bugfixes we have added. We can try to add more tests in a separate PR if you want :)

pablogsal · 2026-05-20T06:05:26Z

Measured in a different machine:

`get_async_stack_trace()` — baseline vs final

Metric	Baseline	Final	Change
Work rate	101,117/s	160,804/s	1.59x faster
Sample rate	97,857/s	153,686/s	1.57x faster
Avg call time	9.89 µs	6.22 µs	-37.1%
Success rate	98.93%	98.90%	-0.03 pp

`get_all_awaited_by()` — baseline vs final

Metric	Baseline	Final	Change
Work rate	19,241/s	22,103/s	1.15x faster
Sample rate	19,157/s	21,973/s	1.15x faster
Avg call time	51.97 µs	45.24 µs	-12.9%
Success rate	100.00%	100.00%	flat

bedevere-app Bot added the awaiting core review label May 10, 2026

bedevere-app Bot mentioned this pull request May 10, 2026

_remote_debugging: reading whole pages over and over #149584

Open

pablogsal added 2 commits May 10, 2026 19:31

pythongh-149584: Avoid page reads for hot profiler structs

c5520d3

Use exact remote reads for interpreter state, thread state, and interpreter frame structs instead of pulling full remote pages into the profiler page cache. This matches the core change from python#149585.

pythongh-149584: Track live remote page cache entries

8be8d7d

The profiler clears the page cache between samples, so live entries are always packed at the front. Track the live count and only clear/search that prefix instead of scanning all 1024 slots on the hot path.

pablogsal force-pushed the speedups branch from fa43c34 to 7899a6a Compare May 10, 2026 18:31

pablogsal mentioned this pull request May 10, 2026

gh-149584: Do not use page cache for thread/frame/interp reads #149585

Closed

pablogsal force-pushed the speedups branch from 7899a6a to d5d3f3b Compare May 10, 2026 18:35

pablogsal added 3 commits May 10, 2026 20:22

pythongh-149584: Reuse profiler result objects

c69a0f3

Cache the last FrameInfo tuple per code object/instruction offset, reuse cached thread id objects, and append cached parent frames directly on full frame-cache hits. This cuts Python allocation churn in the steady-state profiler path.

pythongh-149584: Add NEWS for Tachyon profiler overhead fix

7a85c9a

pablogsal force-pushed the speedups branch from d5d3f3b to 7a85c9a Compare May 10, 2026 19:22

maurycy mentioned this pull request May 12, 2026

_remote_debugging: returning the same samples over and over #149718

Closed

maurycy reviewed May 12, 2026

View reviewed changes

Comment thread Modules/_remote_debugging/code_objects.c

Comment thread Modules/_remote_debugging/threads.c Outdated

Comment thread Modules/_remote_debugging/module.c

Comment thread Modules/_remote_debugging/_remote_debugging.h Outdated

pablogsal added 3 commits May 13, 2026 00:53

Address review feedback

46a0b2c

Add async benchmark modes

f7fe3be

cleanp

2995187

pablogsal requested a review from maurycy May 13, 2026 00:03

small cleanp

d576d38

fixup! small cleanp

05799a7

maurycy reviewed May 20, 2026

View reviewed changes

fixup! fixup! small cleanp

4245b13

		self->cached_generation_interpreter_addr = interpreter_addr;
		self->cached_code_object_generation = code_object_generation;

Uh oh!

Conversation

pablogsal commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Uh oh!

pablogsal commented May 10, 2026

Uh oh!

maurycy commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pablogsal commented May 13, 2026

Uh oh!

pablogsal commented May 13, 2026

Uh oh!

pablogsal commented May 13, 2026

Uh oh!

maurycy commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pablogsal commented May 13, 2026

Uh oh!

maurycy commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pablogsal commented May 13, 2026

Uh oh!

maurycy commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pablogsal commented May 19, 2026

Uh oh!

maurycy May 20, 2026

Choose a reason for hiding this comment

Uh oh!

maurycy May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pablogsal May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maurycy May 20, 2026

Choose a reason for hiding this comment

Uh oh!

pablogsal May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maurycy commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pablogsal commented May 20, 2026

Uh oh!

pablogsal commented May 20, 2026

get_async_stack_trace() — baseline vs final

get_all_awaited_by() — baseline vs final

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pablogsal commented May 10, 2026 •

edited

Loading

maurycy commented May 11, 2026 •

edited

Loading

maurycy commented May 13, 2026 •

edited

Loading

maurycy commented May 13, 2026 •

edited

Loading

maurycy commented May 13, 2026 •

edited

Loading

maurycy May 20, 2026 •

edited

Loading

pablogsal May 20, 2026 •

edited

Loading

pablogsal May 20, 2026 •

edited

Loading

maurycy commented May 20, 2026 •

edited

Loading

`get_async_stack_trace()` — baseline vs final

`get_all_awaited_by()` — baseline vs final