Conversation
| rank -= super_block_rank[s_block]; | ||
| auto pos = find_basicblock_is(rank, s_block); | ||
| // auto pos = find_basicblock_is(rank, s_block); | ||
| auto pos = find_basicblock(rank, s_block); |
There was a problem hiding this comment.
SUGGESTION: Potential select performance regression
Switching from the interpolation search (find_basicblock_is) to the linear scan (find_basicblock) can add extra SIMD scans on random data and reduce select throughput. If the prior change was for correctness, consider restoring the interpolation path with a guard for edge cases (e.g., zero-width range) to keep the faster path when valid.
Code Review SummaryStatus: 4 Issues Found | Recommendation: Address before merge Overview
Fix these issues in Kilo Cloud Issue Details (click to expand)SUGGESTION
Files Reviewed (4 files)
Reviewed by gpt-5.2-codex-20260114 · 1,358,861 tokens |
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #34 +/- ##
==========================================
+ Coverage 86.17% 86.47% +0.29%
==========================================
Files 11 11
Lines 2850 2868 +18
Branches 561 557 -4
==========================================
+ Hits 2456 2480 +24
+ Misses 257 253 -4
+ Partials 137 135 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@Kilo please perform profiling review |
|
@Kilo please perform profiling review on this pr |
|
https://github.com/Kilo please perform profiling review on this PR |
|
@Kilo please perform profiling performance review on this PR |
|
@Malkovsky Profiling/performance review for this PR:\n\n- Source vs target change: now calls instead of in (around line 525).\n- Performance impact (expected): regression in throughput/latency (especially for larger superblocks), since this replaces the interpolation-search style locator with a more linear/basic-block scan path.\n- Most likely cause: increased per-query block traversal and branch work in the hot path.\n\nI do not see benchmark CI output attached to this PR, so there is no measured delta table to quote here. Based on the algorithmic path change alone, this is likely a real degradation rather than noise.\n\nSuggested benchmark focus for confirmation: on vs ; include repetitions and min-time to reduce variance. |
|
@Malkovsky Profiling/performance review for this PR:
I do not see benchmark CI output attached to this PR, so there is no measured delta table to quote here. Based on the algorithmic path change alone, this is likely a real degradation rather than noise. Suggested benchmark focus for confirmation: |
|
@Kilo please profile the changes of the PR |
|
@Malkovsky Performance profiling review for PR #34:
Expected performance impact (based on code path change):
Most likely cause of degradation:
No measurable benchmark delta can be claimed until this PR is run on a benchmark-capable runner. |
|
pong |
Kilo Dispatch Result
Prompt Run Log |
|
Performing one time database migration, may take a few minutes...
|
|
pong |
|
at getModel (src/provider/provider.ts:1222:13) |
|
pong |
1 similar comment
|
pong |
a858d3f to
237ea15
Compare
|
@nikolay please review performance impact of the changes |
|
@Malkovsky (empty Kilo output) |
|
@nikolay please analyse performance impact |
|
@nikolay please investigate performance impact |
|
@nikolay please investigate performance impact |
|
@nikolay please check performance impact of the changes |
|
@Malkovsky (empty Kilo output) |
|
@nikolay please check the performance impact of the changes |
|
@Malkovsky Now I have all the data to produce the final report: Performance Review: test/ai_profiling vs origin/mainConfiguration
Code ChangeThe PR replaces one function call in - auto pos = find_basicblock_is(rank, s_block);
+ // auto pos = find_basicblock_is(rank, s_block);
+ auto pos = find_basicblock(rank, s_block);
Timing Summary
Detailed Timing Results —
|
| Benchmark | Base (ns) | New (ns) | Delta | Class |
|---|---|---|---|---|
BM_SelectNonInterleaved/n:1024 |
40.5 | 38.0 | -6.0% | IMPROVEMENT |
BM_SelectNonInterleaved/n:65536 |
43.3 | 42.8 | -1.3% | neutral |
BM_SelectNonInterleaved/n:4194304 |
49.2 | 48.8 | -0.9% | neutral |
BM_SelectNonInterleaved/n:268435456 |
104.2 | 105.3 | +1.0% | neutral |
BM_SelectNonInterleaved12p5PercentFill/n:1024 |
40.0 | 37.6 | -6.0% | IMPROVEMENT |
BM_SelectNonInterleaved12p5PercentFill/n:65536 |
43.0 | 42.7 | -0.7% | neutral |
BM_SelectNonInterleaved12p5PercentFill/n:4194304 |
49.0 | 48.6 | -0.8% | neutral |
BM_SelectNonInterleaved12p5PercentFill/n:268435456 |
98.8 | 106.9 | +8.2% | REGRESSION |
BM_SelectNonInterleaved87p5PercentFill/n:1024 |
40.0 | 37.6 | -6.0% | IMPROVEMENT |
BM_SelectNonInterleaved87p5PercentFill/n:65536 |
42.9 | 42.7 | -0.4% | neutral |
BM_SelectNonInterleaved87p5PercentFill/n:4194304 |
49.2 | 48.8 | -0.9% | neutral |
BM_SelectNonInterleaved87p5PercentFill/n:268435456 |
101.5 | 108.6 | +6.9% | REGRESSION |
BM_SelectZeroNonInterleaved/n:1024 |
41.2 | 40.7 | -1.3% | neutral |
BM_SelectZeroNonInterleaved/n:65536 |
45.2 | 44.9 | -0.7% | neutral |
BM_SelectZeroNonInterleaved/n:4194304 |
52.5 | 51.8 | -1.2% | neutral |
BM_SelectZeroNonInterleaved/n:268435456 |
109.2 | 108.5 | -0.6% | neutral |
BM_SelectZeroNonInterleaved12p5PercentFill/n:1024 |
41.1 | 40.5 | -1.5% | neutral |
BM_SelectZeroNonInterleaved12p5PercentFill/n:65536 |
45.0 | 44.9 | -0.1% | neutral |
BM_SelectZeroNonInterleaved12p5PercentFill/n:4194304 |
52.4 | 51.9 | -1.1% | neutral |
BM_SelectZeroNonInterleaved12p5PercentFill/n:268435456 |
110.5 | 110.2 | -0.3% | neutral |
BM_SelectZeroNonInterleaved87p5PercentFill/n:1024 |
25.0 | 22.8 | -8.5% | IMPROVEMENT |
BM_SelectZeroNonInterleaved87p5PercentFill/n:65536 |
36.7 | 34.3 | -6.7% | IMPROVEMENT |
BM_SelectZeroNonInterleaved87p5PercentFill/n:4194304 |
43.8 | 43.9 | +0.2% | neutral |
BM_SelectZeroNonInterleaved87p5PercentFill/n:268435456 |
58.9 | 59.3 | +0.7% | neutral |
OVERALL_GEOMEAN (compare.py): -1.24% (contender is faster overall)
Rank benchmarks (control — unchanged code): net +2.1% mean, stddev 3.6% — confirms this environment has ~2-4% noise floor for large-n benchmarks. The BM_RankZeroNonInterleaved/n:4194304 outlier (+12.4%) is noise consistent with this machine's background variance.
Hardware Counter Profile
Unavailable — kernel perf_event_paranoid setting blocks access to hardware performance counters in this CI environment. libpfm is installed but perf_event_open returns permission denied.
Key Findings
-
Small-n
select(n=1024, L1-cached) improves ~6% across all fill densities. Replacingfind_basicblock_iswithfind_basicblockremoves the interpolation overhead and branch misprediction cost at small sizes where the interpolation guess is not useful — the whole superblock structure fits in L1 cache and a direct linear SIMD scan is faster. -
Large-n
SelectNonInterleavedwith skewed fill (12.5%, 87.5%) at n=268435456 regresses +6.9%–+8.2%. At this size (32MB bit vector, at the L3 cache capacity boundary), memory access dominates and the interpolation search infind_basicblock_isprovides a useful prediction: for skewed fill densities the interpolation guess is often accurate, allowing the basic block to be found in 1-2 SIMD scans instead of up to 4. Removing it forces 4 iterations every time. -
SelectNonInterleavedat 50% fill (n=268435456) is only +1.0% (neutral) — consistent with the noise floor. At uniform density the interpolation guess is less accurate anyway. -
SelectZeroNonInterleavedis unaffected as expected — the PR only touches theselect(ones) code path;find_basicblock_is_zerosis still used. -
The
BM_SelectZeroNonInterleaved87p5PercentFillimprovements (-6.7% to -8.5%) at small n are likely noise/confounding from different CPU state during the benchmark run, since the zero-select path was not changed. The control rank benchmarks show similar-magnitude unexplained variance.
Verdict
REGRESSES PERFORMANCE
The PR introduces statistically significant regressions (+6.9% to +8.2%, p < 0.003 by Mann-Whitney U test) in SelectNonInterleaved for skewed fill densities (12.5% and 87.5% ones) at large data sizes (n=268435456, 32MB), precisely the workloads where the removed interpolation search find_basicblock_is was most beneficial. While small-n (n=1024) select improves ~6%, large working sets that exceed L2/L3 cache are the performance-critical case for succinct data structure workloads. Removing the interpolation optimization is a regression for those cases.
|
@nikolay please provide the performance impact of the changes |
| @@ -66,6 +71,19 @@ jobs: | |||
| - name: Install Kilo CLI | |||
| run: npm install -g @kilocode/cli | |||
There was a problem hiding this comment.
SUGGESTION: Cache global Kilo CLI install
npm install -g @kilocode/cli on every run adds noticeable latency; consider caching ~/.npm (or using a local tool dir + npm ci) so subsequent runs reuse the tarballs and speed up the workflow.
|
@nikolay please analyse performance impact of the changes |
|
|
||
| let prefix = ''; | ||
| try { | ||
| const parent = await github.rest.issues.getComment({ |
There was a problem hiding this comment.
SUGGESTION: Avoid extra API call on fallback
When the reply attempt fails, the fallback path fetches the parent comment only to @mention the author. If reducing runtime/API calls is a goal, consider skipping the lookup (post without the @mention) or passing the author login as an input, which removes an extra API round-trip on this path.
|
@Malkovsky Started Kilo run: /perf-review please analyse performance impact of the changes |
|
@nikolay please compare performance impact |
|
@Malkovsky Started Kilo run: /perf-review please compare performance impact |
|
@Malkovsky (empty Kilo output) |
|
@Malkovsky Started Kilo run: /perf-review please compare performance impact |
|
@Malkovsky (empty Kilo output) |
| ->RangeMultiplier(4) | ||
| ->Range(8, 1ull << 34) | ||
| ->Range(1ull << 10, 1ull << 30) | ||
| ->Iterations(10000000) |
There was a problem hiding this comment.
SUGGESTION: Consider time-based iteration control for runtime
With Range(1<<10..1<<30) and fixed 10M iterations per size, total benchmark runtime can grow quickly. Using ->MinTime(...) or scaling iterations by n keeps runs tractable while still capturing performance trends.
Testing AI profiling