Skip to content

test#34

Merged
Malkovsky merged 8 commits intomainfrom
test/ai_profiling
Apr 18, 2026
Merged

test#34
Malkovsky merged 8 commits intomainfrom
test/ai_profiling

Conversation

@Malkovsky
Copy link
Copy Markdown
Owner

Testing AI profiling

Comment thread include/pixie/bitvector.h Outdated
rank -= super_block_rank[s_block];
auto pos = find_basicblock_is(rank, s_block);
// auto pos = find_basicblock_is(rank, s_block);
auto pos = find_basicblock(rank, s_block);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SUGGESTION: Potential select performance regression

Switching from the interpolation search (find_basicblock_is) to the linear scan (find_basicblock) can add extra SIMD scans on random data and reduce select throughput. If the prior change was for correctness, consider restoring the interpolation path with a guard for edge cases (e.g., zero-width range) to keep the faster path when valid.

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented Mar 30, 2026

Code Review Summary

Status: 4 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 0
SUGGESTION 4

Fix these issues in Kilo Cloud

Issue Details (click to expand)

SUGGESTION

File Line Issue
.github/workflows/kilo-dispatch.yml 72 Cache global Kilo CLI install to reduce workflow time
.github/workflows/kilo-dispatch.yml 162 Avoid extra API call on fallback
include/pixie/bitvector.h 525 Potential select performance regression from switching to linear scan
src/benchmarks/benchmarks.cpp 299 Consider time-based iteration control for runtime
Files Reviewed (4 files)
  • include/pixie/bitvector.h - 0 issues
  • scripts/draw_bp_representation.py - 0 issues
  • scripts/draw_bp_tree.py - 0 issues
  • src/benchmarks/benchmarks.cpp - 1 issue

Reviewed by gpt-5.2-codex-20260114 · 1,358,861 tokens

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 30, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 82.35294% with 51 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.47%. Comparing base (7db5cd7) to head (63cf583).

Files with missing lines Patch % Lines
include/pixie/bitvector.h 82.35% 33 Missing and 18 partials ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #34      +/-   ##
==========================================
+ Coverage   86.17%   86.47%   +0.29%     
==========================================
  Files          11       11              
  Lines        2850     2868      +18     
  Branches      561      557       -4     
==========================================
+ Hits         2456     2480      +24     
+ Misses        257      253       -4     
+ Partials      137      135       -2     
Flag Coverage Δ
gcov 86.47% <82.35%> (+0.29%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Malkovsky
Copy link
Copy Markdown
Owner Author

@Kilo please perform profiling review

@Malkovsky
Copy link
Copy Markdown
Owner Author

@Kilo please perform profiling review on this pr

@Malkovsky
Copy link
Copy Markdown
Owner Author

https://github.com/Kilo please perform profiling review on this PR

@Malkovsky
Copy link
Copy Markdown
Owner Author

@Kilo please perform profiling performance review on this PR

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented Mar 30, 2026

@Malkovsky Profiling/performance review for this PR:\n\n- Source vs target change: now calls instead of in (around line 525).\n- Performance impact (expected): regression in throughput/latency (especially for larger superblocks), since this replaces the interpolation-search style locator with a more linear/basic-block scan path.\n- Most likely cause: increased per-query block traversal and branch work in the hot path.\n\nI do not see benchmark CI output attached to this PR, so there is no measured delta table to quote here. Based on the algorithmic path change alone, this is likely a real degradation rather than noise.\n\nSuggested benchmark focus for confirmation: on vs ; include repetitions and min-time to reduce variance.

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented Mar 30, 2026

@Malkovsky Profiling/performance review for this PR:

  • Source vs target change: select() now calls find_basicblock(...) instead of find_basicblock_is(...) in include/pixie/bitvector.h (around line 525).
  • Performance impact (expected): regression in select throughput/latency (especially for larger superblocks), since this replaces the interpolation-search style locator with a more linear/basic-block scan path.
  • Most likely cause: increased per-query block traversal and branch work in the hot select path.

I do not see benchmark CI output attached to this PR, so there is no measured delta table to quote here. Based on the algorithmic path change alone, this is likely a real degradation rather than noise.

Suggested benchmark focus for confirmation:
benchmarks --benchmark_filter="BM_Select" --benchmark_repetitions=10 --benchmark_min_time=0.5s
on main (bcea19e) vs test/ai_profiling (a858d3f).

@Malkovsky
Copy link
Copy Markdown
Owner Author

@Kilo please profile the changes of the PR

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented Mar 30, 2026

@Malkovsky Performance profiling review for PR #34:

  • I attempted to build and run benchmark comparisons between base bcea19e and head a858d3f, but this runner does not have a C++ build toolchain (cmake, g++, make, ninja are unavailable), so no runtime benchmark numbers could be produced locally.
  • Code diff analysis shows a hot-path change in BitVector::select: find_basicblock_is(rank, s_block) was replaced with find_basicblock(rank, s_block).

Expected performance impact (based on code path change):

  • select/select1 likely regresses modestly due to switching from the specialized interleaved-scan variant to the generic basic-block scan.
  • rank and unrelated structures should remain unchanged.

Most likely cause of degradation:

  • The removed find_basicblock_is path appears to be a specialized fast path for locating the basic block inside select; replacing it with the generic search increases instruction count and/or reduces locality in that inner step.

No measurable benchmark delta can be claimed until this PR is run on a benchmark-capable runner.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 4, 2026

pong

@nikolay-reviewer
Copy link
Copy Markdown

Kilo Dispatch Result

Prompt

Test prompt. Respond with "pong"

Run Log

7.1.20
kilo run does not support --timeout; relying on job timeout-minutes.
Performing one time database migration, may take a few minutes...
sqlite-migration:done
Database migration complete.
�[0m
> code · anthropic/claude-sonnet-4.6
�[0m
pong

@nikolay-reviewer
Copy link
Copy Markdown

Performing one time database migration, may take a few minutes...
sqlite-migration:done
Database migration complete.
�[0m

code · anthropic/claude-sonnet-4.6
�[0m
pong

@nikolay-reviewer
Copy link
Copy Markdown

pong

@nikolay-reviewer
Copy link
Copy Markdown

at getModel (src/provider/provider.ts:1222:13)

@nikolay-reviewer
Copy link
Copy Markdown

pong

1 similar comment
@nikolay-reviewer
Copy link
Copy Markdown

pong

@Malkovsky Malkovsky force-pushed the test/ai_profiling branch from a858d3f to 237ea15 Compare April 5, 2026 21:25
@Malkovsky
Copy link
Copy Markdown
Owner Author

@nikolay please review performance impact of the changes

@nikolay-reviewer
Copy link
Copy Markdown

@Malkovsky (empty Kilo output)

@Malkovsky
Copy link
Copy Markdown
Owner Author

@nikolay please analyse performance impact

@Malkovsky
Copy link
Copy Markdown
Owner Author

@nikolay please investigate performance impact

@Malkovsky
Copy link
Copy Markdown
Owner Author

@nikolay please investigate performance impact

@Malkovsky
Copy link
Copy Markdown
Owner Author

@nikolay please check performance impact of the changes

@nikolay-reviewer
Copy link
Copy Markdown

@Malkovsky (empty Kilo output)

@Malkovsky
Copy link
Copy Markdown
Owner Author

@nikolay please check the performance impact of the changes

@nikolay-reviewer
Copy link
Copy Markdown

@Malkovsky Now I have all the data to produce the final report:


Performance Review: test/ai_profiling vs origin/main

Configuration

  • Baseline: origin/main (7db5cd7)
  • Contender: test/ai_profiling (3012865)
  • Platform: Linux x86-64, 4× 3262 MHz, L1D 32KiB, L2 512KiB, L3 32MiB
  • Benchmarks run: benchmarks (BitVector select/rank)
  • Filter: BM_Select.*/n:(1024|65536|4194304|268435456)/ — 4 representative sizes across the cache hierarchy
  • Hardware counters: Unavailable (perf event access denied in this environment; libpfm linked but kernel perf_event_paranoid blocks access)
  • Build type: Release (-O3)

Code Change

The PR replaces one function call in include/pixie/bitvector.h:524:

- auto pos = find_basicblock_is(rank, s_block);
+ // auto pos = find_basicblock_is(rank, s_block);
+ auto pos = find_basicblock(rank, s_block);

find_basicblock_is is an interpolation-search variant that guesses the target basic block position using linear interpolation before falling back to SIMD linear scan (find_basicblock). The PR replaces it with the plain SIMD linear scan for SelectNonInterleaved (select for ones). The analogous function for select-zero (find_basicblock_is_zeros) is not changed.


Timing Summary

Binary Improvements (>5%) Regressions (>5%) Neutral (±5%) Net Change
benchmarks (select) 5 2 17 -1.2%
benchmarks (rank, control) 0 0 28 +2.1% noise

Detailed Timing Results — benchmarks (Select)

Select benchmarks (filter: BM_Select.*/n:(1024|65536|4194304|268435456)/, 5 repetitions, --benchmark_min_time=0.5s)

Benchmark Base (ns) New (ns) Delta Class
BM_SelectNonInterleaved/n:1024 40.5 38.0 -6.0% IMPROVEMENT
BM_SelectNonInterleaved/n:65536 43.3 42.8 -1.3% neutral
BM_SelectNonInterleaved/n:4194304 49.2 48.8 -0.9% neutral
BM_SelectNonInterleaved/n:268435456 104.2 105.3 +1.0% neutral
BM_SelectNonInterleaved12p5PercentFill/n:1024 40.0 37.6 -6.0% IMPROVEMENT
BM_SelectNonInterleaved12p5PercentFill/n:65536 43.0 42.7 -0.7% neutral
BM_SelectNonInterleaved12p5PercentFill/n:4194304 49.0 48.6 -0.8% neutral
BM_SelectNonInterleaved12p5PercentFill/n:268435456 98.8 106.9 +8.2% REGRESSION
BM_SelectNonInterleaved87p5PercentFill/n:1024 40.0 37.6 -6.0% IMPROVEMENT
BM_SelectNonInterleaved87p5PercentFill/n:65536 42.9 42.7 -0.4% neutral
BM_SelectNonInterleaved87p5PercentFill/n:4194304 49.2 48.8 -0.9% neutral
BM_SelectNonInterleaved87p5PercentFill/n:268435456 101.5 108.6 +6.9% REGRESSION
BM_SelectZeroNonInterleaved/n:1024 41.2 40.7 -1.3% neutral
BM_SelectZeroNonInterleaved/n:65536 45.2 44.9 -0.7% neutral
BM_SelectZeroNonInterleaved/n:4194304 52.5 51.8 -1.2% neutral
BM_SelectZeroNonInterleaved/n:268435456 109.2 108.5 -0.6% neutral
BM_SelectZeroNonInterleaved12p5PercentFill/n:1024 41.1 40.5 -1.5% neutral
BM_SelectZeroNonInterleaved12p5PercentFill/n:65536 45.0 44.9 -0.1% neutral
BM_SelectZeroNonInterleaved12p5PercentFill/n:4194304 52.4 51.9 -1.1% neutral
BM_SelectZeroNonInterleaved12p5PercentFill/n:268435456 110.5 110.2 -0.3% neutral
BM_SelectZeroNonInterleaved87p5PercentFill/n:1024 25.0 22.8 -8.5% IMPROVEMENT
BM_SelectZeroNonInterleaved87p5PercentFill/n:65536 36.7 34.3 -6.7% IMPROVEMENT
BM_SelectZeroNonInterleaved87p5PercentFill/n:4194304 43.8 43.9 +0.2% neutral
BM_SelectZeroNonInterleaved87p5PercentFill/n:268435456 58.9 59.3 +0.7% neutral

OVERALL_GEOMEAN (compare.py): -1.24% (contender is faster overall)

Rank benchmarks (control — unchanged code): net +2.1% mean, stddev 3.6% — confirms this environment has ~2-4% noise floor for large-n benchmarks. The BM_RankZeroNonInterleaved/n:4194304 outlier (+12.4%) is noise consistent with this machine's background variance.


Hardware Counter Profile

Unavailable — kernel perf_event_paranoid setting blocks access to hardware performance counters in this CI environment. libpfm is installed but perf_event_open returns permission denied.


Key Findings

  1. Small-n select (n=1024, L1-cached) improves ~6% across all fill densities. Replacing find_basicblock_is with find_basicblock removes the interpolation overhead and branch misprediction cost at small sizes where the interpolation guess is not useful — the whole superblock structure fits in L1 cache and a direct linear SIMD scan is faster.

  2. Large-n SelectNonInterleaved with skewed fill (12.5%, 87.5%) at n=268435456 regresses +6.9%–+8.2%. At this size (32MB bit vector, at the L3 cache capacity boundary), memory access dominates and the interpolation search in find_basicblock_is provides a useful prediction: for skewed fill densities the interpolation guess is often accurate, allowing the basic block to be found in 1-2 SIMD scans instead of up to 4. Removing it forces 4 iterations every time.

  3. SelectNonInterleaved at 50% fill (n=268435456) is only +1.0% (neutral) — consistent with the noise floor. At uniform density the interpolation guess is less accurate anyway.

  4. SelectZeroNonInterleaved is unaffected as expected — the PR only touches the select (ones) code path; find_basicblock_is_zeros is still used.

  5. The BM_SelectZeroNonInterleaved87p5PercentFill improvements (-6.7% to -8.5%) at small n are likely noise/confounding from different CPU state during the benchmark run, since the zero-select path was not changed. The control rank benchmarks show similar-magnitude unexplained variance.


Verdict

REGRESSES PERFORMANCE

The PR introduces statistically significant regressions (+6.9% to +8.2%, p < 0.003 by Mann-Whitney U test) in SelectNonInterleaved for skewed fill densities (12.5% and 87.5% ones) at large data sizes (n=268435456, 32MB), precisely the workloads where the removed interpolation search find_basicblock_is was most beneficial. While small-n (n=1024) select improves ~6%, large working sets that exceed L2/L3 cache are the performance-critical case for succinct data structure workloads. Removing the interpolation optimization is a regression for those cases.

@Malkovsky
Copy link
Copy Markdown
Owner Author

@nikolay please provide the performance impact of the changes

@@ -66,6 +71,19 @@ jobs:
- name: Install Kilo CLI
run: npm install -g @kilocode/cli
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SUGGESTION: Cache global Kilo CLI install

npm install -g @kilocode/cli on every run adds noticeable latency; consider caching ~/.npm (or using a local tool dir + npm ci) so subsequent runs reuse the tarballs and speed up the workflow.

@Malkovsky
Copy link
Copy Markdown
Owner Author

@nikolay please analyse performance impact of the changes


let prefix = '';
try {
const parent = await github.rest.issues.getComment({
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SUGGESTION: Avoid extra API call on fallback

When the reply attempt fails, the fallback path fetches the parent comment only to @mention the author. If reducing runtime/API calls is a goal, consider skipping the lookup (post without the @mention) or passing the author login as an input, which removes an extra API round-trip on this path.

@nikolay-reviewer
Copy link
Copy Markdown

@Malkovsky Started Kilo run: /perf-review please analyse performance impact of the changes
Run: https://github.com/Malkovsky/pixie/actions/runs/24310898408

@Malkovsky
Copy link
Copy Markdown
Owner Author

@nikolay please compare performance impact

@nikolay-reviewer
Copy link
Copy Markdown

@Malkovsky Started Kilo run: /perf-review please compare performance impact
Run: https://github.com/Malkovsky/pixie/actions/runs/24313338661

@nikolay-reviewer
Copy link
Copy Markdown

@Malkovsky (empty Kilo output)

@nikolay-reviewer
Copy link
Copy Markdown

@Malkovsky Started Kilo run: /perf-review please compare performance impact
Run: https://github.com/Malkovsky/pixie/actions/runs/24313338661

@nikolay-reviewer
Copy link
Copy Markdown

@Malkovsky (empty Kilo output)

->RangeMultiplier(4)
->Range(8, 1ull << 34)
->Range(1ull << 10, 1ull << 30)
->Iterations(10000000)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SUGGESTION: Consider time-based iteration control for runtime

With Range(1<<10..1<<30) and fixed 10M iterations per size, total benchmark runtime can grow quickly. Using ->MinTime(...) or scaling iterations by n keeps runs tractable while still capturing performance trends.

@Malkovsky Malkovsky merged commit 1d5ea82 into main Apr 18, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants