feat:Implement perf event groups, scaled reads, and group snapshots by SiyuanSun0736 · Pull Request #22 · multikernel/kernelscript

SiyuanSun0736 · 2026-05-19T07:59:25Z

Overview

This PR introduces the ability to group multiple perf metrics (e.g., cache misses, branch misses, cycles) into a single scheduling group. This ensures that counters observing the same workload are started and stopped together, solving the issue of misaligned results from independently managed counters.

Additionally, it brings comprehensive multiplex-aware read APIs, static PMU slot limit validations, and fixes several internal userspace codegen edges to stabilize snapshot data consumption.

Key Features & User-Facing Changes

1. High-Level Grouping API

New group field: Added a high-level group field in perf_options to easily attach members to a leader.

var cache = attach(prog, perf_options { perf_type: perf_type_hardware, perf_config: cache_misses }, 0)
var branch = attach(prog, perf_options { perf_type: perf_type_hardware, perf_config: branch_misses, group: cache }, 0)

Compatibility: The lower-level group_fd: leader.perf_fd approach is preserved for backward compatibility.

2. Multiplex-Aware Read APIs

read(att): Now returns scaled values by default, corrected via time_enabled / time_running when PMU multiplexing occurs. (Matches raw count if no multiplexing happens).
read_raw(att): Returns the uncorrected, raw counter values.
read_details(att): Returns a struct containing raw, scaled, time_enabled, and time_running—ideal for manual delta or rate calculations.
read_group(leader): Captures an atomic snapshot of the entire group. Returns up to 16 ID/Value pairs (where values[] are pre-scaled according to snapshot timing) and snapshot time fields.

3. Group Lifecycle Management

Group Restarts: Dynamically attaching a new member to an existing active group now triggers a disable/reset/enable sequence on the whole group, ensuring counters start from zero together.
Cascading Detach: Detaching a group leader no longer conservatively rejects the operation. It now cascades and automatically detaches all active members.

4. Compile-Time PMU Slot Validation

Statically visible perf groups are now evaluated during the type-checking phase to calculate hardware PMU slot consumption.
Compilation will fail early if the group is too large. The limit defaults to 4 (or dynamically probes sysfs), and can be overridden via the KERNELSCRIPT_PERF_GROUP_MAX_EVENTS environment variable.
perf_type_software and perf_type_tracepoint are correctly excluded from hardware PMU slot counts.

Internal & Codegen Improvements

Array IR Lowering: Fixed array indexing and dereferencing in IR lowering to ensure user-space C code generates correctly when iterating over read_group() snapshot arrays (snapshot.ids[i] / snapshot.values[i]).
Array Initialization: Modified non-literal array initializations to "declare first, then memcpy", preventing invalid C generation from snapshot struct fields.
Variable Declarations: Fixed an issue where reused for loop counters and subsequent variables of the same name produced duplicate function-level C declarations.
Read Helpers: Added raw/details/group perf read helpers, leveraging 128-bit intermediate values for safe multiplex scaling.

Documentation & Examples

examples/perf_cache_miss.ks: Refactored to use the new group API. Added demonstrations of read_details() for rate calculation and read_group() for iterating through snapshot id/value pairs.
examples/perf_page_fault.ks: Extended to demonstrate updated perf read semantics.
Docs: Updated README.md, SPEC.md, and BUILTINS.md to reflect group semantics, read interfaces, and PMU slot constraints.

Test Coverage

Added IR and codegen assertions for both group_fd and high-level group paths.
Covered member-attach group restarts, ioctl generation, and cascading leader detaches.
Covered multiplex scaling fast/slow paths for read(), and helper generation for read_raw(), read_details(), and read_group().
Covered oversized static group validation during compilation.
Added regression tests for for loop counter variable reuse in userspace codegen.

- Introduced `group_fd` field in the perf options structure to allow attaching BPF programs to a group of perf events. - Updated the `ks_open_perf_event` function to accept `group_fd` and handle group event management. - Implemented helper functions for managing active members of perf event groups, ensuring that group leaders cannot be detached while active members exist. - Enhanced the generated code to include necessary checks and structures for handling multiplexed perf events. - Added tests to validate the new group management features and ensure correct code generation for group-related operations.

- Introduced functions to manage performance event groups, including detection of maximum events and validation of static groups. - Added support for new performance read functions: `read_raw`, `read_details`, and `read_group`, along with their corresponding structures and handling in the code generation. - Enhanced the type checker to validate performance event group attachments and ensure no cycles exist in group leader relationships. - Updated userspace code generation to track usage of new performance read functions and manage group attachments. - Added tests for new functionality, including validation of oversized static performance event groups and code generation for new read functions.

…erformance event groups; added snapshot index printing functionality; updated userspace code generation tests to verify variable reuse logic.

congwang-mk · 2026-05-22T00:47:39Z

Design suggestion: collapse the four read verbs into one generic `read()`

The grouping + multiplex-scaling work here is solid. Before it lands, though, I'd like to push back on the read API surface. This PR introduces four read verbs that all take a single PerfAttachment:

read(att) -> i64 (now scaled)
read_raw(att) -> i64
read_details(att) -> PerfReadDetails
read_group(leader) -> PerfGroupRead

Three of those are just views of the same single-fd read — read and read_raw are literally fields of what read_details already returns — and read_group exists as a second verb only because a group leader and a non-leader share the same type. That's a lot of permanent, generically-named builtins (read / read_raw / read_group collide with the universal notion of "read") for what is fundamentally "snapshot a perf counter."

Proposal: one verb, one unified result

Make read the single generic reader, dispatched on the argument's static type (the compiler already does this via builtin_return_type_for_call), returning one struct that subsumes all four behaviors:

PerfRead {
  raw: i64,            // self raw count   (== values[0] pre-scaling)
  scaled: i64,         // self multiplex-corrected count (== values[0])
  time_enabled: u64,
  time_running: u64,
  count: u32,          // members in snapshot; 1 for a non-group event
  values: [i64; N],    // per-member scaled values
  ids:    [u64; N],    // per-member perf ids
}

A non-group event simply has count == 1, values[0] == scaled. So the other three verbs disappear:

read_raw(x) → read(x).raw
read_details(x) → read(x)
read_group(x) → read(x) + the array fields

var c = read(cache)
print("cache scaled: %lld  raw: %lld", c.scaled, c.raw)

var snap = read(cache)            // same call; just use the group fields
var i = 0
while (i < snap.count) {          // count == 1 for a plain event
    print("id=%llu value=%lld", snap.ids[i], snap.values[i])
    i = i + 1
}

Why this is also a net simplification of this PR

Removes 3 builtins + 1 result struct. Just read + PerfRead.
Deletes the PERF_FORMAT_GROUP string-rewrite toggle. Since any read may expose the group, the read_format is always-on, so the Str.global_replace against whitespace-exact generated C goes away. That toggle is brittle today — if the read_format block's formatting ever changes, the replace silently no-ops and read_group misparses every snapshot.
Forces the capacity fix that's currently latent. Right now KS_PERF_GROUP_MAX_VALUES (16, hardcoded in the result struct) and the compile-time group limit (KERNELSCRIPT_PERF_GROUP_MAX_EVENTS / sysfs num_counters, effectively unbounded) are decoupled — a group the compiler accepts can be silently truncated at runtime, which defeats the whole "atomic snapshot" guarantee, and the static validator is best-effort (only sees perf_options struct literals in-function). Under the unified design the array rides every read, so you must reconcile them: make N the hard ceiling, clamp the detected limit to it (min(detected, N)), and reject statically-visible groups that exceed it. Truncation then only ever applies to dynamically-built groups the compiler can't see.

Trait-shaped, perf-only for now

Keep it perf-only, but structure read as a dispatch point so a future Map/RingBuffer reader becomes a new arm rather than a new verb — three single places:

validate_read_function — the allow-list of readable arg types.
builtin_return_type_for_call — the arg-type → result-type table.
the codegen name→helper resolution keyed on the argument's type.

The validator comment already reads "read() currently requires a PerfAttachment", so this is the direction the verb was already headed.

Cost

read() stops returning a bare i64 — the common case becomes read(att).scaled, and the -1 error sentinel moves into .scaled (so read(x) < 0 becomes read(x).scaled < 0). That, plus folding the read_group call sites into read, is the entire migration surface.

Happy to send a follow-up commit implementing this if you're on board.

Fold calls into primary expressions so chained access like read(cache).scaled parses cleanly. Make read() return PerfRead and remove the split raw/details/group helpers. Always request PERF_FORMAT_ID and PERF_FORMAT_GROUP, clamp static group limits to 16, and update docs, examples, and tests.

SiyuanSun0736 · 2026-05-22T06:19:47Z

Description

This update consolidates all perf counter reads around a single, unified PerfRead snapshot API, addressing previous review feedback regarding API bloat and latent silent-truncation bugs.

Additionally, this PR formalizes our API philosophy regarding data retrieval (Perf vs. Maps vs. RingBuffers) to ensure clear semantic boundaries moving forward.

Key Technical Changes

Unified API Result: read(handle) now returns a comprehensive PerfRead struct. The common path for users becomes read(att).scaled, while .raw, timing data, and group snapshot arrays (.values, .ids) are always available on the same returned value.
Codegen Simplification: Userspace codegen now routes all reads through a single ks_perf_attachment_read path, entirely removing the need for separate raw/details/group helper builtins.
Always-On Group Formatting: Generated C code now always requests PERF_FORMAT_ID and PERF_FORMAT_GROUP. Same-time group values and IDs are inherently available from the read snapshot, allowing us to safely delete the brittle Str.global_replace toggle.
Strict Capacity Validation: Forces the capacity fix. Static perf group validation now explicitly clamps the effective group limit to the 16-entry PerfRead capacity. The compiler now proactively rejects oversized groups by checking both PMU slot usage and member count, preventing silent truncation at runtime.
Docs & Parser: Updated parser, documentation, and examples to match chained access patterns (e.g., read(cache).scaled).

Architectural Note: Semantic Boundaries for read() (Pushing back on generic dispatch)

In the previous review, it was suggested that we structure read() as a polymorphic dispatch point to eventually support Maps and RingBuffers. After careful consideration of the current language syntax, this PR advocates for : Separating verb semantics.

We propose strictly bounding read() to static snapshot retrievals (like Perf counters), rather than making it a universal accessor. Here is the rationale:

Maps: We already have an elegant and idiomatic array-indexing syntax (map[key]). Overloading read(map, key) would introduce redundant ways to do the same thing and degrade the language's expressiveness.
RingBuffers: We currently utilize a dispatch()-driven push model (Event Loop / Callbacks) for stream processing.
Semantic Clarity: A RingBuffer (continuous stream of events) and a Perf Counter (a point-in-time static snapshot) are fundamentally different in control flow. Forcing them into the same read() verb conflates "polling a stream" with "taking a snapshot."

By keeping read() strictly for snapshots, [] for state lookups, and dispatch() for event streams, we maintain a clear, predictable, and highly specialized API philosophy.

SiyuanSun0736 added 3 commits May 19, 2026 07:00

Enhanced the output of cache miss counts and branch miss counts for p…

68bfab7

…erformance event groups; added snapshot index printing functionality; updated userspace code generation tests to verify variable reuse logic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat:Implement perf event groups, scaled reads, and group snapshots#22

feat:Implement perf event groups, scaled reads, and group snapshots#22
SiyuanSun0736 wants to merge 4 commits into
multikernel:mainfrom
SiyuanSun0736:perf-group

SiyuanSun0736 commented May 19, 2026

Uh oh!

congwang-mk commented May 22, 2026

Uh oh!

SiyuanSun0736 commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SiyuanSun0736 commented May 19, 2026

Overview

Key Features & User-Facing Changes

1. High-Level Grouping API

2. Multiplex-Aware Read APIs

3. Group Lifecycle Management

4. Compile-Time PMU Slot Validation

Internal & Codegen Improvements

Documentation & Examples

Test Coverage

Uh oh!

congwang-mk commented May 22, 2026

Design suggestion: collapse the four read verbs into one generic read()

Proposal: one verb, one unified result

Why this is also a net simplification of this PR

Trait-shaped, perf-only for now

Cost

Uh oh!

SiyuanSun0736 commented May 22, 2026

Description

Key Technical Changes

Architectural Note: Semantic Boundaries for read() (Pushing back on generic dispatch)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Design suggestion: collapse the four read verbs into one generic `read()`