Skip to content

feat:Implement perf event groups, scaled reads, and group snapshots#22

Open
SiyuanSun0736 wants to merge 4 commits into
multikernel:mainfrom
SiyuanSun0736:perf-group
Open

feat:Implement perf event groups, scaled reads, and group snapshots#22
SiyuanSun0736 wants to merge 4 commits into
multikernel:mainfrom
SiyuanSun0736:perf-group

Conversation

@SiyuanSun0736
Copy link
Copy Markdown
Contributor

Overview

This PR introduces the ability to group multiple perf metrics (e.g., cache misses, branch misses, cycles) into a single scheduling group. This ensures that counters observing the same workload are started and stopped together, solving the issue of misaligned results from independently managed counters.

Additionally, it brings comprehensive multiplex-aware read APIs, static PMU slot limit validations, and fixes several internal userspace codegen edges to stabilize snapshot data consumption.

Key Features & User-Facing Changes

1. High-Level Grouping API

  • New group field: Added a high-level group field in perf_options to easily attach members to a leader.
var cache = attach(prog, perf_options { perf_type: perf_type_hardware, perf_config: cache_misses }, 0)
var branch = attach(prog, perf_options { perf_type: perf_type_hardware, perf_config: branch_misses, group: cache }, 0)

  • Compatibility: The lower-level group_fd: leader.perf_fd approach is preserved for backward compatibility.

2. Multiplex-Aware Read APIs

  • read(att): Now returns scaled values by default, corrected via time_enabled / time_running when PMU multiplexing occurs. (Matches raw count if no multiplexing happens).
  • read_raw(att): Returns the uncorrected, raw counter values.
  • read_details(att): Returns a struct containing raw, scaled, time_enabled, and time_running—ideal for manual delta or rate calculations.
  • read_group(leader): Captures an atomic snapshot of the entire group. Returns up to 16 ID/Value pairs (where values[] are pre-scaled according to snapshot timing) and snapshot time fields.

3. Group Lifecycle Management

  • Group Restarts: Dynamically attaching a new member to an existing active group now triggers a disable/reset/enable sequence on the whole group, ensuring counters start from zero together.
  • Cascading Detach: Detaching a group leader no longer conservatively rejects the operation. It now cascades and automatically detaches all active members.

4. Compile-Time PMU Slot Validation

  • Statically visible perf groups are now evaluated during the type-checking phase to calculate hardware PMU slot consumption.
  • Compilation will fail early if the group is too large. The limit defaults to 4 (or dynamically probes sysfs), and can be overridden via the KERNELSCRIPT_PERF_GROUP_MAX_EVENTS environment variable.
  • perf_type_software and perf_type_tracepoint are correctly excluded from hardware PMU slot counts.

Internal & Codegen Improvements

  • Array IR Lowering: Fixed array indexing and dereferencing in IR lowering to ensure user-space C code generates correctly when iterating over read_group() snapshot arrays (snapshot.ids[i] / snapshot.values[i]).
  • Array Initialization: Modified non-literal array initializations to "declare first, then memcpy", preventing invalid C generation from snapshot struct fields.
  • Variable Declarations: Fixed an issue where reused for loop counters and subsequent variables of the same name produced duplicate function-level C declarations.
  • Read Helpers: Added raw/details/group perf read helpers, leveraging 128-bit intermediate values for safe multiplex scaling.

Documentation & Examples

  • examples/perf_cache_miss.ks: Refactored to use the new group API. Added demonstrations of read_details() for rate calculation and read_group() for iterating through snapshot id/value pairs.
  • examples/perf_page_fault.ks: Extended to demonstrate updated perf read semantics.
  • Docs: Updated README.md, SPEC.md, and BUILTINS.md to reflect group semantics, read interfaces, and PMU slot constraints.

Test Coverage

  • Added IR and codegen assertions for both group_fd and high-level group paths.
  • Covered member-attach group restarts, ioctl generation, and cascading leader detaches.
  • Covered multiplex scaling fast/slow paths for read(), and helper generation for read_raw(), read_details(), and read_group().
  • Covered oversized static group validation during compilation.
  • Added regression tests for for loop counter variable reuse in userspace codegen.

- Introduced `group_fd` field in the perf options structure to allow
  attaching BPF programs to a group of perf events.
- Updated the `ks_open_perf_event` function to accept `group_fd` and
  handle group event management.
- Implemented helper functions for managing active members of perf
  event groups, ensuring that group leaders cannot be detached while
  active members exist.
- Enhanced the generated code to include necessary checks and
  structures for handling multiplexed perf events.
- Added tests to validate the new group management features and
  ensure correct code generation for group-related operations.
- Introduced functions to manage performance event groups, including detection of maximum events and validation of static groups.
- Added support for new performance read functions: `read_raw`, `read_details`, and `read_group`, along with their corresponding structures and handling in the code generation.
- Enhanced the type checker to validate performance event group attachments and ensure no cycles exist in group leader relationships.
- Updated userspace code generation to track usage of new performance read functions and manage group attachments.
- Added tests for new functionality, including validation of oversized static performance event groups and code generation for new read functions.
…erformance event groups; added snapshot index printing functionality; updated userspace code generation tests to verify variable reuse logic.
@congwang-mk
Copy link
Copy Markdown
Contributor

Design suggestion: collapse the four read verbs into one generic read()

The grouping + multiplex-scaling work here is solid. Before it lands, though, I'd like to push back on the read API surface. This PR introduces four read verbs that all take a single PerfAttachment:

  • read(att) -> i64 (now scaled)
  • read_raw(att) -> i64
  • read_details(att) -> PerfReadDetails
  • read_group(leader) -> PerfGroupRead

Three of those are just views of the same single-fd read — read and read_raw are literally fields of what read_details already returns — and read_group exists as a second verb only because a group leader and a non-leader share the same type. That's a lot of permanent, generically-named builtins (read / read_raw / read_group collide with the universal notion of "read") for what is fundamentally "snapshot a perf counter."

Proposal: one verb, one unified result

Make read the single generic reader, dispatched on the argument's static type (the compiler already does this via builtin_return_type_for_call), returning one struct that subsumes all four behaviors:

PerfRead {
  raw: i64,            // self raw count   (== values[0] pre-scaling)
  scaled: i64,         // self multiplex-corrected count (== values[0])
  time_enabled: u64,
  time_running: u64,
  count: u32,          // members in snapshot; 1 for a non-group event
  values: [i64; N],    // per-member scaled values
  ids:    [u64; N],    // per-member perf ids
}

A non-group event simply has count == 1, values[0] == scaled. So the other three verbs disappear:

  • read_raw(x)read(x).raw
  • read_details(x)read(x)
  • read_group(x)read(x) + the array fields
var c = read(cache)
print("cache scaled: %lld  raw: %lld", c.scaled, c.raw)

var snap = read(cache)            // same call; just use the group fields
var i = 0
while (i < snap.count) {          // count == 1 for a plain event
    print("id=%llu value=%lld", snap.ids[i], snap.values[i])
    i = i + 1
}

Why this is also a net simplification of this PR

  • Removes 3 builtins + 1 result struct. Just read + PerfRead.
  • Deletes the PERF_FORMAT_GROUP string-rewrite toggle. Since any read may expose the group, the read_format is always-on, so the Str.global_replace against whitespace-exact generated C goes away. That toggle is brittle today — if the read_format block's formatting ever changes, the replace silently no-ops and read_group misparses every snapshot.
  • Forces the capacity fix that's currently latent. Right now KS_PERF_GROUP_MAX_VALUES (16, hardcoded in the result struct) and the compile-time group limit (KERNELSCRIPT_PERF_GROUP_MAX_EVENTS / sysfs num_counters, effectively unbounded) are decoupled — a group the compiler accepts can be silently truncated at runtime, which defeats the whole "atomic snapshot" guarantee, and the static validator is best-effort (only sees perf_options struct literals in-function). Under the unified design the array rides every read, so you must reconcile them: make N the hard ceiling, clamp the detected limit to it (min(detected, N)), and reject statically-visible groups that exceed it. Truncation then only ever applies to dynamically-built groups the compiler can't see.

Trait-shaped, perf-only for now

Keep it perf-only, but structure read as a dispatch point so a future Map/RingBuffer reader becomes a new arm rather than a new verb — three single places:

  1. validate_read_function — the allow-list of readable arg types.
  2. builtin_return_type_for_call — the arg-type → result-type table.
  3. the codegen name→helper resolution keyed on the argument's type.

The validator comment already reads "read() currently requires a PerfAttachment", so this is the direction the verb was already headed.

Cost

read() stops returning a bare i64 — the common case becomes read(att).scaled, and the -1 error sentinel moves into .scaled (so read(x) < 0 becomes read(x).scaled < 0). That, plus folding the read_group call sites into read, is the entire migration surface.

Happy to send a follow-up commit implementing this if you're on board.

Fold calls into primary expressions so chained access like read(cache).scaled parses cleanly.
Make read() return PerfRead and remove the split raw/details/group helpers.
Always request PERF_FORMAT_ID and PERF_FORMAT_GROUP, clamp static group limits to 16, and update docs, examples, and tests.
@SiyuanSun0736
Copy link
Copy Markdown
Contributor Author

Description

This update consolidates all perf counter reads around a single, unified PerfRead snapshot API, addressing previous review feedback regarding API bloat and latent silent-truncation bugs.

Additionally, this PR formalizes our API philosophy regarding data retrieval (Perf vs. Maps vs. RingBuffers) to ensure clear semantic boundaries moving forward.

Key Technical Changes

  • Unified API Result: read(handle) now returns a comprehensive PerfRead struct. The common path for users becomes read(att).scaled, while .raw, timing data, and group snapshot arrays (.values, .ids) are always available on the same returned value.
  • Codegen Simplification: Userspace codegen now routes all reads through a single ks_perf_attachment_read path, entirely removing the need for separate raw/details/group helper builtins.
  • Always-On Group Formatting: Generated C code now always requests PERF_FORMAT_ID and PERF_FORMAT_GROUP. Same-time group values and IDs are inherently available from the read snapshot, allowing us to safely delete the brittle Str.global_replace toggle.
  • Strict Capacity Validation: Forces the capacity fix. Static perf group validation now explicitly clamps the effective group limit to the 16-entry PerfRead capacity. The compiler now proactively rejects oversized groups by checking both PMU slot usage and member count, preventing silent truncation at runtime.
  • Docs & Parser: Updated parser, documentation, and examples to match chained access patterns (e.g., read(cache).scaled).

Architectural Note: Semantic Boundaries for read() (Pushing back on generic dispatch)

In the previous review, it was suggested that we structure read() as a polymorphic dispatch point to eventually support Maps and RingBuffers. After careful consideration of the current language syntax, this PR advocates for : Separating verb semantics.

We propose strictly bounding read() to static snapshot retrievals (like Perf counters), rather than making it a universal accessor. Here is the rationale:

  1. Maps: We already have an elegant and idiomatic array-indexing syntax (map[key]). Overloading read(map, key) would introduce redundant ways to do the same thing and degrade the language's expressiveness.
  2. RingBuffers: We currently utilize a dispatch()-driven push model (Event Loop / Callbacks) for stream processing.
  3. Semantic Clarity: A RingBuffer (continuous stream of events) and a Perf Counter (a point-in-time static snapshot) are fundamentally different in control flow. Forcing them into the same read() verb conflates "polling a stream" with "taking a snapshot."

By keeping read() strictly for snapshots, [] for state lookups, and dispatch() for event streams, we maintain a clear, predictable, and highly specialized API philosophy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants