Skip to content

Port frontend tile fusion to EmitC mainline#704

Open
Likai-19 wants to merge 2 commits into
hw-native-sys:mainfrom
Likai-19:tile_front_fusion
Open

Port frontend tile fusion to EmitC mainline#704
Likai-19 wants to merge 2 commits into
hw-native-sys:mainfrom
Likai-19:tile_front_fusion

Conversation

@Likai-19
Copy link
Copy Markdown

Summary

Reintroduce frontend tile fusion on the current A5 EmitC mainline behind
--enable-op-fusion, but keep the implementation intentionally small:

  • run fusion planning and scheduling on tile-native PTO IR before
    PTOViewToMemref
  • mark fused tile ops with pto.last_use directly on scheduled block-local
    spans
  • preserve the final EmitC contract by emitting
    [[pto::last_use(... )]] CALLEE(...)
  • do not introduce or preserve a pto.fusion_region / pto.yield
    lifecycle in the shared mainline

In other words, this PR keeps the user-visible goal of "frontend op scheduling

  • final last_use emission", while removing the larger FusionRegion-based IR
    contract from the implementation.

What changed

Driver and pipeline

  • add --enable-op-fusion on the current ptoas driver
  • gate it to --pto-arch=a5 with --pto-level=level2|level3
  • run the frontend fusion core on tile-native PTO IR:
    • FusionPlan
    • OpScheduling
    • PTOMarkLastUse
  • keep this pipeline before PTOViewToMemref
  • leave unsupported configurations on the ordinary unfused path with warnings
    instead of failing compilation

Frontend fusion core

  • port the tile-fusion planning/scheduling support needed on the current
    mainline:
    • FusionAnalysis
    • FusionOpSemantics
    • PTOFusionPlan
    • PTOOpScheduling
  • represent accepted fusion groups as contiguous scheduled spans in a block
    rather than wrapping them in a region op

last_use implementation

  • introduce PTOMarkLastUse as the place that computes pto.last_use
  • make the analysis span-based instead of region/yield-based:
    • collect each contiguous scheduled group span from
      pto.fusion.group_id / pto.fusion.order
    • compute last-use per tile operand slot inside that span
    • block a bit if the tile value is used later in the same span
    • also block a bit if the tile value is used later in the parent block after
      the span
  • encode last_use per tile operand slot, with the following rules:
    • scalar operands do not occupy slots
    • DPS init / output tile slots are preserved but always stay 0
    • repeated SSA tile operands are evaluated independently per slot

EmitC last_use output

  • keep the final output contract as [[pto::last_use(... )]] CALLEE(...)
  • lower marked fused tile ops through a PTOAS-local marker callee path in
    PTOToEmitC
  • rewrite that marker to the final C++ attribute spelling in
    CppPostprocess
  • fix marker bit ordering so single-DPS-init tile intrinsics follow the final
    emitted operand order, which keeps the output tile slot at 0 in the final
    emitted attribute

Explicit non-goals / removed scope

  • no pto.fusion_region
  • no pto.yield
  • no PTOFusionRegionGen
  • no PTOFlattenFusionRegion
  • no shared-pass preservation contract for fusion-region lifecycle through
    PTOViewToMemref, memory planning, reserved-buffer resolution, sync
    insertion, or tile-handle materialization

Why this shape

The original larger port bundled three concerns together:

  1. frontend fusion planning/scheduling
  2. region formation / flattening
  3. final EmitC last_use emission

For the current goal, only (1) and (3) are essential. This PR keeps the
useful part of the feature and localizes the extra complexity to
PTOMarkLastUse, instead of requiring multiple existing shared passes to
understand and preserve a new region lifecycle.

Testing

Added focused tile-fusion coverage for:

  • fusion planning:
    • join
    • diamond
    • interleaved join
    • treshape boundary
    • dynamic-shape negative case
  • scheduling:
    • basic compaction
    • treshape bridge
    • pure-op bridge
    • negative region / call / SSA boundary cases
  • last_use:
    • slot-mask encoding
    • repeated SSA operands
    • post-span later-use blocking
  • end-to-end EmitC output:
    • final [[pto::last_use(... )]] emission
    • absence of residual pto.fusion_region / pto.yield
  • control surface:
    • CLI visibility / gating
    • non-fused fallback behavior
    • adapter placement in level2 and level3 shared lowering paths

Focused verification run:

  • llvm-lit -sv build/test/lit/tile_fusion

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a frontend tile-fusion optimization pipeline for the A5 EmitC mainline, adding passes for fusion planning, instruction scheduling, and last-use marking, along with supporting semantic analyses and C++ post-processing. The review feedback highlights a critical scheduling bug in PTOOpScheduling.cpp where moving only the placement operator breaks the contiguity of the fusion group. Additionally, improvements are suggested to translate a Chinese comment to English in PTOMarkLastUse.cpp, replace std::isdigit with llvm::isDigit in CppPostprocess.cpp to prevent potential undefined behavior, and simplify a redundant ArrayRef conversion in FusionAnalysis.cpp.

!canMoveLaterAcross(placement, blockingOp))
break;

placement->moveAfter(blockingOp);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Moving only placement later via placement->moveAfter(blockingOp) leaves the previously scheduled members of the group behind, which breaks the contiguity of the fusion group. To maintain contiguity, all previously scheduled members of the group must be moved together with placement, or the scheduling logic should be revised to avoid breaking contiguity.

lastUseMask.push_back(0);
continue;
}
// isSpanLocalLastUseCandidate的检查范围大于hasLaterUseAfterSpan
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Please translate this Chinese comment to English to maintain consistency and readability for international developers.

Suggested change
// isSpanLocalLastUseCandidate的检查范围大于hasLaterUseAfterSpan
// The check scope of isSpanLocalLastUseCandidate is larger than hasLaterUseAfterSpan

: encoded.slice(pos, next);
if (token.empty())
return false;
if (!llvm::all_of(token, [](char c) { return std::isdigit(c); }))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using std::isdigit with a char argument can lead to undefined behavior if the character is signed and has a negative value. It is safer and more idiomatic in LLVM/MLIR to use llvm::isDigit.

Suggested change
if (!llvm::all_of(token, [](char c) { return std::isdigit(c); }))
if (!llvm::all_of(token, [](char c) { return llvm::isDigit(c); }))

if (info.vRow == ShapedType::kDynamic || info.vCol == ShapedType::kDynamic)
info.unprovenReason = IterationDomainUnprovenReason::DynamicShape;

for (Value value : ArrayRef<Value>(anchorValues).drop_front()) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The explicit conversion to ArrayRef<Value> is redundant because anchorValues is already an ArrayRef<Value>. You can simplify this by calling drop_front() directly on anchorValues.

Suggested change
for (Value value : ArrayRef<Value>(anchorValues).drop_front()) {
for (Value value : anchorValues.drop_front()) {

@reedhecre
Copy link
Copy Markdown

reedhecre commented May 26, 2026

Codex Review

该评论由 review 机器人自动更新。

  • PR: Port frontend tile fusion to EmitC mainline #704 Port frontend tile fusion to EmitC mainline
  • Author: Likai-19
  • Base/Head: main / tile_front_fusion
  • Head SHA: b74c0783b44e
  • Trigger: 检测到新的 open PR
  • Generated At: 2026-05-26T13:01:18Z
  • Status: completed

Summary

发现 3 个问题:pure-op bridge 调度不会生效、非连续 fusion group 不会被拒绝、且 FusionPlan 会跨 side-effecting hard boundary 规划出不可调度的 group。

Findings

  1. P1 Pure non-fusion ops are classified as hard barriers, so `op_scheduling_pure_op_bridge` cannot pass lib/PTO/Transforms/TileFusion/PTOOpScheduling.cpp:85

classifySchedulingBarrier returns as soon as getFusionOpSemantics() succeeds. For memory-effect-free bridge ops like arith.constant and arith.index_cast, getFusionOpSemantics() still succeeds with FusionOpKind::HardBoundary, so the later isMemoryEffectFree() fallback is never reached. That makes canMoveEarlierAcross/canMoveLaterAcross refuse to cross those pure ops, so the planned group in test/lit/tile_fusion/op_scheduling_pure_op_bridge.pto stays split instead of becoming the contiguous span the test checks for.

  1. P1 Split fusion groups are silently accepted after scheduling, so the new negative tests still exit 0 lib/PTO/Transforms/TileFusion/PTOMarkLastUse.cpp:133

Once a group has started, collectGroupSpansInBlock simply continues across every non-fusion op. If pto-op-scheduling cannot compact a group across a call, region op, or other unmovable gap, pto-mark-last-use still treats the separated members as one span and the pipeline succeeds. That means test/lit/tile_fusion/op_scheduling_negative_call_boundary.pto, ...negative_region.pto, and ...negative_ssa.pto will not produce the expected failure, and downstream EmitC sees last-use metadata for a group that never became contiguous.

  1. P2 FusionPlan can fuse across side-effecting PTO barriers that OpScheduling will never cross lib/PTO/Transforms/TileFusion/PTOFusionPlan.cpp:85

hasHardBoundaryBetween only treats terminators, region ops, and CallOpInterface as hard boundaries. Side-effecting PTO ops such as pto.tstore, barriers, and sync ops are therefore invisible to the planner, even though pto-op-scheduling classifies them as hard barriers and refuses to move across them. A producer/consumer pair separated by one of those ops can still receive the same pto.fusion.group_id, leaving an irreparable split group that is only tolerated later instead of being rejected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants