Port frontend tile fusion to EmitC mainline by Likai-19 · Pull Request #704 · hw-native-sys/PTOAS

Likai-19 · 2026-05-26T11:47:38Z

Summary

Reintroduce frontend tile fusion on the current A5 EmitC mainline behind
--enable-op-fusion, but keep the implementation intentionally small:

run fusion planning and scheduling on tile-native PTO IR before
PTOViewToMemref
mark fused tile ops with pto.last_use directly on scheduled block-local
spans
preserve the final EmitC contract by emitting
[[pto::last_use(... )]] CALLEE(...)
do not introduce or preserve a pto.fusion_region / pto.yield
lifecycle in the shared mainline

In other words, this PR keeps the user-visible goal of "frontend op scheduling

final last_use emission", while removing the larger FusionRegion-based IR
contract from the implementation.

What changed

Driver and pipeline

add --enable-op-fusion on the current ptoas driver
gate it to --pto-arch=a5 with --pto-level=level2|level3
run the frontend fusion core on tile-native PTO IR:
- FusionPlan
- OpScheduling
- PTOMarkLastUse
keep this pipeline before PTOViewToMemref
leave unsupported configurations on the ordinary unfused path with warnings
instead of failing compilation

Frontend fusion core

port the tile-fusion planning/scheduling support needed on the current
mainline:
- FusionAnalysis
- FusionOpSemantics
- PTOFusionPlan
- PTOOpScheduling
represent accepted fusion groups as contiguous scheduled spans in a block
rather than wrapping them in a region op

`last_use` implementation

introduce PTOMarkLastUse as the place that computes pto.last_use
make the analysis span-based instead of region/yield-based:
- collect each contiguous scheduled group span from
  pto.fusion.group_id / pto.fusion.order
- compute last-use per tile operand slot inside that span
- block a bit if the tile value is used later in the same span
- also block a bit if the tile value is used later in the parent block after
  the span
encode last_use per tile operand slot, with the following rules:
- scalar operands do not occupy slots
- DPS init / output tile slots are preserved but always stay 0
- repeated SSA tile operands are evaluated independently per slot

EmitC `last_use` output

keep the final output contract as [[pto::last_use(... )]] CALLEE(...)
lower marked fused tile ops through a PTOAS-local marker callee path in
PTOToEmitC
rewrite that marker to the final C++ attribute spelling in
CppPostprocess
fix marker bit ordering so single-DPS-init tile intrinsics follow the final
emitted operand order, which keeps the output tile slot at 0 in the final
emitted attribute

Explicit non-goals / removed scope

no pto.fusion_region
no pto.yield
no PTOFusionRegionGen
no PTOFlattenFusionRegion
no shared-pass preservation contract for fusion-region lifecycle through
PTOViewToMemref, memory planning, reserved-buffer resolution, sync
insertion, or tile-handle materialization

Why this shape

The original larger port bundled three concerns together:

frontend fusion planning/scheduling
region formation / flattening
final EmitC last_use emission

For the current goal, only (1) and (3) are essential. This PR keeps the
useful part of the feature and localizes the extra complexity to
PTOMarkLastUse, instead of requiring multiple existing shared passes to
understand and preserve a new region lifecycle.

Testing

Added focused tile-fusion coverage for:

fusion planning:
- join
- diamond
- interleaved join
- treshape boundary
- dynamic-shape negative case
scheduling:
- basic compaction
- treshape bridge
- pure-op bridge
- negative region / call / SSA boundary cases
last_use:
- slot-mask encoding
- repeated SSA operands
- post-span later-use blocking
end-to-end EmitC output:
- final [[pto::last_use(... )]] emission
- absence of residual pto.fusion_region / pto.yield
control surface:
- CLI visibility / gating
- non-fused fallback behavior
- adapter placement in level2 and level3 shared lowering paths

Focused verification run:

llvm-lit -sv build/test/lit/tile_fusion

gemini-code-assist

Code Review

This pull request introduces a frontend tile-fusion optimization pipeline for the A5 EmitC mainline, adding passes for fusion planning, instruction scheduling, and last-use marking, along with supporting semantic analyses and C++ post-processing. The review feedback highlights a critical scheduling bug in PTOOpScheduling.cpp where moving only the placement operator breaks the contiguity of the fusion group. Additionally, improvements are suggested to translate a Chinese comment to English in PTOMarkLastUse.cpp, replace std::isdigit with llvm::isDigit in CppPostprocess.cpp to prevent potential undefined behavior, and simplify a redundant ArrayRef conversion in FusionAnalysis.cpp.

gemini-code-assist · 2026-05-26T11:50:18Z

+          !canMoveLaterAcross(placement, blockingOp))
+        break;
+
+      placement->moveAfter(blockingOp);


Moving only placement later via placement->moveAfter(blockingOp) leaves the previously scheduled members of the group behind, which breaks the contiguity of the fusion group. To maintain contiguity, all previously scheduled members of the group must be moved together with placement, or the scheduling logic should be revised to avoid breaking contiguity.

gemini-code-assist · 2026-05-26T11:50:18Z

+        lastUseMask.push_back(0);
+        continue;
+      }
+      // isSpanLocalLastUseCandidate的检查范围大于hasLaterUseAfterSpan


Please translate this Chinese comment to English to maintain consistency and readability for international developers.

Suggested change

// isSpanLocalLastUseCandidate的检查范围大于hasLaterUseAfterSpan

// The check scope of isSpanLocalLastUseCandidate is larger than hasLaterUseAfterSpan

gemini-code-assist · 2026-05-26T11:50:19Z

+                                      : encoded.slice(pos, next);
+    if (token.empty())
+      return false;
+    if (!llvm::all_of(token, [](char c) { return std::isdigit(c); }))


Using std::isdigit with a char argument can lead to undefined behavior if the character is signed and has a negative value. It is safer and more idiomatic in LLVM/MLIR to use llvm::isDigit.

Suggested change

if (!llvm::all_of(token, [](char c) { return std::isdigit(c); }))

if (!llvm::all_of(token, [](char c) { return llvm::isDigit(c); }))

gemini-code-assist · 2026-05-26T11:50:19Z

+  if (info.vRow == ShapedType::kDynamic || info.vCol == ShapedType::kDynamic)
+    info.unprovenReason = IterationDomainUnprovenReason::DynamicShape;
+
+  for (Value value : ArrayRef<Value>(anchorValues).drop_front()) {


The explicit conversion to ArrayRef<Value> is redundant because anchorValues is already an ArrayRef<Value>. You can simplify this by calling drop_front() directly on anchorValues.

Suggested change

for (Value value : ArrayRef<Value>(anchorValues).drop_front()) {

for (Value value : anchorValues.drop_front()) {

reedhecre · 2026-05-26T12:50:27Z

Codex Review

该评论由 review 机器人自动更新。

PR: Port frontend tile fusion to EmitC mainline #704 Port frontend tile fusion to EmitC mainline
Author: Likai-19
Base/Head: main / tile_front_fusion
Head SHA: b74c0783b44e
Trigger: 检测到新的 open PR
Generated At: 2026-05-26T13:01:18Z
Status: completed

Summary

发现 3 个问题：pure-op bridge 调度不会生效、非连续 fusion group 不会被拒绝、且 FusionPlan 会跨 side-effecting hard boundary 规划出不可调度的 group。

Findings

P1 Pure non-fusion ops are classified as hard barriers, so `op_scheduling_pure_op_bridge` cannot pass lib/PTO/Transforms/TileFusion/PTOOpScheduling.cpp:85

classifySchedulingBarrier returns as soon as getFusionOpSemantics() succeeds. For memory-effect-free bridge ops like arith.constant and arith.index_cast, getFusionOpSemantics() still succeeds with FusionOpKind::HardBoundary, so the later isMemoryEffectFree() fallback is never reached. That makes canMoveEarlierAcross/canMoveLaterAcross refuse to cross those pure ops, so the planned group in test/lit/tile_fusion/op_scheduling_pure_op_bridge.pto stays split instead of becoming the contiguous span the test checks for.

P1 Split fusion groups are silently accepted after scheduling, so the new negative tests still exit 0 lib/PTO/Transforms/TileFusion/PTOMarkLastUse.cpp:133

Once a group has started, collectGroupSpansInBlock simply continues across every non-fusion op. If pto-op-scheduling cannot compact a group across a call, region op, or other unmovable gap, pto-mark-last-use still treats the separated members as one span and the pipeline succeeds. That means test/lit/tile_fusion/op_scheduling_negative_call_boundary.pto, ...negative_region.pto, and ...negative_ssa.pto will not produce the expected failure, and downstream EmitC sees last-use metadata for a group that never became contiguous.

P2 FusionPlan can fuse across side-effecting PTO barriers that OpScheduling will never cross lib/PTO/Transforms/TileFusion/PTOFusionPlan.cpp:85

hasHardBoundaryBetween only treats terminators, region ops, and CallOpInterface as hard boundaries. Side-effecting PTO ops such as pto.tstore, barriers, and sync ops are therefore invisible to the planner, even though pto-op-scheduling classifies them as hard barriers and refuses to move across them. A producer/consumer pair separated by one of those ops can still receive the same pto.fusion.group_id, leaving an irreparable split group that is only tolerated later instead of being rejected.

Zhendong404 and others added 2 commits May 26, 2026 19:40

feature(tile fusion): support tile op scheduling and marking last_use

fffafff

enhance robust of fusion pass

b74c078

gemini-code-assist Bot reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port frontend tile fusion to EmitC mainline#704

Port frontend tile fusion to EmitC mainline#704
Likai-19 wants to merge 2 commits into
hw-native-sys:mainfrom
Likai-19:tile_front_fusion

Likai-19 commented May 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Uh oh!

gemini-code-assist Bot May 26, 2026

Uh oh!

gemini-code-assist Bot May 26, 2026

Uh oh!

gemini-code-assist Bot May 26, 2026

Uh oh!

reedhecre commented May 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	// isSpanLocalLastUseCandidate的检查范围大于hasLaterUseAfterSpan
	// The check scope of isSpanLocalLastUseCandidate is larger than hasLaterUseAfterSpan

	if (!llvm::all_of(token, [](char c) { return std::isdigit(c); }))
	if (!llvm::all_of(token, [](char c) { return llvm::isDigit(c); }))

	for (Value value : ArrayRef<Value>(anchorValues).drop_front()) {
	for (Value value : anchorValues.drop_front()) {

Conversation

Likai-19 commented May 26, 2026

Summary

What changed

Driver and pipeline

Frontend fusion core

last_use implementation

EmitC last_use output

Explicit non-goals / removed scope

Why this shape

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

reedhecre commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codex Review

Summary

Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`last_use` implementation

EmitC `last_use` output

reedhecre commented May 26, 2026 •

edited

Loading