Skip to content

Expose configurable frontend pipe slot_num#710

Merged
zhangstevenunity merged 5 commits into
mainfrom
codex/issue-709-slot-num
May 28, 2026
Merged

Expose configurable frontend pipe slot_num#710
zhangstevenunity merged 5 commits into
mainfrom
codex/issue-709-slot-num

Conversation

@zhangstevenunity
Copy link
Copy Markdown
Collaborator

Summary: add optional slot_num to frontend aic/aiv initialize_pipe ops, forward it during lowering, relax internal pipe slot_num verification to positive values, and keep local_slot_num bounded by slot_num. Tests: ninja -C build-main-wsl ptoas; llvm-lit related frontend/internal slot_num tests; llvm-lit existing tpush/tpop regression tests. Fixes #709

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an optional compile-time integer attribute slot_num to the frontend pipe initialization operations (pto.aic_initialize_pipe and pto.aiv_initialize_pipe). This attribute controls the GM ring FIFO depth, which previously defaulted to fixed values of 8 or 4 depending on the communication direction (dir_mask). The changes include updating the ODS definitions, adding parsing and printing logic, enhancing verification rules to ensure slot_num is positive and local_slot_num does not exceed it, and updating lowering passes to propagate this attribute. Documentation and comprehensive LIT tests have also been added to verify correct behavior and error handling. I have no feedback to provide as there are no review comments.

@reedhecre
Copy link
Copy Markdown

reedhecre commented May 27, 2026

Codex Review

该评论由 review 机器人自动更新。

  • PR: Expose configurable frontend pipe slot_num #710 Expose configurable frontend pipe slot_num
  • Author: zhangstevenunity
  • Base/Head: main / codex/issue-709-slot-num
  • Head SHA: e90932529bc3
  • Trigger: PR 有新提交
  • Generated At: 2026-05-28T05:16:25Z
  • Previous Head SHA: 996ff697f605
  • Status: completed

Summary

PR 暴露 frontend slot_num 后,仍缺少与 pto.reserve_buffer.size 的联动校验,错误 IR 会编译通过并在运行时踩坏 local FIFO 内存。

Findings

  1. P2 可配置 `slot_num` 没有同步校验 consumer FIFO 预留大小 lib/PTO/IR/PTO.cpp:11760

这里开始接受任意正整数的 frontend slot_num,并且 lowering 会在未显式给出 local_slot_num 时把它默认成 slot_num。但同文件里的 ReserveBufferOp::verify() 仍然只检查 size > 0,没有校验 local FIFO buffer 是否满足 A2/A3 的 slot_size * effective_local_slot_num,也没有校验 A5 的 slot_size * effective_slot_num。因此像 slot_num = 2 却只给 1 个 slot 大小的 pto.reserve_buffer 这类 IR 现在会照常通过;PlanMemory 只会预留声明的字节数,而最终生成的 TPipe<..., 2, ...> 会按 2 个 slot 访问 local FIFO,运行时会越界或和相邻 reserved buffer 互相覆盖。这是这次把 slot_num 暴露给前端后新增的真实 contract mismatch,应该在 verifier 里直接拒绝,并补一条负测。

@zhangstevenunity zhangstevenunity marked this pull request as ready for review May 27, 2026 08:41
@zhangstevenunity zhangstevenunity merged commit 239f1d1 into main May 28, 2026
14 checks passed
@reedhecre
Copy link
Copy Markdown

A5 板测成功

  • 触发方式:merged
  • 源码提交:239f1d1cc9ff
  • 结果汇总:OK 21 / FAIL 0 / SKIP 0
  • 日志:/root/ptoas-board-monitor-a5/logs/20260528_142705_merged_pr710.log
  • 结果 TSV:/root/ptoas-board-monitor-a5/logs/20260528_142705_merged_pr710.tsv

@reedhecre
Copy link
Copy Markdown

A3 板测失败

  • 触发方式:merged
  • 源码提交:239f1d1cc9ff
  • 结果汇总:OK 217 / FAIL 2 / SKIP 1
  • 日志:/home/zhongxuan/ptoas-board-monitor/runtime/logs/20260528_150905_merged_pr710.log
  • 失败阶段:board-validation / exit=1

失败用例

  • syncall_binding (run, exit=1)
  • tprefetch_async_binding (run, exit=1)

@reedhecre
Copy link
Copy Markdown

A3 板测失败详情:PR #710

syncall_binding

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507014 (/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260528_150905_merged_pr710/npu_validation/SyncAll/syncall_binding/main.cpp:84)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 2719252] 2026-05-28-15:52:55.090.662 (EZ9999):  The error from device(chipId:3, dieId:1), serial number is 126, there is an exception of aicore error, core id is 15, error code = 0, dump info: pc start: 0x124e00000000, current: 0x124e00000188, vec error info: 0, mte error info: 0xc503000032, ifu error info: 0x212c1a0800300, ccu error info: 0x1cc000000000009b, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:645]
        TraceBack (most recent call last):
       The extend info: errcode:(0, 0, 0) errorStr: timeout or trap error. fixp_error0 info: 0x3000032, fixp_error1 info: 0xc5, fsmId:0, tslot:2, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:658]
       Kernel task happen error, retCode=0x25, [aicore timeout].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1729]
       AICORE Kernel task happen error, retCode=0x25.[FUNC:GetError][FILE:stream.cc][LINE:1475]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1475]
       [DFX_INFO]Aicore kernel execute failed, device_id=7, stream_id=46, report_stream_id=46, task_id=0, flip_num=0, fault kernel_name=_Z22syncall_binding_kernelPii, fault kernel info ext=_Z22syncall_binding_kernelPii, program id=0, hash=3129332313788381512.[FUNC:GetError][FILE:stream.cc][LINE:1475]
       rtStreamSynchronize execution failed, reason=aicore timeout[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507014[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-05-28 15:52:56] ERROR: testcase failed (exit 1): syncall_binding
tprefetch_async_binding

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260528_150905_merged_pr710/npu_validation/TPrefetchAsync/tprefetch_async_binding/main.cpp:91)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 3104816] 2026-05-28-15:53:36.337.814 (EZ9999):  The error from device(chipId:3, dieId:1), serial number is 127, there is an exception of aivec error, core id is 31, error code = 0, dump info: pc start: 0x124e00000000, current: 0x124e00000160, vec error info: 0x7400008068, mte error info: 0x9800000052, ifu error info: 0x212c1a0800b40, ccu error info: 0x1ce6000000000052, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:645]
        TraceBack (most recent call last):
       The extend info: errcode:(0, 0x200000000000000, 0) errorStr: The MPU address access is invalid. fixp_error0 info: 0x52, fixp_error1 info: 0x98, fsmId:1, tslot:3, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:658]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1729]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1475]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1475]
       [DFX_INFO]Aicore kernel execute failed, device_id=7, stream_id=46, report_stream_id=46, task_id=0, flip_num=0, fault kernel_name=_Z30tprefetch_async_binding_kernelPfPa, fault kernel info ext=_Z30tprefetch_async_binding_kernelPfPa, program id=0, hash=8435686547367685641.[FUNC:GetError][FILE:stream.cc][LINE:1475]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-05-28 15:53:37] ERROR: testcase failed (exit 1): tprefetch_async_binding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose configurable GM-ring slot_num on aic/aiv_initialize_pipe (currently hardcoded to 8/4)

2 participants