Skip to content

support vpto backend#699

Open
mouliangyu wants to merge 266 commits into
hw-native-sys:mainfrom
mouliangyu:feature-vpto-backend-merge
Open

support vpto backend#699
mouliangyu wants to merge 266 commits into
hw-native-sys:mainfrom
mouliangyu:feature-vpto-backend-merge

Conversation

@mouliangyu
Copy link
Copy Markdown

@mouliangyu mouliangyu commented May 25, 2026

Summary

This PR adds the VPTO backend to PTOAS. The new backend is selected with --pto-backend=vpto; the default backend remains emitc.

Main additions:

  • VPTO IR/types/ops and VPTO-specific verification/lowering support.
  • VPTO backend pipeline in tools/ptoas, including VPTO module normalization, TileLang template expansion, VPTO pointer normalization, wrapper-op expansion, LLVM lowering, host stub emission, and fatobj emission.
  • TileLang DSL package, TileOp template libraries, and related lit/ST coverage.
  • CMake/CTest integration for TileLang DSL tests and install support for ptoas.

Impact on non-VPTO code paths

The default ptoas path still uses the emitc backend. VPTO split/normalize, TileLang expand, VPTO pointer cleanup, VPTO LLVM emission, and fatobj emission are only entered through the VPTO backend branch or explicitly registered passes.

This PR does change shared PTO IR and shared passes used by non-VPTO flows:

  • !pto.ptr now carries a PTO memory space. The parser/printer accepts forms such as !pto.ptr<T, gm> and !pto.ptr<T, ub>, with GM as the default. getPTOAddressSpaceAttr(Type) now reads the memory space from PtrType, and PTOViewToMemref preserves that memory space when lowering ptr types to memrefs.
  • Shared parser/printer behavior changes for several PTO ops. set_flag, wait_flag, get_buf, and rls_buf now use custom parsers/printers. get_buf/rls_buf also accept pipe attrs in addition to the previous sync op type attrs. castptr, get_tensor_view_stride, and mem_bar are added as shared IR surface.
  • Several tile ops gain a default precision_mode attr: tcolexpanddiv, tdiv, tdivs, texp, tlog, trecip, trowexpanddiv, trsqrt, and tsqrt. The verifier requires HIGH_PRECISION to be used only with f16/f32 element types.
  • Shared verifiers are tightened for existing ops: castptr validates supported cast forms and memory-space consistency; A5 tload/tstore validate vec layouts; shift-like tile ops require src0/src1/dst element types to match; scalar tile ops such as tadds/tmuls/tsubs require valid rows to match on A5; tpart* ops validate source valid shapes against destination valid shapes; tsels validates row-major layout for mask/src/dst on A5.
  • Shared lowering changes include PTOViewToMemref preserving ptr memory spaces, inserting pto.bind_tile anchors for original tile_buf function arguments, lowering get_tensor_view_stride, and preserving attrs when rebuilding selected tile ops. PTOPlanMemory handles null alias inputs defensively. PTOToEmitC maps the extended ReluPreMode values ScalarRelu, VectorRelu, and Pwl.

Validation

Local validation run on this branch:

  • ninja -C build-ci-local ptoas ptobc install
  • ninja -C build-ci-local check-pto (416/416)
  • sample CI (OK=263 FAIL=0 SKIP=16)
  • docker/test_ptoas_cli.sh
  • TileLang DSL unit tests (312 tests)
  • VPTO SIM validation (281/281)
  • TileLang ST simulator smoke (85/85)

docker/collect_ptoas_dist.sh was checked locally and failed because the script only collects LLVM libraries from the CI LLVM workspace layout; the local LLVM installation is under the actions-runner tool cache.

Zhendong404 and others added 30 commits May 25, 2026 11:13
    Introduce a three-pass pipeline that lowers PTO tile ops to vector-level
    implementations via TileLang DSL templates:

    - ExpandTileOp: invokes TileLang Python DSL to instantiate template
      functions and replaces tile ops with func.call. SpecKey covers all
      operands; tile_buf operands are passed through without bridging.
    - PTOInlineLibCall: extended to recognize tilelang instance functions via
      the  attribute set by the DSL frontend.
    - FoldTileBufIntrinsics: resolves pto.tile_buf_addr / tile_valid_rows /
      tile_valid_cols, including dynamic valid-shape via pto.bind_tile chain
      tracing.
    - MemrefToTileBuf: recovers tile_buf types from memref + bind_tile
      metadata after PlanMemory/InsertSync.
    - PTOViewToMemref: insert pto.bind_tile anchors for tile_buf function
      args so MemrefToTileBuf can recover them.

    Adds new PTO ops (tile_buf_addr/tile_valid_rows/tile_valid_cols),
    ptoas pipeline wiring, design docs, and unit tests.
mouliangyu and others added 20 commits May 25, 2026 11:14
Relocate FileCheck-based VPTO .pto tests from test/vpto to test/lit/vpto so they are discovered by the lit framework.

Wrap auto-vecscope tests in a vector kernel submodule to match the current VPTO container form.
* add tcolargmax/min tileops lib implementation

* fix tcolargmax/min implementations and tile_buf declaration

* add init texp high precision implementation

* add texp high precision implementation

* remove exp high precision code in math.py(already in custom.py)

* add empty lines in math.py

* fix texp high precision test case's eps

* fix texp test case

* move exp hp from custom to exp_hp.py
* Rename VPTO MTE semantic ops

Input: existing VPTO memory-transfer semantic ops and tests using dma/cube/acc_store-style names.

Output: public semantic ops use mte_<src>_<dst> names, with ptr addrspace aliases for the new memory-space spelling.

Key steps: update ODS/parser/lowering/expand/ptr-normalize paths; migrate lit, VPTO, and TileLang test inputs; refresh VPTO/ISA docs and legacy raw-op wrapper mappings.

---------

Co-authored-by: mouliangyu <mouliangyu@huawei.com>
* Add OP for TMrgSort

* Add OP for TMrgSort | fix review

* Add constraints to tmov template for UB2UB ND2ND only

Add constraint function to restrict template_tmov_basic to only support:
- UB2UB: Both src and dst must be in Unified Buffer (memory_space="ub")
- ND2ND: Both tiles must have N-dimensional layout (s_layout=NONE_BOX)

Other scenarios (GM2UB, UB2GM, specialized layouts) require different
implementation paths and are explicitly rejected by this constraint.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: caojian5 <caojian5@huawei.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* Add HP (HIGH_PRECISION) support for TDiv

* Add HP for divs recip rowexpanddiv colexpanddiv

* [fix] Remove redundant else branch in HIGH_PRECISION mode since MLIR validation guarantees f16/f32 only

* [fix] Add missing HIGH_PRECISION test kernels for tdivs ST test

HP test cases were defined in cases.py but missing from implementation
files, causing FileNotFoundError when compare.py tried to read output.bin.

Added 12 HP kernels across three synced layers:
- tdivs.pto: HIGH_PRECISION kernels with precision_mode attribute
- main.cpp: kCases[] entries and launch wrapper declarations
- launch.cpp: Launch functions with correct IEEE 754 scalar values

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: caojian5 <caojian5@huawei.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly expands the project's documentation by adding detailed ISA specifications, design documents for core operations like mad and acc_store, and a framework for TileLang ST validation. It also integrates the PTO-Gym submodule, updates build and Docker configurations for TileLang DSL support, and introduces a script for automated release document generation. The review feedback identifies several areas for improvement in the generation script, including more robust version string parsing, avoiding incorrect heading demotion within code blocks, replacing fragile string matching with structured markers, and ensuring anchor uniqueness to prevent broken links in merged documentation.

notes[version] = "Release refresh"

def key_fn(item: str) -> tuple[int, ...]:
return tuple(int(part) for part in item.split("."))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This list comprehension will raise a ValueError if any part of the version string is not a digit (e.g., "0.4-beta"). Since this script is used for release documentation generation, consider adding a check to ensure only numeric parts are converted to integers to avoid runtime crashes.

new_level = min(6, len(hashes) + levels)
return f"{'#' * new_level} {heading}"

return re.sub(r"^(#{1,6})\s+(.*)$", replace, text, flags=re.MULTILINE)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The regex r"^(#{1,6})\s+(.*)$" used with re.MULTILINE matches any line starting with a hash followed by a space. This can incorrectly demote comments inside fenced code blocks (e.g., in Python or Bash snippets) if they are at the start of a line. It is recommended to use a more robust way to identify headings that are not part of code blocks, such as splitting the text by code block markers before applying the substitution.

Comment on lines +175 to +178
if "For detailed semantics, C-style pseudocode, and CCE mappings" in line:
continue
if "CCE correspondence" in line or "builtin mapping" in line.lower():
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for stripping unwanted lines relies on exact string matches for prose, which is fragile. If the source document (vpto-spec.md) is updated with slightly different wording, these lines will no longer be stripped. Consider using more stable markers or a structured approach to identify content to be removed.

Comment on lines +212 to +213
if anchor_suffix:
return f"]({anchor_suffix})"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When an intra-bundle link contains an anchor (e.g., [Foo](chapter.md#anchor)), it is rewritten to ](#anchor). This assumes that the anchor name is unique across all inlined chapters in the merged document. If multiple chapters use the same anchor names, these links may break or point to the wrong location. Consider prefixing internal anchors with the chapter name during the merge process to ensure uniqueness.

@reedhecre
Copy link
Copy Markdown

reedhecre commented May 25, 2026

Codex Review

该评论由 review 机器人自动更新。

  • PR: support vpto backend #699 support vpto backend
  • Author: mouliangyu
  • Base/Head: main / feature-vpto-backend-merge
  • Head SHA: db2bf1f7ba2d
  • Trigger: PR 有新提交
  • Generated At: 2026-05-26T14:50:06Z
  • Previous Head SHA: 2a4ccae86aec
  • Status: completed

Summary

PR #699 的 VPTO tile-op expansion 存在 3 个明确问题:tmrgsort format2 丢失了 excutedexhausted 语义,trsqrt 则接受了未实现的 HIGH_PRECISION 模式。

Findings

  1. P1 VPTO `tmrgsort` format2 完全忽略了 `excuted` 向量操作数 lib/TileOps/tmrgsort_template.py:167

TMrgSortOp 的现有契约仍然把 outs(dst, excuted) 作为 format2 语义的一部分,现有 EmitC lowering 也会把它作为 executedNumList 传给 TMRGSORT(...)。但这里新增的 TileLang 模板里,template_tmrgsort_multi_list{2,3,4} 虽然接收了 ex_vec,后续却从不读取它,而是仅根据 tile 的 valid_col 重新计算每个 list 的结构数,并把整个 tmp 复制到 dst。这会把非默认 executed-list 输入静默丢掉,导致 VPTO backend 对 format2 产生与现有 backend 不一致的 merge 边界/结果。test/lit/pto/tmrgsort_executed_constant_emitc.pto 里的常量 executed-list 用例在 VPTO 路径下会被错误编译。

  1. P1 `tmrgsort.exhausted` 属性没有传到 TileLang helper,`exhausted=true` 会被降成 `false` lib/PTO/Transforms/ExpandTileOp.cpp:345

tmrgsort_template.py 依赖 pto.get_op_attr("exhausted", ...) 来设置 vmrgsort4 的 exhausted bit,但 appendOpContextAttrs() 现在只序列化 round/cmp/precision 相关属性,根本没有把 TMrgSortOpexhausted 带过去。结果是所有 format2 pto.tmrgsort 在 VPTO tile-op expansion 中都会走默认值 0,即便源 IR 明确写了 exhausted = true。这会直接改变 exhausted case 的 ISA 配置,现有 test/tilelang_st/.../tmrgsort 里的 exhausted=true 场景会被误编译。

  1. P2 `TRsqrtOp` 暴露了 `precision_mode`,但 VPTO 模板完全没有实现该语义 lib/TileOps/trsqrt_template.py:13

这个 PR 给 TRsqrtOp 新增了 precision_mode 属性,verifier 也允许用户传 HIGH_PRECISION。同时 ExpandTileOp 还会像其他数学 op 一样把该属性带入模板路径。但 template_trsqrt 仍然只有默认的 vsqrt + vdiv 实现,既没有读取 precision_mode,文件里还保留着 HIGH_PRECISION 未实现的 TODO。也就是说,用户请求 HIGH_PRECISION 会成功编译,却静默退化成默认路径,属于明确的 contract mismatch,并会改变数值结果。

@mouliangyu mouliangyu force-pushed the feature-vpto-backend-merge branch from 2a4ccae to db2bf1f Compare May 26, 2026 14:35
@FangRui0
Copy link
Copy Markdown
Contributor

/run a3

@reedhecre
Copy link
Copy Markdown

已接收 /run a3,A3 板测器会处理这条请求。

页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。

@reedhecre
Copy link
Copy Markdown

A3 板测失败

  • 触发方式:manual
  • 源码提交:03089d80ad9c
  • 结果汇总:OK 217 / FAIL 2 / SKIP 1
  • 日志:/home/zhongxuan/ptoas-board-monitor/runtime/logs/20260527_104105_manual_pr699.log
  • 手动指令:/run a3
  • 触发人:FangRui0
  • 触发评论:support vpto backend #699 (comment)
  • 失败阶段:board-validation / exit=1

失败用例

  • syncall_binding (run, exit=1)
  • tprefetch_async_binding (run, exit=1)

@reedhecre
Copy link
Copy Markdown

A3 板测失败详情:PR #699

syncall_binding

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507014 (/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260527_104105_manual_pr699/npu_validation/SyncAll/syncall_binding/main.cpp:84)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 4175868] 2026-05-27-11:26:05.538.466 (EZ9999):  The error from device(chipId:2, dieId:0), serial number is 146, there is an exception of aicore error, core id is 17, error code = 0, dump info: pc start: 0x124800000000, current: 0x124800000188, vec error info: 0, mte error info: 0x7a06000033, ifu error info: 0x212c200024600, ccu error info: 0x40a01900778000d8, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:645]
        TraceBack (most recent call last):
       The extend info: errcode:(0, 0, 0) errorStr: timeout or trap error. fixp_error0 info: 0x6000033, fixp_error1 info: 0x7a, fsmId:1, tslot:0, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:658]
       Kernel task happen error, retCode=0x25, [aicore timeout].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1729]
       AICORE Kernel task happen error, retCode=0x25.[FUNC:GetError][FILE:stream.cc][LINE:1475]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1475]
       [DFX_INFO]Aicore kernel execute failed, device_id=4, stream_id=46, report_stream_id=46, task_id=0, flip_num=0, fault kernel_name=_Z22syncall_binding_kernelPii, fault kernel info ext=_Z22syncall_binding_kernelPii, program id=0, hash=3129332313788381512.[FUNC:GetError][FILE:stream.cc][LINE:1475]
       rtStreamSynchronize execution failed, reason=aicore timeout[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507014[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-05-27 11:26:06] ERROR: testcase failed (exit 1): syncall_binding
tprefetch_async_binding

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260527_104105_manual_pr699/npu_validation/TPrefetchAsync/tprefetch_async_binding/main.cpp:91)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 348335] 2026-05-27-11:26:46.014.212 (EZ9999):  The error from device(chipId:2, dieId:0), serial number is 147, there is an exception of aivec error, core id is 34, error code = 0, dump info: pc start: 0x124800000000, current: 0x124800000160, vec error info: 0x1e000000a8, mte error info: 0xa50312b08b, ifu error info: 0x2dfffc0e34f40, ccu error info: 0x40a00e0000000052, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:645]
        TraceBack (most recent call last):
       The extend info: errcode:(0, 0x200000000000000, 0) errorStr: The MPU address access is invalid. fixp_error0 info: 0x312b08b, fixp_error1 info: 0xa5, fsmId:1, tslot:0, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:658]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1729]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1475]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1475]
       [DFX_INFO]Aicore kernel execute failed, device_id=4, stream_id=46, report_stream_id=46, task_id=0, flip_num=0, fault kernel_name=_Z30tprefetch_async_binding_kernelPfPa, fault kernel info ext=_Z30tprefetch_async_binding_kernelPfPa, program id=0, hash=8435686547367685641.[FUNC:GetError][FILE:stream.cc][LINE:1475]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-05-27 11:26:48] ERROR: testcase failed (exit 1): tprefetch_async_binding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.