Question: kernel authoring roadmap (Triton / TileLang)? #15

gugudeshubao · 2026-05-09T00:29:04Z

gugudeshubao
May 9, 2026
Collaborator

Hi — I read in docs/inference_engine_differences.md that kernels are hand-written CUDA today, and that Triton /
TileLang integration is a possible future surface but not a current goal.

I’d appreciate any public roadmap or design discussion on DSL-backed or multi-backend kernel authoring. Concretely:

Multi-backend / multi-source operators — As FlashRT grows (more models, shapes, and hardware), do you envision
the same logical op being implemented or targeted through more than one path (e.g. hand-tuned CUDA for
production hotspots + optional Triton/TileLang or codegen paths for faster iteration or portability), or do you
expect CUDA + CUTLASS-style stacks to remain the single source of truth for the performance-critical path?
Leveraging community tooling — A major constraint for small teams is operator velocity (writing, tuning, and
validating kernels). Are you open to intentionally leaning on ecosystem pieces (Triton, Tile-Lang / related MLIR
flows, vendor libraries, etc.) where they fit FlashRT’s small-batch / graph-capture model — not necessarily
replacing the hot path on day one, but to reduce maintenance surface or speed up bring-up for new primitives?

If the answer is “stay CUDA-first for the foreseeable future,” that’s useful too — it sets expectations for
contributors who might otherwise assume a DSL or codegen layer is on the near-term horizon.

Thanks.

LiangSu8899 · 2026-05-09T13:09:55Z

LiangSu8899
May 9, 2026
Maintainer

I appreciate you asking that — that's a super constructive question, and yeah, I've been turning it over in my head for a while.

Some context first, because "why hand-written CUDA" reads like dogma without it. Before FlashRT I spent a fair amount of time on the toolchain side of the same problem — TRT / Triton / MLIR-TensorRT paths for openpi: https://github.com/LiangSu8899/openpi-jax_torch_mlir-trt. The Triton version is in there too. The atypical tile shapes that fall out of small-batch VLA inference meant none of those paths cleared ~70 ms for me; that was the ceiling I kept hitting. And the wall-clock cost of debugging the toolchain — opaque autotune choices, version-pin churn, MLIR diagnostics — ended up larger than just writing the kernel by hand once I knew what shape it should take.

I want to be careful: this isn't "toolchains bad, CUDA good". Triton, TileLang, modelopt, MLIR-TensorRT are all really well-built, and I still reach for them as the fast path to a deployment baseline. The point is narrower — for atypical model shapes + small-batch + realtime + an aggressive perf target, the cost/benefit flips. Going through the toolchain attempts is also what gave me a much sharper sense of where the kernel actually needs to be different, which made the handwritten version much faster to land than it would have been if I'd started there. The two phases were complementary, not opposed.

So FlashRT's intent is to fill the gap those toolchains don't cover well — not to compete with them. They handle the typical-shape / batched / less-latency-sensitive case far better than I could. FlashRT is for the cases they leave on the floor.

That shapes how I think about your questions:

Multi-backend operator dispatch — yes, this is where I want to grow

This is the direction I most want to push on. Same logical op, multiple implementations behind a registry, dispatched on hardware + dtype. Some of this exists already (FA2 vs CUTLASS FMHA on Thor; cuBLAS Lt FP8 vs CUTLASS SM100 FP8), but it's ad-hoc. A proper dispatch layer is the cleanest way to grow cross-hardware support without breaking paths that already work.

I haven't built it yet — partly I don't have a strong opinion on the API yet, and partly that FlashRT is a one-person project right now (lol), so the surface I can keep healthy is finite. Cross-hardware support is something I really want to grow into, and it's exactly the place I'm hoping community contributions can pick up — on design, and on per-hardware kernel paths.

DSLs in-tree (Triton / TileLang as an authoring layer) — I'd rather keep them outside

This is where I want FlashRT to stay narrow. The reasons, in order of weight:

Deployment dep weight. Triton brings an LLVM, TileLang brings its own MLIR stack. As hard runtime deps these inflate the build matrix, version-pin surface, and container size — awkward for edge / Jetson, which might be half of what FlashRT targets.
Maintenance surface. Owning bug-fix surface across CUDA + DSL-prototype + DSL-runtime is more than one person can do well (FlashRT is a solo project at the moment), and more than is warranted by the marginal velocity gain on the production path.
They're already great externally. Prototype a primitive in your preferred DSL, measure, and if it's a long-term win we port it to handwritten CUDA. Each tool does what it's best at and FlashRT's runtime contract stays narrow.

For vendor / header-only libraries (CUTLASS, cuBLAS Lt, FA2, modelopt) — case-by-case, used where they fit. The line I draw is roughly "header-only or static-link, narrow API surface".

Operator velocity — real, but the right place to push it is at the edges

You're right that velocity is the binding constraint for a small team. My read: toolchains buy real velocity for new primitives and typical shapes, and that's exactly where I'd want them used — externally, to prototype, to get a baseline, to validate the design. Inside FlashRT the marginal gain is smaller and the maintenance cost is larger, so the trade flips.

If a primitive has a clean DSL prototype with measurements, that's a useful starting point and I'd happily take a CUDA port as a PR. The two open primitives where this would help most right now are FP4 SigLIP FFN and decoder all-proj — both have favorable simulation numbers and are blocked on me writing the kernel.

TL;DR

Production hot path: handwritten CUDA + CUTLASS. Stays that way for the foreseeable future.
Multi-backend operator dispatch: wanted, partially exists, this is where I'm hoping community contributions go.
DSLs as in-tree authoring layer: not planned. Use them externally; PR a CUDA port if it's a long-term win.
Vendor / header-only libs: case-by-case.

Thanks again for thinking carefully about this. If you have specific scenarios or primitives in mind, leave a comment any time — or happy to set up a conference if you'd rather discuss more. Appreciate the attention to the project.

0 replies

gugudeshubao · 2026-05-20T07:10:11Z

gugudeshubao
May 20, 2026
Collaborator Author

thanks。

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: kernel authoring roadmap (Triton / TileLang)? #15

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Question: kernel authoring roadmap (Triton / TileLang)? #15

Uh oh!

gugudeshubao May 9, 2026 Collaborator

Replies: 2 comments

Uh oh!

LiangSu8899 May 9, 2026 Maintainer

Multi-backend operator dispatch — yes, this is where I want to grow

DSLs in-tree (Triton / TileLang as an authoring layer) — I'd rather keep them outside

Operator velocity — real, but the right place to push it is at the edges

TL;DR

Uh oh!

gugudeshubao May 20, 2026 Collaborator Author

gugudeshubao
May 9, 2026
Collaborator

LiangSu8899
May 9, 2026
Maintainer

gugudeshubao
May 20, 2026
Collaborator Author