Question: kernel authoring roadmap (Triton / TileLang)? #15
Replies: 2 comments
-
|
I appreciate you asking that — that's a super constructive question, and yeah, I've been turning it over in my head for a while. Some context first, because "why hand-written CUDA" reads like dogma without it. Before FlashRT I spent a fair amount of time on the toolchain side of the same problem — TRT / Triton / MLIR-TensorRT paths for openpi: https://github.com/LiangSu8899/openpi-jax_torch_mlir-trt. The Triton version is in there too. The atypical tile shapes that fall out of small-batch VLA inference meant none of those paths cleared ~70 ms for me; that was the ceiling I kept hitting. And the wall-clock cost of debugging the toolchain — opaque autotune choices, version-pin churn, MLIR diagnostics — ended up larger than just writing the kernel by hand once I knew what shape it should take. I want to be careful: this isn't "toolchains bad, CUDA good". Triton, TileLang, modelopt, MLIR-TensorRT are all really well-built, and I still reach for them as the fast path to a deployment baseline. The point is narrower — for atypical model shapes + small-batch + realtime + an aggressive perf target, the cost/benefit flips. Going through the toolchain attempts is also what gave me a much sharper sense of where the kernel actually needs to be different, which made the handwritten version much faster to land than it would have been if I'd started there. The two phases were complementary, not opposed. So FlashRT's intent is to fill the gap those toolchains don't cover well — not to compete with them. They handle the typical-shape / batched / less-latency-sensitive case far better than I could. FlashRT is for the cases they leave on the floor. That shapes how I think about your questions: Multi-backend operator dispatch — yes, this is where I want to growThis is the direction I most want to push on. Same logical op, multiple implementations behind a registry, dispatched on hardware + dtype. Some of this exists already (FA2 vs CUTLASS FMHA on Thor; cuBLAS Lt FP8 vs CUTLASS SM100 FP8), but it's ad-hoc. A proper dispatch layer is the cleanest way to grow cross-hardware support without breaking paths that already work. I haven't built it yet — partly I don't have a strong opinion on the API yet, and partly that FlashRT is a one-person project right now (lol), so the surface I can keep healthy is finite. Cross-hardware support is something I really want to grow into, and it's exactly the place I'm hoping community contributions can pick up — on design, and on per-hardware kernel paths. DSLs in-tree (Triton / TileLang as an authoring layer) — I'd rather keep them outsideThis is where I want FlashRT to stay narrow. The reasons, in order of weight:
For vendor / header-only libraries (CUTLASS, cuBLAS Lt, FA2, modelopt) — case-by-case, used where they fit. The line I draw is roughly "header-only or static-link, narrow API surface". Operator velocity — real, but the right place to push it is at the edgesYou're right that velocity is the binding constraint for a small team. My read: toolchains buy real velocity for new primitives and typical shapes, and that's exactly where I'd want them used — externally, to prototype, to get a baseline, to validate the design. Inside FlashRT the marginal gain is smaller and the maintenance cost is larger, so the trade flips. If a primitive has a clean DSL prototype with measurements, that's a useful starting point and I'd happily take a CUDA port as a PR. The two open primitives where this would help most right now are FP4 SigLIP FFN and decoder all-proj — both have favorable simulation numbers and are blocked on me writing the kernel. TL;DR
Thanks again for thinking carefully about this. If you have specific scenarios or primitives in mind, leave a comment any time — or happy to set up a conference if you'd rather discuss more. Appreciate the attention to the project. |
Beta Was this translation helpful? Give feedback.
-
|
thanks。 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi — I read in docs/inference_engine_differences.md that kernels are hand-written CUDA today, and that Triton /
TileLang integration is a possible future surface but not a current goal.
I’d appreciate any public roadmap or design discussion on DSL-backed or multi-backend kernel authoring. Concretely:
Multi-backend / multi-source operators — As FlashRT grows (more models, shapes, and hardware), do you envision
the same logical op being implemented or targeted through more than one path (e.g. hand-tuned CUDA for
production hotspots + optional Triton/TileLang or codegen paths for faster iteration or portability), or do you
expect CUDA + CUTLASS-style stacks to remain the single source of truth for the performance-critical path?
Leveraging community tooling — A major constraint for small teams is operator velocity (writing, tuning, and
validating kernels). Are you open to intentionally leaning on ecosystem pieces (Triton, Tile-Lang / related MLIR
flows, vendor libraries, etc.) where they fit FlashRT’s small-batch / graph-capture model — not necessarily
replacing the hot path on day one, but to reduce maintenance surface or speed up bring-up for new primitives?
If the answer is “stay CUDA-first for the foreseeable future,” that’s useful too — it sets expectations for
contributors who might otherwise assume a DSL or codegen layer is on the near-term horizon.
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions