Kernel embeddings#98
Conversation
using "stringImportPaths" for better readability
|
while ensureInit in buffers is not required but we don't know what a user will do . so , added a check |
|
|
||
| version(DComputeTestCUDA) | ||
| { | ||
| Platform.initialise(); |
There was a problem hiding this comment.
please add an additional test instead of altering this one.
There was a problem hiding this comment.
got it! after everything starts working smoothly i will update the test
|
|
||
| this(size_t elems) | ||
| { | ||
| ensureInit(); |
There was a problem hiding this comment.
why are these calls needed here?
|
|
||
| this(T[] arr) | ||
| { | ||
| ensureInit(); |
| * The PTX file is read and embedded at compile time via the D compiler's | ||
| * string import mechanism (-J / stringImportPaths in dub.json). No file | ||
| * I/O occurs at runtime. | ||
| * | ||
| * Example: | ||
| * Program p = Program.fromEmbedded!"kernel.ptx"(); | ||
| */ | ||
| static Program fromEmbedded(string filename)() |
There was a problem hiding this comment.
Currently the compiler emits the PTX file after compilation, so unless you double compile this then I don't think it will work as expected. You would need to essentially reference a symbol and then in the link phase have the compiler generate an object file for it and link that in.
There was a problem hiding this comment.
for a better developer experience we can write a program(as sub package) and run it via prebuild .. so that ptx is generated before compilation of host code.. the user can update prebuild commands manually otherwise(no subpackage needed). we need to do some improvements in the LDC itself for our goal.
Summary
With this PR, all DCompute runtime infrastructure is managed lazily and transparently behind the scenes. Developers only need to write their host code, allocate memory (
Buffer), and launch their compute kernels directly usinglaunch!k.Major Changes
1. Lazy Static Init Runtime (
source/dcompute/driver/cuda/runtime.d)shared static this()) that initializes CUDA, discovers active GPUs, allocates the defaultContext(Device 0), and pushes it onto the context stack.static this()) that ensures every thread gets a lock-free, dedicatedQueue(CUstream) with zero resource contention.ensureInit()guard as a defensive safety fallback for edge cases.2. Context-Sensitive Compile-Time PTX Embedding (
source/dcompute/driver/cuda/package.d)import()statement inside thelaunch!ktemplate definition.launch!is a template, it is instantiated inside the parent project's compilation context.dcomputelibrary to compile as a standard static library without requiring any local PTX files or string import flags, while seamlessly embedding the consumer project's custom PTX at compile time.3. Defensive Safety Triggers (
source/dcompute/driver/cuda/buffer.d)ensureInit()triggers inside bothBuffer!Tconstructors.4. dub.json update
"stringImportPaths": ["."]or-Jflag should be used with the path where ptx is generated .Developer Workflow & Flow of State
1. Compilation Flow:
@computemodules (e.g.tests/kernel.d) directly into PTX intermediate assembly (kernels_cuda800_64.ptx).-J.(the current directory) to the host compilation.launch!matmul. The compiler processesimport("kernels_cuda800_64.ptx"), embedding the GPU bytecode directly into your executable's text segment.2. Execution Flow:
Bufferis instantiated, the underlying static constructors initialize CUDA, assign the default device, push the GPU context, and initialize the active thread's CUDA stream.launch!is executed, it checks ifProgram.globalProgramis initialized. Seeing it is null, it passes the embedded PTX string tocuModuleLoadData, registering your custom kernels in the GPU context.Current State & Validation
All internal unittests and client applications compile, link, and validate successfully in one command:
dub test --compiler=ldc2completes and passes successfully.dub run --force --compiler=ldc2builds cleanly from scratch, embeds custommatmulkernels, executes them on the GPU, and validates output against host CPU matrices.