Skip to content

Add c_funcify registry, start deprecating COp class#2218

Open
jessegrabowski wants to merge 8 commits into
pymc-devs:mainfrom
jessegrabowski:c-dispatch
Open

Add c_funcify registry, start deprecating COp class#2218
jessegrabowski wants to merge 8 commits into
pymc-devs:mainfrom
jessegrabowski:c-dispatch

Conversation

@jessegrabowski

@jessegrabowski jessegrabowski commented Jun 13, 2026

Copy link
Copy Markdown
Member

Closes #2006

This PR provides the machinery and a proof of concept for a c_funcify registry and a pattern for moving COps out of the op definitions and into link/c/dispatch. The C-backend specifics currently held by the COp class move onto a new CImpl class, and these are produced by functions registered to c_funcify. Aside from that bit of special machinery, the setup mirrors what the other backends already do. This is admittedly something of an intermediate form — ideally we'd be purely functional — but I wanted a minimal proof-of-concept before going nuts and converting every single COp. By doing things this way, existing COps self-register: the registry's default returns the op as its own implementation, so the transition is smooth and this PR changes no behavior for anything that isn't explicitly migrated (cache keys and generated source stay byte-identical for unmigrated ops).

How it hangs together:

  • c_funcify is a singledispatch returning a CImpl (or CFileImpl, which loads C from .c/.h files the way ExternalCOp does). The default returns the op itself when it's a CLinkerOp, so every existing COp is its own impl.
  • CLinker resolves each node's implementation through the registry while compiling, and VMLinker / OpWiseCLinker / DebugMode all go through the same path — so cvm, c, and c|py are covered. The graph node keeps its original op, which is why the cache keys don't move.
  • Registrations live under link/c/dispatch/ and load lazily from dispatch/__init__.py, exactly like link/numba/dispatch and link/jax/dispatch.

The proof-of-concept migration is CheckAndRaise: it drops from COp to a plain Op, its c_code moves to link/c/dispatch/raise_op.py, and its exception type still flows at runtime through a ParamsType carried on the impl. The emitted code remains unchanged, as does the cache key.

Some special cases that will need addressing before the bulk migration:

  • OpenMPOp probes the compiler for -fopenmp support and threads an openmp flag through the op. That probe needs to become a cached helper in the dispatch layer, and openmp should become an impl-time decision rather than op/graph state (with a pickle shim for graphs that were saved with it set).
  • Elemwise, CAReduce, and Composite don't have their own kernels; they wrap a scalar op's c_code in loops, and Composite holds an inner fgraph of scalar ops. Their impls will recurse through c_funcify to fetch the inner impls, which is also what drags the ~90 ScalarOps along for the ride.
  • Most params are static props that can become compile-time #defines, but the genuinely runtime ones (like CheckAndRaise's exception type) stay on the impl as params, as done here.

From there the plan is to migrate family by family (subtensor, shape, scalar/elemwise, sparse, …), each one a self-contained move of c_code off the op and into a dispatch/ module, until COp has no subclasses left and can be removed. I'd like to add a coverage test over the op inventory so we fail loudly if a family is ever forgotten — otherwise a missing registration just silently falls back to perform.

@ricardoV94

ricardoV94 commented Jun 13, 2026

Copy link
Copy Markdown
Member

You need to give your bot some flair, make it talk like a pirate or something.

Two self contradictory notes:

  1. Back when I was optimizing C code I wanted to get rid of the ExternalCop. Only one Op uses it (Dimshuffle) and it makes it slower than a code optimized for the actual new order of the node, because the function stays agnostic, has to read a dynamically sized spec, etc. Also quite a lot of extra loc / docs to support what was basically a POC.

  2. Re: Most params are static props. Theano guys were pushing hard in the direction of using these so that different op parametrizations compile the same and you get more cache hits and compile faster. That was in the direction of allowing PyTensor to scale to large large model (or folks who want graphs with 100k nodes). So fighting this is pushing against one of the benefits CVM has over numba full specialized JIT.

These are self contradictory right. I want Dimshuffle to be as lightweight as possible because its a glue Op that shows everywhere. If it's judt expandim it shouldn't habe logic to handle drop (and validate it's a size 1 being dropped) or transpose, etc... I never pushed these changes but I had a branch that dropped it and make the cvm closer in perf with the radon.

Other ops that have naturally more weight (not just metadata change like Dimshuffle).I still think the params idea is nice with the goal of faster cvm compile.

Introduce the singledispatch c_funcify registry returning detached CImpl
implementations, resolve CLinker through it, and route OpWiseCLinker, the
VM, and DebugMode to the dispatched C thunk with a Python fallback.
Make CheckAndRaise a plain Op and register CheckAndRaiseImpl, moving its
c_code and ParamsType into the detached impl.
Make DimShuffle a plain Op and register DimShuffleImpl, which emits per-node
specialized C from the static permutation and shape, replacing the
runtime-spec dimshuffle.c kernel.
DimShuffle was its last consumer and now uses the C dispatch registry, so
the ExternalCOp base class, its section-loading machinery, and the tests
exercising it are dead code.
@jessegrabowski

jessegrabowski commented Jun 14, 2026

Copy link
Copy Markdown
Member Author

To entice you to support this PR, I ripped out all the CFileImpl stuff. I also killed ExternalCOp and migrated DimShuffle to c_dispatch form. It now has a code printer that inspects the node and emits specialized kernels. Here are three examples. I did some microbench and found ~2x speedup (just pure dimshuffle in difference cases pre/post refactor). I will try it on radon once my big model is done fitting.

/* CASE: expand_dims
 * op            = ExpandDims{axes=(0, 2)}
 * input_ndim    = 1
 * new_order     = ['x', 0, 'x']
 * _new_order    = [-1, 0, -1]   (-1 == augmented/'x')
 * drop          = ()
 * input.shape   = (None,)   (None == statically unknown)
 * cache_version = (1,)
 */

{
    npy_intp itemsize = PyArray_ITEMSIZE(input);

    npy_intp dimensions[3];
    npy_intp strides[3];
    dimensions[0] = 1;
    strides[0] = itemsize;
    dimensions[1] = PyArray_DIMS(input)[0];
    strides[1] = PyArray_DIMS(input)[0] == 1 ? itemsize : PyArray_STRIDES(input)[0];
    dimensions[2] = 1;
    strides[2] = itemsize;

    Py_XDECREF(res);
    // Borrow only the writable flag from the input; NPY_OWNDATA stays 0.
    res = (PyArrayObject*)PyArray_New(
    &PyArray_Type, 3, dimensions,
    PyArray_TYPE(input), strides,
    PyArray_DATA(input), itemsize,
    (NPY_ARRAY_WRITEABLE * PyArray_ISWRITEABLE(input)),
    NULL);
    if (res == NULL) {
        FAIL;
    }

    // Declare the result a view of the input and recompute its flags.
    Py_INCREF((PyObject*)input);
    PyArray_SetBaseObject(res, (PyObject*)input);
    PyArray_UpdateFlags(res, NPY_ARRAY_UPDATE_ALL);
}
/* CASE: transpose_dynamic
 * op            = MatrixTranspose
 * input_ndim    = 2
 * new_order     = [1, 0]
 * _new_order    = [1, 0]   (-1 == augmented/'x')
 * drop          = ()
 * input.shape   = (None, None)   (None == statically unknown)
 * cache_version = (1,)
 */

{
    npy_intp itemsize = PyArray_ITEMSIZE(input);

    npy_intp dimensions[2];
    npy_intp strides[2];
    dimensions[0] = PyArray_DIMS(input)[1];
    strides[0] = PyArray_DIMS(input)[1] == 1 ? itemsize : PyArray_STRIDES(input)[1];
    dimensions[1] = PyArray_DIMS(input)[0];
    strides[1] = PyArray_DIMS(input)[0] == 1 ? itemsize : PyArray_STRIDES(input)[0];

    Py_XDECREF(res);
    // Borrow only the writable flag from the input; NPY_OWNDATA stays 0.
    res = (PyArrayObject*)PyArray_New(
    &PyArray_Type, 2, dimensions,
    PyArray_TYPE(input), strides,
    PyArray_DATA(input), itemsize,
    (NPY_ARRAY_WRITEABLE * PyArray_ISWRITEABLE(input)),
    NULL);
    if (res == NULL) {
        FAIL;
    }

    // Declare the result a view of the input and recompute its flags.
    Py_INCREF((PyObject*)input);
    PyArray_SetBaseObject(res, (PyObject*)input);
    PyArray_UpdateFlags(res, NPY_ARRAY_UPDATE_ALL);
}
/* CASE: transpose_known_dim
 * op            = MatrixTranspose
 * input_ndim    = 2
 * new_order     = [1, 0]
 * _new_order    = [1, 0]   (-1 == augmented/'x')
 * drop          = ()
 * input.shape   = (None, 3)   (None == statically unknown)
 * cache_version = (1,)
 */

{
    npy_intp itemsize = PyArray_ITEMSIZE(input);

    npy_intp dimensions[2];
    npy_intp strides[2];
    dimensions[0] = PyArray_DIMS(input)[1];
    strides[0] = PyArray_STRIDES(input)[1];
    dimensions[1] = PyArray_DIMS(input)[0];
    strides[1] = PyArray_DIMS(input)[0] == 1 ? itemsize : PyArray_STRIDES(input)[0];

    Py_XDECREF(res);
    // Borrow only the writable flag from the input; NPY_OWNDATA stays 0.
    res = (PyArrayObject*)PyArray_New(
    &PyArray_Type, 2, dimensions,
    PyArray_TYPE(input), strides,
    PyArray_DATA(input), itemsize,
    (NPY_ARRAY_WRITEABLE * PyArray_ISWRITEABLE(input)),
    NULL);
    if (res == NULL) {
        FAIL;
    }

    // Declare the result a view of the input and recompute its flags.
    Py_INCREF((PyObject*)input);
    PyArray_SetBaseObject(res, (PyObject*)input);
    PyArray_UpdateFlags(res, NPY_ARRAY_UPDATE_ALL);
}

@jessegrabowski

Copy link
Copy Markdown
Member Author

I also killed OpenMPOp and migrated everything in elementwise.py into the new format. That handles both special cases (and cleans up a bunch of dead code). I'll stop now. If we like this, the rest of the cutover is purely mechanical.

Replace the class-cached update_self_openmp (which set config.openmp = False
process-wide when the compiler lacked OpenMP) with a pure, memoized
openmp_supported() probe. self.openmp is now the request; effective use is
request AND compiler support, resolved lazily at codegen.
Elemwise was the only consumer. It now derives from COp directly and carries
the openmp request plus _use_openmp/c_compile_args itself; the dead omp.h
header path (always shadowed by Elemwise.c_headers) is dropped.
Extract _c_all into a module-level _elemwise_c_all generator, register
ElemwiseImpl, and make Elemwise a plain Op with no C methods. CAReduce calls
the generator directly; the openmp decision moves from the op to the impl.
CAReduce becomes a plain Op; NonZeroDimsCAReduce's _c_all override becomes an
error_on_empty_reduce_axis flag the generator reads. The elemwise and careduce
C generators move into link.c.dispatch.elemwise beside the impls, leaving
tensor/elemwise.py free of any C codegen.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce status of C backend to "just another backend"

2 participants