Add c_funcify registry, start deprecating COp class by jessegrabowski · Pull Request #2218 · pymc-devs/pytensor

jessegrabowski · 2026-06-13T04:10:16Z

Closes #2006

This PR provides the machinery and a proof of concept for a c_funcify registry and a pattern for moving COps out of the op definitions and into link/c/dispatch. The C-backend specifics currently held by the COp class move onto a new CImpl class, and these are produced by functions registered to c_funcify. Aside from that bit of special machinery, the setup mirrors what the other backends already do. This is admittedly something of an intermediate form — ideally we'd be purely functional — but I wanted a minimal proof-of-concept before going nuts and converting every single COp. By doing things this way, existing COps self-register: the registry's default returns the op as its own implementation, so the transition is smooth and this PR changes no behavior for anything that isn't explicitly migrated (cache keys and generated source stay byte-identical for unmigrated ops).

How it hangs together:

c_funcify is a singledispatch returning a CImpl (or CFileImpl, which loads C from .c/.h files the way ExternalCOp does). The default returns the op itself when it's a CLinkerOp, so every existing COp is its own impl.
CLinker resolves each node's implementation through the registry while compiling, and VMLinker / OpWiseCLinker / DebugMode all go through the same path — so cvm, c, and c|py are covered. The graph node keeps its original op, which is why the cache keys don't move.
Registrations live under link/c/dispatch/ and load lazily from dispatch/__init__.py, exactly like link/numba/dispatch and link/jax/dispatch.

The proof-of-concept migration is CheckAndRaise: it drops from COp to a plain Op, its c_code moves to link/c/dispatch/raise_op.py, and its exception type still flows at runtime through a ParamsType carried on the impl. The emitted code remains unchanged, as does the cache key.

Some special cases that will need addressing before the bulk migration:

OpenMPOp probes the compiler for -fopenmp support and threads an openmp flag through the op. That probe needs to become a cached helper in the dispatch layer, and openmp should become an impl-time decision rather than op/graph state (with a pickle shim for graphs that were saved with it set).
Elemwise, CAReduce, and Composite don't have their own kernels; they wrap a scalar op's c_code in loops, and Composite holds an inner fgraph of scalar ops. Their impls will recurse through c_funcify to fetch the inner impls, which is also what drags the ~90 ScalarOps along for the ride.
Most params are static props that can become compile-time #defines, but the genuinely runtime ones (like CheckAndRaise's exception type) stay on the impl as params, as done here.

From there the plan is to migrate family by family (subtensor, shape, scalar/elemwise, sparse, …), each one a self-contained move of c_code off the op and into a dispatch/ module, until COp has no subclasses left and can be removed. I'd like to add a coverage test over the op inventory so we fail loudly if a family is ever forgotten — otherwise a missing registration just silently falls back to perform.

ricardoV94 · 2026-06-13T08:28:32Z

You need to give your bot some flair, make it talk like a pirate or something.

Two self contradictory notes:

Back when I was optimizing C code I wanted to get rid of the ExternalCop. Only one Op uses it (Dimshuffle) and it makes it slower than a code optimized for the actual new order of the node, because the function stays agnostic, has to read a dynamically sized spec, etc. Also quite a lot of extra loc / docs to support what was basically a POC.
Re: Most params are static props. Theano guys were pushing hard in the direction of using these so that different op parametrizations compile the same and you get more cache hits and compile faster. That was in the direction of allowing PyTensor to scale to large large model (or folks who want graphs with 100k nodes). So fighting this is pushing against one of the benefits CVM has over numba full specialized JIT.

These are self contradictory right. I want Dimshuffle to be as lightweight as possible because its a glue Op that shows everywhere. If it's judt expandim it shouldn't habe logic to handle drop (and validate it's a size 1 being dropped) or transpose, etc... I never pushed these changes but I had a branch that dropped it and make the cvm closer in perf with the radon.

Other ops that have naturally more weight (not just metadata change like Dimshuffle).I still think the params idea is nice with the goal of faster cvm compile.

Introduce the singledispatch c_funcify registry returning detached CImpl implementations, resolve CLinker through it, and route OpWiseCLinker, the VM, and DebugMode to the dispatched C thunk with a Python fallback.

Make CheckAndRaise a plain Op and register CheckAndRaiseImpl, moving its c_code and ParamsType into the detached impl.

Make DimShuffle a plain Op and register DimShuffleImpl, which emits per-node specialized C from the static permutation and shape, replacing the runtime-spec dimshuffle.c kernel.

DimShuffle was its last consumer and now uses the C dispatch registry, so the ExternalCOp base class, its section-loading machinery, and the tests exercising it are dead code.

jessegrabowski · 2026-06-14T03:57:51Z

To entice you to support this PR, I ripped out all the CFileImpl stuff. I also killed ExternalCOp and migrated DimShuffle to c_dispatch form. It now has a code printer that inspects the node and emits specialized kernels. Here are three examples. I did some microbench and found ~2x speedup (just pure dimshuffle in difference cases pre/post refactor). I will try it on radon once my big model is done fitting.

/* CASE: expand_dims
 * op            = ExpandDims{axes=(0, 2)}
 * input_ndim    = 1
 * new_order     = ['x', 0, 'x']
 * _new_order    = [-1, 0, -1]   (-1 == augmented/'x')
 * drop          = ()
 * input.shape   = (None,)   (None == statically unknown)
 * cache_version = (1,)
 */

{
    npy_intp itemsize = PyArray_ITEMSIZE(input);

    npy_intp dimensions[3];
    npy_intp strides[3];
    dimensions[0] = 1;
    strides[0] = itemsize;
    dimensions[1] = PyArray_DIMS(input)[0];
    strides[1] = PyArray_DIMS(input)[0] == 1 ? itemsize : PyArray_STRIDES(input)[0];
    dimensions[2] = 1;
    strides[2] = itemsize;

    Py_XDECREF(res);
    // Borrow only the writable flag from the input; NPY_OWNDATA stays 0.
    res = (PyArrayObject*)PyArray_New(
    &PyArray_Type, 3, dimensions,
    PyArray_TYPE(input), strides,
    PyArray_DATA(input), itemsize,
    (NPY_ARRAY_WRITEABLE * PyArray_ISWRITEABLE(input)),
    NULL);
    if (res == NULL) {
        FAIL;
    }

    // Declare the result a view of the input and recompute its flags.
    Py_INCREF((PyObject*)input);
    PyArray_SetBaseObject(res, (PyObject*)input);
    PyArray_UpdateFlags(res, NPY_ARRAY_UPDATE_ALL);
}

/* CASE: transpose_dynamic
 * op            = MatrixTranspose
 * input_ndim    = 2
 * new_order     = [1, 0]
 * _new_order    = [1, 0]   (-1 == augmented/'x')
 * drop          = ()
 * input.shape   = (None, None)   (None == statically unknown)
 * cache_version = (1,)
 */

{
    npy_intp itemsize = PyArray_ITEMSIZE(input);

    npy_intp dimensions[2];
    npy_intp strides[2];
    dimensions[0] = PyArray_DIMS(input)[1];
    strides[0] = PyArray_DIMS(input)[1] == 1 ? itemsize : PyArray_STRIDES(input)[1];
    dimensions[1] = PyArray_DIMS(input)[0];
    strides[1] = PyArray_DIMS(input)[0] == 1 ? itemsize : PyArray_STRIDES(input)[0];

    Py_XDECREF(res);
    // Borrow only the writable flag from the input; NPY_OWNDATA stays 0.
    res = (PyArrayObject*)PyArray_New(
    &PyArray_Type, 2, dimensions,
    PyArray_TYPE(input), strides,
    PyArray_DATA(input), itemsize,
    (NPY_ARRAY_WRITEABLE * PyArray_ISWRITEABLE(input)),
    NULL);
    if (res == NULL) {
        FAIL;
    }

    // Declare the result a view of the input and recompute its flags.
    Py_INCREF((PyObject*)input);
    PyArray_SetBaseObject(res, (PyObject*)input);
    PyArray_UpdateFlags(res, NPY_ARRAY_UPDATE_ALL);
}

/* CASE: transpose_known_dim
 * op            = MatrixTranspose
 * input_ndim    = 2
 * new_order     = [1, 0]
 * _new_order    = [1, 0]   (-1 == augmented/'x')
 * drop          = ()
 * input.shape   = (None, 3)   (None == statically unknown)
 * cache_version = (1,)
 */

{
    npy_intp itemsize = PyArray_ITEMSIZE(input);

    npy_intp dimensions[2];
    npy_intp strides[2];
    dimensions[0] = PyArray_DIMS(input)[1];
    strides[0] = PyArray_STRIDES(input)[1];
    dimensions[1] = PyArray_DIMS(input)[0];
    strides[1] = PyArray_DIMS(input)[0] == 1 ? itemsize : PyArray_STRIDES(input)[0];

    Py_XDECREF(res);
    // Borrow only the writable flag from the input; NPY_OWNDATA stays 0.
    res = (PyArrayObject*)PyArray_New(
    &PyArray_Type, 2, dimensions,
    PyArray_TYPE(input), strides,
    PyArray_DATA(input), itemsize,
    (NPY_ARRAY_WRITEABLE * PyArray_ISWRITEABLE(input)),
    NULL);
    if (res == NULL) {
        FAIL;
    }

    // Declare the result a view of the input and recompute its flags.
    Py_INCREF((PyObject*)input);
    PyArray_SetBaseObject(res, (PyObject*)input);
    PyArray_UpdateFlags(res, NPY_ARRAY_UPDATE_ALL);
}

jessegrabowski · 2026-06-14T05:58:59Z

I also killed OpenMPOp and migrated everything in elementwise.py into the new format. That handles both special cases (and cleans up a bunch of dead code). I'll stop now. If we like this, the rest of the cutover is purely mechanical.

Replace the class-cached update_self_openmp (which set config.openmp = False process-wide when the compiler lacked OpenMP) with a pure, memoized openmp_supported() probe. self.openmp is now the request; effective use is request AND compiler support, resolved lazily at codegen.

Elemwise was the only consumer. It now derives from COp directly and carries the openmp request plus _use_openmp/c_compile_args itself; the dead omp.h header path (always shadowed by Elemwise.c_headers) is dropped.

Extract _c_all into a module-level _elemwise_c_all generator, register ElemwiseImpl, and make Elemwise a plain Op with no C methods. CAReduce calls the generator directly; the openmp decision moves from the op to the impl.

CAReduce becomes a plain Op; NonZeroDimsCAReduce's _c_all override becomes an error_on_empty_reduce_axis flag the generator reads. The elemwise and careduce C generators move into link.c.dispatch.elemwise beside the impls, leaving tensor/elemwise.py free of any C codegen.

jessegrabowski requested a review from ricardoV94 June 13, 2026 04:10

jessegrabowski added C-backend refactor labels Jun 13, 2026

jessegrabowski added 3 commits June 13, 2026 22:32

Add c_funcify C dispatch registry and route linkers through it

168468e

Introduce the singledispatch c_funcify registry returning detached CImpl implementations, resolve CLinker through it, and route OpWiseCLinker, the VM, and DebugMode to the dispatched C thunk with a Python fallback.

Migrate CheckAndRaise to the C dispatch registry

9a4a720

Make CheckAndRaise a plain Op and register CheckAndRaiseImpl, moving its c_code and ParamsType into the detached impl.

Migrate DimShuffle to the C dispatch registry

a1b6a6b

Make DimShuffle a plain Op and register DimShuffleImpl, which emits per-node specialized C from the static permutation and shape, replacing the runtime-spec dimshuffle.c kernel.

jessegrabowski force-pushed the c-dispatch branch from a39ebdc to a1b6a6b Compare June 14, 2026 03:37

Remove unused ExternalCOp

ffa8712

DimShuffle was its last consumer and now uses the C dispatch registry, so the ExternalCOp base class, its section-loading machinery, and the tests exercising it are dead code.

jessegrabowski added 4 commits June 14, 2026 01:11

Remove OpenMPOp, inlining its remainder into Elemwise

36025dc

Elemwise was the only consumer. It now derives from COp directly and carries the openmp request plus _use_openmp/c_compile_args itself; the dead omp.h header path (always shadowed by Elemwise.c_headers) is dropped.

Migrate Elemwise to the C dispatch registry

6bdc7ce

Extract _c_all into a module-level _elemwise_c_all generator, register ElemwiseImpl, and make Elemwise a plain Op with no C methods. CAReduce calls the generator directly; the openmp decision moves from the op to the impl.

jessegrabowski force-pushed the c-dispatch branch from 6837a9c to db46dd8 Compare June 14, 2026 06:15

jessegrabowski added the major label Jun 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add c_funcify registry, start deprecating COp class#2218

Add c_funcify registry, start deprecating COp class#2218
jessegrabowski wants to merge 8 commits into
pymc-devs:mainfrom
jessegrabowski:c-dispatch

jessegrabowski commented Jun 13, 2026 •

edited

Loading

Uh oh!

ricardoV94 commented Jun 13, 2026 •

edited

Loading

Uh oh!

jessegrabowski commented Jun 14, 2026 •

edited

Loading

Uh oh!

jessegrabowski commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jessegrabowski commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ricardoV94 commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jessegrabowski commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jessegrabowski commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jessegrabowski commented Jun 13, 2026 •

edited

Loading

ricardoV94 commented Jun 13, 2026 •

edited

Loading

jessegrabowski commented Jun 14, 2026 •

edited

Loading