Add c_funcify registry, start deprecating COp class#2218
Add c_funcify registry, start deprecating COp class#2218jessegrabowski wants to merge 8 commits into
Conversation
|
You need to give your bot some flair, make it talk like a pirate or something. Two self contradictory notes:
These are self contradictory right. I want Dimshuffle to be as lightweight as possible because its a glue Op that shows everywhere. If it's judt expandim it shouldn't habe logic to handle drop (and validate it's a size 1 being dropped) or transpose, etc... I never pushed these changes but I had a branch that dropped it and make the cvm closer in perf with the radon. Other ops that have naturally more weight (not just metadata change like Dimshuffle).I still think the params idea is nice with the goal of faster cvm compile. |
Introduce the singledispatch c_funcify registry returning detached CImpl implementations, resolve CLinker through it, and route OpWiseCLinker, the VM, and DebugMode to the dispatched C thunk with a Python fallback.
Make CheckAndRaise a plain Op and register CheckAndRaiseImpl, moving its c_code and ParamsType into the detached impl.
Make DimShuffle a plain Op and register DimShuffleImpl, which emits per-node specialized C from the static permutation and shape, replacing the runtime-spec dimshuffle.c kernel.
a39ebdc to
a1b6a6b
Compare
DimShuffle was its last consumer and now uses the C dispatch registry, so the ExternalCOp base class, its section-loading machinery, and the tests exercising it are dead code.
|
To entice you to support this PR, I ripped out all the /* CASE: expand_dims
* op = ExpandDims{axes=(0, 2)}
* input_ndim = 1
* new_order = ['x', 0, 'x']
* _new_order = [-1, 0, -1] (-1 == augmented/'x')
* drop = ()
* input.shape = (None,) (None == statically unknown)
* cache_version = (1,)
*/
{
npy_intp itemsize = PyArray_ITEMSIZE(input);
npy_intp dimensions[3];
npy_intp strides[3];
dimensions[0] = 1;
strides[0] = itemsize;
dimensions[1] = PyArray_DIMS(input)[0];
strides[1] = PyArray_DIMS(input)[0] == 1 ? itemsize : PyArray_STRIDES(input)[0];
dimensions[2] = 1;
strides[2] = itemsize;
Py_XDECREF(res);
// Borrow only the writable flag from the input; NPY_OWNDATA stays 0.
res = (PyArrayObject*)PyArray_New(
&PyArray_Type, 3, dimensions,
PyArray_TYPE(input), strides,
PyArray_DATA(input), itemsize,
(NPY_ARRAY_WRITEABLE * PyArray_ISWRITEABLE(input)),
NULL);
if (res == NULL) {
FAIL;
}
// Declare the result a view of the input and recompute its flags.
Py_INCREF((PyObject*)input);
PyArray_SetBaseObject(res, (PyObject*)input);
PyArray_UpdateFlags(res, NPY_ARRAY_UPDATE_ALL);
}/* CASE: transpose_dynamic
* op = MatrixTranspose
* input_ndim = 2
* new_order = [1, 0]
* _new_order = [1, 0] (-1 == augmented/'x')
* drop = ()
* input.shape = (None, None) (None == statically unknown)
* cache_version = (1,)
*/
{
npy_intp itemsize = PyArray_ITEMSIZE(input);
npy_intp dimensions[2];
npy_intp strides[2];
dimensions[0] = PyArray_DIMS(input)[1];
strides[0] = PyArray_DIMS(input)[1] == 1 ? itemsize : PyArray_STRIDES(input)[1];
dimensions[1] = PyArray_DIMS(input)[0];
strides[1] = PyArray_DIMS(input)[0] == 1 ? itemsize : PyArray_STRIDES(input)[0];
Py_XDECREF(res);
// Borrow only the writable flag from the input; NPY_OWNDATA stays 0.
res = (PyArrayObject*)PyArray_New(
&PyArray_Type, 2, dimensions,
PyArray_TYPE(input), strides,
PyArray_DATA(input), itemsize,
(NPY_ARRAY_WRITEABLE * PyArray_ISWRITEABLE(input)),
NULL);
if (res == NULL) {
FAIL;
}
// Declare the result a view of the input and recompute its flags.
Py_INCREF((PyObject*)input);
PyArray_SetBaseObject(res, (PyObject*)input);
PyArray_UpdateFlags(res, NPY_ARRAY_UPDATE_ALL);
}/* CASE: transpose_known_dim
* op = MatrixTranspose
* input_ndim = 2
* new_order = [1, 0]
* _new_order = [1, 0] (-1 == augmented/'x')
* drop = ()
* input.shape = (None, 3) (None == statically unknown)
* cache_version = (1,)
*/
{
npy_intp itemsize = PyArray_ITEMSIZE(input);
npy_intp dimensions[2];
npy_intp strides[2];
dimensions[0] = PyArray_DIMS(input)[1];
strides[0] = PyArray_STRIDES(input)[1];
dimensions[1] = PyArray_DIMS(input)[0];
strides[1] = PyArray_DIMS(input)[0] == 1 ? itemsize : PyArray_STRIDES(input)[0];
Py_XDECREF(res);
// Borrow only the writable flag from the input; NPY_OWNDATA stays 0.
res = (PyArrayObject*)PyArray_New(
&PyArray_Type, 2, dimensions,
PyArray_TYPE(input), strides,
PyArray_DATA(input), itemsize,
(NPY_ARRAY_WRITEABLE * PyArray_ISWRITEABLE(input)),
NULL);
if (res == NULL) {
FAIL;
}
// Declare the result a view of the input and recompute its flags.
Py_INCREF((PyObject*)input);
PyArray_SetBaseObject(res, (PyObject*)input);
PyArray_UpdateFlags(res, NPY_ARRAY_UPDATE_ALL);
} |
|
I also killed |
Replace the class-cached update_self_openmp (which set config.openmp = False process-wide when the compiler lacked OpenMP) with a pure, memoized openmp_supported() probe. self.openmp is now the request; effective use is request AND compiler support, resolved lazily at codegen.
Elemwise was the only consumer. It now derives from COp directly and carries the openmp request plus _use_openmp/c_compile_args itself; the dead omp.h header path (always shadowed by Elemwise.c_headers) is dropped.
Extract _c_all into a module-level _elemwise_c_all generator, register ElemwiseImpl, and make Elemwise a plain Op with no C methods. CAReduce calls the generator directly; the openmp decision moves from the op to the impl.
CAReduce becomes a plain Op; NonZeroDimsCAReduce's _c_all override becomes an error_on_empty_reduce_axis flag the generator reads. The elemwise and careduce C generators move into link.c.dispatch.elemwise beside the impls, leaving tensor/elemwise.py free of any C codegen.
6837a9c to
db46dd8
Compare
Closes #2006
This PR provides the machinery and a proof of concept for a
c_funcifyregistry and a pattern for moving COps out of the op definitions and intolink/c/dispatch. The C-backend specifics currently held by theCOpclass move onto a newCImplclass, and these are produced by functions registered toc_funcify. Aside from that bit of special machinery, the setup mirrors what the other backends already do. This is admittedly something of an intermediate form — ideally we'd be purely functional — but I wanted a minimal proof-of-concept before going nuts and converting every single COp. By doing things this way, existing COps self-register: the registry's default returns the op as its own implementation, so the transition is smooth and this PR changes no behavior for anything that isn't explicitly migrated (cache keys and generated source stay byte-identical for unmigrated ops).How it hangs together:
c_funcifyis asingledispatchreturning aCImpl(orCFileImpl, which loads C from.c/.hfiles the wayExternalCOpdoes). The default returns the op itself when it's aCLinkerOp, so every existing COp is its own impl.CLinkerresolves each node's implementation through the registry while compiling, andVMLinker/OpWiseCLinker/DebugModeall go through the same path — socvm,c, andc|pyare covered. The graph node keeps its original op, which is why the cache keys don't move.link/c/dispatch/and load lazily fromdispatch/__init__.py, exactly likelink/numba/dispatchandlink/jax/dispatch.The proof-of-concept migration is
CheckAndRaise: it drops fromCOpto a plainOp, itsc_codemoves tolink/c/dispatch/raise_op.py, and its exception type still flows at runtime through aParamsTypecarried on the impl. The emitted code remains unchanged, as does the cache key.Some special cases that will need addressing before the bulk migration:
OpenMPOpprobes the compiler for-fopenmpsupport and threads anopenmpflag through the op. That probe needs to become a cached helper in the dispatch layer, andopenmpshould become an impl-time decision rather than op/graph state (with a pickle shim for graphs that were saved with it set).Elemwise,CAReduce, andCompositedon't have their own kernels; they wrap a scalar op'sc_codein loops, andCompositeholds an inner fgraph of scalar ops. Their impls will recurse throughc_funcifyto fetch the inner impls, which is also what drags the ~90ScalarOps along for the ride.#defines, but the genuinely runtime ones (likeCheckAndRaise's exception type) stay on the impl as params, as done here.From there the plan is to migrate family by family (subtensor, shape, scalar/elemwise, sparse, …), each one a self-contained move of
c_codeoff the op and into adispatch/module, untilCOphas no subclasses left and can be removed. I'd like to add a coverage test over the op inventory so we fail loudly if a family is ever forgotten — otherwise a missing registration just silently falls back toperform.