Level-2 enrichment: integrate Jelly and build a dataflow/taint analysis layer
PROBLEM
The level-2 analysis enrichment is currently a stub. The original plan targeted
CodeQL, which was never implemented and is unsuitable: it requires an external,
proprietary engine that cannot be embedded in the self-contained cants binary
produced by bun --compile.
We will replace it with Jelly (https://github.com/cs-au-dk/jelly, Aarhus
University). Jelly is a pure-TypeScript static analyzer that computes a call
graph, flow-insensitive (Andersen-style) points-to information, and library
access paths. It is BSD-3-Clause licensed with no native dependencies, so it can
run in-process inside the cants binary. This was verified by a spike: Jelly
builds and runs under Bun, exposes a usable library API (analyzeFiles, Solver,
fragmentState.subsetEdges), and was successfully embedded in a compiled
single-file binary that ran the analysis with no Node runtime and no node_modules.
Jelly provides a call graph and points-to information, but not dataflow or taint.
On top of Jelly we will build a flow-sensitive, relational, summary-based System
Dependence Graph (SDG) and expose taint analysis as a query over it. The design
uses hammock (single-entry, multi-exit) region decomposition with bottom-up
relational summaries, treating Jelly as a fixed call-graph and points-to oracle.
Sources, sinks, sanitizers, and library models are supplied as data (not
hardcoded), and the same relational-summary format represents both
user-specified models and computed function summaries.
This issue covers three pieces of work:
- Jelly integration (level-2 call-graph enrichment) plus the taint/dataflow layer.
- Removal of the dead CodeQL references from code and docs.
- Addition of the missing Apache-2.0 LICENSE file.
PART 1 - JELLY INTEGRATION AND DATAFLOW LAYER
Substrate decisions (locked):
- Depend on @cs-au-dk/jelly as a package and deep-import its compiled
lib/.
Do not vendor its source. The published package ships lib/ with no
exports restriction, so deep imports are supported.
- Treat Jelly as a frozen oracle: read its solved state (call graph, points-to,
access paths); never modify its solver.
- Build a flow-sensitive, relational summary layer on top. This is
flow-sensitive on value flow and locals, and flow-insensitive on the heap
(heap precision is capped by Jelly's Andersen solve).
Steps:
- Add @cs-au-dk/jelly and @babel/preset-typescript (dev) as dependencies.
- Check in a
bun patch for the one bun --compile blocker (see CAVEATS).
- Wire analyzeFiles/Solver behind a flag; run one whole-program solve.
- Build the identity-mapping layer: map Jelly's location-based and
access-path-based node identities to our canonical signature keys (the keys
produced by computeSignatureForDecl and used by symbol_table / call_graph).
- Merge Jelly's call graph into the existing call_graph with provenance
"jelly", reconciled against the level-1 tsc-resolver edges.
- Record global and module-level state (module bindings, singletons, exported
mutables) and their read/write sites; these become summary inputs/outputs.
- Condense the call graph into strongly connected components (Tarjan).
- Build an exceptional control-flow graph per function (edges for return,
throw, break, continue, yield, await), then decompose each CFG into hammock
regions: single-entry, multi-exit.
- Process regions bottom-up (innermost first). For each region build a local
relational dataflow over access-path vertices and summarize it as a set of
labeled edges (source vertex -> destination vertex) indexed by exit, plus a
global read/write footprint.
- Collapse each region into a summary node in the enclosing CFG, splicing each
exceptional exit to the nearest enclosing handler or propagating it outward.
- Compose function summaries bottom-up over the SCC-condensation DAG. At each
call site, apply the callee summary by binding formals to actuals via
access-path rewriting and splicing exits.
- Within each SCC (mutual recursion), iterate summary computation to a
monotone fixpoint; summaries of recursive members are co-defined.
- Represent external/library behavior as model summaries in the same
relational format. Unmodeled externals default to a conservative
pass-through (all arguments and reachable heap flow to the return value and
to external state).
- Merge function summaries, region summaries, global-state effects, and model
summaries into one System Dependence Graph keyed by canonical signatures.
- Expose taint analysis as a query over the SDG: seed at sources, propagate
labeled reachability along dependence edges, block at sanitizers that lie on
the path (flow-sensitive), and report a flow when a source-kind label
reaches a matching sink. Reconstruct witness paths lazily over the reverse
value-flow edges.
- Make sources, sinks, sanitizers, and pass-through models configurable as
data via CLI flags and a JSON spec file (with stdin support), validated
against a JSON Schema. Precedence: built-in pack, then config file, then
inline flags. Report the matching model id per flow for explainability.
- Emit a new taint_flows section in analysis.json (source, sink, rule,
sanitized flag, and path). Co-evolve the matching python-sdk Pydantic models.
- After an initial call-graph-only MVP, upgrade propagation to be points-to
backed (alias-aware) over Jelly's constraint graph.
PART 2 - REMOVE CODEQL REFERENCES
The CodeQL path was only ever a stub. Retarget all references to Jelly.
- Rename src/semantic_analysis/codeql/ to jelly/ and buildCodeqlCallGraph to
buildJellyCallGraph.
- Update src/core.ts (level-2 enrich call and comments).
- Update src/cli.ts (--analysis-level help text: "2 = + CodeQL enrichment").
- Update src/options/options.ts (level-2 doc comment).
- Update comments in src/semantic_analysis/callGraph.ts and index.ts.
- Update README.md (--help block, "Deeper analysis" example, level-2 prose).
PART 3 - ADD LICENSE
The LICENSE file is missing, though README links to it and both package.json and
pyproject.toml declare Apache-2.0.
- Add a verbatim Apache-2.0 LICENSE file (matching the codeanalyzer-python and
codeanalyzer-java siblings).
- Add a NOTICE / third-party-license entry crediting @cs-au-dk/jelly
(BSD-3-Clause) for attribution in the compiled binary.
CAVEATS AND KNOWN RISKS
-
bun --compile and Babel: Jelly's parser passes Babel transform plugins by
string name, which the bundler cannot statically resolve, so a naive compiled
binary fails at parse time with "Cannot find module". Fix via a checked-in
bun patch that passes imported plugin objects instead of string ids; also
add @babel/preset-typescript (a dead-branch require inside @babel/core) as a
dependency. A CJS/ESM interop quirk (the default export arriving as
default.default under --compile) is handled in our own glue, not Jelly's.
Upstreaming the plugin fix is worthwhile: Jelly's own standalone binary build
likely has the same latent bug, and a merge would let us drop the patch.
-
Deep-import stability: we depend on Jelly's internal module paths (e.g.
lib/analysis/solver.js). If upstream later adds an exports map, these break.
Track this and request a stable, supported entrypoint upstream.
-
Two parsing foundations: level-1 uses the tsc (ts-morph) type-aware resolver;
Jelly parses with Babel (type-stripping, not type-aware). The call graphs
must be reconciled in the merge step.
-
Identity model mismatch: Jelly node identities are location-based and
access-path-based; our call_graph and symbol_table are keyed by canonical
signatures. The identity-mapping layer is the required bridge and is on the
critical path for both call-graph merge and taint.
-
Termination: the relational summaries use access-path vertices, so the
interprocedural fixpoint does not terminate without bounding access-path
depth. k-limiting is mandatory (CLI knob, e.g. --taint-field-depth, default
3); taint-kind sets are bounded bitsets.
-
Precision: the analysis is intentionally over-approximate (sound-leaning,
false-positive prone). Precision is recovered downstream by ranking/pruning;
the engine should not trade soundness for a lower false-positive rate.
-
Inherited unsoundness: dynamic eval, reflection, and monkey-patching are not
modeled (Jelly limitation), and npm type declarations are not reliable enough
to trust for soundness.
-
Incrementality: aspirational, not in the initial scope. Jelly runs
whole-program from scratch each time, which is the main obstacle. Record
summary-to-fact dependency edges and content-hash summaries from the start so
incremental re-analysis can be switched on later without a rewrite.
-
Binary size: embedding Jelly (with Babel and TypeScript) produced a roughly
78 MB compiled binary in the spike, up from the current cants size. Confirm
this is acceptable.
SUGGESTED STAGING
PR A LICENSE plus CodeQL removal (independent, no Jelly dependency).
PR B Jelly dependency, bun patch, identity mapping, call-graph merge; CI
proves the compiled binary runs Jelly in-process.
PR C MVP taint: relational summaries, hammock regions, SCC fixpoint with
k-limiting; demonstrate one source-to-sink flow with a correctly applied
sanitizer.
PR D Configurable models: CLI, JSON Schema validation, default model pack.
PR E taint_flows output and lazy witness reconstruction.
PR F Points-to-backed (alias-aware) propagation.
PR G Incremental re-analysis (switch on the recorded dependency edges).
Context: this supersedes the original CodeQL-based level-2 plan. The
single-binary goal is the driver; Jelly is the only credible engine that runs in
pure JavaScript and can live inside the cants binary.
Level-2 enrichment: integrate Jelly and build a dataflow/taint analysis layer
PROBLEM
The level-2 analysis enrichment is currently a stub. The original plan targeted
CodeQL, which was never implemented and is unsuitable: it requires an external,
proprietary engine that cannot be embedded in the self-contained
cantsbinaryproduced by
bun --compile.We will replace it with Jelly (https://github.com/cs-au-dk/jelly, Aarhus
University). Jelly is a pure-TypeScript static analyzer that computes a call
graph, flow-insensitive (Andersen-style) points-to information, and library
access paths. It is BSD-3-Clause licensed with no native dependencies, so it can
run in-process inside the
cantsbinary. This was verified by a spike: Jellybuilds and runs under Bun, exposes a usable library API (analyzeFiles, Solver,
fragmentState.subsetEdges), and was successfully embedded in a compiled
single-file binary that ran the analysis with no Node runtime and no node_modules.
Jelly provides a call graph and points-to information, but not dataflow or taint.
On top of Jelly we will build a flow-sensitive, relational, summary-based System
Dependence Graph (SDG) and expose taint analysis as a query over it. The design
uses hammock (single-entry, multi-exit) region decomposition with bottom-up
relational summaries, treating Jelly as a fixed call-graph and points-to oracle.
Sources, sinks, sanitizers, and library models are supplied as data (not
hardcoded), and the same relational-summary format represents both
user-specified models and computed function summaries.
This issue covers three pieces of work:
PART 1 - JELLY INTEGRATION AND DATAFLOW LAYER
Substrate decisions (locked):
lib/.Do not vendor its source. The published package ships
lib/with noexportsrestriction, so deep imports are supported.access paths); never modify its solver.
flow-sensitive on value flow and locals, and flow-insensitive on the heap
(heap precision is capped by Jelly's Andersen solve).
Steps:
bun patchfor the onebun --compileblocker (see CAVEATS).access-path-based node identities to our canonical signature keys (the keys
produced by computeSignatureForDecl and used by symbol_table / call_graph).
"jelly", reconciled against the level-1 tsc-resolver edges.
mutables) and their read/write sites; these become summary inputs/outputs.
throw, break, continue, yield, await), then decompose each CFG into hammock
regions: single-entry, multi-exit.
relational dataflow over access-path vertices and summarize it as a set of
labeled edges (source vertex -> destination vertex) indexed by exit, plus a
global read/write footprint.
exceptional exit to the nearest enclosing handler or propagating it outward.
call site, apply the callee summary by binding formals to actuals via
access-path rewriting and splicing exits.
monotone fixpoint; summaries of recursive members are co-defined.
relational format. Unmodeled externals default to a conservative
pass-through (all arguments and reachable heap flow to the return value and
to external state).
summaries into one System Dependence Graph keyed by canonical signatures.
labeled reachability along dependence edges, block at sanitizers that lie on
the path (flow-sensitive), and report a flow when a source-kind label
reaches a matching sink. Reconstruct witness paths lazily over the reverse
value-flow edges.
data via CLI flags and a JSON spec file (with stdin support), validated
against a JSON Schema. Precedence: built-in pack, then config file, then
inline flags. Report the matching model id per flow for explainability.
sanitized flag, and path). Co-evolve the matching python-sdk Pydantic models.
backed (alias-aware) over Jelly's constraint graph.
PART 2 - REMOVE CODEQL REFERENCES
The CodeQL path was only ever a stub. Retarget all references to Jelly.
buildJellyCallGraph.
PART 3 - ADD LICENSE
The LICENSE file is missing, though README links to it and both package.json and
pyproject.toml declare Apache-2.0.
codeanalyzer-java siblings).
(BSD-3-Clause) for attribution in the compiled binary.
CAVEATS AND KNOWN RISKS
bun --compile and Babel: Jelly's parser passes Babel transform plugins by
string name, which the bundler cannot statically resolve, so a naive compiled
binary fails at parse time with "Cannot find module". Fix via a checked-in
bun patchthat passes imported plugin objects instead of string ids; alsoadd @babel/preset-typescript (a dead-branch require inside @babel/core) as a
dependency. A CJS/ESM interop quirk (the default export arriving as
default.default under --compile) is handled in our own glue, not Jelly's.
Upstreaming the plugin fix is worthwhile: Jelly's own standalone binary build
likely has the same latent bug, and a merge would let us drop the patch.
Deep-import stability: we depend on Jelly's internal module paths (e.g.
lib/analysis/solver.js). If upstream later adds an exports map, these break.
Track this and request a stable, supported entrypoint upstream.
Two parsing foundations: level-1 uses the tsc (ts-morph) type-aware resolver;
Jelly parses with Babel (type-stripping, not type-aware). The call graphs
must be reconciled in the merge step.
Identity model mismatch: Jelly node identities are location-based and
access-path-based; our call_graph and symbol_table are keyed by canonical
signatures. The identity-mapping layer is the required bridge and is on the
critical path for both call-graph merge and taint.
Termination: the relational summaries use access-path vertices, so the
interprocedural fixpoint does not terminate without bounding access-path
depth. k-limiting is mandatory (CLI knob, e.g. --taint-field-depth, default
3); taint-kind sets are bounded bitsets.
Precision: the analysis is intentionally over-approximate (sound-leaning,
false-positive prone). Precision is recovered downstream by ranking/pruning;
the engine should not trade soundness for a lower false-positive rate.
Inherited unsoundness: dynamic eval, reflection, and monkey-patching are not
modeled (Jelly limitation), and npm type declarations are not reliable enough
to trust for soundness.
Incrementality: aspirational, not in the initial scope. Jelly runs
whole-program from scratch each time, which is the main obstacle. Record
summary-to-fact dependency edges and content-hash summaries from the start so
incremental re-analysis can be switched on later without a rewrite.
Binary size: embedding Jelly (with Babel and TypeScript) produced a roughly
78 MB compiled binary in the spike, up from the current cants size. Confirm
this is acceptable.
SUGGESTED STAGING
PR A LICENSE plus CodeQL removal (independent, no Jelly dependency).
PR B Jelly dependency, bun patch, identity mapping, call-graph merge; CI
proves the compiled binary runs Jelly in-process.
PR C MVP taint: relational summaries, hammock regions, SCC fixpoint with
k-limiting; demonstrate one source-to-sink flow with a correctly applied
sanitizer.
PR D Configurable models: CLI, JSON Schema validation, default model pack.
PR E taint_flows output and lazy witness reconstruction.
PR F Points-to-backed (alias-aware) propagation.
PR G Incremental re-analysis (switch on the recorded dependency edges).
Context: this supersedes the original CodeQL-based level-2 plan. The
single-binary goal is the driver; Jelly is the only credible engine that runs in
pure JavaScript and can live inside the cants binary.