Performance issue list based on LDBC benchmark

I'm creating this issue to uncover some deeper performance issues I've noticed in the LDBC SNB-SF1 benchmark [summary table](https://github.com/prrao87/graph-benchmark-ldbc?tab=readme-ov-file#high-level-results) - it's clear that there are more fundamental reasons why lance-graph is ~10x slower than other alternatives for nearly all the queries.

## Why this dataset? 

LDBC SNB SF1 is a reasonably large and complex dataset with several wide tables as opposed to the [earlier benchmark](https://github.com/prrao87/graph-benchmark) on an artificially generated social network dataset, which had fewer, narrower tables. This means that we're uncovering issues with the LDBC benchmark that are more likely to surface in users' real-world workloads, so it's worth investigating deeper from a query parsing and planning perspective.

## Summary of issues (high-level)

- [x] The previous version of lance-graph used the `execute` API, which rebuilt the catalog each time it was called (since the dataset was passed as an input parameter). In v0.5.1, lance-graph introduces a new `CypherEngine` API, which caches the catalog, so we can compare the pure query execution times with other engines in the benchmark repo.
- [ ] Currently, there doesn't exist a short‑circuit for existence checks due to a parser limitation in lance-graph. The parser rejects `COUNT(DISTINCT x) > 0`, which blocks simple boolean returns as required by Q20-30 of the LDBC benchmark, and this forces full counts + Python post‑processing overheads. That removes the chance to stop early on `LIMIT 1` / `EXISTS`, which other engines can exploit this pattern. This is addressed by #123.

- [ ] There are query optimizer limitations around multi‑`MATCH` + `WHERE`. Per #117, we must express many queries as a single `MATCH` followed by the `WHERE` clause in order to be able to filter, which restricts the flexibility with which users can write their own queries using these patterns. From a performance perspective, it makes join‑order/alias‑pruning inefficient and risks huge intermediate joins if predicate pushdown or reorder isn't considered during planning. In general, from a DevEx perspective, the user should be able to write all kinds of patterns (including multi-match + `WHERE` queries) without facing a performance penalty.

- [x] No parameter binding / plan cache. Parameters are manually inlined (string replace), so each query is parsed and planned from scratch and can’t be cached as a prepared statement. File: [query.py](https://file+.vscode-resource.vscode-cdn.net/Users/prrao/.vscode/extensions/openai.chatgpt-0.4.68-darwin-arm64/webview/#) (apply_params + execute_query). This would be fixed by #103. 

- [x] Self-referencing alias in a Cypher query pattern doesn't work, meaning that the most expensive query in the LDBC benchmark (Q30) cannot be run against other tools. This is addressed by #111.


- [x] LDBC queries typically involve large fan‑out traversals, which can become expensive quickly. In the DataFusion engine, graph traversal is implemented as joins over edge tables. If the engine doesn’t reorder joins aggressively, high‑degree steps (e.g., Comment ↔ Post ↔ Tag) can become expensive.


> [!NOTE]
>  Some of the above may not be query planning issues, but I've been explaining and inspecting the query plans with GPT 5.2 Codex to the best of my ability to see what could explain the performance gap. Experienced human eyes are needed to look through the plans and identify performance gaps with other query plans from Kuzu/Ladybug and Neo4j.

We can expand/contract this list as new issues become apparent, and link them to existing sub-issues in this repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue list based on LDBC benchmark #122

Why this dataset?

Summary of issues (high-level)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Performance issue list based on LDBC benchmark #122

Description

Why this dataset?

Summary of issues (high-level)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions