Skip to content

Performance issue list based on LDBC benchmark #122

@prrao87

Description

@prrao87

I'm creating this issue to uncover some deeper performance issues I've noticed in the LDBC SNB-SF1 benchmark summary table - it's clear that there are more fundamental reasons why lance-graph is ~10x slower than other alternatives for nearly all the queries.

Why this dataset?

LDBC SNB SF1 is a reasonably large and complex dataset with several wide tables as opposed to the earlier benchmark on an artificially generated social network dataset, which had fewer, narrower tables. This means that we're uncovering issues with the LDBC benchmark that are more likely to surface in users' real-world workloads, so it's worth investigating deeper from a query parsing and planning perspective.

Summary of issues (high-level)

  • The previous version of lance-graph used the execute API, which rebuilt the catalog each time it was called (since the dataset was passed as an input parameter). In v0.5.1, lance-graph introduces a new CypherEngine API, which caches the catalog, so we can compare the pure query execution times with other engines in the benchmark repo.

  • Currently, there doesn't exist a short‑circuit for existence checks due to a parser limitation in lance-graph. The parser rejects COUNT(DISTINCT x) > 0, which blocks simple boolean returns as required by Q20-30 of the LDBC benchmark, and this forces full counts + Python post‑processing overheads. That removes the chance to stop early on LIMIT 1 / EXISTS, which other engines can exploit this pattern. This is addressed by Parser rejects boolean comparison after COUNT(DISTINCT ...) during projection #123.

  • There are query optimizer limitations around multi‑MATCH + WHERE. Per Multi-match query with WHERE predicate downstream filter/projection fails #117, we must express many queries as a single MATCH followed by the WHERE clause in order to be able to filter, which restricts the flexibility with which users can write their own queries using these patterns. From a performance perspective, it makes join‑order/alias‑pruning inefficient and risks huge intermediate joins if predicate pushdown or reorder isn't considered during planning. In general, from a DevEx perspective, the user should be able to write all kinds of patterns (including multi-match + WHERE queries) without facing a performance penalty.

  • No parameter binding / plan cache. Parameters are manually inlined (string replace), so each query is parsed and planned from scratch and can’t be cached as a prepared statement. File: query.py (apply_params + execute_query). This would be fixed by feat: Properly support parameter placeholders in Python/Rust #103.

  • Self-referencing alias in a Cypher query pattern doesn't work, meaning that the most expensive query in the LDBC benchmark (Q30) cannot be run against other tools. This is addressed by Self-pointing Cypher pattern causes lance-graph to hang indefinitely for high cardinality datasets #111.

  • LDBC queries typically involve large fan‑out traversals, which can become expensive quickly. In the DataFusion engine, graph traversal is implemented as joins over edge tables. If the engine doesn’t reorder joins aggressively, high‑degree steps (e.g., Comment ↔ Post ↔ Tag) can become expensive.

Note

Some of the above may not be query planning issues, but I've been explaining and inspecting the query plans with GPT 5.2 Codex to the best of my ability to see what could explain the performance gap. Experienced human eyes are needed to look through the plans and identify performance gaps with other query plans from Kuzu/Ladybug and Neo4j.

We can expand/contract this list as new issues become apparent, and link them to existing sub-issues in this repo.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Task.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions