Skip to content

perf: batch community cohesion to stop hangs on large repos#183

Merged
tirth8205 merged 1 commit intotirth8205:mainfrom
realkotob:perf/batch-community-cohesion
Apr 11, 2026
Merged

perf: batch community cohesion to stop hangs on large repos#183
tirth8205 merged 1 commit intotirth8205:mainfrom
realkotob:perf/batch-community-cohesion

Conversation

@realkotob
Copy link
Copy Markdown
Contributor

@realkotob realkotob commented Apr 9, 2026

The problem

Running code-review-graph build on a large source tree (~9k files, ~100k edges, ~93k nodes) on my Macbook M1 Pro effectively hangs during post-processing. The parse phase finishes fine, but community detection never returns and Python is pinned at 100% CPU:

INFO: Progress: 9234/9234 files parsed
INFO: FTS index rebuilt: 93541 rows indexed
INFO: igraph not available, using file-based community detection
    ... <never returns, 100% CPU> ...

A sampled stack trace points inside _detect_file_based.

Why it happens

_compute_cohesion walks the entire edge list to count internal/external edges for one community. It's called once per community, from three places:

  • _detect_file_based — once per file community ← the killer
  • _detect_leiden — once per Leiden cluster
  • _detect_leiden_sub — once per sub-cluster

The resulting work is O(edges × communities). For the file-based fallback on a large repo that's ~9k × ~100k ≈ ~10⁹ pure-Python comparisons — the hang I hit.

The Leiden path has the same shape but with far fewer clusters, so it was merely slow rather than catastrophic. It's still a real bug, just less dramatic.

The fix

Compute cohesion for all communities in a single O(edges) pass.

New helper _compute_cohesion_batch(community_member_qns, all_edges) builds a qualified_name → community_index reverse map once — each node appears in at most one community since all three detection functions produce partitions — then walks every edge exactly once and buckets it into internal/external counters per community.

Total work drops from O(edges × communities) to O(edges + Σ|members|). For the big repo, community detection now completes in seconds instead of hanging.

All three detection functions (_detect_file_based, _detect_leiden, _detect_leiden_sub) are refactored to:

  1. Collect member sets for all qualifying communities first.
  2. Call _compute_cohesion_batch once.
  3. zip the cohesions back onto the community metadata.

_compute_cohesion is kept as a thin single-community wrapper around the batch helper, so existing callers and the direct unit tests (which pin exact cohesion values like 0.5 and 1.0) keep working unchanged.

Test coverage

Three layers of regression protection, all added in this PR:

Test Level Catches
test_compute_cohesion_batch_matches_single Unit Batch-vs-single parity on a mixed-edge fixture
test_detect_file_based_integration Integration (deterministic) Pins exact member sets and cohesion values on a hand-built fixture with asymmetric cohesions (1.0 vs 0.6667) — catches zip misalignment or wrong member_qns passed to the batch helper
test_detected_cohesions_match_direct_computation End-to-end (algorithm-agnostic) Runs detect_communities on the seeded fixture (with an extra edge added to break cohesion symmetry) and asserts every stored cohesion equals what _compute_cohesion produces directly on the same member set

All three were validated with a mutation test (results.reverse() injected into _compute_cohesion_batch) and caught the bug with clear, specific error messages before being reverted. Without the fixture asymmetry, test_detected_cohesions_match_direct_computation alone wouldn't catch a simple swap — the test explicitly asserts that at least two distinct cohesion values are seen, so a future fixture regression that accidentally re-symmetrizes it will fail loudly.

Test plan

  • uv run pytest tests/test_communities.py — 23 passed, 1 skipped (igraph-only test)
  • uv run pytest tests/ — 624 passed, 1 skipped, 2 xpassed
  • uv run ruff check code_review_graph/ tests/test_communities.py — clean
  • uv run mypy code_review_graph/communities.py --ignore-missing-imports --no-strict-optional — clean
  • Mutation test: injected results.reverse() into _compute_cohesion_batch, confirmed all three new tests fail loudly with specific error messages, then reverted.

Out of scope

  • No changes to the Leiden algorithm, edge weights (EDGE_WEIGHTS), or edge-kind handling.
  • No change to the cohesion field's serialization or rounding.
  • No change to the public _compute_cohesion signature.
  • A separate issue I observed during end-to-end verification — g.community_leiden(...) itself can be very slow on graphs this size, likely because it's called without an n_iterations cap. That's a distinct problem and will be addressed in a follow-up PR.

Community detection was effectively hanging on the Godot source
(~9k files, ~100k edges, ~93k nodes) during the post-parse phase. The
parser finished fine, but `_detect_file_based` never returned, with
Python pinned at 100% CPU.

The cause was a quadratic pattern in `_compute_cohesion`: for each
community, it walked every edge in the graph once to count
internal/external edges. That's O(edges) per community, called once
per community, so the whole step was O(edges × communities).

- `_detect_file_based` creates one community per file. On Godot that's
  ~9k × ~100k ≈ ~10⁹ pure-Python comparisons — the hang I hit.
- `_detect_leiden` and `_detect_leiden_sub` have the same shape, just
  with fewer clusters, so they were merely slow rather than
  catastrophic.

Fix: compute cohesion for all communities in a single O(edges) pass
via a new `_compute_cohesion_batch` helper. It builds a
`qualified_name → community_index` reverse map once (since all three
detection functions produce partitions, each node belongs to at most
one community), then walks every edge exactly once and buckets it
into internal/external counters per community. Total work drops from
O(edges × communities) to O(edges + Σ|members|).

Both `_detect_leiden` and `_detect_leiden_sub` are refactored to
collect all cluster member sets first and then call
`_compute_cohesion_batch` once, the same pattern as the file-based
path.

`_compute_cohesion` is kept as a thin wrapper around the batch helper
so existing callers and the direct unit tests (which pin exact
values like 0.5 and 1.0) keep working unchanged.

Test coverage:

- Unit: 3 new tests verify `_compute_cohesion_batch` matches
  `_compute_cohesion` on a mixed-edge fixture, handles empty
  community lists, and handles edge-less input.
- Integration (deterministic): `test_detect_file_based_integration`
  builds an asymmetric fixture (one community with cohesion 1.0,
  another with cohesion 0.6667) and pins exact member sets and
  cohesion values — catches zip misalignment or wrong `member_qns`
  passed to the batch helper.
- End-to-end (algorithm-agnostic):
  `test_detected_cohesions_match_direct_computation` runs
  `detect_communities` on the seeded two-cluster fixture (with an
  extra edge added to break cohesion symmetry) and asserts that
  every stored community cohesion equals what `_compute_cohesion`
  produces when called directly on that community's member set.
  Works whether Leiden (igraph) or the file-based fallback is active.

All three new tests were validated with a mutation test
(`results.reverse()` injected into `_compute_cohesion_batch`) and
caught the bug with clear, specific error messages before being
reverted.

Full suite: 624 passed, 1 skipped, 2 xpassed. Ruff and mypy clean.
@tirth8205 tirth8205 merged commit 84005ba into tirth8205:main Apr 11, 2026
1 check passed
dg264 added a commit to dg264/code-review-graph that referenced this pull request Apr 11, 2026
…, batch store_communities

Narrowed scope per review feedback — cohesion batch fix and build.py
changes are handled by tirth8205#183 and tirth8205#184 respectively. This PR contributes
only the optimizations unique to this branch:

1. **Remove `_detect_leiden_sub`**: The recursive second-pass Leiden on
   every community >50 nodes caused exponential blow-up on large graphs.
   With 3k+ communities, each sub-pass re-scanned all edges and ran
   a full Leiden + cohesion pass. Removed entirely — the first-pass
   partitioning is sufficient.

2. **Cap Leiden iterations** (`n_iterations=2`): The default runs until
   convergence, which can take unbounded time on dense code graphs.
   Two passes produce equivalent partition quality for dependency graphs.

3. **Batch UPDATE in `store_communities`**: Replace per-member
   `UPDATE nodes SET community_id` with a single
   `WHERE qualified_name IN (...)` per community. Fully parameterized
   (nosec B608 is correct).

4. **Progress logging**: Added `logger.info()` at each phase boundary
   (node loading, igraph construction, Leiden execution, cohesion,
   completion) so users can monitor progress on large builds.

## Performance (tested on 132k-file monorepo, M3 Pro)

| Phase                | Before             | After          |
|----------------------|--------------------|----------------|
| Leiden clustering    | >2 hours (hung)    | ~45 seconds    |
| store_communities    | ~5 min             | ~15 seconds    |

Co-Authored-By: Cursor <noreply@cursor.com>
Made-with: Cursor
tirth8205 added a commit that referenced this pull request Apr 11, 2026
…, batch store_communities (#213)

Narrowed scope per review feedback — cohesion batch fix and build.py
changes are handled by #183 and #184 respectively. This PR contributes
only the optimizations unique to this branch:

1. **Remove `_detect_leiden_sub`**: The recursive second-pass Leiden on
   every community >50 nodes caused exponential blow-up on large graphs.
   With 3k+ communities, each sub-pass re-scanned all edges and ran
   a full Leiden + cohesion pass. Removed entirely — the first-pass
   partitioning is sufficient.

2. **Cap Leiden iterations** (`n_iterations=2`): The default runs until
   convergence, which can take unbounded time on dense code graphs.
   Two passes produce equivalent partition quality for dependency graphs.

3. **Batch UPDATE in `store_communities`**: Replace per-member
   `UPDATE nodes SET community_id` with a single
   `WHERE qualified_name IN (...)` per community. Fully parameterized
   (nosec B608 is correct).

4. **Progress logging**: Added `logger.info()` at each phase boundary
   (node loading, igraph construction, Leiden execution, cohesion,
   completion) so users can monitor progress on large builds.

## Performance (tested on 132k-file monorepo, M3 Pro)

| Phase                | Before             | After          |
|----------------------|--------------------|----------------|
| Leiden clustering    | >2 hours (hung)    | ~45 seconds    |
| store_communities    | ~5 min             | ~15 seconds    |


Made-with: Cursor

Co-authored-by: Cursor <noreply@cursor.com>
Co-authored-by: Tirth Kanani <tirthkanani18@gmail.com>
tirth8205 added a commit that referenced this pull request Apr 11, 2026
Unreleased fixes since v2.2.2 that users are complaining about:
- #208 Claude Code hook schema (fixes #97, #138, #163, #168, #172, #182,
  #188, #191, #201) — v2.2.2 generates an invalid hooks schema and
  timeouts in ms instead of seconds; PreCommit is also not a real event.
- #205 SQLite transaction nesting (fixes #110, #135, #181) — implicit
  transactions from the legacy sqlite3 default caused "cannot start a
  transaction within a transaction" on update.
- #166 Go method receivers resolved from field_identifier.
- #170 UTF-8 decode errors in detect_changes (fixes #169).
- #142 --platform target filters (fixes #133).
- #213 / #183 large-repo community detection hangs.
- #220 CI lint + tomllib on Python 3.10.
- #159 missing pytest-cov dev dep.
- #154 JSX component CALLS edges.

Plus features: #177 Codex, #165 Luau (#153), #217 REFERENCES edge,
#215 recurse_submodules, #185 gitignore default (#175), #171 gitignore
docs (#157).

Verified locally on Python 3.11: ruff clean, mypy clean, bandit clean,
691 tests pass, coverage 73.72%.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tirth8205 tirth8205 mentioned this pull request Apr 11, 2026
5 tasks
tirth8205 added a commit that referenced this pull request Apr 11, 2026
Unreleased fixes since v2.2.2 that users are complaining about:
- #208 Claude Code hook schema (fixes #97, #138, #163, #168, #172, #182,
  #188, #191, #201) — v2.2.2 generates an invalid hooks schema and
  timeouts in ms instead of seconds; PreCommit is also not a real event.
- #205 SQLite transaction nesting (fixes #110, #135, #181) — implicit
  transactions from the legacy sqlite3 default caused "cannot start a
  transaction within a transaction" on update.
- #166 Go method receivers resolved from field_identifier.
- #170 UTF-8 decode errors in detect_changes (fixes #169).
- #142 --platform target filters (fixes #133).
- #213 / #183 large-repo community detection hangs.
- #220 CI lint + tomllib on Python 3.10.
- #159 missing pytest-cov dev dep.
- #154 JSX component CALLS edges.

Plus features: #177 Codex, #165 Luau (#153), #217 REFERENCES edge,
#215 recurse_submodules, #185 gitignore default (#175), #171 gitignore
docs (#157).

Verified locally on Python 3.11: ruff clean, mypy clean, bandit clean,
691 tests pass, coverage 73.72%.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants