Skip to content

perf: optimize database pools, batch commits, and node health checks#605

Closed
wahh3b-lgtm wants to merge 2 commits into
PasarGuard:mainfrom
wahh3b-lgtm:pr-4-db-node-perf-v2
Closed

perf: optimize database pools, batch commits, and node health checks#605
wahh3b-lgtm wants to merge 2 commits into
PasarGuard:mainfrom
wahh3b-lgtm:pr-4-db-node-perf-v2

Conversation

@wahh3b-lgtm

@wahh3b-lgtm wahh3b-lgtm commented Jun 14, 2026

Copy link
Copy Markdown

Changes

  • DB connection pool (config.py, app/db/base.py): Increased pool_recycle to 1800s, added configurable pool_timeout (default 15s) with ge=1 validation, enabled pool_use_lifo=True
  • Performance indexes (Alembic migration): Added composite indexes via proper migration (a1b2c3d4e5f6) — (status, expire), (status, used_traffic), (node_id, created_at) and others with if_not_exists=True
  • Record usages (app/jobs/record_usages.py): Cached dialect lookup, batched node stats in single transaction with per-row safe_execute fallback that continues on individual failures
  • Node health checks (app/node/__init__.py): Concurrent asyncio.gather() with snapshot revalidation and exception logging for failed probes
  • XRayConfig (app/core/xray.py): Reverted TLS reads to synchronous (run_until_complete on a running loop always raised RuntimeError)
  • .env.example: Updated pool defaults to match new config

Files Changed

  • config.py
  • app/db/base.py
  • app/db/__init__.py
  • app/db/migrations/versions/a1b2c3d4e5f6_add_performance_indexes.py (new)
  • app/jobs/record_usages.py
  • app/node/__init__.py
  • app/core/xray.py
  • .env.example

Verification

  • Ruff lint + format pass
  • 13/13 tests pass
  • No database schema changes beyond additive indexes

Summary by CodeRabbit

  • New Features
    • Added database performance indexes across multiple tables.
    • Added a configurable database connection pool timeout setting (with minimum validation).
  • Chores
    • Updated the default database connection pool recycle setting from 300 to 1800 seconds.
    • Improved database connection pool behavior (including LIFO usage) to better manage connections.
  • Bug Fixes
    • Improved node health checks and usage recording reliability during transient database issues.

- DB pool: increase pool_recycle to 1800s, add pool_timeout (default 15s), enable pool_use_lifo, add ge=1 validation
- Alembic migration: add composite indexes (status/expire, status/used_traffic, node_id/created_at, etc.)
- Record usages: cache dialect lookup, batch node stats in single transaction with per-row safe_execute fallback
- Node health checks: concurrent asyncio.gather with snapshot revalidation and exception logging
- XRayConfig: revert TLS reads to synchronous (run_until_complete fails on running loop)
- Update .env.example with new pool defaults
@coderabbitai

coderabbitai Bot commented Jun 14, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6cb247a3-309b-4a09-8e42-09227ef9033d

📥 Commits

Reviewing files that changed from the base of the PR and between ea36e34 and 7ceb822.

📒 Files selected for processing (1)
  • app/node/__init__.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • app/node/init.py

Walkthrough

This PR tunes database connection pool settings (pool_recycle default raised to 1800, new configurable pool_timeout, pool_use_lifo enabled), adds a migration with performance indexes on several tables, refactors node health lookups to run concurrently, rewrites node stats recording to use a single batched transaction with per-node fallback, caches the DB dialect, and replaces deepcopy with shallow copies in Hysteria config handling.

Changes

Performance & Reliability Improvements

Layer / File(s) Summary
DB pool settings and engine wiring
config.py, .env.example, app/db/base.py, app/db/__init__.py
DatabaseSettings gains a pool_timeout field (default 15) and raises pool_recycle default to 1800; the async engine is wired to database_settings.pool_timeout with pool_use_lifo=True; .env.example reflects the new default; JWT, System, and User are re-exported from app/db/__init__.py.
Alembic migration: performance indexes
app/db/migrations/versions/a1b2c3d4e5f6_add_performance_indexes.py
New migration a1b2c3d4e5f6 creates idempotent non-unique indexes on users, nodes, notification_reminders, hosts, temp_keys, and node_stats tables, with a matching downgrade() to drop them.
Concurrent node health lookup in NodeManager
app/node/__init__.py
Adds _get_nodes_by_health(expected) that snapshots nodes under a reader lock, fetches health concurrently via asyncio.gather(..., return_exceptions=True), skips per-node failures with logging, and validates node identity before returning; public health-query methods delegate to this helper.
Batched node stats recording with dialect cache and fallback
app/jobs/record_usages.py
Adds _dialect_cache so get_dialect() avoids repeated DB lookups; record_node_stats_batched() now aggregates totals into a single upsert_params list, executes one batched engine.begin transaction, and falls back to per-node safe_execute writes on OperationalError/DatabaseError.
Hysteria finalmask shallow copy
app/core/xray.py
Replaces deepcopy with a dict comprehension for finalmask and list() for udpmasks in _hysteria_finalmask_from_stream.

Sequence Diagram(s)

sequenceDiagram
  rect rgba(70, 130, 180, 0.5)
    Note over Scheduler,safe_execute: Batched node stats recording
    Scheduler->>record_node_stats_batched: invoke with node_stats
    record_node_stats_batched->>record_node_stats_batched: aggregate upsert_params per node
    record_node_stats_batched->>engine.begin: single transaction under JOB_SEM
    alt batch success
      engine.begin-->>record_node_stats_batched: committed
    else OperationalError/DatabaseError
      engine.begin-->>record_node_stats_batched: exception raised
      record_node_stats_batched->>safe_execute: per-node fallback write
      safe_execute-->>record_node_stats_batched: done
    end
  end
Loading
sequenceDiagram
  rect rgba(60, 179, 113, 0.5)
    Note over Caller,Node: Concurrent node health lookup
    Caller->>NodeManager: get_healthy_nodes()
    NodeManager->>NodeManager: snapshot nodes under reader lock
    NodeManager->>asyncio.gather: node.get_health() for all nodes concurrently
    asyncio.gather->>Node: get_health()
    Node-->>asyncio.gather: Health or exception
    asyncio.gather-->>NodeManager: results list
    NodeManager->>NodeManager: skip exceptions, re-validate node identity
    NodeManager-->>Caller: filtered node list
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 Hoppity-hop through the query lane,
Indexes planted to ease the strain.
Pools recycled at a longer pace,
Health checks gathered with concurrent grace.
Shallow copies, one batch to rule—
This bunny's panel just got cool! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 29.41% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately and concisely summarizes the main changes: optimizing database pools, batch commits, and node health checks, which align with all major changes across the codebase.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
app/jobs/record_usages.py (1)

515-516: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Python 2 syntax error: Invalid exception handling.

Line 515 uses Python 2 exception syntax except ValueError, TypeError: which is invalid in Python 3 and will cause a SyntaxError when the module is loaded. The correct Python 3 syntax requires parentheses for multiple exception types.

🐛 Proposed fix
-        except ValueError, TypeError:
+        except (ValueError, TypeError):
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/jobs/record_usages.py` around lines 515 - 516, The except clause uses
Python 2 syntax with comma-separated exception types which is invalid in Python
3 and causes a SyntaxError. In the exception handling block where
invalid_uids.append(uid) is called, change the except statement from using a
comma between ValueError and TypeError to wrapping both exception types in
parentheses, using the correct Python 3 syntax for handling multiple exception
types.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/node/__init__.py`:
- Around line 78-81: The asyncio.gather() call in the get_healthy_nodes()
function lacks a timeout on individual node.get_health() calls, which means a
single hung probe can block the entire operation and stall downstream work. Wrap
each node.get_health() call in the generator expression with
asyncio.wait_for(..., timeout=10) to match the timeout pattern already
implemented in app/jobs/node_checker.py. The existing return_exceptions=True
parameter will automatically catch and handle the timeout exceptions, allowing
timed-out nodes to be skipped gracefully.

---

Outside diff comments:
In `@app/jobs/record_usages.py`:
- Around line 515-516: The except clause uses Python 2 syntax with
comma-separated exception types which is invalid in Python 3 and causes a
SyntaxError. In the exception handling block where invalid_uids.append(uid) is
called, change the except statement from using a comma between ValueError and
TypeError to wrapping both exception types in parentheses, using the correct
Python 3 syntax for handling multiple exception types.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d494fb77-7c18-4819-b000-f555f9df5079

📥 Commits

Reviewing files that changed from the base of the PR and between 5155eac and ea36e34.

📒 Files selected for processing (8)
  • .env.example
  • app/core/xray.py
  • app/db/__init__.py
  • app/db/base.py
  • app/db/migrations/versions/a1b2c3d4e5f6_add_performance_indexes.py
  • app/jobs/record_usages.py
  • app/node/__init__.py
  • config.py
💤 Files with no reviewable changes (1)
  • app/db/init.py

Comment thread app/node/__init__.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@M03ED

M03ED commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

_dialect_cache race on first call:
Two coroutines can both see _dialect_cache is None, both enter GetDB(), both set the cache. Not a bug in practice (they'd both set it to the same value), but it can open two connections on startup instead of one.
using engine.begin() can only be done inside safe_execute that handle all errors, i see all of your pull requests targeg main branch, were not gonna accept any pr that target main branch

@M03ED

M03ED commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

idx_users_status and idx_users_status_expire overlap — if you query WHERE status = ? AND expire < ?, only the composite one is needed and the single-column one becomes redundant overhead on writes.
i don't know why temp_keys need index.
database migrations must be generated with help of alembic and indexes should be defined in db models so alembic can detect the changes

@wahh3b-lgtm

Copy link
Copy Markdown
Author

Superseded by #615 targeting dev branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants