Skip to content

Feature/multi credential failover 0.3.x#2554

Open
fengluodb wants to merge 25 commits into
release/0.3.xfrom
feature/multi-credential-failover-0.3.x
Open

Feature/multi credential failover 0.3.x#2554
fengluodb wants to merge 25 commits into
release/0.3.xfrom
feature/multi-credential-failover-0.3.x

Conversation

@fengluodb

Copy link
Copy Markdown
Collaborator

Description

Related Issue

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this on the following platforms:
    • Linux
    • macOS
    • Windows

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Screenshots (if applicable)

Additional Notes

chenxiaobin-lang and others added 20 commits June 3, 2026 10:49
Add OrderedCredentialSwitcher for N-credential failover, MultiCredentialVLM and FailoverEmbedder for credential switching, support automatic migration from legacy backup config
Separate the monolithic PERMANENT error class so the switcher reacts to
the actual root cause:

- 400 (request-level parameter error) -> PERMANENT, fail-fast: same
  request fails on every credential of the same model.
- 401/403/unauthorized/accountoverdue -> new AUTH class: credential-level,
  advances to the next credential in multi-credential mode.
- new CONTENT_SAFETY class (moderation rejections) and INPUT_TOO_LARGE ->
  fail-fast: switching credentials cannot help.

classify_api_error now checks CONTENT_SAFETY before PERMANENT so a
moderation message containing "400" is not misclassified. On fail-fast the
failover wrappers re-raise the original exception (preserving type/info)
instead of wrapping it, so callers can react (e.g. truncate on
input_too_large). AllCredentialsFailedError is reserved for chain
exhaustion. The legacy PrimaryBackupSwitcher also switches on AUTH to keep
existing backup behavior.
get_active_index() previously mutated state: when a failback threshold was
met it would decrement the active index. Because observability properties
(active_credential_index / active_credential_id) call it, merely reading the
current credential for logging or metrics could accidentally advance the
failback state machine.

Split the concern: get_active_index() is now a pure read, and a new public
maybe_failback() performs the one-step failback (and logs an info line when
the active credential index changes). The request loops in MultiCredentialVLM
and FailoverEmbedder call maybe_failback() at the top of each attempt, so the
failback behavior is unchanged while pure reads no longer have side effects.
The failover loops exited on `idx >= n OR total_attempts >= total_max_retries`
(default 10). With more than 10 credentials, or when failback churn inflated
the attempt count, this could raise AllCredentialsFailedError before every
credential had actually been tried, leaving lower-priority credentials unused.

Remove total_max_retries entirely from MultiCredentialVLM and FailoverEmbedder:
credential exhaustion is now decided solely by reaching the end of the chain
(idx >= n), and per-credential retries remain the responsibility of each
underlying instance via its own max_retries. The aggregated error tuple now
records the failing credential index instead of the attempt counter.

Adds a regression test covering more than 10 credentials all being tried.
encoding_format, model_path, cache_dir, enable_fusion, res_level and
max_video_frames describe how a model runs, not which credential is used.
All credentials of a single embedding model share the same model, so these
belong on the parent EmbeddingModelConfig, not on each credential.

Keeping them on the credential forced a `cred.X or config.X` merge in
_create_failover_embedder, which silently dropped explicit falsy values
(enable_fusion=False, res_level=0, max_video_frames=0) and fell back to the
parent value.

Remove these six fields from EmbeddingCredential and read them directly from
the parent config when building per-credential embedders. id/provider/model/
api_key/api_base/api_version/ak/sk/region/host/extra_headers remain
credential-level.
FailoverEmbedder.get_dimension() returned a hardcoded 2048 when the first
embedder had no get_dimension(). That path is reached only when wrapping
sparse embedders, which have no fixed dense dimension; returning a fabricated
2048 silently feeds a wrong dimension to callers (e.g. schema creation).

Delegate to the first embedder and raise AttributeError when it has no
get_dimension(), surfacing the misuse instead of hiding it.
Two issues in the cross-instance token usage merge:

1. Encapsulation: FailoverVLM / MultiCredentialVLM / FailoverEmbedder reached
   into other instances' private _token_tracker. Add a public token_tracker
   accessor on VLMBase and use it in the VLM mergers.

2. Double counting in FailoverEmbedder: embedders share a process-wide
   singleton token tracker (_get_token_tracker), so all wrapped embedders point
   at the same object. Merging N identical trackers inflated usage N-fold.
   Return a single instance's usage directly instead of merging.
Splitting 401/403/unauthorized/accountoverdue out of PERMANENT into the new
AUTH class (commit b477b9f) regressed the circuit breaker: it only tripped
immediately on PERMANENT/QUOTA_EXCEEDED, so auth errors no longer opened the
breaker right away.

For a single embedding instance an auth failure (key invalid / no permission /
overdue) is persistent and retrying is pointless, so the breaker should still
trip immediately. Add ERROR_CLASS_AUTH to the immediate-trip set and update the
classification tests to assert the new AUTH class (403 still trips the breaker).
Without setting _last_switch_time, maybe_failback() retreats to idx 0
immediately because the default 0 timestamp is always older than the
600s timeout, so the unavailable last credential never actually gets
exercised.
@github-actions

Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 88
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Add multi-credential failover for embeddings

Relevant files:

  • openviking_cli/utils/config/embedding_config.py
  • openviking/models/embedder/init.py
  • openviking/models/embedder/base.py
  • tests/unit/test_embedding_credential_model_override.py
  • tests/unit/test_failover_embedder_permanent_advance.py

Sub-PR theme: Add multi-credential failover for VLM

Relevant files:

  • openviking_cli/utils/config/vlm_config.py
  • openviking/models/vlm/base.py
  • openviking/models/vlm/init.py
  • tests/unit/test_vlm_failover.py

Sub-PR theme: Add ordered credential switcher and error classification improvements

Relevant files:

  • openviking/utils/model_retry.py
  • openviking/utils/exceptions.py
  • tests/unit/test_ordered_credential_switcher.py
  • tests/unit/test_model_retry.py
  • tests/unit/test_circuit_breaker.py

⚡ No major issues detected

@github-actions

Copy link
Copy Markdown

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Use project-standard get_logger for MultiCredentialVLM

Use the project's standard get_logger utility instead of the raw logging.getLogger
to ensure consistent logging behavior (structured logging, log levels, etc.) across
the codebase.

openviking/models/vlm/base.py [619-622]

 self._vlm_instances = vlm_instances
 self._credential_ids = credential_ids
-self._logger = logging.getLogger(__name__)
+self._logger = get_logger(__name__)
 self._switcher = OrderedCredentialSwitcher(
Suggestion importance[1-10]: 5

__

Why: The suggestion replaces raw logging.getLogger with the project-standard get_logger utility (already imported in the file), ensuring consistent logging behavior across the codebase. This is a valid, moderate-impact code quality improvement.

Low

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

2 participants