One command. Every AI model you have. Automatically assembled into the best team for each task.
nvh "What is a binary search tree?" # → answers (single best advisor)
nvh "Fix the timeout bug in council.py" # → auto-detects coding task → agent mode
nvh "Should we use Redis or Postgres?" # → auto-detects debate → council (3+ advisors)
nvh "take a screenshot and describe my desktop" # → desktop agent (vision + tools)
nvh "setup comfyui" # → agent installs, configures, launchespip install nvhive
nvh # first-run setup auto-detects GPU, installs local AI, configures providers
nvh "your question" # just ask — nvhive figures out the rest# With desktop agent (vision + mouse/keyboard control)
pip install "nvhive[vision]" # screenshot, click, type, scroll
pip install "nvhive[browser]" # headless browser automation (playwright)
pip install "nvhive[all]" # everythingOn first run, nvh automatically launches a guided 3-step setup:
Works immediately with local models (no signup needed). Every step is skippable — press Enter to skip. Run nvh setup or type setup in the REPL anytime to reconfigure.
GPU tier → model recommendations:
| VRAM | Text Model | Vision Model | Behavior |
|---|---|---|---|
| 0 GB (no GPU) | Cloud only | Cloud fallback | Free tiers first (Groq, LLM7, GitHub) |
| 4 GB | Cloud fallback | moondream (2 GB) |
Vision-only local, text via cloud |
| 8 GB | nemotron-mini (4 GB) |
moondream (2 GB) |
Basic local + desktop agent |
| 12 GB | qwen2.5-coder:7b (5 GB) |
minicpm-v (5 GB) |
Coding + vision local |
| 16 GB | qwen2.5-coder:7b (5 GB) |
minicpm-v (5 GB) |
Coding + vision local |
| 24 GB | gemma2:27b (16 GB) |
llama3.2-vision (7 GB) |
Strong text + best vision |
| 48 GB | llama3.3:70b (40 GB) |
llama3.2-vision (7 GB) |
Full power local |
| 96+ GB | llama3.3:70b + qwen2.5-coder:32b |
llama3.2-vision |
Multi-model local |
| 128+ GB | 3 text models | llama3.2-vision |
Full local council, $0 |
Setup auto-detects your VRAM and recommends models that fit concurrently. No root/sudo needed — Ollama installs to ~/.nvh/.
NVIDIA GPU Quick Start — local inference + desktop agent on your hardware
# Install nvHive with desktop agent support — no root needed
pip install "nvhive[vision]"
nvh # setup auto-detects GPU, installs Ollama, pulls models, configures everythingThat's it. Setup auto-installs Ollama (no root/sudo), downloads the right models for your GPU, and enables the desktop agent. After setup:
- Simple queries route to your GPU automatically — $0, private, nothing leaves your machine
- Complex queries escalate to cloud only when local models aren't confident enough
- Desktop agent takes screenshots, controls mouse/keyboard, and installs software
nvh "take a screenshot"→ vision model analyzes your screennvh "setup comfyui"→ agent clones, installs, and launches the app- The adaptive routing engine measures quality over time and adjusts automatically
Works on Ubuntu, Windows, macOS. No root required — installs to ~/.nvh/.
Council scored 68% higher than a single model — at $0 cost. Three free providers (Ollama + Groq + Google) running in parallel outperformed a single Nemotron Super on accuracy, completeness, and coherence. Real benchmark on NVIDIA DGX Spark. See full results below.
You type one command. nvHive figures out the rest. It detects what you're asking for, checks which advisors are healthy, and assembles the best team for the task — automatically. More advisors connected = smarter behavior, with zero configuration.
What makes it different:
- Smart team assembly. nvHive doesn't just route to one model — it generates expert agents based on your question and matches each one to the best LLM for their specialty. A "Security Engineer" agent gets routed to a provider strong at security tasks. A "Database Expert" gets one suited to database queries. Based on adaptive routing data once sufficient queries are collected, with curated defaults for new installations.
- Automatic orchestration. Coding tasks get a planner + coder + reviewer. Complex questions get a council of specialists. Simple questions get the fastest advisor. All automatic based on intent detection and available advisors.
- Scales with what you have. 1 provider? Single-model answers. 3+ providers? Council automatically on complex questions, multi-model verification on code. Local GPU? Free inference alongside cloud. DGX Spark? Three 70B models in parallel, fully local.
- Performant by default. Uses all available advisors within reason. Simple questions don't trigger council. Budget limits always enforced. Switch to cost mode for minimal spend.
- 4-layer safety guardrails. Command blocklist, filesystem boundary enforcement, secrets redaction, and resource limits — guardrails block destructive commands like
rm -rf /.
9 layers from pip install to GPU inference — install, 3-step setup, 4 user interfaces, intent detection, 5 execution modes, smart routing, tool registry, 23+ AI providers, and the hardware stack. Every layer is local-first with cloud fallback. No root needed.
You type something. nvHive detects your intent and picks the right mode — no commands to memorize.
Zero commands to memorize. Same natural language across CLI, REPL, WebUI, and API. Routing improves over time — after 20 queries per provider, it's fully data-driven.
How routing works: Each request is scored across capability (40%), cost (30%), latency (20%), and health (10%), then routed to the highest-scoring provider. On failure, nvHive tries the next provider in the fallback chain.
nvh routing-stats # see learned vs static scores
nvh health # provider resilience dashboardLocal-first with NVIDIA GPUs: Simple queries route to Nemotron on your NVIDIA GPU via Ollama — no cloud, no cost, no data leaving your machine. GPU detection via pynvml reads VRAM, driver version, and CUDA version to select the optimal local model. The --prefer-nvidia flag gives a 1.3x routing bonus to keep inference on NVIDIA hardware whenever quality allows.
AI that can see your screen, control your mouse and keyboard, install software, and navigate browsers — powered by local vision models.
# In the REPL — just talk naturally
nvh
> take a screenshot and describe what you see
> setup comfyui
> open firefox and go to github.com
> click on the search bar and type "nvhive"# From the CLI — same natural language
nvh "take a screenshot and describe my desktop"
nvh "install numpy"
nvh "open a terminal and run nvidia-smi"How it works: nvhive auto-detects whether your input is a question, a simple action, or a multi-step task:
| Input | Detection | What happens |
|---|---|---|
| "what is python?" | Question | Sent to LLM directly |
| "open firefox" | Simple action | Executes immediately |
| "take a screenshot" | Task | Agent loop: screenshot → vision analysis → report |
| "setup comfyui" | Task | Agent loop: git clone → pip install → launch → verify |
Vision pipeline: screenshot (pyautogui) → local vision model (llama3.2-vision / minicpm-v) → coordinate estimation → mouse/keyboard action → verify with another screenshot. Falls back to cloud vision (GPT-4o, Gemini, Claude) if no local vision model.
No root needed. Works on Linux (X11), macOS, and Windows. Install with pip install "nvhive[vision]".
Multi-model coding agent with dynamic expert referral, iterative QA refinement, parallel execution, and vision/browser tools. Scales from no-GPU to DGX Spark.
# One-time setup: pulls the right models for your GPU
nvh agent --setup
# Run a coding task
nvh agent "Fix the streaming timeout bug in council.py"
nvh agent "Add unit tests for the auth middleware" --dir ./myproject
nvh agent "Refactor the router to use health-aware selection" -y
# Advanced: sandbox, workspace, parallel pipeline
nvh agent "Build the notification service" --sandbox # Docker-isolated execution
nvh agent "task" --workspace ./api,./frontend # multi-repo context
nvh agent "refactor the auth module" --sandbox # runs in Docker containerHow it works: Intent detection classifies the task, the orchestrator generates expert agents matched to the best LLMs, agents run in parallel where possible, dynamic expert referral fills knowledge gaps on-demand, and an iterative QA loop refines until the task is completed. See the Architecture Overview diagram above for the full flow.
nvh agent "task" --iterative # enable iterative QA refinement
nvh agent "task" --iterative --budget 2.50 # cap spend at $2.50| Feature | What It Does |
|---|---|
| Dynamic Expert Referral | Agents self-identify knowledge gaps and emit REFER: Need a Database Expert for sharding — the system dynamically spawns the specialist, gets the answer, and feeds it back. Max depth prevents infinite recursion. |
| Iterative QA Refinement | Generate agents → run with referrals → post-QA reviews → if gaps found, spawn new agents informed by feedback → repeat until PASSED or budget exhausted. |
| Parallel Pipeline | Decomposes tasks into independent subtasks, runs them concurrently (bounded semaphore), respects dependencies, VRAM-aware model swapping with context preservation. |
| Vision + Desktop Control | Screenshot capture, image analysis via vision LLMs (GPT-4o, Claude, Gemini, LLaVA), OCR, mouse/keyboard automation with pyautogui. Agents can see and interact with GUIs. |
| Browser Automation | Headless browser navigation, screenshots, form filling via Playwright. HTTP requests, process management, Docker tools. |
| Docker Sandbox | --sandbox flag runs agent shell commands inside a Docker container — memory-limited, CPU-limited, no network by default, non-root user. Falls back to local if Docker unavailable. |
| Execution Checkpoints | File state snapshots before execution. Automatic rollback on failure — restores modified files, deletes newly created ones. |
| LLM Drift Detection | Monitors provider quality over time using EMA. Alerts when a provider drops >20% vs historical average. Auto-reroutes traffic away from degraded providers. |
| Code Analysis | Static analysis for code smells (long functions, deep nesting, complex conditionals, magic numbers, missing docstrings), tech debt scoring, complexity hotspots, missing test detection. |
| Multi-Repo Workspaces | --workspace aggregates multiple repos into a single agent context. Cross-repo import detection, language detection, shared file patterns. Read-only support for reference repos. |
| VS Code Extension | Agent tasks, code review, test generation, council queries, and explain — all from the VS Code sidebar. Auto-starts nvh serve if needed. |
Scales with your hardware — 6 tiers from no-GPU to DGX Spark:
| GPU | VRAM | Tier | Models | Mode |
|---|---|---|---|---|
| DGX Spark | 128 GB | Tier 5 | Nemotron 70B + Llama 70B + Qwen 72B (3 models, all local) | Multi |
| RTX 6000 Pro BSE | 96 GB | Tier 4 | Cloud planner + Llama 70B coder + Qwen 32B reviewer (dual local) | Multi |
| A100 / A6000 | 48-80 GB | Tier 3 | Cloud planner + Llama 70B coder (--mode multi for dual local) |
Auto |
| RTX 3090 / 4090 | 24 GB | Tier 2 | Cloud planner + Gemma 2 27B coder | Single |
| RTX 4060 Ti | 16 GB | Tier 1 | Cloud planner + Qwen Coder 7B | Single |
| No GPU | — | Tier 0 | Fully cloud | Single |
nvh agent --setup # pull recommended models
nvh agent --remove # clean up models
nvh agent "task" --mode multi # force multi-model (Tier 3+)
nvh agent "task" --mode single # force single model
nvh agent "task" --git # auto-branch + commit changes
nvh agent "task" --no-quality # skip lint/syntax gatesMulti-model mode (Tier 4-5, or --mode multi on Tier 3): a DIFFERENT model reviews the coder's output, catching bugs the coder's architecture has blind spots for. Cross-model verification is measurably better than self-review.
Quality gates: after the agent modifies files, ruff lint + syntax checks run automatically. If they fail, the agent gets the errors and fixes them in a feedback loop.
nvh review # review staged changes
nvh review HEAD~3..HEAD # review last 3 commits
nvh review 42 # review GitHub PR #42
nvh review --mode multi # two models review independentlyMulti-model code review: two different LLM architectures review independently, then findings are synthesized. Catches bugs that self-review and single-model review miss.
nvh test-gen nvh/core/council.py # generate tests for a file
nvh test-gen --coverage-gaps # find and fill coverage gapsReads your code, identifies untested paths, generates pytest tests, runs them, and iterates until they pass. The agent that improves itself — it writes the tests that verify its own future changes.
When one model isn't enough, nvHive runs the same query through multiple providers in parallel, then synthesizes their responses. Expert personas are generated for the query (e.g., Backend Engineer, Architect, DBA), each assigned to a different model. Responses are collected, analyzed for agreement using keyword overlap and an LLM judge, and then synthesized by a non-member provider (where available) into a final council response with a confidence score and individual perspectives.
Why this works: Different models have different blind spots. Council mode surfaces all perspectives and synthesizes the best of each.
Confidence scoring: Every council response includes an agreement metric (e.g., "Strong consensus" vs "Split decision") based on pairwise response similarity. Tells you when to trust the consensus.
Cost: Council with 3 free providers costs $0. Council with 3 Nemotron variants on a single NVIDIA GPU costs $0 and never leaves your machine. Premium cloud council costs ~3x a single query.
nvh convene "Should we use Redis or Postgres for session storage?"
# → 3 models debate → synthesis with confidence scoreThrowdown goes beyond council. Three passes, each building on the last. In the first pass, three experts analyze the query independently. In the second pass, each expert critiques the others — finding blind spots and challenging assumptions. The final synthesis integrates all perspectives into a single, thoroughly vetted answer.
nvh throwdown "Review this architecture for scalability issues"
# Pass 1: 3 experts analyze independently
# Pass 2: experts critique each other's analysis
# Pass 3: final synthesis integrating all perspectivesWhy throwdown beats single-model: A single model gives you one perspective, once. Throwdown gives you three perspectives, challenged by three critiques, then synthesized. Errors get caught. Assumptions get questioned. The final answer is more thorough than any single pass.
# Confidence-gated escalation: try free first, upgrade only if needed
nvh ask --escalate "Design a distributed lock manager"
# → groq (free, confidence: 42%) → auto-escalated to openai
# Cross-model verification: a second model checks the answer
nvh ask --verify "Is eval() safe in Python?"
# → groq answers → google verifies (9/10, no issues)
# Both together: cheapest possible verified answer
nvh ask --escalate --verify "Explain the CAP theorem"nvh setup detects your NVIDIA GPU, selects which models fit in your VRAM, and pulls them automatically. Supports both NVIDIA Nemotron and Google Gemma 4 (NVIDIA-optimized) for local council with two different architectures.
VRAM determines which models run locally: with no VRAM you get cloud-only routing (free tiers first); 8-24 GB runs small 7B models locally alongside cloud; 24-48 GB runs medium 27B models with a cloud planner; 48-96 GB runs large 70B models with a cloud orchestrator; and 128 GB+ (DGX Spark) runs all models locally at $0, fully private.
nvh setup
# Step 3/3: Local GPU inference
# Detected: NVIDIA GeForce RTX 4090 (24GB VRAM)
# Models: nemotron-small, gemma4:26b
# Pulling nemotron-small... ✓
# Pulling gemma4:26b... ✓
# Local council ready — multiple models for consensusWhat nvh setup handles: GPU detection via pynvml reads your VRAM, driver, and CUDA version. Based on available VRAM, it selects the optimal models — small models for modest GPUs, dual-architecture setups (Nemotron + Gemma 4) for mid-range cards, and full 70B models for high-end hardware. It checks whether Ollama is installed and running, then auto-pulls all models that fit. Once complete, the adaptive routing engine tracks each model's quality on your specific hardware.
After setup, routing is automatic:
- Simple queries → local Nemotron or Gemma 4 on your GPU (free, private)
- Council mode → both models collaborate locally, catching different blind spots
- Complex queries → cloud providers when local quality isn't sufficient
nvh benchmeasures your GPU's actual tok/s with community baselines- The adaptive routing engine measures each model's quality on YOUR hardware
Full GPU detection + VRAM guide
| Layer | Technology | Hardware | Use Case |
|---|---|---|---|
| Local | Ollama + Nemotron | Consumer GPUs (RTX 3060+) | Default local inference, privacy mode |
| Local | Ollama + Gemma 4 | Consumer GPUs (RTX 3060+) | NVIDIA-optimized, reasoning + multimodal |
| Cloud | NVIDIA NIM API | NVIDIA cloud | Specialized models, 1000 free credits |
| Enterprise | Triton Inference Server | H100 / A100 / L40 | Production multi-model serving, TensorRT-LLM |
| Agent | NemoClaw / OpenShell | Any | Agent orchestration with nvHive routing |
| Detection | pynvml | Any NVIDIA GPU | VRAM, driver, CUDA, temp, power, PCIe |
--prefer-nvidia gives a 1.3x routing bonus to all NVIDIA-backed providers, keeping inference on NVIDIA hardware whenever quality allows.
nvHive exposes multiple integration surfaces: a CLI (nvh), a web dashboard (nvh webui), a Python SDK (import nvh), an MCP server for Claude Code, and OpenAI/Anthropic-compatible API proxies. All integrations feed through the same adaptive router, council engine, and 4-layer guardrails, routing to your local GPU, free cloud providers, paid cloud, or NVIDIA NIM as appropriate.
API Proxies — point existing SDKs at nvHive:
| SDK | Configuration |
|---|---|
| Anthropic | ANTHROPIC_BASE_URL=http://localhost:8000/v1/anthropic |
| OpenAI | OPENAI_BASE_URL=http://localhost:8000/v1/proxy |
| Claude Code | claude mcp add nvhive -- python -m nvh.mcp_server |
nvHive works alongside OpenClaw as a routing layer, and integrates with NemoClaw (NVIDIA's agent framework) as both inference provider and MCP tool server.
nvh migrate --from openclaw # import your existing API keys
nvh nemoclaw --start # start proxy for NemoClaw agentsNote: Anthropic recently changed billing for third-party tools. See the integration guide for details.
4-layer guardrails protect your system. Agents are designed to keep within the project directory, block destructive commands, and prevent secret leakage. Docker sandbox adds container isolation.
How the guardrails work: Every agent command passes through four layers in sequence. First, a blocklist rejects known-dangerous commands (rm -rf, mkfs, etc.). Second, path boundary enforcement checks that all file operations stay within the project directory. Third, secrets redaction strips API keys and credentials from agent context. Fourth, resource limits cap execution time and memory. If --sandbox is enabled, commands run inside a Docker container with memory/CPU limits, no network access, and a non-root user. All executions are checkpointed so changes can be rolled back on failure.
nvHive is a routing layer. Any AI application can add multi-provider routing:
import nvh
# Drop-in OpenAI-compatible interface
response = await nvh.complete([
{"role": "user", "content": "Explain quicksort"}
])
# Inspect routing without executing
decision = await nvh.route("complex question about databases")
# Council consensus
result = await nvh.convene("Architecture review", cabinet="engineering")
# Provider health check
status = await nvh.health()| Command | What It Does |
|---|---|
nvh agent "task" |
Dynamic expert referral + iterative QA refinement (6 GPU tiers) |
nvh agent --setup |
Pull recommended local models for your GPU |
nvh agent --mode multi |
Force multi-model: separate planner, coder, reviewer |
nvh agent --sandbox |
Execute shell commands inside a Docker container |
nvh agent --workspace ./a,./b |
Multi-repo context for cross-project tasks |
nvh agent --git |
Auto-create branch + commit changes |
nvh review |
Multi-model code review (staged changes, PRs, commit ranges) |
nvh test-gen file.py |
AI test generation with automatic verification |
nvh analyze ./src |
Code smells, tech debt score, complexity hotspots |
nvh drift |
Check for LLM quality degradation across providers |
| Command | What It Does |
|---|---|
nvh "question" |
Smart route to best available model |
nvh convene "question" |
Council consensus (3+ models collaborate) |
nvh throwdown "question" |
Two-pass deep analysis with critique |
nvh poll "question" |
Side-by-side provider comparison |
nvh safe "question" |
Local only — nothing leaves your machine |
nvh ask --escalate |
Try free first, escalate if uncertain |
nvh ask --verify |
Cross-model verification |
| Command | What It Does |
|---|---|
nvh serve |
Start the API server (OpenAI + Anthropic compatible proxy) |
nvh webui |
Launch the web dashboard |
nvh health |
Provider resilience dashboard |
nvh nvidia |
NVIDIA GPU infrastructure status |
nvh bench |
GPU speed test (tokens/sec) |
nvh setup |
Interactive provider setup |
nvh doctor |
Full diagnostic dump for troubleshooting |
Full command reference (50+ commands)
23 providers. 63 models. 25 free — no credit card required.
| Tier | Providers | Rate Limits |
|---|---|---|
| Free (no signup) | Ollama (local), LLM7 | Unlimited / 30 RPM |
| Free (email signup) | Groq, GitHub Models, Cerebras, SambaNova, Cohere, AI21, SiliconFlow, HuggingFace | 15-30 RPM |
| Free (account) | Google Gemini, Mistral, NVIDIA NIM | 15-1000 RPM |
| Paid | OpenAI, Anthropic, DeepSeek, Fireworks, Together, OpenRouter, Grok | Pay per token |
Real data from NVIDIA DGX Spark (GB10, 120GB). Judged by OpenAI with ground truth verification on math prompts.
| Mode | Accuracy | Completeness | Coherence | Overall | Cost |
|---|---|---|---|---|---|
| Single Model (Nemotron Super) | 5.5 | 5.7 | 5.0 | 5.1 | $0.00 |
| Council (Free: Ollama + Groq + Google) | 9.0 | 8.0 | 9.0 | 8.6 | $0.00 |
In our reference benchmark (16 prompts, DGX Spark hardware, judged by OpenAI), council consensus scored 68% higher than a single model on the same prompts. Ground truth verification on math problems caught errors the single model missed. Results will vary by hardware and workload — run nvh bench to measure on your setup.
| Model | Size | tok/s |
|---|---|---|
| gemma3 | 3.3 GB | 119.3 |
| nemotron-mini | 2.7 GB | 85.7 |
| gemma4 (e4b) | 9.6 GB | 61.7 |
| llama3.1 | 4.9 GB | 48.2 |
| nemotron-3-super | 86 GB | 23.6 |
nvh bench # GPU speed (tokens/sec)
nvh bench -q # speed + quality comparison
nvh health # provider resilience
nvh why # explain last routing decision
nvh estimate --gpu rtx_4090 # predict tok/s on any GPU16 prompts across code generation, debugging, reasoning, math, creative writing, and Q&A. LLM judge + ground truth verification. Run it yourself. Publish the results.
| Guide | Description |
|---|---|
| Getting Started | First-time setup |
| Commands | Full CLI reference (50+ commands) |
| Providers | 23 providers, rate limits, free tiers |
| Council System | Multi-LLM consensus with confidence scoring |
| Releasing | Release runbook, version bumps, PyPI publishing |
| Windows Troubleshooting | Encoding, segfaults, port 80, nvh.exe locks |
| GPU Detection | Auto-detection, model selection, OOM protection |
| Claude Code | MCP server setup |
| NemoClaw | NVIDIA NemoClaw integration |
| OpenClaw Integration | Works alongside OpenClaw |
| SDK & API | Python SDK, REST API, proxies |
| Deploy Without Root | No-root install on servers (Ollama, keyring, systemd user service) |
| Web UI | Web UI |
| Agent Tools | Agent tools |
| Configuration | Configuration |
| Architecture | System design and adaptive routing |
- Data Privacy: When using cloud providers, queries are transmitted to third-party APIs (OpenAI, Anthropic, Google, Groq, etc.) subject to each provider's privacy policy. Use
nvh safeor--prefer-nvidiato keep all inference local. - AI Accuracy: AI-generated outputs may contain errors. Review agent-modified files before committing to production. nvHive provides guardrails and rollback but does not guarantee output correctness.
- Provider Liability: nvHive is a routing and orchestration layer. Response quality, availability, and pricing are determined by third-party providers and subject to change.
- Security: Safety guardrails use pattern-matching heuristics and may not catch all edge cases. For sensitive environments, use
--sandboxwith Docker isolation. - Benchmarks: Benchmark results measured on NVIDIA DGX Spark reference hardware. Results vary by hardware, provider, and workload.
- QA Gate: The iterative QA reviewer is an LLM judge (not a static analysis tool). QA verdicts are probabilistic assessments, not deterministic guarantees.
MIT License. See LICENSE for details.




