| title | OmniNPC - Nemotron Live Demo |
|---|---|
| emoji | 🎮 |
| colorFrom | green |
| colorTo | indigo |
| sdk | docker |
| app_port | 7860 |
| pinned | true |
Hierarchical real-time screen-space embodied agents via the Nemotron Super+Nano pattern
Three NPCs. Rendered pixels only. No engine API.
Nemotron Nano VL 8B processes the canvas every ~700ms and outputs a JSON action. Every 10 seconds, Nemotron Super 49B reads the scoreboard and rewrites the strategy directive that steers all three NPCs simultaneously. The directive injection is the entire communication channel between the two loops -- Super writes a sentence, Nano reads it.
The key result: without a Super directive, Nano never evades threats (0% across 50 trials). Inject a specific "EVADE" directive and evasion reaches 100% [89-100% CI]. The hierarchy is not an optimization. It is the only mechanism that produces non-default behavior.
export NVIDIA_NIM_API_KEY=nvapi-...
pip install -r requirements.txt
python api_server.py --port 8000Open http://localhost:8000/ and click START.
Left: live scoreboard. Center: 3D arena with HOSTILE (red), ALLY (blue), WANDERER (yellow) and the human player. Top-right: per-call Nano latency chart. Right panel: streaming scene descriptions from each NPC's inference call.
The browser captures a JPEG of the Three.js canvas every ~700ms and sends it to the FastAPI server via POST /infer. The server prepends the current Super directive into Nano's system prompt, calls NVIDIA NIM, and returns the action JSON. Three NPCs run this loop in round-robin. Every 10 seconds, a separate POST /super call sends the scoreboard text to Super 49B, which returns one sentence of strategic guidance. That sentence becomes the directive in every subsequent Nano call until the next Super update.
The directive injection is the entire communication channel between the two loops. Super does not call Nano directly. It writes a sentence. Nano reads it.
Each inference cycle, Nano returns a scene_description field alongside the action. The reasoning panel streams these in real time, labeled by NPC role. The SUPER-49B directive is shown at the top -- this is what steers all three NPCs simultaneously.
Without a Super directive, Nano is a deterministic coin-collector. Across 50 trials on a conflict scene (threat visible on left, coin ahead), the evasion rate was 0% [0-7% CI]. Every single trial: move_forward toward the coin.
Inject a specific "EVADE IMMEDIATELY" directive from Super 49B and the evasion rate reaches 100% [89-100% CI] on the conflict frame. The VTRT evaluation (50 trials, random threat positions) shows 70% [56-81% CI] evasion with the directive versus 0% [0-7% CI] without.
The Super directive is not an optimization. It is the only mechanism that produces non-default behavior. This is the hierarchy claim, demonstrated empirically.
Press V, say a command, and it routes to the named NPC for a 6-second override window.
Press V in-game to enable voice. The Web Speech API transcribes your command in the browser -- no server round-trip for the transcription. The command is parsed into a target NPC and an action, then injected as a high-priority directive override for 6 seconds.
| Command | Effect |
|---|---|
hostile go to blue zone |
HOSTILE walks to the blue tile |
ally come here |
ALLY follows the player |
wanderer dance |
WANDERER stops and dances |
everyone stop |
All three NPCs idle |
hostile run away |
HOSTILE flees from the player |
Cross-Modal Sync Score (CMSS): 100% [93-100% CI] across 50 trials with unambiguous directional commands.
All evaluations run against cloud-hosted NVIDIA NIM. Latency is dominated by API round-trip (~2.3s). The architecture targets local deployment at <1 Hz per NPC.
| Metric | Result | N | Notes |
|---|---|---|---|
| VTRT evasion w/ directive | 96% [86-99%] | 50 | Phase 3; FSM + safety verifier |
| VTRT evasion no directive | 0% [0-7%] | 50 | Nano alone — never evades |
| Directive lift (survival) | +100pp | -- | 0%→96%; hierarchy proof |
| Survive directive adherence | 100% [89-100%] | 30 | |
| Collect directive adherence | 100% [89-100%] | 30 | |
| ACRL mean +/- std | 3527+/-1175ms | 50 | |
| ACRL P95 | 4448ms | 50 | |
| CMSS direction alignment | 100% [93-100%] | 50 | |
| Adversarial JSON validity | 100% | 50 | |
| MUR | 95.8% [92-98%] | 240 | Memory utilization rate |
| SGPR | 70.4% [64-76%] | 240 | Scene graph parse rate |
| RDS mean JSD (Phase-1 baseline=0.0) | 0.6667 | 3 role pairs | Phase 4 role differentiation |
| Grid Coverage | 0.7+/-0.5 cells | 24 | Super cycles |
- Event-triggered inference with frame differencing (threshold: 19)
- Skip rate: ~48% of frames skipped per session
- Compact output format toggle added
- Inference efficiency logged to CSV and JSON summary
- NPCMemory: per-role episodic memory (last 3 events) injected into Super directive
- OccupancyGrid: 8x8 2D world model updated from scene descriptions, feeds Super
- SceneGraph: structured entity extraction from Nano JSON responses (SGPR 70.4%)
- Memory utilization: 95.8% of Nano calls receive non-empty memory context
- NPCStateMachine: per-NPC FSM with 5 states (EXPLORE/DEFEND/COLLECT/EVADE/IDLE)
- FSM enforces minimum 2-frame state commitment — prevents mid-evasion reversals
- Super now returns structured JSON directive: state, duration, priority fields
- SafetyVerifier: rule-based action filter blocks contradictory actions per FSM state
- VTRT evasion improved from 90% (Phase 2) to 96% (Phase 3)
- Shared Blackboard: thread-safe ring buffer (max 5 entries) for short messages in the format
{"from": role, "msg": str, "t": timestamp} - Nano now reads the last 2 blackboard entries for local context before acting
- Role-Specific Directives: Super now emits 3 differentiated JSON directives for HOSTILE, ALLY, and WANDERER instead of one broadcast directive
- Tool Use: Super uses lightweight Python tools for pathfinding and entity position lookups derived from OccupancyGrid and NPCMemory
- RDS mean JSD reached 0.6667 against the Phase-1 baseline of 0.0, confirming role differentiation
The perception loop uses frame differencing to skip inference on visually identical frames. Each NPC maintains a per-role frame cache. Inference is triggered only when the mean absolute pixel difference exceeds a calibrated threshold (19.0), or when a maximum repeat count is reached. Across live sessions, ~48% of frames are skipped, reducing redundant NIM API calls without affecting behavioral quality.
Most calls land between 2300-3000ms (API floor + processing). The right tail comes from NIM rate-limit back-off. Local deployment eliminates both.
To reproduce:
python run_evaluation.py --trials audio --n 50
python run_evaluation.py --trials visual --n 100
python run_evaluation.py --trials hierarchy --n 30
python run_evaluation.py --trials adversarial --n 10
python run_analysis.py
# Output: data/results/PAPER_RESULTS.txt| Default 2025 agent | OmniNPC | |
|---|---|---|
| Sensing | Engine API (game.get_state()) |
Rendered pixels only |
| Cadence | Request-response, sequential | Continuous: 3 NPCs at ~0.37 Hz each, in parallel |
| Reasoning | One model, one prompt | Hierarchical: Nano 8B (reflex) + Super 49B (strategy) + FSM state machine |
| Agents | Usually 1 | 3 agents sharing one directive + on-demand dialogue |
| Modality | Vision or audio | Vision + microphone simultaneously |
Pixel-only sensing is a hard requirement for generality. If the agent needs game.get_state(), it works in 0% of games because 99% of games do not expose that. If it works from pixels, every game ever made is a valid target environment.
- Create a new Space at
huggingface.co/new-space - Select Docker as the SDK, set App port to
7860 - Clone the Space repo and push this codebase to it:
git remote add space https://huggingface.co/spaces/<your-username>/omninpc git push space main
- In your Space settings, add a Secret named
NVIDIA_NIM_API_KEYwith your NIM key - The Space will build the Docker image and serve the game at your Space URL
The Dockerfile installs only what api_server.py needs (no Panda3D, no audio stack). Build time is roughly 2-3 minutes.
export NVIDIA_NIM_API_KEY=nvapi-...
python api_server.py --port 8000
# open http://localhost:8000/export NVIDIA_NIM_API_KEY=nvapi-...
python run_demo.py
# requires: pip install panda3d SpeechRecognition pyaudio- Python 3.10+ (3.11 recommended)
- NVIDIA NIM API key -- free at
build.nvidia.com
git clone <repo>
cd omninpc
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
echo "NVIDIA_NIM_API_KEY=nvapi-..." > .env| Input | Action |
|---|---|
| WASD | Walk |
| Mouse | Look |
| Space | Jump |
| Shift | Sprint |
| E | Pick up coin |
| V | Toggle voice |
| M | Mute / unmute |
| R | Record session (.webm) |
| Esc | Release mouse |
omninpc/
api_server.py FastAPI entry -- what you deploy
html_game/
index.html
game.js
Dockerfile HF Spaces / Docker target
inference/ Nano + Super agents, NIM client, prompt templates
orchestrator/ Async perception + strategy loops
actions/ Action schema + executors
capture/ Screen capture (dxcam / mss fallback)
capture/frame_differ.py Frame differencing + skip logic
metrics/ Latency tracker + CSV logger (15 columns, including npc_role, did_infer, skip_reason)
metrics/latency_tracker.py Skip rate + inference metrics
evaluation/ Trial runner, results analyser
run_evaluation.py Evaluation entry point
run_analysis.py Paper table generator
data/results/ trial_results.json, PAPER_RESULTS.txt
docs/
images/ Screenshots and architecture diagrams
figures/ Generated evaluation charts
generate_figures.py Figure generation script
tests/ 26 pytest tests, all passing
config/config.yaml
tools/calibrate_threshold.py Threshold calibration script
Cloud-hosted NIM adds ~2.3s of latency per call. Mean ACRL is 2668ms and mean VTRT is 3345ms -- well above production NPC standards. The architecture targets local deployment where these drop below 1 Hz.
Test frames in the evaluation suite are 128x128 synthetic images. The directive lift results hold in live play, but the controlled evaluation uses synthetic input, not captured game frames.
CMSS of 100% is measured on unambiguous directional commands. Ambiguous natural-language commands ("go to the health station") are not covered by this metric.
At temperature 0.1, Nano is near-deterministic. Role differentiation (HOSTILE vs ALLY vs WANDERER) in live play comes from Super's per-role directive content, not from within-model temperature diversity.
Aadi Joshi and Kavya Bhand
MIT






