Skip to content

aadi-joshi/omninpc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

title OmniNPC - Nemotron Live Demo
emoji 🎮
colorFrom green
colorTo indigo
sdk docker
app_port 7860
pinned true
OmniNPC

OmniNPC

Hierarchical real-time screen-space embodied agents via the Nemotron Super+Nano pattern

Three NPCs. Rendered pixels only. No engine API.

Python NVIDIA NIM License HF Spaces


Nemotron Nano VL 8B processes the canvas every ~700ms and outputs a JSON action. Every 10 seconds, Nemotron Super 49B reads the scoreboard and rewrites the strategy directive that steers all three NPCs simultaneously. The directive injection is the entire communication channel between the two loops -- Super writes a sentence, Nano reads it.

The key result: without a Super directive, Nano never evades threats (0% across 50 trials). Inject a specific "EVADE" directive and evasion reaches 100% [89-100% CI]. The hierarchy is not an optimization. It is the only mechanism that produces non-default behavior.


export NVIDIA_NIM_API_KEY=nvapi-...
pip install -r requirements.txt
python api_server.py --port 8000

Open http://localhost:8000/ and click START.


What it looks like

Game screenshot

Left: live scoreboard. Center: 3D arena with HOSTILE (red), ALLY (blue), WANDERER (yellow) and the human player. Top-right: per-call Nano latency chart. Right panel: streaming scene descriptions from each NPC's inference call.


How it works

Architecture diagram

The browser captures a JPEG of the Three.js canvas every ~700ms and sends it to the FastAPI server via POST /infer. The server prepends the current Super directive into Nano's system prompt, calls NVIDIA NIM, and returns the action JSON. Three NPCs run this loop in round-robin. Every 10 seconds, a separate POST /super call sends the scoreboard text to Super 49B, which returns one sentence of strategic guidance. That sentence becomes the directive in every subsequent Nano call until the next Super update.

The directive injection is the entire communication channel between the two loops. Super does not call Nano directly. It writes a sentence. Nano reads it.

The reasoning trace

Reasoning trace

Each inference cycle, Nano returns a scene_description field alongside the action. The reasoning panel streams these in real time, labeled by NPC role. The SUPER-49B directive is shown at the top -- this is what steers all three NPCs simultaneously.


The key finding

Without a Super directive, Nano is a deterministic coin-collector. Across 50 trials on a conflict scene (threat visible on left, coin ahead), the evasion rate was 0% [0-7% CI]. Every single trial: move_forward toward the coin.

Inject a specific "EVADE IMMEDIATELY" directive from Super 49B and the evasion rate reaches 100% [89-100% CI] on the conflict frame. The VTRT evaluation (50 trials, random threat positions) shows 70% [56-81% CI] evasion with the directive versus 0% [0-7% CI] without.

Directive lift

The Super directive is not an optimization. It is the only mechanism that produces non-default behavior. This is the hierarchy claim, demonstrated empirically.


Voice commands

Voice command

Press V, say a command, and it routes to the named NPC for a 6-second override window.

Press V in-game to enable voice. The Web Speech API transcribes your command in the browser -- no server round-trip for the transcription. The command is parsed into a target NPC and an action, then injected as a high-priority directive override for 6 seconds.

Command Effect
hostile go to blue zone HOSTILE walks to the blue tile
ally come here ALLY follows the player
wanderer dance WANDERER stops and dances
everyone stop All three NPCs idle
hostile run away HOSTILE flees from the player

Cross-Modal Sync Score (CMSS): 100% [93-100% CI] across 50 trials with unambiguous directional commands.


Benchmark results

All evaluations run against cloud-hosted NVIDIA NIM. Latency is dominated by API round-trip (~2.3s). The architecture targets local deployment at <1 Hz per NPC.

Metric Result N Notes
VTRT evasion w/ directive 96% [86-99%] 50 Phase 3; FSM + safety verifier
VTRT evasion no directive 0% [0-7%] 50 Nano alone — never evades
Directive lift (survival) +100pp -- 0%→96%; hierarchy proof
Survive directive adherence 100% [89-100%] 30
Collect directive adherence 100% [89-100%] 30
ACRL mean +/- std 3527+/-1175ms 50
ACRL P95 4448ms 50
CMSS direction alignment 100% [93-100%] 50
Adversarial JSON validity 100% 50
MUR 95.8% [92-98%] 240 Memory utilization rate
SGPR 70.4% [64-76%] 240 Scene graph parse rate
RDS mean JSD (Phase-1 baseline=0.0) 0.6667 3 role pairs Phase 4 role differentiation
Grid Coverage 0.7+/-0.5 cells 24 Super cycles

Phase 1 — Efficiency

  • Event-triggered inference with frame differencing (threshold: 19)
  • Skip rate: ~48% of frames skipped per session
  • Compact output format toggle added
  • Inference efficiency logged to CSV and JSON summary

Phase 2 — State & Spatial Memory

  • NPCMemory: per-role episodic memory (last 3 events) injected into Super directive
  • OccupancyGrid: 8x8 2D world model updated from scene descriptions, feeds Super
  • SceneGraph: structured entity extraction from Nano JSON responses (SGPR 70.4%)
  • Memory utilization: 95.8% of Nano calls receive non-empty memory context

Phase 3 — Structured Planning & Safety

  • NPCStateMachine: per-NPC FSM with 5 states (EXPLORE/DEFEND/COLLECT/EVADE/IDLE)
  • FSM enforces minimum 2-frame state commitment — prevents mid-evasion reversals
  • Super now returns structured JSON directive: state, duration, priority fields
  • SafetyVerifier: rule-based action filter blocks contradictory actions per FSM state
  • VTRT evasion improved from 90% (Phase 2) to 96% (Phase 3)

Phase 4 — Multi-Agent Coordination

  • Shared Blackboard: thread-safe ring buffer (max 5 entries) for short messages in the format {"from": role, "msg": str, "t": timestamp}
  • Nano now reads the last 2 blackboard entries for local context before acting
  • Role-Specific Directives: Super now emits 3 differentiated JSON directives for HOSTILE, ALLY, and WANDERER instead of one broadcast directive
  • Tool Use: Super uses lightweight Python tools for pathfinding and entity position lookups derived from OccupancyGrid and NPCMemory
  • RDS mean JSD reached 0.6667 against the Phase-1 baseline of 0.0, confirming role differentiation

Inference efficiency

The perception loop uses frame differencing to skip inference on visually identical frames. Each NPC maintains a per-role frame cache. Inference is triggered only when the mean absolute pixel difference exceeds a calibrated threshold (19.0), or when a maximum repeat count is reached. Across live sessions, ~48% of frames are skipped, reducing redundant NIM API calls without affecting behavioral quality.

Metrics overview

Latency distribution

ACRL distribution

Most calls land between 2300-3000ms (API floor + processing). The right tail comes from NIM rate-limit back-off. Local deployment eliminates both.

To reproduce:

python run_evaluation.py --trials audio --n 50
python run_evaluation.py --trials visual --n 100
python run_evaluation.py --trials hierarchy --n 30
python run_evaluation.py --trials adversarial --n 10
python run_analysis.py
# Output: data/results/PAPER_RESULTS.txt

Five things that make this different

Default 2025 agent OmniNPC
Sensing Engine API (game.get_state()) Rendered pixels only
Cadence Request-response, sequential Continuous: 3 NPCs at ~0.37 Hz each, in parallel
Reasoning One model, one prompt Hierarchical: Nano 8B (reflex) + Super 49B (strategy) + FSM state machine
Agents Usually 1 3 agents sharing one directive + on-demand dialogue
Modality Vision or audio Vision + microphone simultaneously

Pixel-only sensing is a hard requirement for generality. If the agent needs game.get_state(), it works in 0% of games because 99% of games do not expose that. If it works from pixels, every game ever made is a valid target environment.


Deploy to Hugging Face Spaces

  1. Create a new Space at huggingface.co/new-space
  2. Select Docker as the SDK, set App port to 7860
  3. Clone the Space repo and push this codebase to it:
    git remote add space https://huggingface.co/spaces/<your-username>/omninpc
    git push space main
  4. In your Space settings, add a Secret named NVIDIA_NIM_API_KEY with your NIM key
  5. The Space will build the Docker image and serve the game at your Space URL

The Dockerfile installs only what api_server.py needs (no Panda3D, no audio stack). Build time is roughly 2-3 minutes.


Run locally

HTML game (recommended)

export NVIDIA_NIM_API_KEY=nvapi-...
python api_server.py --port 8000
# open http://localhost:8000/

Panda3D FPS arena (research view)

export NVIDIA_NIM_API_KEY=nvapi-...
python run_demo.py
# requires: pip install panda3d SpeechRecognition pyaudio

Requirements

  • Python 3.10+ (3.11 recommended)
  • NVIDIA NIM API key -- free at build.nvidia.com
git clone <repo>
cd omninpc
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
echo "NVIDIA_NIM_API_KEY=nvapi-..." > .env

Controls

Input Action
WASD Walk
Mouse Look
Space Jump
Shift Sprint
E Pick up coin
V Toggle voice
M Mute / unmute
R Record session (.webm)
Esc Release mouse

Repo layout

omninpc/
  api_server.py             FastAPI entry -- what you deploy
  html_game/
    index.html
    game.js
  Dockerfile                HF Spaces / Docker target
  inference/                Nano + Super agents, NIM client, prompt templates
  orchestrator/             Async perception + strategy loops
  actions/                  Action schema + executors
  capture/                  Screen capture (dxcam / mss fallback)
  capture/frame_differ.py   Frame differencing + skip logic
  metrics/                  Latency tracker + CSV logger (15 columns, including npc_role, did_infer, skip_reason)
  metrics/latency_tracker.py  Skip rate + inference metrics
  evaluation/               Trial runner, results analyser
  run_evaluation.py         Evaluation entry point
  run_analysis.py           Paper table generator
  data/results/             trial_results.json, PAPER_RESULTS.txt
  docs/
    images/                 Screenshots and architecture diagrams
    figures/                Generated evaluation charts
    generate_figures.py     Figure generation script
  tests/                    26 pytest tests, all passing
  config/config.yaml
  tools/calibrate_threshold.py  Threshold calibration script

Limitations

Cloud-hosted NIM adds ~2.3s of latency per call. Mean ACRL is 2668ms and mean VTRT is 3345ms -- well above production NPC standards. The architecture targets local deployment where these drop below 1 Hz.

Test frames in the evaluation suite are 128x128 synthetic images. The directive lift results hold in live play, but the controlled evaluation uses synthetic input, not captured game frames.

CMSS of 100% is measured on unambiguous directional commands. Ambiguous natural-language commands ("go to the health station") are not covered by this metric.

At temperature 0.1, Nano is near-deterministic. Role differentiation (HOSTILE vs ALLY vs WANDERER) in live play comes from Super's per-role directive content, not from within-model temperature diversity.


Authors

Aadi Joshi and Kavya Bhand


License

MIT

About

Hierarchical real-time screen-space embodied agents via the Nemotron Super+Nano pattern

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors