OmniNPC

title	OmniNPC - Nemotron Live Demo
emoji	🎮
colorFrom	green
colorTo	indigo
sdk	docker
app_port	7860
pinned	true

OmniNPC

Hierarchical real-time screen-space embodied agents via the Nemotron Super+Nano pattern

Three NPCs. Rendered pixels only. No engine API.

Nemotron Nano VL 8B processes the canvas every ~700ms and outputs a JSON action. Every 10 seconds, Nemotron Super 49B reads the scoreboard and rewrites the strategy directive that steers all three NPCs simultaneously. The directive injection is the entire communication channel between the two loops -- Super writes a sentence, Nano reads it.

The key result: without a Super directive, Nano never evades threats (0% across 50 trials). Inject a specific "EVADE" directive and evasion reaches 100% [89-100% CI]. The hierarchy is not an optimization. It is the only mechanism that produces non-default behavior.

export NVIDIA_NIM_API_KEY=nvapi-...
pip install -r requirements.txt
python api_server.py --port 8000

Open http://localhost:8000/ and click START.

What it looks like

Left: live scoreboard. Center: 3D arena with HOSTILE (red), ALLY (blue), WANDERER (yellow) and the human player. Top-right: per-call Nano latency chart. Right panel: streaming scene descriptions from each NPC's inference call.

How it works

The browser captures a JPEG of the Three.js canvas every ~700ms and sends it to the FastAPI server via POST /infer. The server prepends the current Super directive into Nano's system prompt, calls NVIDIA NIM, and returns the action JSON. Three NPCs run this loop in round-robin. Every 10 seconds, a separate POST /super call sends the scoreboard text to Super 49B, which returns one sentence of strategic guidance. That sentence becomes the directive in every subsequent Nano call until the next Super update.

The directive injection is the entire communication channel between the two loops. Super does not call Nano directly. It writes a sentence. Nano reads it.

The reasoning trace

Each inference cycle, Nano returns a scene_description field alongside the action. The reasoning panel streams these in real time, labeled by NPC role. The SUPER-49B directive is shown at the top -- this is what steers all three NPCs simultaneously.

The key finding

Without a Super directive, Nano is a deterministic coin-collector. Across 50 trials on a conflict scene (threat visible on left, coin ahead), the evasion rate was 0% [0-7% CI]. Every single trial: move_forward toward the coin.

Inject a specific "EVADE IMMEDIATELY" directive from Super 49B and the evasion rate reaches 100% [89-100% CI] on the conflict frame. The VTRT evaluation (50 trials, random threat positions) shows 70% [56-81% CI] evasion with the directive versus 0% [0-7% CI] without.

The Super directive is not an optimization. It is the only mechanism that produces non-default behavior. This is the hierarchy claim, demonstrated empirically.

Voice commands

Press V, say a command, and it routes to the named NPC for a 6-second override window.

Press V in-game to enable voice. The Web Speech API transcribes your command in the browser -- no server round-trip for the transcription. The command is parsed into a target NPC and an action, then injected as a high-priority directive override for 6 seconds.

Command	Effect
`hostile go to blue zone`	HOSTILE walks to the blue tile
`ally come here`	ALLY follows the player
`wanderer dance`	WANDERER stops and dances
`everyone stop`	All three NPCs idle
`hostile run away`	HOSTILE flees from the player

Cross-Modal Sync Score (CMSS): 100% [93-100% CI] across 50 trials with unambiguous directional commands.

Benchmark results

All evaluations run against cloud-hosted NVIDIA NIM. Latency is dominated by API round-trip (~2.3s). The architecture targets local deployment at <1 Hz per NPC.

Metric	Result	N	Notes
VTRT evasion w/ directive	96% [86-99%]	50	Phase 3; FSM + safety verifier
VTRT evasion no directive	0% [0-7%]	50	Nano alone — never evades
Directive lift (survival)	+100pp	--	0%→96%; hierarchy proof
Survive directive adherence	100% [89-100%]	30
Collect directive adherence	100% [89-100%]	30
ACRL mean +/- std	3527+/-1175ms	50
ACRL P95	4448ms	50
CMSS direction alignment	100% [93-100%]	50
Adversarial JSON validity	100%	50
MUR	95.8% [92-98%]	240	Memory utilization rate
SGPR	70.4% [64-76%]	240	Scene graph parse rate
RDS mean JSD (Phase-1 baseline=0.0)	0.6667	3 role pairs	Phase 4 role differentiation
Grid Coverage	0.7+/-0.5 cells	24	Super cycles

Phase 1 — Efficiency

Event-triggered inference with frame differencing (threshold: 19)
Skip rate: ~48% of frames skipped per session
Compact output format toggle added
Inference efficiency logged to CSV and JSON summary

Phase 2 — State & Spatial Memory

NPCMemory: per-role episodic memory (last 3 events) injected into Super directive
OccupancyGrid: 8x8 2D world model updated from scene descriptions, feeds Super
SceneGraph: structured entity extraction from Nano JSON responses (SGPR 70.4%)
Memory utilization: 95.8% of Nano calls receive non-empty memory context

Phase 3 — Structured Planning & Safety

NPCStateMachine: per-NPC FSM with 5 states (EXPLORE/DEFEND/COLLECT/EVADE/IDLE)
FSM enforces minimum 2-frame state commitment — prevents mid-evasion reversals
Super now returns structured JSON directive: state, duration, priority fields
SafetyVerifier: rule-based action filter blocks contradictory actions per FSM state
VTRT evasion improved from 90% (Phase 2) to 96% (Phase 3)

Phase 4 — Multi-Agent Coordination

Shared Blackboard: thread-safe ring buffer (max 5 entries) for short messages in the format {"from": role, "msg": str, "t": timestamp}
Nano now reads the last 2 blackboard entries for local context before acting
Role-Specific Directives: Super now emits 3 differentiated JSON directives for HOSTILE, ALLY, and WANDERER instead of one broadcast directive
Tool Use: Super uses lightweight Python tools for pathfinding and entity position lookups derived from OccupancyGrid and NPCMemory
RDS mean JSD reached 0.6667 against the Phase-1 baseline of 0.0, confirming role differentiation

Inference efficiency

The perception loop uses frame differencing to skip inference on visually identical frames. Each NPC maintains a per-role frame cache. Inference is triggered only when the mean absolute pixel difference exceeds a calibrated threshold (19.0), or when a maximum repeat count is reached. Across live sessions, ~48% of frames are skipped, reducing redundant NIM API calls without affecting behavioral quality.

Latency distribution

Most calls land between 2300-3000ms (API floor + processing). The right tail comes from NIM rate-limit back-off. Local deployment eliminates both.

To reproduce:

python run_evaluation.py --trials audio --n 50
python run_evaluation.py --trials visual --n 100
python run_evaluation.py --trials hierarchy --n 30
python run_evaluation.py --trials adversarial --n 10
python run_analysis.py
# Output: data/results/PAPER_RESULTS.txt

Five things that make this different

	Default 2025 agent	OmniNPC
Sensing	Engine API (`game.get_state()`)	Rendered pixels only
Cadence	Request-response, sequential	Continuous: 3 NPCs at ~0.37 Hz each, in parallel
Reasoning	One model, one prompt	Hierarchical: Nano 8B (reflex) + Super 49B (strategy) + FSM state machine
Agents	Usually 1	3 agents sharing one directive + on-demand dialogue
Modality	Vision or audio	Vision + microphone simultaneously

Pixel-only sensing is a hard requirement for generality. If the agent needs game.get_state(), it works in 0% of games because 99% of games do not expose that. If it works from pixels, every game ever made is a valid target environment.

Deploy to Hugging Face Spaces

Create a new Space at huggingface.co/new-space
Select Docker as the SDK, set App port to 7860

Clone the Space repo and push this codebase to it:

git remote add space https://huggingface.co/spaces/<your-username>/omninpc
git push space main

In your Space settings, add a Secret named NVIDIA_NIM_API_KEY with your NIM key
The Space will build the Docker image and serve the game at your Space URL

The Dockerfile installs only what api_server.py needs (no Panda3D, no audio stack). Build time is roughly 2-3 minutes.

Run locally

HTML game (recommended)

export NVIDIA_NIM_API_KEY=nvapi-...
python api_server.py --port 8000
# open http://localhost:8000/

Panda3D FPS arena (research view)

export NVIDIA_NIM_API_KEY=nvapi-...
python run_demo.py
# requires: pip install panda3d SpeechRecognition pyaudio

Requirements

Python 3.10+ (3.11 recommended)
NVIDIA NIM API key -- free at build.nvidia.com

git clone <repo>
cd omninpc
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
echo "NVIDIA_NIM_API_KEY=nvapi-..." > .env

Controls

Input	Action
WASD	Walk
Mouse	Look
Space	Jump
Shift	Sprint
E	Pick up coin
V	Toggle voice
M	Mute / unmute
R	Record session (.webm)
Esc	Release mouse

Repo layout

omninpc/
  api_server.py             FastAPI entry -- what you deploy
  html_game/
    index.html
    game.js
  Dockerfile                HF Spaces / Docker target
  inference/                Nano + Super agents, NIM client, prompt templates
  orchestrator/             Async perception + strategy loops
  actions/                  Action schema + executors
  capture/                  Screen capture (dxcam / mss fallback)
  capture/frame_differ.py   Frame differencing + skip logic
  metrics/                  Latency tracker + CSV logger (15 columns, including npc_role, did_infer, skip_reason)
  metrics/latency_tracker.py  Skip rate + inference metrics
  evaluation/               Trial runner, results analyser
  run_evaluation.py         Evaluation entry point
  run_analysis.py           Paper table generator
  data/results/             trial_results.json, PAPER_RESULTS.txt
  docs/
    images/                 Screenshots and architecture diagrams
    figures/                Generated evaluation charts
    generate_figures.py     Figure generation script
  tests/                    26 pytest tests, all passing
  config/config.yaml
  tools/calibrate_threshold.py  Threshold calibration script

Limitations

Cloud-hosted NIM adds ~2.3s of latency per call. Mean ACRL is 2668ms and mean VTRT is 3345ms -- well above production NPC standards. The architecture targets local deployment where these drop below 1 Hz.

Test frames in the evaluation suite are 128x128 synthetic images. The directive lift results hold in live play, but the controlled evaluation uses synthetic input, not captured game frames.

CMSS of 100% is measured on unambiguous directional commands. Ambiguous natural-language commands ("go to the health station") are not covered by this metric.

At temperature 0.1, Nano is near-deterministic. Role differentiation (HOSTILE vs ALLY vs WANDERER) in live play comes from Super's per-role directive content, not from within-model temperature diversity.

Authors

Aadi Joshi and Kavya Bhand

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OmniNPC

What it looks like

How it works

The reasoning trace

The key finding

Voice commands

Benchmark results

Phase 1 — Efficiency

Phase 2 — State & Spatial Memory

Phase 3 — Structured Planning & Safety

Phase 4 — Multi-Agent Coordination

Inference efficiency

Latency distribution

Five things that make this different

Deploy to Hugging Face Spaces

Run locally

HTML game (recommended)

Panda3D FPS arena (research view)

Requirements

Controls

Repo layout

Limitations

Authors

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
actions		actions
capture		capture
config		config
dashboard		dashboard
data		data
docs		docs
evaluation		evaluation
game		game
html_game		html_game
inference		inference
metrics		metrics
orchestrator		orchestrator
paper		paper
tests		tests
tools		tools
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
PROJECT_STRUCTURE.md		PROJECT_STRUCTURE.md
README.md		README.md
api_server.py		api_server.py
launch.sh		launch.sh
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run_analysis.py		run_analysis.py
run_baseline.py		run_baseline.py
run_demo.py		run_demo.py
run_evaluation.py		run_evaluation.py
run_external_game.py		run_external_game.py
setup.py		setup.py
test_framediffer.py		test_framediffer.py
yolov8n.pt		yolov8n.pt

Folders and files

Latest commit

History

Repository files navigation

OmniNPC

What it looks like

How it works

The reasoning trace

The key finding

Voice commands

Benchmark results

Phase 1 — Efficiency

Phase 2 — State & Spatial Memory

Phase 3 — Structured Planning & Safety

Phase 4 — Multi-Agent Coordination

Inference efficiency

Latency distribution

Five things that make this different

Deploy to Hugging Face Spaces

Run locally

HTML game (recommended)

Panda3D FPS arena (research view)

Requirements

Controls

Repo layout

Limitations

Authors

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages