A tiny, opinionated starter for OpenAI's gpt-realtime-2. Two files do the work. The browser handles audio over WebRTC. Python forwards one handshake message with your API key attached. That's it.
You ─🎤─► Browser ═══════ audio + tool events ═══════ OpenAI Realtime
│
└── /session ── Python (40 lines, just adds the API key)
- Voice in, voice out, real-time, low latency.
- Function calling. One JS object to add new tools.
- API key stays on the server. The browser never sees it.
- One env file controls model, voice, and system prompt.
- Production-friendly. Startup validation, health check, structured errors.
- No build step. No bundler. No frameworks.
git clone <this-repo> && cd gpt2-rt
cp .env.example .env # paste your OPENAI_API_KEY
uv run server.pyOpen http://localhost:3000, click the mic, say "what time is it?" or "what's the weather where I am?".
Prefer pip?
pip install -r requirements.txt
python server.py- Browser opens the mic and creates a WebRTC peer connection.
- Browser POSTs its SDP offer to
/session. - Python attaches
Authorization: Bearer $OPENAI_API_KEYand forwards the offer tohttps://api.openai.com/v1/realtime/callsalong with the session config (model, voice, instructions). - OpenAI returns its SDP answer. Python passes it back.
- WebRTC peer connection opens directly between browser and OpenAI. Audio streams. A side data channel named
oai-eventscarries JSON events (transcripts, tool calls). - When the model wants to call a tool, the browser runs the matching function in
TOOL_IMPL, sends the result back, then triggers a response.
Python never touches audio. It is signaling only.
The starter ships three real tools, no extra API keys required:
get_current_timeuses the browser's locale and timezone. Try "what day is it?" or "what time is it in my timezone?".get_weatheruses browser geolocation (with permission) and the freewttr.inservice. Try "what's the weather where I am?" or "how warm is it in Tokyo?".see_screencaptures your screen on demand and sends it to the model. Click Share Screen, then ask "what's on my screen?" or "help me with this". The model describes what it sees and guides you through the task.
Two places in static/index.html:
// 1. Describe the tool to the model
const TOOLS = [
/* ...existing tools... */
{
type: "function",
name: "flip_coin",
description: "Flip a fair coin.",
parameters: { type: "object", properties: {}, required: [] }
}
];
// 2. Implement it. Return any JSON-serializable value. Async is fine.
const TOOL_IMPL = {
/* ...existing impls... */
flip_coin: () => ({ result: Math.random() < 0.5 ? "heads" : "tails" })
};Reload, click the mic, ask the model to flip a coin. It figures out when to call the function.
Edit .env:
REALTIME_VOICE=ballad
REALTIME_MODEL=gpt-realtime-2
REALTIME_INSTRUCTIONS=You are a sarcastic concierge. Answer in one sentence.
Voices to try: alloy, ash, ballad, coral, echo, sage, shimmer, verse.
curl http://localhost:3000/healthz
{"ok":true,"model":"gpt-realtime-2","voice":"ash"}
- Put the server behind HTTPS in production. The browser already needs HTTPS for the mic unless you stay on
localhost. OPENAI_API_KEYis validated at startup. Misconfig fails fast with a clear message./sessionreturns the upstream HTTP status if OpenAI rejects the request, so your monitoring picks it up.- The server adds
Authorizationper request. Rotate the key in.envand restart. No code changes. - Logs go to stdout in JSON-ish format. Pipe to your log aggregator of choice.
server.py # FastAPI signaling server
static/index.html # the entire frontend
requirements.txt # fastapi, uvicorn, httpx, python-dotenv
.env.example # copy to .env, add your key
- More tools: drop them into
TOOLSandTOOL_IMPL. Async is fine. - Multiple voices/personas: pass
instructionsandvoiceviasession.updatefrom the browser at runtime. - Transcripts: the data channel already streams
response.audio_transcript.deltaevents. Render them in the page. - Vision / screen capture: click Share Screen to start a screen share. When you ask the model to look at your screen, the
see_screentool fires. It callsImageCapture.grabFrame()on the screen share track, which pulls one decoded frame directly from the OS without needing a video element or any playback state. The frame is scaled to 1280px wide, encoded as JPEG, and injected into the conversation as aninput_imagemessage over the existing WebRTC data channel. No extra API call, no backend involved. - Robots, hardware, anything: tool calls are just JavaScript. Hit a WebSocket, fire a webhook, call a serial port.
MIT.