From 897ce33eb0567e557f9c45494e6d9fb3fa5997ef Mon Sep 17 00:00:00 2001
From: humabot <71288277+humabot@users.noreply.github.com>
Date: Mon, 30 Mar 2026 15:10:21 -0500
Subject: [PATCH] Add local Whisper setup and harden Electron startup
Expand speech support to include a repo-local OpenAI Whisper workflow alongside Azure Speech. This updates setup, configuration, settings UI, and runtime speech handling so the app can bootstrap and use a local Whisper CLI with sane defaults and a smoke-test script.
Also harden Electron startup and renderer behavior by clearing ELECTRON_RUN_AS_NODE during npm start/dev, enforcing a single-instance lock, simplifying preload quit behavior, and replacing fragile CDN-backed UI assets with local resources or guarded fallbacks to reduce renderer crashes and background SSL noise.
Documentation and examples are updated to match the new speech setup flow, environment variables, and startup behavior.
---
.gitignore | 5 +-
README.md | 78 +-
chat.html | 12 +-
env.example | 102 +--
llm-response.html | 48 +-
main.js | 49 +-
package-lock.json | 20 +
package.json | 8 +-
preload.js | 6 +-
scripts/test-speech.js | 25 +
settings.html | 63 +-
setup.sh | 381 +++++----
src/core/config.js | 8 +-
src/managers/window.manager.js | 2 +-
src/services/speech.service.js | 1354 +++++++++++++++++++-------------
src/styles/common.css | 4 +-
src/ui/settings-window.js | 45 ++
17 files changed, 1364 insertions(+), 846 deletions(-)
create mode 100644 scripts/test-speech.js
diff --git a/.gitignore b/.gitignore
index 65a6a54..e249383 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,5 +1,8 @@
node_modules/
.env
+.venv-whisper/
+.whisper-models/
eng.traineddata
dist/
-.DS_Store
\ No newline at end of file
+.DS_Store
+*.log
diff --git a/README.md b/README.md
index e9de5af..7b7015e 100644
--- a/README.md
+++ b/README.md
@@ -22,7 +22,7 @@
-
+
---
@@ -53,7 +53,7 @@ https://github.com/user-attachments/assets/896a7140-1e85-405d-bfbe-e05c9f3a816b
### 🚀 **AI-Powered Intelligence**
- **Direct Image Analysis**: Screenshots are analyzed by Gemini (no Tesseract OCR)
-- **Voice Commands**: Optional Azure Speech (Whisper questions, get instant answers)
+- **Voice Commands**: Optional Azure Speech or local OpenAI Whisper
- **Context Memory**: Remembers entire interview conversation
- **Multi-Language Support**: C++, Python, Java, JavaScript, C
- **Smart Response Window**: Draggable with close button
@@ -68,7 +68,7 @@ https://github.com/user-attachments/assets/896a7140-1e85-405d-bfbe-e05c9f3a816b
- **Floating Overlay Bar**: Compact command center with camera, mic, and skill selector
- **Draggable Answer Window**: Move and resize AI response window anywhere
- **Close Button**: Clean Ă— button to close answer window when needed
-- **Auto-Hide Mic**: Microphone button appears only when Azure Speech is configured
+- **Auto-Hide Mic**: Microphone button appears only when a speech provider is available
- **Interactive Chat**: Full conversation window with markdown support
### 🎨 **Visual Design**
@@ -133,7 +133,7 @@ https://github.com/user-attachments/assets/896a7140-1e85-405d-bfbe-e05c9f3a816b
- [x] **Global shortcuts** (capture, visibility, interaction, chat, settings)
- [x] **Session memory** and chat UI
- [x] **Language picker** and DSA skill prompt
-- [x] **Optional Azure Speech** integration with auto‑hide mic
+- [x] **Optional Azure Speech / local Whisper** integration with auto‑hide mic
- [x] **Multi‑monitor** and area capture APIs
- [x] **Window binding** and positioning system
- [x] **Settings management** with app icon/stealth modes
@@ -157,12 +157,22 @@ The setup script automatically handles configuration. You only need:
# Required: Google Gemini API Key (setup script will ask for this)
GEMINI_API_KEY=your_gemini_api_key_here
-# Optional: Azure Speech Recognition (add later if you want voice features)
+# Optional: Speech Recognition (pick one provider)
+SPEECH_PROVIDER=whisper
+
+# Azure option
AZURE_SPEECH_KEY=your_azure_speech_key
AZURE_SPEECH_REGION=your_region
+
+# Local Whisper option
+WHISPER_COMMAND=whisper
+WHISPER_MODEL_DIR=.whisper-models
+WHISPER_MODEL=base
+WHISPER_LANGUAGE=en
+WHISPER_SEGMENT_MS=4000
```
-**Note**: Speech recognition is completely optional. If Azure credentials are not provided, the microphone button will be automatically hidden from all interfaces.
+**Note**: Speech recognition is completely optional. If no configured provider is available, the microphone button will be automatically hidden from all interfaces.
## 🚀 Quick Start & Installation
@@ -187,7 +197,9 @@ AZURE_SPEECH_REGION=your_region
**That's it!** The setup script will:
- Install all dependencies automatically
-- Create and configure your `.env` file
+- Create your `.env` file from `env.example` if needed
+- Set up a local Whisper virtualenv in `.venv-whisper`
+- Configure `.env` to use local Whisper by default
- Build the app (if needed)
- Launch OpenCluely ready to use (if not works use npm install & then npm start)
@@ -196,6 +208,8 @@ AZURE_SPEECH_REGION=your_region
- **Windows**: Use Git Bash (comes with Git for Windows), WSL, or any bash environment
- **macOS/Linux**: Use your regular terminal
- **All platforms**: No manual npm commands needed - the setup script handles everything
+- **Windows Whisper path**: `setup.sh` now writes `WHISPER_COMMAND=.venv-whisper/Scripts/whisper.exe`
+- **macOS/Linux Whisper path**: `setup.sh` writes `WHISPER_COMMAND=.venv-whisper/bin/whisper`
### 🎛️ Setup Script Options
@@ -204,28 +218,50 @@ AZURE_SPEECH_REGION=your_region
./setup.sh --ci # Use npm ci instead of npm install
./setup.sh --no-run # Setup only, don't launch the app
./setup.sh --install-system-deps # Install sox for microphone (optional)
+./setup.sh --skip-whisper # Skip the local Whisper bootstrap
```
-### đź”§ **Optional: Azure Speech Setup** (For Voice Features)
+### đź”§ **Optional: Speech Setup** (For Voice Features)
+
+Voice recognition is optional. You can use either Azure Speech or local OpenAI Whisper.
-Voice recognition is completely optional. The setup script will create a `.env` file with just the required Gemini key. To add voice features:
+For the local Whisper path, `./setup.sh` now handles the full repo-local setup:
-1. Get Azure Speech credentials:
+1. Creates `.venv-whisper`
+2. Installs `openai-whisper`
+3. Points `.env` at `.venv-whisper/bin/whisper`
+4. Creates `.whisper-models`
+5. Runs `npm run test-speech`
+
+1. For Azure Speech:
- Visit [Azure Portal](https://portal.azure.com/)
- Create a Speech Service
- Copy your key and region
-2. Add to your `.env` file:
+2. For local Whisper:
+ - Run `./setup.sh --install-system-deps`
+ - Or install required audio tools such as `ffmpeg` and `sox` yourself
+ - On Windows, install audio tooling separately and prefer Git Bash or WSL for `setup.sh`
+
+3. Add one provider to your `.env` file:
```env
- # Already configured by setup script
GEMINI_API_KEY=your_gemini_api_key_here
-
- # Add these for voice features (optional)
+ SPEECH_PROVIDER=azure
AZURE_SPEECH_KEY=your_azure_speech_key
AZURE_SPEECH_REGION=your_region
```
-3. Restart the app - microphone buttons will now appear automatically
+ ```env
+ GEMINI_API_KEY=your_gemini_api_key_here
+ SPEECH_PROVIDER=whisper
+ WHISPER_COMMAND=whisper
+ WHISPER_MODEL_DIR=.whisper-models
+ WHISPER_MODEL=base
+ WHISPER_LANGUAGE=en
+ WHISPER_SEGMENT_MS=4000
+ ```
+
+4. Restart the app - microphone buttons will now appear automatically
## 🎮 How to Use
@@ -265,10 +301,11 @@ Voice recognition is completely optional. The setup script will create a `.env`
- **Image Understanding**: DSA prompt is applied only for new image-based queries; chat messages don’t include the full prompt
- **Multi-monitor & Area Capture**: Programmatic APIs allow targeting a display and optional rectangular crop for focused analysis
-#### 🔊 **Optional Voice Features** (Azure Speech)
-- **Real-time Transcription**: Speak questions naturally
+#### 🔊 **Optional Voice Features** (Azure Speech / Local Whisper)
+- **Chunked Local Transcription**: Local Whisper transcribes short recorded segments on your machine
+- **Real-time Transcription**: Azure Speech supports live interim recognition
- **Listening Animation**: Visual feedback during recording
-- **Interim Results**: See transcription as you speak
+- **Interim Results**: Available with Azure Speech
- **Auto-processing**: Instant AI responses to voice input
]
---
@@ -305,7 +342,8 @@ Voice recognition is completely optional. The setup script will create a `.env`
- **Microphone/voice not working**
- Voice is optional - ignore related warnings if you don't need it
- - To enable: install `sox` (Linux/macOS) and add Azure keys to `.env`
+ - Azure mode: add valid Azure keys to `.env`
+ - Whisper mode: install `openai-whisper`, `ffmpeg`, and `sox`, then set `SPEECH_PROVIDER=whisper`
@@ -341,7 +379,7 @@ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file
## 🙏 Acknowledgments
- **Google Gemini**: Powering AI intelligence
-- **Azure Speech**: Optional voice recognition
+- **Azure Speech / Whisper**: Optional voice recognition
- **Electron**: Cross-platform desktop framework
- **Community**: Amazing contributors and feedback
diff --git a/chat.html b/chat.html
index d986477..baff03c 100644
--- a/chat.html
+++ b/chat.html
@@ -4,10 +4,8 @@
Chat
-
-
-
+
@@ -336,6 +352,16 @@
Speech Recognition
+
+
+
Speech Provider
+
Choose Azure Speech or a local OpenAI Whisper CLI
+
+
+
Azure Speech Key
@@ -350,6 +376,39 @@
+
+
+
+
Whisper Command
+
CLI command for local Whisper, such as whisper or python3 -m whisper
+
+
+
+
+
+
+
Whisper Model
+
Local model name used by the Whisper CLI
+
+
+
+
+
+
Whisper Language
+
Language code for local transcription
+
+
+
+
+
+
Whisper Segment Length
+
Chunk size in milliseconds for local transcription
+
+
+
+
+ Local Whisper runs on this machine and needs a Whisper CLI installed. These settings apply immediately for the current app session; use .env for startup defaults.
+