Unified VAD detector, grounded speech scoring, and platform-gated VoiceActivated by flexiondotorg · Pull Request #126 · linuxmatters/jivetalking

flexiondotorg · 2026-06-16T01:13:46Z

Summary

Reworks Pass 1 voice-activity detection and the speech/room-tone region model. Net change is 25 files, +2418 / -3316 (removes more than it adds).

Unified voice-activity detector (analyser_vad.go): a single detector replaces the former separate speech/room-tone detection. Level histogram with an Otsu speech/silence split, percentile noise floor, hysteresis-built speech runs with adaptive gap-tolerance, and a spectral veto (centroid 200 to 6000 Hz, entropy). The old detector was removed.
Grounded speech scoring: candidates are scored SNR-primary with a saturating duration-adequacy term and a consistency tie-break (scoreSpeechCandidateGrounded). Election flipped from longest-wins to highest-score (findBestSpeechRegion). The seven candidateWeight* constants and the old scorer are gone.
Room-tone candidates concept retired from the schema, sidecar, and report. Full removal, the package has no RoomToneCandidateMetrics references left.
VoiceActivated now means platform-gated capture: it reads the floored (digital-silence) interval fraction (flooredFraction, threshold 0.20), not the old below-split fraction. The below-split signal conflated digital silence with quiet speech gaps and fired on sparse per-speaker podcast tracks; the floored fraction measures the gate signature directly.
JSON sidecar fix: IntervalSample and candidateSidecarLine marshal methods sanitise non-finite floats (NaN, ±Inf) to null, mirroring the run-record convention. This stopped an encoding/json error that truncated diagnostic sidecars on digitally-silent audio.
Docs synced: AGENTS.md and docs/ updated to the unified detector, highest-score election, and floored-fraction VoiceActivated.

Calibration

The VoiceActivated threshold (0.20) is set from a corpus sweep, not hand-picked. Floored fraction cleanly separates the classes: three platform-gated recordings at 44 to 70%, the entire corpus at or below 0.10%. A 44-point gap, threshold placed with ~190x margin over the corpus ceiling and ~2.2x under the lowest gated file.

Validation

Full corpus sweep plus three real platform-gated recordings: all three gated files flag VoiceActivated: true, all 51 corpus stems (48 continuous + 3 short clips) flag false, zero mismatches.
Hop-separation sweep confirmed the 250 ms interval (100 ms regresses speech election on 7 stems).
Hermetic unit tests for the floored fraction and the grounded scorer (no audio decode).
just build, just test, just lint all green.

Speech search is now independent of room-tone detection. Instead of searching only after the elected room-tone region's end (which hides early speech on short clips), the search now scans the whole file from t=0 to EOF. The elected room-tone region, when present, only excludes its own intervals from speech runs (to keep metrics clean), never sets a one-way search floor. Changes: - Add roomToneSpan struct to carry the elected span with explicit "none" sentinel (zero value = no room tone) - Change findSpeechCandidatesFromIntervals signature to accept the span instead of a single roomToneEnd duration - Iterate from intervals[0] to EOF, not from a calculated start index - Compute rmsP50 median over non-excluded intervals only (room tone does not bias the speech reference) - Mask room-tone intervals in place as forced non-speech, routed through the existing interruption branch (a run cuts at the boundary, never bridged across) - Delete speechSearchStartBuffer constant (no longer needed) - Rewire selectSpeechProfile to pass the elected span from noiseSelection.roomToneResult.BestRegion Signed-off-by: Martin Wimpress <code@wimpress.io>

…Pass 1 Replace selectNoiseProfile + selectSpeechProfile pair with detectVoiceActivity: one bimodal histogram split feeds both speech region and noise floor. This fixes the structural overlap bug and removes magic-number detection gates. Detector design: per-interval level histogram on momentary-LUFS axis (primary; RMS fallback). Otsu split, clamped [noiseFloor+6, p75]. Low-percentile noise floor from same histogram (minimum-statistics, not region-elected). Per-interval flag: speech iff level >= split AND spectral veto passes (same rule guards loud bridged gaps). Two-mechanism run builder: data-derived hysteresis margin + p75 gap-tolerance from gap distribution (replaces fixed 2s interruption tolerance). Reuse: findBestSpeechRegion election, golden refinement, band decode, spectral- veto bounds unchanged. New: detectVoiceActivity orchestrator, Otsu split, percentile floor, hysteresis margin, gap-p75 tolerance, longest-low-cluster region picker. Moved helpers: medianFloat64/medianAndMAD, extractNoiseProfile FromIntervals for re-measure contract. Updated: Noise.FloorSource = "vad_ percentile", VoiceActivated recomputed from low-cluster fraction. Validation (48 corpus stems): all hit spec (-16 +/- 0.5 LUFS, TP <= -1 dBTP). Three stems elect NO speech profile (same as RMS-axis test; not an axis issue). Noise floor rises +9.8 dB mean (new percentile vs old region-elected; larger than predicted +3-7). Speech region duration drops (semantic: variable-length runs vs fixed 60s, voicing +0.06 to 0.85-1.00). No output pumping regression (final DR -5.7 dB mean, driven by edge astats values, no over-gating signature). Detected audio: to ears, gate is the bottleneck, not detection. Recommendation: zero-candidate misses flagged for gate-overhaul re-validation, not a VAD regression. Old detector code (selectNoiseProfile, selectSpeechProfile, detectVoiceActivated, room-tone candidates) left in-tree for now. Signed-off-by: Martin Wimpress <code@wimpress.io>

Recover three VAD failures (LMP-70-popey, LMP-74-popey, LMP-82-mark) that were falsely rejected by spectral veto. These speakers have healthy speech centroid above 4500 Hz. New bound at 6000 Hz passes all 45 currently elected stems and aligns with ITU-T P.56 wideband (50-7000 Hz). Validated: all three failures now elect speech regions with centroid 3012-3328 Hz. Signed-off-by: Martin Wimpress <code@wimpress.io>

… unified VAD Delete the old two-part detector. - Delete old speech run-formation code (`findSpeechCandidatesFromIntervals`, speech scoring, `roomToneSpan` type, dead run-builder constants) - Delete old room-tone detector files entirely (`analyser_candidates_room_tone_scoring.go`, `analyser_candidates_room_tone.go`) - Move retained seed estimators (`estimateNoiseFloorAndThreshold`, `computeSilenceMedians`) to new `analyser_noise_seed.go` for reuse by buildInputMeasurements - Remove tests that exercised only the old run-builder; fix stale test fixtures (centroid bounds changed in prior commit) Unified VAD detector is now the sole live path. Build, lint, and tests all green. Signed-off-by: Martin Wimpress <code@wimpress.io>

Remove RoomToneCandidates field, RoomToneRegionMetrics type, and candidate-measurement functions. ElectedRoomToneSample now sourced directly from Pass 1 analysis, decoupled from the removed candidates slice. Speech candidates remain active. Update RunRecord, sidecars, report rendering, and tests; golden fixture regenerated. No DSP or audio quality changes; byte-identical output (candidates block only). Signed-off-by: Martin Wimpress <code@wimpress.io>

- Introduce grounded scorer: primary SNR-margin gate (saturating above 40 dB), duration adequacy (full credit at 30 s+), consistency tie-break (lower within-region level-variance wins); replace seven theatre weights and composite with unified single-axis model - Election now selects highest-score candidate instead of longest duration - Add levelVariance helper to compute per-interval RMS/LUFS variance within a region - New hermetic unit tests for scorer (SNR monotonicity, saturation, tie-break) and election logic (highest-score wins, voice-activated, fallback) - Update TestElectSpeechProfile assertion from longest-wins to highest-score Thresholds (snrSaturationMargin, speechDurationAdequacyMinimum) are Phase 3 sweep placeholders. Signed-off-by: Martin Wimpress <code@wimpress.io>

IntervalSample and candidateSidecarLine marshal methods now route through sanitiseValue to null non-finite floats (NaN, +Inf, -Inf), mirroring the run-record convention. This prevents encoding/json errors that truncated .intervals.jsonl and aborted .candidates.jsonl on digitally-silent (voice-gated) audio. Finite values are byte-identical. Signed-off-by: Martin Wimpress <code@wimpress.io>

…ed fraction Changed the VAD signal from "low-cluster intervals below the split" (which conflated digital silence and quiet speech gaps) to "intervals pinned at the digital-silence floor" (the platform-gated capture signature). Threshold set to 0.20 (20%) from calibration data: ~190x above the corpus ceiling (0.10%) and ~2.2x below the lowest gated TT202 track (44.08%). The below-split signal was rejected because sparse per-speaker podcast tracks run 50-75% below-split, making them indistinguishable from platform-gated recordings. - Rename `lowClusterFraction()` to `flooredFraction()` to measure only floored intervals (level <= vadLevelFloorDB), skipping NaN but counting fully silent (-inf) windows in the denominator - Update `vadVoiceActivatedFraction` constant from 0.60 to 0.20 - Update VoiceActivated comment in NoiseMetrics to reflect platform-gated detection intent - Add test cases validating gated tracks (floored > 0.20) and sparse podcast tracks (floored = 0.0) are correctly distinguished Signed-off-by: Martin Wimpress <code@wimpress.io>

…Activated semantics - Update AGENTS.md: unified VAD detector, highest-score speech election, VoiceActivated floored-fraction definition - Update docs/Pipeline.md: add VAD flow, updated speech-gate thresholds, highest-score scoring strategy - Update docs/Levelator-Comparison-And-Gap-Analysis.md: VoiceActivated meaning for floored fraction - Update docs/Usage.md: VoiceActivated flag purpose - Fix code comments: highest-score strategy in candidiate-shared; centroid 4500→6000 Hz in candidate-speech Signed-off-by: Martin Wimpress <code@wimpress.io>

cubic-dev-ai

2 issues found across 25 files

Confidence score: 3/5

internal/processor/analyser_noise_seed.go has non-deterministic top-candidate selection on equal scores, so prescan noise-floor results can change based on interval ordering and produce inconsistent analysis/report outputs between runs — add a deterministic tie-break (lower RMS, then index) before truncating candidates prior to merge.
internal/report/report_full.md.golden describes a below-Otsu-split fraction while the implemented metric is fixed-floor digital-silence fraction, which can mislead readers and make validation expectations drift — align the definition with the actual metric (or update the metric and regenerate the golden) before merging.

_{Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.

Re-trigger cubic}

…c Voice activated definition - Add seed ordering to break ties in noise-floor candidate scoring for reproducible prescan results - Update Voice activated gloss from "below-split" to "floored (digital-silence)" to match metric change - Regenerate report golden file with updated definition Signed-off-by: Martin Wimpress <code@wimpress.io>

cubic-dev-ai

0 issues found across 3 files (changes from recent commits).

_{Requires human review: Major refactor of core VAD, speech scoring, and noise profiling; AI review found no issues, but it touches critical algorithms and requires human verification for correctness and regression safety.

Re-trigger cubic}

flexiondotorg added 9 commits June 15, 2026 04:13

cubic-dev-ai Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread internal/processor/analyser_noise_seed.go

Comment thread internal/report/report_full.md.golden Outdated

cubic-dev-ai Bot reviewed Jun 16, 2026

View reviewed changes

flexiondotorg merged commit 04af1ef into main Jun 16, 2026
16 checks passed

flexiondotorg deleted the vad branch June 16, 2026 01:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unified VAD detector, grounded speech scoring, and platform-gated VoiceActivated#126

Unified VAD detector, grounded speech scoring, and platform-gated VoiceActivated#126
flexiondotorg merged 10 commits into
mainfrom
vad

flexiondotorg commented Jun 16, 2026

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

flexiondotorg commented Jun 16, 2026

Summary

Calibration

Validation

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cubic-dev-ai Bot left a comment •

edited

Loading