Skip to content

Unified VAD detector, grounded speech scoring, and platform-gated VoiceActivated#126

Merged
flexiondotorg merged 10 commits into
mainfrom
vad
Jun 16, 2026
Merged

Unified VAD detector, grounded speech scoring, and platform-gated VoiceActivated#126
flexiondotorg merged 10 commits into
mainfrom
vad

Conversation

@flexiondotorg

Copy link
Copy Markdown
Contributor

Summary

Reworks Pass 1 voice-activity detection and the speech/room-tone region model. Net change is 25 files, +2418 / -3316 (removes more than it adds).

  • Unified voice-activity detector (analyser_vad.go): a single detector replaces the former separate speech/room-tone detection. Level histogram with an Otsu speech/silence split, percentile noise floor, hysteresis-built speech runs with adaptive gap-tolerance, and a spectral veto (centroid 200 to 6000 Hz, entropy). The old detector was removed.
  • Grounded speech scoring: candidates are scored SNR-primary with a saturating duration-adequacy term and a consistency tie-break (scoreSpeechCandidateGrounded). Election flipped from longest-wins to highest-score (findBestSpeechRegion). The seven candidateWeight* constants and the old scorer are gone.
  • Room-tone candidates concept retired from the schema, sidecar, and report. Full removal, the package has no RoomToneCandidateMetrics references left.
  • VoiceActivated now means platform-gated capture: it reads the floored (digital-silence) interval fraction (flooredFraction, threshold 0.20), not the old below-split fraction. The below-split signal conflated digital silence with quiet speech gaps and fired on sparse per-speaker podcast tracks; the floored fraction measures the gate signature directly.
  • JSON sidecar fix: IntervalSample and candidateSidecarLine marshal methods sanitise non-finite floats (NaN, ±Inf) to null, mirroring the run-record convention. This stopped an encoding/json error that truncated diagnostic sidecars on digitally-silent audio.
  • Docs synced: AGENTS.md and docs/ updated to the unified detector, highest-score election, and floored-fraction VoiceActivated.

Calibration

The VoiceActivated threshold (0.20) is set from a corpus sweep, not hand-picked. Floored fraction cleanly separates the classes: three platform-gated recordings at 44 to 70%, the entire corpus at or below 0.10%. A 44-point gap, threshold placed with ~190x margin over the corpus ceiling and ~2.2x under the lowest gated file.

Validation

  • Full corpus sweep plus three real platform-gated recordings: all three gated files flag VoiceActivated: true, all 51 corpus stems (48 continuous + 3 short clips) flag false, zero mismatches.
  • Hop-separation sweep confirmed the 250 ms interval (100 ms regresses speech election on 7 stems).
  • Hermetic unit tests for the floored fraction and the grounded scorer (no audio decode).
  • just build, just test, just lint all green.

  Speech search is now independent of room-tone detection. Instead of
  searching only after the elected room-tone region's end (which hides
  early speech on short clips), the search now scans the whole file from
  t=0 to EOF. The elected room-tone region, when present, only excludes
  its own intervals from speech runs (to keep metrics clean), never sets
  a one-way search floor.

  Changes:
  - Add roomToneSpan struct to carry the elected span with explicit
    "none" sentinel (zero value = no room tone)
  - Change findSpeechCandidatesFromIntervals signature to accept the
    span instead of a single roomToneEnd duration
  - Iterate from intervals[0] to EOF, not from a calculated start index
  - Compute rmsP50 median over non-excluded intervals only (room tone
    does not bias the speech reference)
  - Mask room-tone intervals in place as forced non-speech, routed
    through the existing interruption branch (a run cuts at the boundary,
    never bridged across)
  - Delete speechSearchStartBuffer constant (no longer needed)
  - Rewire selectSpeechProfile to pass the elected span from
    noiseSelection.roomToneResult.BestRegion

Signed-off-by: Martin Wimpress <code@wimpress.io>
…Pass 1

  Replace selectNoiseProfile + selectSpeechProfile pair with detectVoiceActivity:
  one bimodal histogram split feeds both speech region and noise floor. This fixes
  the structural overlap bug and removes magic-number detection gates.

  Detector design: per-interval level histogram on momentary-LUFS axis (primary;
  RMS fallback). Otsu split, clamped [noiseFloor+6, p75]. Low-percentile noise
  floor from same histogram (minimum-statistics, not region-elected). Per-interval
  flag: speech iff level >= split AND spectral veto passes (same rule guards loud
  bridged gaps). Two-mechanism run builder: data-derived hysteresis margin + p75
  gap-tolerance from gap distribution (replaces fixed 2s interruption tolerance).
  Reuse: findBestSpeechRegion election, golden refinement, band decode, spectral-
  veto bounds unchanged. New: detectVoiceActivity orchestrator, Otsu split,
  percentile floor, hysteresis margin, gap-p75 tolerance, longest-low-cluster
  region picker. Moved helpers: medianFloat64/medianAndMAD, extractNoiseProfile
  FromIntervals for re-measure contract. Updated: Noise.FloorSource = "vad_
  percentile", VoiceActivated recomputed from low-cluster fraction.

  Validation (48 corpus stems): all hit spec (-16 +/- 0.5 LUFS, TP <= -1 dBTP).
  Three stems elect NO speech profile (same as RMS-axis test; not an axis issue).
  Noise floor rises +9.8 dB mean (new percentile vs old region-elected; larger than
  predicted +3-7). Speech region duration drops (semantic: variable-length runs vs
  fixed 60s, voicing +0.06 to 0.85-1.00). No output pumping regression (final DR
  -5.7 dB mean, driven by edge astats values, no over-gating signature). Detected
  audio: to ears, gate is the bottleneck, not detection. Recommendation:
  zero-candidate misses flagged for gate-overhaul re-validation, not a VAD
  regression.

  Old detector code (selectNoiseProfile, selectSpeechProfile, detectVoiceActivated,
  room-tone candidates) left in-tree for now.

Signed-off-by: Martin Wimpress <code@wimpress.io>
  Recover three VAD failures (LMP-70-popey, LMP-74-popey, LMP-82-mark) that
  were falsely rejected by spectral veto. These speakers have healthy speech
  centroid above 4500 Hz. New bound at 6000 Hz passes all 45 currently elected
  stems and aligns with ITU-T P.56 wideband (50-7000 Hz). Validated: all three
  failures now elect speech regions with centroid 3012-3328 Hz.

Signed-off-by: Martin Wimpress <code@wimpress.io>
… unified VAD

  Delete the old two-part detector.

  - Delete old speech run-formation code (`findSpeechCandidatesFromIntervals`,
    speech scoring, `roomToneSpan` type, dead run-builder constants)
  - Delete old room-tone detector files entirely
    (`analyser_candidates_room_tone_scoring.go`,
    `analyser_candidates_room_tone.go`)
  - Move retained seed estimators (`estimateNoiseFloorAndThreshold`,
    `computeSilenceMedians`) to new `analyser_noise_seed.go` for reuse by
    buildInputMeasurements
  - Remove tests that exercised only the old run-builder; fix stale test
    fixtures (centroid bounds changed in prior commit)

  Unified VAD detector is now the sole live path. Build, lint, and tests
  all green.

Signed-off-by: Martin Wimpress <code@wimpress.io>
  Remove RoomToneCandidates field, RoomToneRegionMetrics type, and
  candidate-measurement functions. ElectedRoomToneSample now sourced
  directly from Pass 1 analysis, decoupled from the removed candidates
  slice. Speech candidates remain active. Update RunRecord, sidecars,
  report rendering, and tests; golden fixture regenerated. No DSP or
  audio quality changes; byte-identical output (candidates block only).

Signed-off-by: Martin Wimpress <code@wimpress.io>
  - Introduce grounded scorer: primary SNR-margin gate (saturating above 40 dB),
    duration adequacy (full credit at 30 s+), consistency tie-break (lower
    within-region level-variance wins); replace seven theatre weights and
    composite with unified single-axis model
  - Election now selects highest-score candidate instead of longest duration
  - Add levelVariance helper to compute per-interval RMS/LUFS variance within
    a region
  - New hermetic unit tests for scorer (SNR monotonicity, saturation, tie-break)
    and election logic (highest-score wins, voice-activated, fallback)
  - Update TestElectSpeechProfile assertion from longest-wins to highest-score

  Thresholds (snrSaturationMargin, speechDurationAdequacyMinimum) are Phase 3
  sweep placeholders.

Signed-off-by: Martin Wimpress <code@wimpress.io>
  IntervalSample and candidateSidecarLine marshal methods now route
  through sanitiseValue to null non-finite floats (NaN, +Inf, -Inf),
  mirroring the run-record convention. This prevents encoding/json errors
  that truncated .intervals.jsonl and aborted .candidates.jsonl on
  digitally-silent (voice-gated) audio. Finite values are byte-identical.

Signed-off-by: Martin Wimpress <code@wimpress.io>
…ed fraction

  Changed the VAD signal from "low-cluster intervals below the split" (which
  conflated digital silence and quiet speech gaps) to "intervals pinned at the
  digital-silence floor" (the platform-gated capture signature). Threshold set
  to 0.20 (20%) from calibration data: ~190x above the corpus ceiling (0.10%)
  and ~2.2x below the lowest gated TT202 track (44.08%). The below-split signal
  was rejected because sparse per-speaker podcast tracks run 50-75% below-split,
  making them indistinguishable from platform-gated recordings.

  - Rename `lowClusterFraction()` to `flooredFraction()` to measure only floored
    intervals (level <= vadLevelFloorDB), skipping NaN but counting fully silent
    (-inf) windows in the denominator
  - Update `vadVoiceActivatedFraction` constant from 0.60 to 0.20
  - Update VoiceActivated comment in NoiseMetrics to reflect platform-gated
    detection intent
  - Add test cases validating gated tracks (floored > 0.20) and sparse podcast
    tracks (floored = 0.0) are correctly distinguished

Signed-off-by: Martin Wimpress <code@wimpress.io>
…Activated semantics

  - Update AGENTS.md: unified VAD detector, highest-score speech election, VoiceActivated floored-fraction definition
  - Update docs/Pipeline.md: add VAD flow, updated speech-gate thresholds, highest-score scoring strategy
  - Update docs/Levelator-Comparison-And-Gap-Analysis.md: VoiceActivated meaning for floored fraction
  - Update docs/Usage.md: VoiceActivated flag purpose
  - Fix code comments: highest-score strategy in candidiate-shared; centroid 4500→6000 Hz in candidate-speech

Signed-off-by: Martin Wimpress <code@wimpress.io>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 25 files

Confidence score: 3/5

  • internal/processor/analyser_noise_seed.go has non-deterministic top-candidate selection on equal scores, so prescan noise-floor results can change based on interval ordering and produce inconsistent analysis/report outputs between runs — add a deterministic tie-break (lower RMS, then index) before truncating candidates prior to merge.
  • internal/report/report_full.md.golden describes a below-Otsu-split fraction while the implemented metric is fixed-floor digital-silence fraction, which can mislead readers and make validation expectations drift — align the definition with the actual metric (or update the metric and regenerate the golden) before merging.

Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.

Re-trigger cubic

Comment thread internal/processor/analyser_noise_seed.go
Comment thread internal/report/report_full.md.golden Outdated
…c Voice activated definition

  - Add seed ordering to break ties in noise-floor candidate scoring for reproducible prescan results
  - Update Voice activated gloss from "below-split" to "floored (digital-silence)" to match metric change
  - Regenerate report golden file with updated definition

Signed-off-by: Martin Wimpress <code@wimpress.io>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 3 files (changes from recent commits).

Requires human review: Major refactor of core VAD, speech scoring, and noise profiling; AI review found no issues, but it touches critical algorithms and requires human verification for correctness and regression safety.

Re-trigger cubic

@flexiondotorg flexiondotorg merged commit 04af1ef into main Jun 16, 2026
16 checks passed
@flexiondotorg flexiondotorg deleted the vad branch June 16, 2026 01:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant