Unified VAD detector, grounded speech scoring, and platform-gated VoiceActivated#126
Merged
Conversation
Speech search is now independent of room-tone detection. Instead of
searching only after the elected room-tone region's end (which hides
early speech on short clips), the search now scans the whole file from
t=0 to EOF. The elected room-tone region, when present, only excludes
its own intervals from speech runs (to keep metrics clean), never sets
a one-way search floor.
Changes:
- Add roomToneSpan struct to carry the elected span with explicit
"none" sentinel (zero value = no room tone)
- Change findSpeechCandidatesFromIntervals signature to accept the
span instead of a single roomToneEnd duration
- Iterate from intervals[0] to EOF, not from a calculated start index
- Compute rmsP50 median over non-excluded intervals only (room tone
does not bias the speech reference)
- Mask room-tone intervals in place as forced non-speech, routed
through the existing interruption branch (a run cuts at the boundary,
never bridged across)
- Delete speechSearchStartBuffer constant (no longer needed)
- Rewire selectSpeechProfile to pass the elected span from
noiseSelection.roomToneResult.BestRegion
Signed-off-by: Martin Wimpress <code@wimpress.io>
…Pass 1 Replace selectNoiseProfile + selectSpeechProfile pair with detectVoiceActivity: one bimodal histogram split feeds both speech region and noise floor. This fixes the structural overlap bug and removes magic-number detection gates. Detector design: per-interval level histogram on momentary-LUFS axis (primary; RMS fallback). Otsu split, clamped [noiseFloor+6, p75]. Low-percentile noise floor from same histogram (minimum-statistics, not region-elected). Per-interval flag: speech iff level >= split AND spectral veto passes (same rule guards loud bridged gaps). Two-mechanism run builder: data-derived hysteresis margin + p75 gap-tolerance from gap distribution (replaces fixed 2s interruption tolerance). Reuse: findBestSpeechRegion election, golden refinement, band decode, spectral- veto bounds unchanged. New: detectVoiceActivity orchestrator, Otsu split, percentile floor, hysteresis margin, gap-p75 tolerance, longest-low-cluster region picker. Moved helpers: medianFloat64/medianAndMAD, extractNoiseProfile FromIntervals for re-measure contract. Updated: Noise.FloorSource = "vad_ percentile", VoiceActivated recomputed from low-cluster fraction. Validation (48 corpus stems): all hit spec (-16 +/- 0.5 LUFS, TP <= -1 dBTP). Three stems elect NO speech profile (same as RMS-axis test; not an axis issue). Noise floor rises +9.8 dB mean (new percentile vs old region-elected; larger than predicted +3-7). Speech region duration drops (semantic: variable-length runs vs fixed 60s, voicing +0.06 to 0.85-1.00). No output pumping regression (final DR -5.7 dB mean, driven by edge astats values, no over-gating signature). Detected audio: to ears, gate is the bottleneck, not detection. Recommendation: zero-candidate misses flagged for gate-overhaul re-validation, not a VAD regression. Old detector code (selectNoiseProfile, selectSpeechProfile, detectVoiceActivated, room-tone candidates) left in-tree for now. Signed-off-by: Martin Wimpress <code@wimpress.io>
Recover three VAD failures (LMP-70-popey, LMP-74-popey, LMP-82-mark) that were falsely rejected by spectral veto. These speakers have healthy speech centroid above 4500 Hz. New bound at 6000 Hz passes all 45 currently elected stems and aligns with ITU-T P.56 wideband (50-7000 Hz). Validated: all three failures now elect speech regions with centroid 3012-3328 Hz. Signed-off-by: Martin Wimpress <code@wimpress.io>
… unified VAD
Delete the old two-part detector.
- Delete old speech run-formation code (`findSpeechCandidatesFromIntervals`,
speech scoring, `roomToneSpan` type, dead run-builder constants)
- Delete old room-tone detector files entirely
(`analyser_candidates_room_tone_scoring.go`,
`analyser_candidates_room_tone.go`)
- Move retained seed estimators (`estimateNoiseFloorAndThreshold`,
`computeSilenceMedians`) to new `analyser_noise_seed.go` for reuse by
buildInputMeasurements
- Remove tests that exercised only the old run-builder; fix stale test
fixtures (centroid bounds changed in prior commit)
Unified VAD detector is now the sole live path. Build, lint, and tests
all green.
Signed-off-by: Martin Wimpress <code@wimpress.io>
Remove RoomToneCandidates field, RoomToneRegionMetrics type, and candidate-measurement functions. ElectedRoomToneSample now sourced directly from Pass 1 analysis, decoupled from the removed candidates slice. Speech candidates remain active. Update RunRecord, sidecars, report rendering, and tests; golden fixture regenerated. No DSP or audio quality changes; byte-identical output (candidates block only). Signed-off-by: Martin Wimpress <code@wimpress.io>
- Introduce grounded scorer: primary SNR-margin gate (saturating above 40 dB),
duration adequacy (full credit at 30 s+), consistency tie-break (lower
within-region level-variance wins); replace seven theatre weights and
composite with unified single-axis model
- Election now selects highest-score candidate instead of longest duration
- Add levelVariance helper to compute per-interval RMS/LUFS variance within
a region
- New hermetic unit tests for scorer (SNR monotonicity, saturation, tie-break)
and election logic (highest-score wins, voice-activated, fallback)
- Update TestElectSpeechProfile assertion from longest-wins to highest-score
Thresholds (snrSaturationMargin, speechDurationAdequacyMinimum) are Phase 3
sweep placeholders.
Signed-off-by: Martin Wimpress <code@wimpress.io>
IntervalSample and candidateSidecarLine marshal methods now route through sanitiseValue to null non-finite floats (NaN, +Inf, -Inf), mirroring the run-record convention. This prevents encoding/json errors that truncated .intervals.jsonl and aborted .candidates.jsonl on digitally-silent (voice-gated) audio. Finite values are byte-identical. Signed-off-by: Martin Wimpress <code@wimpress.io>
…ed fraction
Changed the VAD signal from "low-cluster intervals below the split" (which
conflated digital silence and quiet speech gaps) to "intervals pinned at the
digital-silence floor" (the platform-gated capture signature). Threshold set
to 0.20 (20%) from calibration data: ~190x above the corpus ceiling (0.10%)
and ~2.2x below the lowest gated TT202 track (44.08%). The below-split signal
was rejected because sparse per-speaker podcast tracks run 50-75% below-split,
making them indistinguishable from platform-gated recordings.
- Rename `lowClusterFraction()` to `flooredFraction()` to measure only floored
intervals (level <= vadLevelFloorDB), skipping NaN but counting fully silent
(-inf) windows in the denominator
- Update `vadVoiceActivatedFraction` constant from 0.60 to 0.20
- Update VoiceActivated comment in NoiseMetrics to reflect platform-gated
detection intent
- Add test cases validating gated tracks (floored > 0.20) and sparse podcast
tracks (floored = 0.0) are correctly distinguished
Signed-off-by: Martin Wimpress <code@wimpress.io>
…Activated semantics - Update AGENTS.md: unified VAD detector, highest-score speech election, VoiceActivated floored-fraction definition - Update docs/Pipeline.md: add VAD flow, updated speech-gate thresholds, highest-score scoring strategy - Update docs/Levelator-Comparison-And-Gap-Analysis.md: VoiceActivated meaning for floored fraction - Update docs/Usage.md: VoiceActivated flag purpose - Fix code comments: highest-score strategy in candidiate-shared; centroid 4500→6000 Hz in candidate-speech Signed-off-by: Martin Wimpress <code@wimpress.io>
Contributor
There was a problem hiding this comment.
2 issues found across 25 files
Confidence score: 3/5
internal/processor/analyser_noise_seed.gohas non-deterministic top-candidate selection on equal scores, so prescan noise-floor results can change based on interval ordering and produce inconsistent analysis/report outputs between runs — add a deterministic tie-break (lower RMS, then index) before truncating candidates prior to merge.internal/report/report_full.md.goldendescribes a below-Otsu-split fraction while the implemented metric is fixed-floor digital-silence fraction, which can mislead readers and make validation expectations drift — align the definition with the actual metric (or update the metric and regenerate the golden) before merging.
Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.
Re-trigger cubic
…c Voice activated definition - Add seed ordering to break ties in noise-floor candidate scoring for reproducible prescan results - Update Voice activated gloss from "below-split" to "floored (digital-silence)" to match metric change - Regenerate report golden file with updated definition Signed-off-by: Martin Wimpress <code@wimpress.io>
Contributor
There was a problem hiding this comment.
0 issues found across 3 files (changes from recent commits).
Requires human review: Major refactor of core VAD, speech scoring, and noise profiling; AI review found no issues, but it touches critical algorithms and requires human verification for correctness and regression safety.
Re-trigger cubic
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reworks Pass 1 voice-activity detection and the speech/room-tone region model. Net change is 25 files, +2418 / -3316 (removes more than it adds).
analyser_vad.go): a single detector replaces the former separate speech/room-tone detection. Level histogram with an Otsu speech/silence split, percentile noise floor, hysteresis-built speech runs with adaptive gap-tolerance, and a spectral veto (centroid 200 to 6000 Hz, entropy). The old detector was removed.scoreSpeechCandidateGrounded). Election flipped from longest-wins to highest-score (findBestSpeechRegion). The sevencandidateWeight*constants and the old scorer are gone.RoomToneCandidateMetricsreferences left.VoiceActivatednow means platform-gated capture: it reads the floored (digital-silence) interval fraction (flooredFraction, threshold 0.20), not the old below-split fraction. The below-split signal conflated digital silence with quiet speech gaps and fired on sparse per-speaker podcast tracks; the floored fraction measures the gate signature directly.IntervalSampleandcandidateSidecarLinemarshal methods sanitise non-finite floats (NaN, ±Inf) to null, mirroring the run-record convention. This stopped anencoding/jsonerror that truncated diagnostic sidecars on digitally-silent audio.AGENTS.mdanddocs/updated to the unified detector, highest-score election, and floored-fractionVoiceActivated.Calibration
The
VoiceActivatedthreshold (0.20) is set from a corpus sweep, not hand-picked. Floored fraction cleanly separates the classes: three platform-gated recordings at 44 to 70%, the entire corpus at or below 0.10%. A 44-point gap, threshold placed with ~190x margin over the corpus ceiling and ~2.2x under the lowest gated file.Validation
VoiceActivated: true, all 51 corpus stems (48 continuous + 3 short clips) flag false, zero mismatches.just build,just test,just lintall green.