Skip to content

feat(livekit): auto-flag stereo on audio tracks with num_channels == 2#1023

Open
jkp wants to merge 1 commit intolivekit:mainfrom
supertest-ai:stereo-autodetect-upstream
Open

feat(livekit): auto-flag stereo on audio tracks with num_channels == 2#1023
jkp wants to merge 1 commit intolivekit:mainfrom
supertest-ai:stereo-autodetect-upstream

Conversation

@jkp
Copy link
Copy Markdown

@jkp jkp commented Apr 18, 2026

Problem

When an application publishes a 2-channel NativeAudioSource via the Rust SDK and pushes asymmetric stereo AudioFrames (e.g. silence on L, speech on R), the client-side receive path sees identical L and R content on every frame — as if the track were mono-duplicated. SDP advertises stereo=1 thanks to the standard Opus fmtp, and MediaStreamTrack.getSettings().channelCount returns 2 on the receiver, but the actual decoded content is mono.

Root cause: the server locks the track to mono-encoded Opus at negotiation time, and libwebrtc's Opus encoder downmixes asymmetric input accordingly, before any content reaches the wire. The server makes this decision based on AddTrackRequest.audio_features (specifically TF_STEREO) and the deprecated AddTrackRequest.stereo bool. The JS client SDK (LocalParticipant.ts) sets both flags when opts.forceStereo === true or when MediaStreamTrack.getSettings().channelCount === 2. The Rust SDK never sets either — it doesn't read num_channels off the source, and TrackPublishOptions has no force_stereo / audio_preset field.

Fix

If the track being published has a NativeAudioSource with num_channels == 2, push TfStereo into audio_features and set the deprecated stereo bool. Both go on the existing AddTrackRequest already built in publish_track, right next to the analogous TfPreconnectBuffer handling.

RtcAudioSource::num_channels() is private (generated via enum_dispatch! in libwebrtc), so we match the Native variant directly to reach the public NativeAudioSource::num_channels(). A wildcard arm covers the #[non_exhaustive] enum.

Verified

Built liblivekit_ffi for aarch64-apple-darwin with this patch, dropped it into a Python SDK install, ran a LiveKit session with an agent publishing 2-channel 24 kHz audio (silence on L, TTS speech on R), and captured the decoded track at the Chrome client via MediaStreamTrackProcessor:

frame 5:  ch=2 L=-InfinitydBFS R=-12.8dBFS  |diff|=Infinity
frame 10: ch=2 L=-InfinitydBFS R=-90.5dBFS  |diff|=Infinity
frame 15: ch=2 L=-InfinitydBFS R=-21.8dBFS  |diff|=Infinity
...

L reads at codec floor on every frame; R tracks the speech envelope. Without the patch, L == R exactly on every frame with content magnitude matching whatever we pushed on R.

Context

This came out of building a stereo marker channel on top of the speech track (silent L → marker tones placed sample-aligned with speech boundaries on L, speech PCM on R) for turn-boundary synchronization. All publisher-side workarounds we tried (disabling APM via AudioSourceOptions, labelling as SOURCE_SCREENSHARE_AUDIO, pinning max_bitrate, pushing native 48 kHz frames to bypass the resampler) failed because the server's mono negotiation precedes everything else. The JS SDK's explicit stereo hint on AddTrackRequest is the only thing that prevents it.

Happy to split the stereo bool (deprecated) and the TfStereo feature into separate conditionals if you prefer — currently sending both for robustness across server versions.

Without setting TF_STEREO in AddTrackRequest.audio_features (or the
deprecated stereo bool), the LiveKit server negotiates the published
audio track as mono. libwebrtc's Opus encoder then downmixes any
asymmetric stereo content to mono-duplicated-both-channels, so the
receiver sees identical L and R on every frame regardless of what the
publisher pushed via capture_frame.

The JS SDK sets these same flags based on ``MediaStreamTrack.getSettings().channelCount``
or explicit ``opts.forceStereo`` (see livekit/client-sdk-js
LocalParticipant.ts ``isStereo`` handling). The Rust SDK doesn't
expose an equivalent option and doesn't infer from the audio source's
declared ``num_channels``.

This patch closes that gap: if the track's underlying source declares
``num_channels == 2``, flag the track as stereo on the AddTrackRequest.
``RtcAudioSource::num_channels()`` is not a public accessor (generated
via ``enum_dispatch!``), so we match the ``Native`` variant directly
and keep a wildcard arm for the ``#[non_exhaustive]`` enum.

Verified end-to-end against a Chrome client via
``MediaStreamTrackProcessor``: before the patch, L == R on every frame
with all our content on R; after, L stays at codec floor (-100+ dBFS)
while R carries the TTS speech envelope, matching what the publisher
pushed.

Discussion context: timeline-protocol-v6 stereo marker channel
(silent-L / TTS-on-R for sample-aligned marker tones) kept showing
identical L and R at the client despite SDP advertising stereo=1 and
AudioSource being constructed with num_channels=2. Every workaround
we tried on the Python side (APM options, ``SOURCE_SCREENSHARE_AUDIO``,
``max_bitrate`` pinning, native 48 kHz source, disabling libwebrtc
voice processing) failed because the server-side SDP negotiation
already locked the track to mono before our audio ever reached the
encoder.
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Jamie Kirkpatrick seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants