When using text_to_dialogue.convert_with_timestamps, character alignment data becomes degenerate (all timestamps collapse to the same value) under certain conditions. The audio itself is fine — only the alignment data is affected.
The degradation seems triggered by either:
- Any single-voice input exceeding ~250 chars / ~15s of audio, OR
- More than 6 lines of dialogue / total audio duration exceeding ~30s
Workaround
Splitting long inputs into shorter ones (~200 chars) and keeping chunks to ~6 inputs produces 100% unique timestamps across all segments. But this means significantly lower-quality audio output due to the TTS AI having reduced context.
So what we usually do is a first pass with text_to_dialogue.convert and then run an STT pass on the resulting audio to build alignment data. But this of course is subject to STT errors and wastes credits if we use elevenlabs for the STT.
Reproduction: long individual input
Here's an example where the degradation seems to be triggered by a long individual speech.
from elevenlabs import ElevenLabs, DialogueInput
client = ElevenLabs()
VOICE_A = "SAz9YHcvj6GT2YYXdXww" # River
VOICE_B = "nPczCjzI2devNBz1zQrb" # Brian
VOICE_C = "iP95p4xoKVk53GoZ742B" # Chris
inputs = [
# Long input (~427 chars, ~28s of audio)
DialogueInput(
text="Patrol car is tucked into the parking lot of this abandoned prison. "
"Bebar, the driver and senior officer, scrolls on his phone. "
"Quick is restless, this is not his idea of police work at all. "
"A heavily-tattooed Gangbanger sits ramrod straight in his pulled-over car. "
"Behind him: Quick and Bebar in their patrol car, overheads flashing. "
"Traffic whips by, students play on the athletic fields. "
"There is a sense of foreboding here.",
voice_id=VOICE_A,
),
DialogueInput(text="He ran the red.", voice_id=VOICE_B),
DialogueInput(
text="When he saw us. Jacoby Tate, thirty four, multiple priors, "
"two weapons charges. Use utmost caution.",
voice_id=VOICE_C,
),
DialogueInput(text="Quick moves to exit the car.", voice_id=VOICE_A),
DialogueInput(
text="Whoa. Easy, tiger. Unlawful use of a weapon and we know "
"there is an AK out there. What do we do here.",
voice_id=VOICE_B,
),
DialogueInput(text="I am doing it.", voice_id=VOICE_C),
DialogueInput(text="We wait for back-up, rookie.", voice_id=VOICE_B),
]
result = client.text_to_dialogue.convert_with_timestamps(
inputs=inputs,
model_id="eleven_v3",
output_format="mp3_44100_128",
)
for vs in result.voice_segments:
chars = result.alignment.characters[vs.character_start_index:vs.character_end_index]
starts = result.alignment.character_start_times_seconds[
vs.character_start_index:vs.character_end_index
]
unique = len(set(starts))
print(
f"Seg {vs.dialogue_input_index}: "
f"{unique}/{len(chars)} unique timestamps ({unique * 100 // len(chars)}%) "
f'start={vs.start_time_seconds:.1f}s'
)
Observed output
Seg 0: 427/427 unique (100%) start=0.0s
Seg 1: 15/15 unique (100%) start=28.4s
Seg 2: 1/99 unique (1%) start=31.4s <-- degrades
Seg 3: 1/28 unique (3%) start=31.4s
Seg 4: 13/101 unique (12%) start=31.4s
Seg 5: 4/14 unique (28%) start=46.4s
Seg 6: 28/28 unique (100%) start=48.9s <-- recovers
Segments 2–4 have nearly all timestamps collapsed to a single value. voice_segments.start_time_seconds is also wrong — segments 2, 3, and 4 all report 31.4s which can't be correct since they're sequential.
Also reproducible with many short inputs
Even when no individual input is long, alignment degrades at the tail end with 8+ short inputs. With 6 short inputs alignment is 100% across all segments.
inputs = [
DialogueInput(text="Patrol car is tucked into the parking lot.", voice_id=VOICE_A),
DialogueInput(text="He ran the red.", voice_id=VOICE_B),
DialogueInput(text="When he saw us. Multiple priors, two weapons.", voice_id=VOICE_C),
DialogueInput(text="Quick moves to exit the car.", voice_id=VOICE_A),
DialogueInput(text="Whoa. Easy, tiger. What do we do here.", voice_id=VOICE_B),
DialogueInput(text="I am doing it.", voice_id=VOICE_C),
DialogueInput(text="We wait for back-up, rookie.", voice_id=VOICE_B),
DialogueInput(text="Yeah, I am not here for that. You wait for back-up.", voice_id=VOICE_C),
]
Seg 0: 427/427 unique (100%) start=0.0s
Seg 1: 15/15 unique (100%) start=28.8s
Seg 2: 1/99 unique (1%) start=31.4s <-- degrades
Seg 3: 1/28 unique (3%) start=31.4s
Seg 4: 13/101 unique (12%) start=31.4s
Seg 5: 4/14 unique (28%) start=46.4s <-- still bad
Seg 6: 1/28 unique (3%) start=50.1s
Seg 7: 17/51 unique (33%) start=50.1s
Environment
- Python SDK:
elevenlabs (latest pip)
- Model:
eleven_v3
- Output format:
mp3_44100_128
Notes
None of these limits appear in the API docs. The only documented constraint is max 10 unique voice IDs per request.
When using
text_to_dialogue.convert_with_timestamps, character alignment data becomes degenerate (all timestamps collapse to the same value) under certain conditions. The audio itself is fine — only the alignment data is affected.The degradation seems triggered by either:
Workaround
Splitting long inputs into shorter ones (~200 chars) and keeping chunks to ~6 inputs produces 100% unique timestamps across all segments. But this means significantly lower-quality audio output due to the TTS AI having reduced context.
So what we usually do is a first pass with
text_to_dialogue.convertand then run an STT pass on the resulting audio to build alignment data. But this of course is subject to STT errors and wastes credits if we use elevenlabs for the STT.Reproduction: long individual input
Here's an example where the degradation seems to be triggered by a long individual speech.
Observed output
Segments 2–4 have nearly all timestamps collapsed to a single value.
voice_segments.start_time_secondsis also wrong — segments 2, 3, and 4 all report31.4swhich can't be correct since they're sequential.Also reproducible with many short inputs
Even when no individual input is long, alignment degrades at the tail end with 8+ short inputs. With 6 short inputs alignment is 100% across all segments.
Environment
elevenlabs(latest pip)eleven_v3mp3_44100_128Notes
None of these limits appear in the API docs. The only documented constraint is max 10 unique voice IDs per request.