Skip to content

text_to_dialogue.convert_with_timestamps: alignment degrades after long input or many segments #760

@dr-skot

Description

@dr-skot

When using text_to_dialogue.convert_with_timestamps, character alignment data becomes degenerate (all timestamps collapse to the same value) under certain conditions. The audio itself is fine — only the alignment data is affected.

The degradation seems triggered by either:

  1. Any single-voice input exceeding ~250 chars / ~15s of audio, OR
  2. More than 6 lines of dialogue / total audio duration exceeding ~30s

Workaround

Splitting long inputs into shorter ones (~200 chars) and keeping chunks to ~6 inputs produces 100% unique timestamps across all segments. But this means significantly lower-quality audio output due to the TTS AI having reduced context.

So what we usually do is a first pass with text_to_dialogue.convert and then run an STT pass on the resulting audio to build alignment data. But this of course is subject to STT errors and wastes credits if we use elevenlabs for the STT.

Reproduction: long individual input

Here's an example where the degradation seems to be triggered by a long individual speech.

from elevenlabs import ElevenLabs, DialogueInput

client = ElevenLabs()

VOICE_A = "SAz9YHcvj6GT2YYXdXww"  # River
VOICE_B = "nPczCjzI2devNBz1zQrb"  # Brian
VOICE_C = "iP95p4xoKVk53GoZ742B"  # Chris

inputs = [
    # Long input (~427 chars, ~28s of audio)
    DialogueInput(
        text="Patrol car is tucked into the parking lot of this abandoned prison. "
             "Bebar, the driver and senior officer, scrolls on his phone. "
             "Quick is restless, this is not his idea of police work at all. "
             "A heavily-tattooed Gangbanger sits ramrod straight in his pulled-over car. "
             "Behind him: Quick and Bebar in their patrol car, overheads flashing. "
             "Traffic whips by, students play on the athletic fields. "
             "There is a sense of foreboding here.",
        voice_id=VOICE_A,
    ),
    DialogueInput(text="He ran the red.", voice_id=VOICE_B),
    DialogueInput(
        text="When he saw us. Jacoby Tate, thirty four, multiple priors, "
             "two weapons charges. Use utmost caution.",
        voice_id=VOICE_C,
    ),
    DialogueInput(text="Quick moves to exit the car.", voice_id=VOICE_A),
    DialogueInput(
        text="Whoa. Easy, tiger. Unlawful use of a weapon and we know "
             "there is an AK out there. What do we do here.",
        voice_id=VOICE_B,
    ),
    DialogueInput(text="I am doing it.", voice_id=VOICE_C),
    DialogueInput(text="We wait for back-up, rookie.", voice_id=VOICE_B),
]

result = client.text_to_dialogue.convert_with_timestamps(
    inputs=inputs,
    model_id="eleven_v3",
    output_format="mp3_44100_128",
)

for vs in result.voice_segments:
    chars = result.alignment.characters[vs.character_start_index:vs.character_end_index]
    starts = result.alignment.character_start_times_seconds[
        vs.character_start_index:vs.character_end_index
    ]
    unique = len(set(starts))
    print(
        f"Seg {vs.dialogue_input_index}: "
        f"{unique}/{len(chars)} unique timestamps ({unique * 100 // len(chars)}%)  "
        f'start={vs.start_time_seconds:.1f}s'
    )

Observed output

Seg 0: 427/427 unique (100%)  start=0.0s
Seg 1: 15/15 unique (100%)    start=28.4s
Seg 2: 1/99 unique (1%)       start=31.4s   <-- degrades
Seg 3: 1/28 unique (3%)       start=31.4s
Seg 4: 13/101 unique (12%)    start=31.4s
Seg 5: 4/14 unique (28%)      start=46.4s
Seg 6: 28/28 unique (100%)    start=48.9s   <-- recovers

Segments 2–4 have nearly all timestamps collapsed to a single value. voice_segments.start_time_seconds is also wrong — segments 2, 3, and 4 all report 31.4s which can't be correct since they're sequential.

Also reproducible with many short inputs

Even when no individual input is long, alignment degrades at the tail end with 8+ short inputs. With 6 short inputs alignment is 100% across all segments.

inputs = [
    DialogueInput(text="Patrol car is tucked into the parking lot.", voice_id=VOICE_A),
    DialogueInput(text="He ran the red.", voice_id=VOICE_B),
    DialogueInput(text="When he saw us. Multiple priors, two weapons.", voice_id=VOICE_C),
    DialogueInput(text="Quick moves to exit the car.", voice_id=VOICE_A),
    DialogueInput(text="Whoa. Easy, tiger. What do we do here.", voice_id=VOICE_B),
    DialogueInput(text="I am doing it.", voice_id=VOICE_C),
    DialogueInput(text="We wait for back-up, rookie.", voice_id=VOICE_B),
    DialogueInput(text="Yeah, I am not here for that. You wait for back-up.", voice_id=VOICE_C),
]
Seg 0: 427/427 unique (100%)  start=0.0s
Seg 1: 15/15 unique (100%)    start=28.8s
Seg 2: 1/99 unique (1%)       start=31.4s   <-- degrades
Seg 3: 1/28 unique (3%)       start=31.4s
Seg 4: 13/101 unique (12%)    start=31.4s
Seg 5: 4/14 unique (28%)      start=46.4s   <-- still bad
Seg 6: 1/28 unique (3%)       start=50.1s   
Seg 7: 17/51 unique (33%)     start=50.1s

Environment

  • Python SDK: elevenlabs (latest pip)
  • Model: eleven_v3
  • Output format: mp3_44100_128

Notes

None of these limits appear in the API docs. The only documented constraint is max 10 unique voice IDs per request.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions