`text_to_dialogue.convert_with_timestamps`: alignment degrades after long input or many segments

When using `text_to_dialogue.convert_with_timestamps`, character alignment data becomes degenerate (all timestamps collapse to the same value) under certain conditions. The audio itself is fine — only the alignment data is affected.

The degradation seems triggered by either:

1. Any single-voice input exceeding ~250 chars / ~15s of audio, OR
2. More than 6 lines of dialogue / total audio duration exceeding ~30s

## Workaround

Splitting long inputs into shorter ones (~200 chars) and keeping chunks to ~6 inputs produces 100% unique timestamps across all segments. But this means significantly lower-quality audio output due to the TTS AI having reduced context.

So what we usually do is a first pass with `text_to_dialogue.convert` and then run an STT pass on the resulting audio to build alignment data. But this of course is subject to STT errors and wastes credits if we use elevenlabs for the STT.

## Reproduction: long individual input

Here's an example where the degradation seems to be triggered by a long individual speech.

```python
from elevenlabs import ElevenLabs, DialogueInput

client = ElevenLabs()

VOICE_A = "SAz9YHcvj6GT2YYXdXww"  # River
VOICE_B = "nPczCjzI2devNBz1zQrb"  # Brian
VOICE_C = "iP95p4xoKVk53GoZ742B"  # Chris

inputs = [
    # Long input (~427 chars, ~28s of audio)
    DialogueInput(
        text="Patrol car is tucked into the parking lot of this abandoned prison. "
             "Bebar, the driver and senior officer, scrolls on his phone. "
             "Quick is restless, this is not his idea of police work at all. "
             "A heavily-tattooed Gangbanger sits ramrod straight in his pulled-over car. "
             "Behind him: Quick and Bebar in their patrol car, overheads flashing. "
             "Traffic whips by, students play on the athletic fields. "
             "There is a sense of foreboding here.",
        voice_id=VOICE_A,
    ),
    DialogueInput(text="He ran the red.", voice_id=VOICE_B),
    DialogueInput(
        text="When he saw us. Jacoby Tate, thirty four, multiple priors, "
             "two weapons charges. Use utmost caution.",
        voice_id=VOICE_C,
    ),
    DialogueInput(text="Quick moves to exit the car.", voice_id=VOICE_A),
    DialogueInput(
        text="Whoa. Easy, tiger. Unlawful use of a weapon and we know "
             "there is an AK out there. What do we do here.",
        voice_id=VOICE_B,
    ),
    DialogueInput(text="I am doing it.", voice_id=VOICE_C),
    DialogueInput(text="We wait for back-up, rookie.", voice_id=VOICE_B),
]

result = client.text_to_dialogue.convert_with_timestamps(
    inputs=inputs,
    model_id="eleven_v3",
    output_format="mp3_44100_128",
)

for vs in result.voice_segments:
    chars = result.alignment.characters[vs.character_start_index:vs.character_end_index]
    starts = result.alignment.character_start_times_seconds[
        vs.character_start_index:vs.character_end_index
    ]
    unique = len(set(starts))
    print(
        f"Seg {vs.dialogue_input_index}: "
        f"{unique}/{len(chars)} unique timestamps ({unique * 100 // len(chars)}%)  "
        f'start={vs.start_time_seconds:.1f}s'
    )
```

## Observed output

```
Seg 0: 427/427 unique (100%)  start=0.0s
Seg 1: 15/15 unique (100%)    start=28.4s
Seg 2: 1/99 unique (1%)       start=31.4s   <-- degrades
Seg 3: 1/28 unique (3%)       start=31.4s
Seg 4: 13/101 unique (12%)    start=31.4s
Seg 5: 4/14 unique (28%)      start=46.4s
Seg 6: 28/28 unique (100%)    start=48.9s   <-- recovers
```

Segments 2–4 have nearly all timestamps collapsed to a single value. `voice_segments.start_time_seconds` is also wrong — segments 2, 3, and 4 all report `31.4s` which can't be correct since they're sequential.

## Also reproducible with many short inputs

Even when no individual input is long, alignment degrades at the tail end with 8+ short inputs. With 6 short inputs alignment is 100% across all segments.

```python
inputs = [
    DialogueInput(text="Patrol car is tucked into the parking lot.", voice_id=VOICE_A),
    DialogueInput(text="He ran the red.", voice_id=VOICE_B),
    DialogueInput(text="When he saw us. Multiple priors, two weapons.", voice_id=VOICE_C),
    DialogueInput(text="Quick moves to exit the car.", voice_id=VOICE_A),
    DialogueInput(text="Whoa. Easy, tiger. What do we do here.", voice_id=VOICE_B),
    DialogueInput(text="I am doing it.", voice_id=VOICE_C),
    DialogueInput(text="We wait for back-up, rookie.", voice_id=VOICE_B),
    DialogueInput(text="Yeah, I am not here for that. You wait for back-up.", voice_id=VOICE_C),
]
```

```
Seg 0: 427/427 unique (100%)  start=0.0s
Seg 1: 15/15 unique (100%)    start=28.8s
Seg 2: 1/99 unique (1%)       start=31.4s   <-- degrades
Seg 3: 1/28 unique (3%)       start=31.4s
Seg 4: 13/101 unique (12%)    start=31.4s
Seg 5: 4/14 unique (28%)      start=46.4s   <-- still bad
Seg 6: 1/28 unique (3%)       start=50.1s   
Seg 7: 17/51 unique (33%)     start=50.1s
```

## Environment

- Python SDK: `elevenlabs` (latest pip)
- Model: `eleven_v3`
- Output format: `mp3_44100_128`

## Notes

None of these limits appear in the [API docs](https://elevenlabs.io/docs/api-reference/text-to-dialogue/convert-with-timestamps). The only documented constraint is max 10 unique voice IDs per request.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`text_to_dialogue.convert_with_timestamps`: alignment degrades after long input or many segments #760

Workaround

Reproduction: long individual input

Observed output

Also reproducible with many short inputs

Environment

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

text_to_dialogue.convert_with_timestamps: alignment degrades after long input or many segments #760

Description

Workaround

Reproduction: long individual input

Observed output

Also reproducible with many short inputs

Environment

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`text_to_dialogue.convert_with_timestamps`: alignment degrades after long input or many segments #760