Motivation
Multi-turn conversation relies on kv reuse/prefix-caching/radix attention to reduce the cost of long prefill from increasing context length, but allowing full kv reuse between the exact same sample (e.g.: between dataset copies) is not the intended behavior.
Proposed Solution
Add cache salt to identical samples.
Alternatives Considered
No response
Additional Context
No response
Motivation
Multi-turn conversation relies on kv reuse/prefix-caching/radix attention to reduce the cost of long prefill from increasing context length, but allowing full kv reuse between the exact same sample (e.g.: between dataset copies) is not the intended behavior.
Proposed Solution
Add cache salt to identical samples.
Alternatives Considered
No response
Additional Context
No response