Skip to content

[CORE-16144] Sync timeout shadow linking#30574

Open
Lazin wants to merge 3 commits into
redpanda-data:devfrom
Lazin:fix/sync-timeout-shadow-linking
Open

[CORE-16144] Sync timeout shadow linking#30574
Lazin wants to merge 3 commits into
redpanda-data:devfrom
Lazin:fix/sync-timeout-shadow-linking

Conversation

@Lazin
Copy link
Copy Markdown
Contributor

@Lazin Lazin commented May 21, 2026

A bunch of CI failure fixes.

  • add a timeout to the fence_epoch call and handle timeout error
  • add parallel fanout for the RedpandaService.validate_metastore to avoid timeout at scale

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

  • none

Lazin added 2 commits May 21, 2026 10:19
Throw ss::timed_out_error instead of std::runtime_error so the missed
sync deadline is identifiable by type rather than message string. Add a
timeout parameter to fence_epoch (default unchanged at 10s, moved into
ctp_stm as a static constexpr member) so callers can pass their own
deadline.

Add //src/v/model to the ctp_stm bazel target since ctp_stm.h now
includes model/timeout_clock.h.

Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
Plumb a timeout parameter through ctp_stm_api::fence_epoch so callers
can pass their own deadline. The bulk-replicate paths in the frontend
now hand fence_epoch the same budget they already use for the L0
upload: replicate_at_offset forwards its `timeout` argument and
replicate forwards opts.timeout.value_or(L0_replicate_default_timeout).

At both call sites, catch ss::timed_out_error and convert it to a
typed errc (raft::errc::timeout / kafka::error_code::request_timed_out)
instead of re-throwing. This keeps the cluster-link replicator's
catch-all from logging the transient sync-timeout case at ERROR.

Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
Copilot AI review requested due to automatic review settings May 21, 2026 15:03
@Lazin Lazin requested a review from dotnwat May 21, 2026 15:05
validate_metastore previously iterated topics x partitions serially,
issuing one ValidatePartition admin RPC at a time. At fleet scale
(thousands of partitions per topic) the wall-clock summed across
calls overran ducktape test timeouts even when each call completed
quickly.

Fan the per-partition calls out through a bounded ThreadPoolExecutor
(default 16 workers, exposed as max_concurrent_partitions) so the
wall-clock cost is dominated by the slowest call rather than the sum.
Per-partition pagination remains sequential (resume_at_offset is
intrinsically ordered), and anomaly collection is guarded by a lock.
Any per-partition exception is surfaced once all futures settle.

Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
@Lazin Lazin force-pushed the fix/sync-timeout-shadow-linking branch from fd541e5 to 9917296 Compare May 21, 2026 15:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses CI failures/timeouts by bounding fence_epoch sync time and by reducing end-to-end metastore validation time in large-scale ducktape runs.

Changes:

  • Add an explicit timeout parameter to ctp_stm::fence_epoch / ctp_stm_api::fence_epoch and throw a typed timeout exception on sync failure.
  • Propagate the replicate timeout budget into fence_epoch and translate ss::timed_out_error into typed timeout results in cloud-topics frontend replicate paths.
  • Validate metastore partitions concurrently (bounded by a new max_concurrent_partitions parameter) to reduce wall-clock time.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/rptest/services/redpanda.py Adds concurrent per-partition metastore validation within each topic to reduce ducktape wall-clock time.
src/v/cloud_topics/level_zero/stm/ctp_stm.h Adds a default sync timeout and extends fence_epoch API with a timeout parameter.
src/v/cloud_topics/level_zero/stm/ctp_stm.cc Uses caller-provided timeout for STM sync and switches timeout failure to ss::timed_out_error.
src/v/cloud_topics/level_zero/stm/ctp_stm_api.h Extends ctp_stm_api::fence_epoch to accept a timeout and documents intended usage.
src/v/cloud_topics/level_zero/stm/ctp_stm_api.cc Plumbs timeout through ctp_stm_api::fence_epoch into ctp_stm.
src/v/cloud_topics/level_zero/stm/BUILD Adds the model dependency required for timeout_clock types.
src/v/cloud_topics/frontend/frontend.cc Passes replicate timeouts into fence_epoch and maps timeout exceptions into typed timeout results/logging.
Comments suppressed due to low confidence (2)

tests/rptest/services/redpanda.py:6167

  • This comment is inaccurate: iterating futures via as_completed() and immediately calling f.result() will raise as soon as the first future fails, and will not wait for the remaining futures to finish. If you want to wait for all partitions to settle before surfacing an exception, collect exceptions and re-raise after the loop (or use concurrent.futures.wait with return_when=ALL_COMPLETED).
        while True:
            try:
                list_resp = admin.metastore().list_cloud_topics(

src/v/cloud_topics/frontend/frontend.cc:1340

  • Similar to replicate(): this maps ss::timed_out_error to raft::errc::timeout even if _partition->is_leader() is false (leadership-transfer case). To avoid misclassifying not-leader as timeout, gate the timeout mapping on still being leader (e.g., if (is_timeout && !not_leader)).
        if (is_timeout) {
            co_return raft::errc::timeout;
        }

Comment on lines +6092 to +6096
partition, paginating to bound the work per call. Partitions are
validated concurrently (bounded by `max_concurrent_partitions`) so the
wall-clock cost scales with the slowest partition rather than the sum
across partitions; at fleet scale (thousands of partitions) the
sequential variant overran ducktape timeouts.
Comment on lines 437 to +441
} else {
vlog(_log.debug, "ctp_stm::fence_epoch sync timeout (not leader)");
}
throw std::runtime_error(fmt_with_ctx(fmt::format, "Sync timeout"));
// The stm sync only return false on timeout or leadership transfer.
throw ss::timed_out_error{};
ntp(),
e);
}
if (is_timeout) {
@dotnwat dotnwat requested review from andrwng and pgellert May 22, 2026 00:16
cd_log,
ssx::is_shutdown_exception(e) ? ss::log_level::debug
: ss::log_level::warn,
(ssx::is_shutdown_exception(e) || is_timeout)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't it valuable to still log a warning about timeouts, especially if it will have user-facing effects? Same above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants