[CORE-16144] Sync timeout shadow linking by Lazin · Pull Request #30574 · redpanda-data/redpanda

Lazin · 2026-05-21T15:03:53Z

A bunch of CI failure fixes.

add a timeout to the fence_epoch call and handle timeout error
add parallel fanout for the RedpandaService.validate_metastore to avoid timeout at scale

Backports Required

Release Notes

none

Throw ss::timed_out_error instead of std::runtime_error so the missed sync deadline is identifiable by type rather than message string. Add a timeout parameter to fence_epoch (default unchanged at 10s, moved into ctp_stm as a static constexpr member) so callers can pass their own deadline. Add //src/v/model to the ctp_stm bazel target since ctp_stm.h now includes model/timeout_clock.h. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>

Plumb a timeout parameter through ctp_stm_api::fence_epoch so callers can pass their own deadline. The bulk-replicate paths in the frontend now hand fence_epoch the same budget they already use for the L0 upload: replicate_at_offset forwards its `timeout` argument and replicate forwards opts.timeout.value_or(L0_replicate_default_timeout). At both call sites, catch ss::timed_out_error and convert it to a typed errc (raft::errc::timeout / kafka::error_code::request_timed_out) instead of re-throwing. This keeps the cluster-link replicator's catch-all from logging the transient sync-timeout case at ERROR. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>

validate_metastore previously iterated topics x partitions serially, issuing one ValidatePartition admin RPC at a time. At fleet scale (thousands of partitions per topic) the wall-clock summed across calls overran ducktape test timeouts even when each call completed quickly. Fan the per-partition calls out through a bounded ThreadPoolExecutor (default 16 workers, exposed as max_concurrent_partitions) so the wall-clock cost is dominated by the slowest call rather than the sum. Per-partition pagination remains sequential (resume_at_offset is intrinsically ordered), and anomaly collection is guarded by a lock. Any per-partition exception is surfaced once all futures settle. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>

Copilot

Pull request overview

This PR addresses CI failures/timeouts by bounding fence_epoch sync time and by reducing end-to-end metastore validation time in large-scale ducktape runs.

Changes:

Add an explicit timeout parameter to ctp_stm::fence_epoch / ctp_stm_api::fence_epoch and throw a typed timeout exception on sync failure.
Propagate the replicate timeout budget into fence_epoch and translate ss::timed_out_error into typed timeout results in cloud-topics frontend replicate paths.
Validate metastore partitions concurrently (bounded by a new max_concurrent_partitions parameter) to reduce wall-clock time.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/rptest/services/redpanda.py	Adds concurrent per-partition metastore validation within each topic to reduce ducktape wall-clock time.
src/v/cloud_topics/level_zero/stm/ctp_stm.h	Adds a default sync timeout and extends `fence_epoch` API with a timeout parameter.
src/v/cloud_topics/level_zero/stm/ctp_stm.cc	Uses caller-provided timeout for STM sync and switches timeout failure to `ss::timed_out_error`.
src/v/cloud_topics/level_zero/stm/ctp_stm_api.h	Extends `ctp_stm_api::fence_epoch` to accept a timeout and documents intended usage.
src/v/cloud_topics/level_zero/stm/ctp_stm_api.cc	Plumbs timeout through `ctp_stm_api::fence_epoch` into `ctp_stm`.
src/v/cloud_topics/level_zero/stm/BUILD	Adds the `model` dependency required for `timeout_clock` types.
src/v/cloud_topics/frontend/frontend.cc	Passes replicate timeouts into `fence_epoch` and maps timeout exceptions into typed timeout results/logging.

Comments suppressed due to low confidence (2)

tests/rptest/services/redpanda.py:6167

This comment is inaccurate: iterating futures via as_completed() and immediately calling f.result() will raise as soon as the first future fails, and will not wait for the remaining futures to finish. If you want to wait for all partitions to settle before surfacing an exception, collect exceptions and re-raise after the loop (or use concurrent.futures.wait with return_when=ALL_COMPLETED).

        while True:
            try:
                list_resp = admin.metastore().list_cloud_topics(

src/v/cloud_topics/frontend/frontend.cc:1340

Similar to replicate(): this maps ss::timed_out_error to raft::errc::timeout even if _partition->is_leader() is false (leadership-transfer case). To avoid misclassifying not-leader as timeout, gate the timeout mapping on still being leader (e.g., if (is_timeout && !not_leader)).

        if (is_timeout) {
            co_return raft::errc::timeout;
        }

+        partition, paginating to bound the work per call. Partitions are
+        validated concurrently (bounded by `max_concurrent_partitions`) so the
+        wall-clock cost scales with the slowest partition rather than the sum
+        across partitions; at fleet scale (thousands of partitions) the
+        sequential variant overran ducktape timeouts.


        } else {
            vlog(_log.debug, "ctp_stm::fence_epoch sync timeout (not leader)");
        }
-        throw std::runtime_error(fmt_with_ctx(fmt::format, "Sync timeout"));
+        // The stm sync only return false on timeout or leadership transfer.
+        throw ss::timed_out_error{};


              ntp(),
              e);
        }
+        if (is_timeout) {


andrwng · 2026-05-22T02:55:53Z

              cd_log,
-              ssx::is_shutdown_exception(e) ? ss::log_level::debug
-                                            : ss::log_level::warn,
+              (ssx::is_shutdown_exception(e) || is_timeout)


Why isn't it valuable to still log a warning about timeouts, especially if it will have user-facing effects? Same above

Lazin added 2 commits May 21, 2026 10:19

Copilot AI review requested due to automatic review settings May 21, 2026 15:03

github-actions Bot added area/build area/redpanda labels May 21, 2026

Copilot started reviewing on behalf of Lazin May 21, 2026 15:04 View session

Lazin requested a review from dotnwat May 21, 2026 15:05

Lazin force-pushed the fix/sync-timeout-shadow-linking branch from fd541e5 to 9917296 Compare May 21, 2026 15:07

Copilot AI reviewed May 21, 2026

View reviewed changes

dotnwat requested review from andrwng and pgellert May 22, 2026 00:16

andrwng reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CORE-16144] Sync timeout shadow linking#30574

[CORE-16144] Sync timeout shadow linking#30574
Lazin wants to merge 3 commits into
redpanda-data:devfrom
Lazin:fix/sync-timeout-shadow-linking

Lazin commented May 21, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

andrwng May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Lazin commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backports Required

Release Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

andrwng May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Lazin commented May 21, 2026 •

edited

Loading