[CORE-16144] Sync timeout shadow linking#30574
Open
Lazin wants to merge 3 commits into
Open
Conversation
Throw ss::timed_out_error instead of std::runtime_error so the missed sync deadline is identifiable by type rather than message string. Add a timeout parameter to fence_epoch (default unchanged at 10s, moved into ctp_stm as a static constexpr member) so callers can pass their own deadline. Add //src/v/model to the ctp_stm bazel target since ctp_stm.h now includes model/timeout_clock.h. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
Plumb a timeout parameter through ctp_stm_api::fence_epoch so callers can pass their own deadline. The bulk-replicate paths in the frontend now hand fence_epoch the same budget they already use for the L0 upload: replicate_at_offset forwards its `timeout` argument and replicate forwards opts.timeout.value_or(L0_replicate_default_timeout). At both call sites, catch ss::timed_out_error and convert it to a typed errc (raft::errc::timeout / kafka::error_code::request_timed_out) instead of re-throwing. This keeps the cluster-link replicator's catch-all from logging the transient sync-timeout case at ERROR. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
validate_metastore previously iterated topics x partitions serially, issuing one ValidatePartition admin RPC at a time. At fleet scale (thousands of partitions per topic) the wall-clock summed across calls overran ducktape test timeouts even when each call completed quickly. Fan the per-partition calls out through a bounded ThreadPoolExecutor (default 16 workers, exposed as max_concurrent_partitions) so the wall-clock cost is dominated by the slowest call rather than the sum. Per-partition pagination remains sequential (resume_at_offset is intrinsically ordered), and anomaly collection is guarded by a lock. Any per-partition exception is surfaced once all futures settle. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
fd541e5 to
9917296
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR addresses CI failures/timeouts by bounding fence_epoch sync time and by reducing end-to-end metastore validation time in large-scale ducktape runs.
Changes:
- Add an explicit timeout parameter to
ctp_stm::fence_epoch/ctp_stm_api::fence_epochand throw a typed timeout exception on sync failure. - Propagate the replicate timeout budget into
fence_epochand translatess::timed_out_errorinto typed timeout results in cloud-topics frontend replicate paths. - Validate metastore partitions concurrently (bounded by a new
max_concurrent_partitionsparameter) to reduce wall-clock time.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/rptest/services/redpanda.py | Adds concurrent per-partition metastore validation within each topic to reduce ducktape wall-clock time. |
| src/v/cloud_topics/level_zero/stm/ctp_stm.h | Adds a default sync timeout and extends fence_epoch API with a timeout parameter. |
| src/v/cloud_topics/level_zero/stm/ctp_stm.cc | Uses caller-provided timeout for STM sync and switches timeout failure to ss::timed_out_error. |
| src/v/cloud_topics/level_zero/stm/ctp_stm_api.h | Extends ctp_stm_api::fence_epoch to accept a timeout and documents intended usage. |
| src/v/cloud_topics/level_zero/stm/ctp_stm_api.cc | Plumbs timeout through ctp_stm_api::fence_epoch into ctp_stm. |
| src/v/cloud_topics/level_zero/stm/BUILD | Adds the model dependency required for timeout_clock types. |
| src/v/cloud_topics/frontend/frontend.cc | Passes replicate timeouts into fence_epoch and maps timeout exceptions into typed timeout results/logging. |
Comments suppressed due to low confidence (2)
tests/rptest/services/redpanda.py:6167
- This comment is inaccurate: iterating futures via as_completed() and immediately calling f.result() will raise as soon as the first future fails, and will not wait for the remaining futures to finish. If you want to wait for all partitions to settle before surfacing an exception, collect exceptions and re-raise after the loop (or use concurrent.futures.wait with return_when=ALL_COMPLETED).
while True:
try:
list_resp = admin.metastore().list_cloud_topics(
src/v/cloud_topics/frontend/frontend.cc:1340
- Similar to replicate(): this maps ss::timed_out_error to raft::errc::timeout even if
_partition->is_leader()is false (leadership-transfer case). To avoid misclassifying not-leader as timeout, gate the timeout mapping on still being leader (e.g.,if (is_timeout && !not_leader)).
if (is_timeout) {
co_return raft::errc::timeout;
}
Comment on lines
+6092
to
+6096
| partition, paginating to bound the work per call. Partitions are | ||
| validated concurrently (bounded by `max_concurrent_partitions`) so the | ||
| wall-clock cost scales with the slowest partition rather than the sum | ||
| across partitions; at fleet scale (thousands of partitions) the | ||
| sequential variant overran ducktape timeouts. |
Comment on lines
437
to
+441
| } else { | ||
| vlog(_log.debug, "ctp_stm::fence_epoch sync timeout (not leader)"); | ||
| } | ||
| throw std::runtime_error(fmt_with_ctx(fmt::format, "Sync timeout")); | ||
| // The stm sync only return false on timeout or leadership transfer. | ||
| throw ss::timed_out_error{}; |
| ntp(), | ||
| e); | ||
| } | ||
| if (is_timeout) { |
andrwng
reviewed
May 22, 2026
| cd_log, | ||
| ssx::is_shutdown_exception(e) ? ss::log_level::debug | ||
| : ss::log_level::warn, | ||
| (ssx::is_shutdown_exception(e) || is_timeout) |
Contributor
There was a problem hiding this comment.
Why isn't it valuable to still log a warning about timeouts, especially if it will have user-facing effects? Same above
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A bunch of CI failure fixes.
fence_epochcall and handle timeout errorRedpandaService.validate_metastoreto avoid timeout at scaleBackports Required
Release Notes