Skip to content

tiflash-proxy CI: flaky proxy_tests in fast_add_peer and snapshot merge #10879

@JaySon-Huang

Description

@JaySon-Huang

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Found when running CI integration tests for pingcap/tidb-engine-ext#457 under contrib/tiflash-proxy:

cd contrib/tiflash-proxy
make ci_test
# Or run proxy_tests only:
cargo test -p proxy_tests --test proxy

Under parallel CI or high load, the following four tests fail intermittently (process exits with SIGABRT). Two are temporarily marked #[ignore] in tests; two can still block CI.

Test File Current status
shared::fast_add_peer::simple::simple_normal::test_simple_from_invalid_source proxy_tests/proxy/shared/fast_add_peer/simple.rs #[ignore]
shared::fast_add_peer::fp::test_overlap_apply_tikv_snap_in_the_middle proxy_tests/proxy/shared/fast_add_peer/fp.rs #[ignore]
shared::fast_add_peer::fp::test_prehandle_snapshot_after_restart_reset proxy_tests/proxy/shared/fast_add_peer/fp.rs Not ignored; may still fail
shared::snapshot::test_split_merge proxy_tests/proxy/shared/snapshot.rs Not ignored; may still fail

1) test_simple_from_invalid_source

  • Scenario: FAP selects an invalid source (fap_mock_add_peer_from_id=100), then falls back to the slow-path snapshot flow.
  • Failure: components/raftstore/src/store/peer_storage.rs:1141
  • Assertion: assert_eq!(apply_state, last_applied_state) — when generating a snapshot, apply_state in KV does not match the in-memory peer state (e.g. KV index=10, memory index=11).
  • Related failpoint: on_pre_write_apply_state=return(true) (tests persist apply state early for leader snapshot logic).

2) test_overlap_apply_tikv_snap_in_the_middle

  • Scenario: On a mono-store (raftstore v1 single RocksDB), FAP snapshot and TiKV snapshot apply overlap; store 3 previously had region 1 (k1=v1), and after split region 1000 writes k1=v13.
  • Failure: must_get_finally / check_key_ex — expected k1=v13, got k1=v1.
  • Test comment: Raftstore v1 is mono store, so there could be v1 written by old_one_1_k3 (fp.rs, ~line 554).

3) test_prehandle_snapshot_after_restart_reset

  • Scenario: 3-store cluster with FAP enabled. Store 2 joins via FAP (invalid source id 4), then k3=v3 is written. Store 3 joins via FAP from store 2 while on_ob_pre_handle_snapshot_s3=pause delays TiKV snapshot prehandle. Test waits for FAP phase1 Persisted, resets cached_region_info.inited_or_fallback (kind=2, simulating restart), removes pause, sets fap_core_no_prehandle=panic, then checks store 3 disk has k3=v3.
  • Failure: fp.rs:289 / must_getcan't get value Some("v3") for key 6B33, actual=None.
  • Observed race (from build-log.4.txt):
    • FAP copy_data_from store 2 → store 3 only copies k1=v1, missing k3=v3, because store 2's mock kvstore does not yet contain k3 when FAP runs.
    • Store 2 may catch up via TiKV snapshot idx=9 with key_count: 1 (k1 only); leader then sends empty MsgAppend index:9 without re-sending the k3 put entry.
    • TiKV snapshot idx=10 prehandle for store 3 stays paused; after FAP flush (k1 only), resume + fap_core_no_prehandle skips TiKV prehandle when kvstore_region_exist==true, so k3 never reaches disk (no post apply snapshot for peer 3003 in failing logs).
  • Contrast with passing run: copy_data_from includes both k1 and k3; leader explicitly sends k3 entry in MsgAppend to store 2 before store 3 FAP copy.

4) test_split_merge

  • Scenario: After split and merge, the test checks immediately after must_merge that each node's mock kvstore no longer contains the source region (e.g. region 1000).
  • Failure: proxy_tests/proxy/shared/snapshot.rs:346node {id} should has no region {source_id}.
  • Timing: must_merge only waits for PD-side merge completion; store 1's CommitMerge / ffi_handle_destroy can finish later than PD returns (logs show store 2/3 ~11.120, store 1 ~11.147–11.150). In the same repo, normal.rs sleeps 1000ms after merge and test_handle_destroy sleeps 100ms, but test_split_merge has no wait.

2. What did you expect to see? (Required)

  • make ci_test / proxy_tests should pass reliably; these cases should not be flaky due to timing or mono-store stale data.
  • Mock engine-store and raftstore state should stay consistent across merge, FAP fallback, overlapping snapshot apply, and FAP + paused TiKV snapshot prehandle resume.

3. What did you see instead (Required)

CI logs intermittently show failures during the proxy_tests stage (build succeeds):

  1. test_simple_from_invalid_source: peer_storage.rs:1141 apply_state mismatch; snap-generator thread panics.
  2. test_overlap_apply_tikv_snap_in_the_middle: can't get value Some("v13") for key 6B31 (must_get_finally).
  3. test_prehandle_snapshot_after_restart_reset: can't get value Some("v3") for key 6B33 (must_get at fp.rs:289).
  4. test_split_merge: node 1 should has no region 1000 (source region still present in store 1's kvstore).

Suggested fix directions (for tracking)

test_split_merge (test / mock, higher priority)

  • After must_merge, poll until all stores satisfy !kvstore.contains_key(source_id) (similar to must_wait_until_cond_generic), or add a wait like normal.rs.
  • Optional: in mock, do not recreate merged/destroyed regions via the handle_admin Vacant path.

test_prehandle_snapshot_after_restart_reset (test / FAP source readiness)

  • Before adding store 3, wait until store 2 has k3=v3 on disk (e.g. check_key(..., b"k3", b"v3", None, Some(true), Some(vec![2]))), so FAP copy is not racing store 2 apply.
  • After removing on_ob_pre_handle_snapshot_s3 pause, poll until TiKV snapshot apply finishes on store 3 before asserting disk keys.
  • Short term: mark #[ignore] if needed to unblock CI.

test_prehandle_snapshot_after_restart_reset (product / mock)

  • FAP copy_data_from should read from persisted engine state (or wait until source peer data at applied_index is fully visible), not a racing in-memory mock kvstore snapshot.
  • When TiKV snapshot prehandle resumes after FAP has partially written data, merge/apply missing keys from the TiKV SST instead of skipping prehandle due to kvstore_region_exist + fap_core_no_prehandle.

test_simple_from_invalid_source (product / FAP)

  • In fap_fallback_to_slow / fap_fallback_to_slow_with_lock (engine_store_ffi/src/core/fast_add_peer.rs), sync KV apply_state on fallback so slow-path snapshot generation matches the in-memory applied index.

test_overlap_apply_tikv_snap_in_the_middle (mock snapshot / mono-store)

  • When applying a snapshot, clear overlapping stale data from other regions on the same store by key range (write_snapshot_to_db_data / ffi_apply_prehandled_snapshot), so region 1's k1=v1 does not shadow region 1000's k1=v13.

4. What is your TiFlash version? (Required)

  • Observed in contrib/tiflash-proxy CI (make ci_test), on branches that integrate tidb-engine-ext proxy_tests (e.g. release-8.5 cherry-picks).
  • TiFlash main-repo version: matches the integrated branch at build time (CI logs show commit-based git_hash, not a fixed release tag).

Reference paths (in tiflash repo)

  • contrib/tiflash-proxy/proxy_tests/proxy/shared/snapshot.rstest_split_merge
  • contrib/tiflash-proxy/proxy_tests/proxy/shared/fast_add_peer/simple.rstest_simple_from_invalid_source
  • contrib/tiflash-proxy/proxy_tests/proxy/shared/fast_add_peer/fp.rstest_overlap_apply_tikv_snap_in_the_middle, test_prehandle_snapshot_after_restart_reset, prehandle_snapshot_after_restart
  • contrib/tiflash-proxy/proxy_components/mock-engine-store/src/mock_store/mock_core.rscopy_data_from
  • contrib/tiflash-proxy/proxy_components/mock-engine-store/src/mock_store/mock_write_impls.rs — CommitMerge kvstore.remove
  • contrib/tiflash-proxy/proxy_components/engine_store_ffi/src/core/forward_raft/fap_snapshot.rs — FAP snapshot skip / apply
  • contrib/tiflash-proxy/proxy_components/engine_store_ffi/src/core/forward_raft/snapshot.rson_ob_pre_handle_snapshot_s3, fap_core_no_prehandle
  • contrib/tiflash-proxy/proxy_components/engine_store_ffi/src/core/fast_add_peer.rs — FAP fallback

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugThe issue is confirmed as a bug.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions