schemastore,gc: switch GC keeper to generic service-ts safety path#4670
schemastore,gc: switch GC keeper to generic service-ts safety path#4670
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughSchema-store GC keeper now uses a single precomputed Changes
Sequence Diagram(s)sequenceDiagram
participant Keeper as SchemaStoreGCKeeper
participant GC as gc package
participant Store as SafePointService
Keeper->>GC: EnsureServiceTsSafety(ctx, serviceID, ts)
GC->>Store: SetServiceGCSafepoint / SetGCBarrier (mode-dependent)
alt ts behind returned minimum
Store-->>GC: minSafePoint (>= ts)
GC-->>Keeper: ErrSnapshotLostByGC
else success
Store-->>GC: ok
GC-->>Keeper: success
end
Note over Keeper,GC: On close
Keeper->>GC: UnifyDeleteGcSafepoint(ctx, serviceID)
GC->>Store: UnifyDelete (remove service safepoint)
Store-->>GC: ok
GC-->>Keeper: closed
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refactors the schema store's GC keeper to simplify service ID management and introduces a more precise GC safety API, EnsureServiceTsSafety, which supports both changefeed start timestamps and exact timestamps. The GC refresh interval was also increased from 10 seconds to 1 minute. Review feedback identifies a bug in ensureServiceGCSafetyNextGen where non-GC errors are incorrectly reported as timestamp issues and notes that the wrong timestamp is reported in error messages when using the exact safety mode.
| if mode == serviceTsSafetyStartTs { | ||
| return errors.ErrStartTsBeforeGC.GenWithStackByArgs(ts) | ||
| } | ||
| if !errors.IsGCBarrierTSBehindTxnSafePointError(err) { | ||
| return errors.WrapError(errors.ErrUpdateGCBarrierFailed, err) | ||
| } | ||
|
|
||
| minBarrierTS, barrierErr := UnifyGetServiceGCSafepoint(ctx, pdCli, keyspaceID, serviceID) | ||
| if barrierErr != nil { | ||
| return barrierErr | ||
| } |
There was a problem hiding this comment.
The current implementation of ensureServiceGCSafetyNextGen incorrectly returns ErrStartTsBeforeGC for any error encountered during SetGCBarrier when in serviceTsSafetyStartTs mode. This includes non-GC related errors like network failures or context cancellations. It should first verify if the error is indeed due to the timestamp being behind the GC safe point. Furthermore, ErrStartTsBeforeGC expects two arguments (the requested timestamp and the current safe point), but only one is provided here, which will lead to a malformed error message.
| if mode == serviceTsSafetyStartTs { | |
| return errors.ErrStartTsBeforeGC.GenWithStackByArgs(ts) | |
| } | |
| if !errors.IsGCBarrierTSBehindTxnSafePointError(err) { | |
| return errors.WrapError(errors.ErrUpdateGCBarrierFailed, err) | |
| } | |
| minBarrierTS, barrierErr := UnifyGetServiceGCSafepoint(ctx, pdCli, keyspaceID, serviceID) | |
| if barrierErr != nil { | |
| return barrierErr | |
| } | |
| if !errors.IsGCBarrierTSBehindTxnSafePointError(err) { | |
| return errors.WrapError(errors.ErrUpdateGCBarrierFailed, err) | |
| } | |
| minBarrierTS, barrierErr := UnifyGetServiceGCSafepoint(ctx, pdCli, keyspaceID, serviceID) | |
| if barrierErr != nil { | |
| return barrierErr | |
| } | |
| if mode == serviceTsSafetyStartTs { | |
| return errors.ErrStartTsBeforeGC.GenWithStackByArgs(ts, minBarrierTS) | |
| } | |
| } | ||
| case serviceTsSafetyExactTs: | ||
| if lowerBound != math.MaxUint64 && ts < lowerBound { | ||
| return errors.ErrSnapshotLostByGC.GenWithStackByArgs(ts+1, lowerBound) |
There was a problem hiding this comment.
In serviceTsSafetyExactTs mode, the error should report the actual timestamp ts that was found to be lost. Using ts+1 is inconsistent with an 'exact' safety mode and likely stems from the startTs convention where startTs is expected to keep startTs-1 readable. For this new internal API, it's better to be explicit and report the requested timestamp.
| return errors.ErrSnapshotLostByGC.GenWithStackByArgs(ts+1, lowerBound) | |
| return errors.ErrSnapshotLostByGC.GenWithStackByArgs(ts, lowerBound) |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
pkg/txnutil/gc/gc_service.go (1)
62-80: Duplicate safety check afterensureServiceGCSafetyClassicalready validates.
ensureServiceGCSafetyClassiccallscheckServiceTsSafetywhich performs the samets < lowerBound+1check (lines 155-157). The check at lines 76-78 is redundant and will never trigger becausecheckServiceTsSafetywould have returned the error first.♻️ Proposed fix to remove duplicate check
func ensureChangefeedStartTsSafetyClassic(ctx context.Context, pdCli GCServiceClient, gcServiceID string, ttl int64, startTs uint64) error { minServiceGCTs, err := ensureServiceGCSafetyClassic(ctx, pdCli, gcServiceID, ttl, startTs, serviceTsSafetyStartTs) if err != nil { return errors.Trace(err) } log.Info("set gc safepoint for changefeed", zap.String("gcServiceID", gcServiceID), zap.Uint64("expectedGCSafepoint", startTs), zap.Uint64("actualGCSafepoint", minServiceGCTs), zap.Int64("ttl", ttl)) - - // startTs should be greater than or equal to minServiceGCTs + 1, otherwise gcManager - // would return a ErrSnapshotLostByGC even though the changefeed would appear to be successfully - // created/resumed. See issue `#6350` for more detail. - if startTs > 0 && startTs < minServiceGCTs+1 { - return errors.ErrStartTsBeforeGC.GenWithStackByArgs(startTs, minServiceGCTs) - } return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/txnutil/gc/gc_service.go` around lines 62 - 80, The duplicate startTs safety check in ensureChangefeedStartTsSafetyClassic is redundant because ensureServiceGCSafetyClassic already invokes checkServiceTsSafety which enforces the same "startTs < lowerBound+1" validation; remove the redundant if-block that compares startTs with minServiceGCTs+1 and returns ErrStartTsBeforeGC (the comparison and error return inside ensureChangefeedStartTsSafetyClassic), leaving only the call to ensureServiceGCSafetyClassic and the logging, so validation is centralized in checkServiceTsSafety/ensureServiceGCSafetyClassic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/txnutil/gc/gc_service.go`:
- Around line 135-137: The branch checking mode == serviceTsSafetyStartTs
returns ErrStartTsBeforeGC with only ts, producing an inconsistent/partial
message compared to the other path which passes both ts and lowerBound; update
the branch in gc_service.go (the serviceTsSafetyStartTs check) to obtain the
current safe point (lowerBound) before returning and call
ErrStartTsBeforeGC.GenWithStackByArgs(ts, lowerBound) — or, if you intentionally
only have one value, switch both call sites to a single-argument variant so the
error format is consistent; ensure you reference ErrStartTsBeforeGC and
GenWithStackByArgs in the serviceTsSafetyStartTs branch when making the change.
- Around line 121-147: The current fragile string-based error classification
used by errors.IsGCBarrierTSBehindTxnSafePointError in
ensureServiceGCSafetyNextGen (after calling SetGCBarrier) should be replaced
with a structured check: inspect the PD gc client package
(github.com/tikv/pd/client/clients/gc) for a sentinel error value, exported
error type, or error code you can match with errors.Is or errors.As; update
errors.IsGCBarrierTSBehindTxnSafePointError to use that structured type/code (or
add a new helper that uses errors.As/Is against the PD error type), and then
change the branch in ensureServiceGCSafetyNextGen to rely on that robust check;
if the PD package lacks structured errors, add a small wrapper in your package
that normalizes PD errors into a sentinel error you can errors.Is against and
use that in the SetGCBarrier handling and in related helpers like
UnifyGetServiceGCSafepoint.
---
Nitpick comments:
In `@pkg/txnutil/gc/gc_service.go`:
- Around line 62-80: The duplicate startTs safety check in
ensureChangefeedStartTsSafetyClassic is redundant because
ensureServiceGCSafetyClassic already invokes checkServiceTsSafety which enforces
the same "startTs < lowerBound+1" validation; remove the redundant if-block that
compares startTs with minServiceGCTs+1 and returns ErrStartTsBeforeGC (the
comparison and error return inside ensureChangefeedStartTsSafetyClassic),
leaving only the call to ensureServiceGCSafetyClassic and the logging, so
validation is centralized in checkServiceTsSafety/ensureServiceGCSafetyClassic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7ac9fedf-3164-463f-9d80-fbe07c79361c
📒 Files selected for processing (3)
logservice/schemastore/gc_keeper.gologservice/schemastore/gc_keeper_test.gopkg/txnutil/gc/gc_service.go
| func ensureServiceGCSafetyNextGen( | ||
| ctx context.Context, | ||
| pdCli GCServiceClient, | ||
| serviceID string, | ||
| keyspaceID uint32, | ||
| ttl int64, | ||
| ts uint64, | ||
| mode serviceTsSafetyMode, | ||
| ) error { | ||
| gcCli := pdCli.GetGCStatesClient(keyspaceID) | ||
| _, err := SetGCBarrier(ctx, gcCli, serviceID, ts, time.Duration(ttl)*time.Second) | ||
| if err == nil { | ||
| return nil | ||
| } | ||
| if mode == serviceTsSafetyStartTs { | ||
| return errors.ErrStartTsBeforeGC.GenWithStackByArgs(ts) | ||
| } | ||
| if !errors.IsGCBarrierTSBehindTxnSafePointError(err) { | ||
| return errors.WrapError(errors.ErrUpdateGCBarrierFailed, err) | ||
| } | ||
|
|
||
| minBarrierTS, barrierErr := UnifyGetServiceGCSafepoint(ctx, pdCli, keyspaceID, serviceID) | ||
| if barrierErr != nil { | ||
| return barrierErr | ||
| } | ||
| return checkServiceTsSafety(mode, ts, minBarrierTS) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for how ErrGCBarrierTSBehindTxnSafePoint is defined/used in the pd client
rg -n "ErrGCBarrierTSBehindTxnSafePoint" --type goRepository: pingcap/ticdc
Length of output: 361
🏁 Script executed:
# First, let's examine the pb_error.go file to see the exact implementation
cat -n pkg/errors/pb_error.go | head -50Repository: pingcap/ticdc
Length of output: 1314
🏁 Script executed:
# Check if there are typed errors or error codes for this scenario
rg -n "GCBarrier" --type go -A 3 -B 3 | head -80Repository: pingcap/ticdc
Length of output: 5248
🏁 Script executed:
# Search for how this error is created in PD client or related code
rg -n "IsGCBarrierTSBehindTxnSafePointError" --type go -B 5 -A 2Repository: pingcap/ticdc
Length of output: 1765
🏁 Script executed:
# Check if there are error codes or typed errors in the errors package
fd "errors" -t f --type go | xargs rg -l "type.*Error" | head -10Repository: pingcap/ticdc
Length of output: 525
🏁 Script executed:
# Check imports and how PD errors are typically handled
cat -n pkg/txnutil/gc/gc_service.go | grep -E "import|pdpb" -A 20 | head -30Repository: pingcap/ticdc
Length of output: 881
🏁 Script executed:
# Search for SetGCBarrier implementation to understand what error types are returned
rg -n "func SetGCBarrier" --type go -A 10Repository: pingcap/ticdc
Length of output: 957
🏁 Script executed:
# Look for how other similar error checks are structured in the codebase
rg -n "IsKeyspaceNotExistError\|IsGCBarrier" --type go -B 2 -A 2Repository: pingcap/ticdc
Length of output: 39
🏁 Script executed:
# Check if there's a pattern of using error codes or typed errors in PD interactions
rg -n "pdpb\." pkg/txnutil/gc/gc_service.go | head -20Repository: pingcap/ticdc
Length of output: 39
🏁 Script executed:
# Check if pdgc (PD GC client) errors are typically handled with string matching or typed errors
rg -n "pdgc\|gc\.GCStatesClient" --type go -B 3 -A 5 | head -60Repository: pingcap/ticdc
Length of output: 39
🏁 Script executed:
# Look at the full SetGCBarrier function to see if errors from PD are wrapped
cat -n pkg/txnutil/gc/gc_service.go | sed -n '246,270p'Repository: pingcap/ticdc
Length of output: 1354
🏁 Script executed:
# Search for any documentation or patterns on how to handle PD errors
rg -n "github.com/tikv/pd" --type go | head -20Repository: pingcap/ticdc
Length of output: 1385
🏁 Script executed:
# Check if there are any error codes or error types available from pd client
rg -n "Error\|Code" pkg/txnutil/gc/gc_service.go | head -15Repository: pingcap/ticdc
Length of output: 39
🏁 Script executed:
# Check if PD errors have typed error support or codes
rg -n "pdgc\." pkg/txnutil/gc/gc_service.goRepository: pingcap/ticdc
Length of output: 436
🏁 Script executed:
# Look at the error types available from PD gc client
find . -name "*.go" -path "*/vendor/*" -prune -o -type f -name "*.go" -exec grep -l "github.com/tikv/pd/client/clients/gc" {} \; | head -10Repository: pingcap/ticdc
Length of output: 423
🏁 Script executed:
# Search for how other parts of the codebase handle PD gc errors
rg -n "SetGCBarrier\|DeleteGCBarrier\|GetGCState" --type go -B 3 -A 8 | grep -A 8 "err" | head -50Repository: pingcap/ticdc
Length of output: 39
🏁 Script executed:
# Check if the error from PD is wrapped anywhere else
rg -n "SetGCBarrier" pkg/txnutil/gc/ --type go -B 2 -A 15 | head -80Repository: pingcap/ticdc
Length of output: 5604
String-based error classification is fragile; consider exploring if PD provides error codes or structured error types.
At line 138, the code uses errors.IsGCBarrierTSBehindTxnSafePointError(err) which relies on string matching via strings.Contains(err.Error(), "ErrGCBarrierTSBehindTxnSafePoint") to classify errors returned from the PD client. This is fragile if the error message changes in future PD versions.
Verify with the PD library maintainers whether github.com/tikv/pd/client/clients/gc provides error codes or structured error types (e.g., error codes, sentinel error values, or error interfaces) that could replace string matching for more robust error classification.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/txnutil/gc/gc_service.go` around lines 121 - 147, The current fragile
string-based error classification used by
errors.IsGCBarrierTSBehindTxnSafePointError in ensureServiceGCSafetyNextGen
(after calling SetGCBarrier) should be replaced with a structured check: inspect
the PD gc client package (github.com/tikv/pd/client/clients/gc) for a sentinel
error value, exported error type, or error code you can match with errors.Is or
errors.As; update errors.IsGCBarrierTSBehindTxnSafePointError to use that
structured type/code (or add a new helper that uses errors.As/Is against the PD
error type), and then change the branch in ensureServiceGCSafetyNextGen to rely
on that robust check; if the PD package lacks structured errors, add a small
wrapper in your package that normalizes PD errors into a sentinel error you can
errors.Is against and use that in the SetGCBarrier handling and in related
helpers like UnifyGetServiceGCSafepoint.
|
[FORMAT CHECKER NOTIFICATION] Notice: To remove the 📖 For more info, you can check the "Contribute Code" section in the development guide. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/txnutil/gc/gc_service.go (1)
111-113: Please lock in the equality boundary with tests.Both branches now intentionally allow
ts == lowerBoundand only fail onts < lowerBound, while the returned error reportsts+1. That off-by-one is subtle enough that an explicit equality test would help prevent regressions.Also applies to: 138-140
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/txnutil/gc/gc_service.go` around lines 111 - 113, The conditional that checks "if minServiceGCTs != math.MaxUint64 && ts < minServiceGCTs" allows ts == minServiceGCTs but the error uses ts+1, so add an explicit equality assertion in both places (the existing block and the similar block around lines 138-140) and update tests to cover the boundary: add unit tests that call the code path with ts == minServiceGCTs (should succeed) and ts == minServiceGCTs-1 (should produce errors.ErrSnapshotLostByGC with the exact expected argument), ensuring the returned GenWithStackByArgs value matches the intended lower-bound semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/txnutil/gc/gc_service.go`:
- Around line 111-113: The conditional that checks "if minServiceGCTs !=
math.MaxUint64 && ts < minServiceGCTs" allows ts == minServiceGCTs but the error
uses ts+1, so add an explicit equality assertion in both places (the existing
block and the similar block around lines 138-140) and update tests to cover the
boundary: add unit tests that call the code path with ts == minServiceGCTs
(should succeed) and ts == minServiceGCTs-1 (should produce
errors.ErrSnapshotLostByGC with the exact expected argument), ensuring the
returned GenWithStackByArgs value matches the intended lower-bound semantics.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 432706a8-1b2a-4e8a-a323-eb1063b79902
📒 Files selected for processing (1)
pkg/txnutil/gc/gc_service.go
What problem does this PR solve?
Issue Number: close #xxx
What is changed and how it works?
Check List
Tests
Questions
Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?
Release note
Summary by CodeRabbit
Improvements
Bug Fixes