MDEV-38147 Mariadb error 1950 after SST#5316
Open
ayurchen wants to merge 2 commits into
Open
Conversation
After a mariabackup SST the joiner could fail with
ER_GTID_STRICT_OUT_OF_ORDER (error 1950)
while re-binlogging transactions received over IST.
The cause is that the binary log copied from the donor carries a
Gtid_list whose position can be ahead of the storage-engine snapshot:
BACKUP STAGE BLOCK_COMMIT blocks the engine commit (2PC step 3) but not
the binary log write (step 2), so transactions can be present in the
copied binlog that are not committed in the copied engine snapshot.
After the SST the joiner reports the (committed) engine position to the
cluster, IST resends those transactions, and re-binlogging them under
gtid_strict_mode=ON collides with the ahead Gtid_list -> error 1950.
(MDEV-34483 made the engine snapshot stop short of the binlog, which is
what exposed this.)
The copied binary log carries no transactions the joiner needs - only a
Gtid_list - so instead of shipping and then having to truncate/reconcile
it, the joiner now starts a fresh binary log and seeds its GTID position
from the storage-engine checkpoint during recovery. That checkpoint is
the committed cluster position, i.e. exactly where IST resumes, so the
joiner's binary log stays in lockstep with the rest of the cluster and
no out-of-order GTID can occur.
This works for both wsrep_gtid_mode settings; only the binlog domain of
the cluster stream differs:
- wsrep_gtid_mode=ON : wsrep_gtid_domain_id (cluster writes are
re-tagged to it), which is the domain stored in the checkpoint;
- wsrep_gtid_mode=OFF: gtid_domain_id (cluster writes keep the node's
configured domain).
Async-replica positions (mysql.gtid_slave_pos) are part of the engine
snapshot and survive the SST unchanged, so a Galera node can still serve
as an async master or replica across the SST.
This commit:
- sql/log.cc: adds wsrep_seed_binlog_gtid_state(), called from
do_binlog_recovery() when the joiner has no binary log, seeding the
binlog GTID state for the cluster domain to the SE checkpoint position.
- scripts/wsrep_sst_mariabackup.sh: no longer moves the donor's binary
log into place on the joiner.
- extra/mariabackup: stop flushing and copying the donor's current
binary log under --galera-info (removed write_current_binlog_file()).
Its only purpose was to ship that binary log to the joiner, which now
discards it; flushing needlessly rotated the donor's binary log on
every SST. xtrabackup_galera_info and xtrabackup_binlog_info are still
written.
- sql/wsrep_sst.cc: logs the position actually adopted from storage
(the authoritative post-SST position) rather than the script-reported
one.
- sql/handler.cc: downgrades the "Discovered discontinuity in recovered
wsrep transaction XIDs" message in wsrep_order_and_check_continuity()
from warning to debug level. With parallel appliers a snapshot
routinely captures prepared XIDs that are not contiguous with the
engine checkpoint, so this is normal during SST recovery and of no
value in regular operation; the transactions past the checkpoint are
re-delivered by the cluster (IST/SST) regardless.
- Adds an MDEV-38147 MTR test reproducing the issue.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
With log_bin=ON a transaction is committed via two-phase commit (the binary log is the second participant), so it passes through the InnoDB XA-prepare state. While a donor is held in BLOCK_COMMIT for a mariabackup backup, its parallel appliers (wsrep_slave_threads > 1) leave one or more such writesets prepared-but-not-yet-committed, and the snapshot captures them. On a freshly SST'd joiner nothing resolves these prepared transactions: binlog crash recovery does not run (the joiner has no in-use binlog to recover from), and the wsrep continuity-based commit is inactive because wsrep_emulate_bin_log is FALSE when log_bin is ON. The leftover prepared transactions then abort startup with "Found <N> prepared transactions!". Note this does not depend on the prepared set being non-contiguous - even a contiguous run aborts, because nothing commits or rolls it back. Rollback these transactions in xarecover_handlerton(). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
Contributor
Author
|
/gemini review |
Contributor
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull request created in: https://jira.mariadb.org/browse/MDEV-38147
Investigation into MDEV-38147 revealed that with --galera-info option given during SST mariabackup rotates the binlog and ships it to joiner. The file is likely to contain wrong Gtid_list info and is used on joiner to initialize binlog.
Since that file is useless, don't rotate the binlog and ship the file, instead the joiner can generate its own correct Gtid_list.
MDEV-40179 - rollback orphaned prepared transactions.