Skip to content

MDEV-40128 Use per-cell latch in lock_move_reorganize_page()#5297

Merged
iMineLink merged 1 commit into
11.8from
MDEV-40128
Jun 30, 2026
Merged

MDEV-40128 Use per-cell latch in lock_move_reorganize_page()#5297
iMineLink merged 1 commit into
11.8from
MDEV-40128

Conversation

@iMineLink

Copy link
Copy Markdown
Contributor

lock_move_reorganize_page() was acquiring lock_sys.latch in exclusive mode (via LockMutexGuard) for the entire body of phase 2 (lock chain iteration, bitmap reset, and lock_rec_add_to_queue() calls). The function however only touches record locks belonging to a single page, which all live in a single lock_sys.rec_hash cell. Holding that cell latch in exclusive mode via LockGuard is sufficient:

  • The cell latch protects the cell's lock chain and the bitmaps of the lock_t objects in it (lock_rec_bitmap_reset and the new bit set by lock_rec_add_to_queue()).
  • It also protects lock->type_mode, including the LOCK_WAIT bit. The canonical clear in lock_reset_lock_and_trx_wait() runs under the cell latch, and lock_grant() invokes it before taking trx->mutex, so the bit is cell-latch state rather than trx->mutex state. Phase 1 only clears the bit and leaves trx->lock.wait_lock intact; the copy in old_locks keeps LOCK_WAIT and phase 2 re-adds the lock with it, so the wait relationship (guarded by lock_sys.wait_mutex) is preserved across the move. Neither trx->mutex nor wait_mutex is required here.
  • Each owning trx's mutex is acquired per-iteration to protect that trx's trx_locks list and lock_heap during lock_rec_add_to_queue().

The global exclusive latch was over-strong: it blocked every concurrent lock_sys.rd_lock() acquirer in lock_rec_lock() and lock_table() server-wide for the duration of the reorganize, contributing disproportionately to the lock_sys.latch convoy under heavy concurrency.

The TMLockGuard fast-path empty check at the top of the function is preserved; for cells with no locks the cost is still just a TSX-elided read.

@iMineLink iMineLink requested a review from Copilot June 29, 2026 16:05
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the locking behavior in lock_move_reorganize_page by replacing the global LockMutexGuard with a more granular LockGuard on the specific hash cell. This prevents blocking concurrent lock acquisitions server-wide. Additionally, detailed comments have been added to explain the lock protection model and the handling of the LOCK_WAIT bit. There are no review comments, and I have no additional feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@iMineLink iMineLink requested a review from dr-m June 29, 2026 16:06

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces lock contention during InnoDB page reorganize lock relocation by replacing a server-wide lock_sys exclusive latch with an exclusive latch on only the relevant rec_hash cell for the affected page. This aligns the locking scope with the data actually touched (record locks for a single page), aiming to reduce global lock convoying under high concurrency.

Changes:

  • Replace LockMutexGuard (global lock_sys.wr_lock) with LockGuard (per-rec_hash cell latch) in lock_move_reorganize_page().
  • Preserve the existing TSX-elided empty-fast-path check via TMLockGuard.
  • Add expanded rationale comments documenting correctness of cell-latch protection (including LOCK_WAIT handling) and concurrency implications.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@dr-m dr-m left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, and great source code comments.

lock_move_reorganize_page() was acquiring lock_sys.latch in exclusive
mode (via LockMutexGuard) for the entire body of phase 2 (lock chain
iteration, bitmap reset, and lock_rec_add_to_queue() calls). The
function however only touches record locks belonging to a single page,
which all live in a single lock_sys.rec_hash cell. Holding that cell
latch in exclusive mode via LockGuard is sufficient:

- The cell latch protects the cell's lock chain and the bitmaps of the
  lock_t objects in it (lock_rec_bitmap_reset and the new bit set by
  lock_rec_add_to_queue()).
- It also protects lock->type_mode, including the LOCK_WAIT bit. The
  canonical clear in lock_reset_lock_and_trx_wait() runs under the cell
  latch, and lock_grant() invokes it before taking trx->mutex, so the bit
  is cell-latch state rather than trx->mutex state. Phase 1 only clears
  the bit and leaves trx->lock.wait_lock intact; the copy in old_locks
  keeps LOCK_WAIT and phase 2 re-adds the lock with it, so the wait
  relationship (guarded by lock_sys.wait_mutex) is preserved across the
  move. Neither trx->mutex nor wait_mutex is required here.
- Each owning trx's mutex is acquired per-iteration to protect that trx's
  trx_locks list and lock_heap during lock_rec_add_to_queue().

The global exclusive latch was over-strong: it blocked every concurrent
lock_sys.rd_lock() acquirer in lock_rec_lock() and lock_table()
server-wide for the duration of the reorganize, contributing
disproportionately to the lock_sys.latch convoy under heavy concurrency.

The TMLockGuard fast-path empty check at the top of the function is
preserved; for cells with no locks the cost is still just a TSX-elided
read.
@iMineLink iMineLink merged commit 232e3da into 11.8 Jun 30, 2026
17 of 19 checks passed
@iMineLink iMineLink deleted the MDEV-40128 branch June 30, 2026 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants