Skip to content

SDSTOR-21760: Fix table not found issue during index recovery#877

Open
yuwmao wants to merge 1 commit intoeBay:stable/v7.xfrom
yuwmao:index_crash
Open

SDSTOR-21760: Fix table not found issue during index recovery#877
yuwmao wants to merge 1 commit intoeBay:stable/v7.xfrom
yuwmao:index_crash

Conversation

@yuwmao
Copy link
Copy Markdown
Contributor

@yuwmao yuwmao commented Apr 17, 2026

Journal–table metadata mismatch due to CP vs destroy table ordering

Here is the issue description:
A split hits crash flip and marks its parent buffer with m_crash_flag_on during transact_bufs (src/lib/index/wb_cache.cpp:237-247). The same logical window removes the table: index_table::destroy() immediately removes its superblock from meta via MetaBlkService::remove_sub_sb (src/include/homestore/index/index_table.hpp:135-147 →
src/lib/meta/meta_blk_service.cpp:872+).
CP flush later starts and writes the txn_journal to meta first, then begins flushing dirty buffers; when the flagged parent buffer is reached, it crashes (src/lib/index/wb_cache.cpp:860-871, 896-903). On restart, recovery replays the persisted txn_journal and attempts to repair the table by ordinal, but the table superblock is gone and the table isn’t loaded → HS_DBG_ASSERT in repair_index_node (src/lib/index/index_service.cpp:205-212).

Key ordering rules that cause the mismatch
- Table destroy persistence is immediate at destroy(): meta superblock is removed synchronously (not tied to CP).
- Index CP flush ordering is fixed: (1) persist txn_journal; (2) flush dirty buffers; crash can occur at (2).
- Thus it’s possible to have a persisted journal entry for a table whose superblock was already removed.

Solution:
Trigger cp flush when deleting index table to force separate the deletion and other modification.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 17, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (stable/v7.x@568f9ff). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/include/homestore/index/index_table.hpp 0.00% 0 Missing and 1 partial ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@              Coverage Diff               @@
##             stable/v7.x     #877   +/-   ##
==============================================
  Coverage               ?   48.13%           
==============================================
  Files                  ?      110           
  Lines                  ?    12936           
  Branches               ?     6221           
==============================================
  Hits                   ?     6227           
  Misses                 ?     2574           
  Partials               ?     4135           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@xiaoxichen
Copy link
Copy Markdown
Collaborator

@yuwmao can we rebase this PR and try to see if it can merge?

@yuwmao yuwmao changed the base branch from master to stable/v7.x May 9, 2026 07:04
@yuwmao
Copy link
Copy Markdown
Contributor Author

yuwmao commented May 9, 2026

@yuwmao can we rebase this PR and try to see if it can merge?

Rebased.

@yuwmao yuwmao marked this pull request as draft May 9, 2026 07:11
@yuwmao yuwmao marked this pull request as ready for review May 9, 2026 07:12
Here is the issue description:
journal–table metadata mismatch due to CP vs destroy ordering

A split hits crash flip and marks its parent buffer with m_crash_flag_on during transact_bufs (src/lib/index/wb_cache.cpp:237-247).
The same logical window removes the table: index_table::destroy() immediately removes its superblock from meta via MetaBlkService::remove_sub_sb (src/include/homestore/index/index_table.hpp:135-147 →
  src/lib/meta/meta_blk_service.cpp:872+).
CP flush later starts and writes the txn_journal to meta first, then begins flushing dirty buffers; when the flagged parent buffer is reached, it crashes (src/lib/index/wb_cache.cpp:860-871, 896-903).
On restart, recovery replays the persisted txn_journal and attempts to repair the table by ordinal, but the table superblock is gone and the table isn’t loaded → HS_DBG_ASSERT in repair_index_node (src/lib/index/index_service.cpp:205-212).
Key ordering rules that cause the mismatch
    - Table destroy persistence is immediate at destroy(): meta superblock is removed synchronously (not tied to CP).
    - Index CP flush ordering is fixed: (1) persist txn_journal; (2) flush dirty buffers; crash can occur at (2).
    - Thus it’s possible to have a persisted journal entry for a table whose superblock was already removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants