Skip to content

Core: Fix background thread leak in ScanTaskIterable#16768

Open
sejal-gupta-ksolves wants to merge 1 commit into
apache:mainfrom
sejal-gupta-ksolves:fix/scan-task-iterable-thread-leak
Open

Core: Fix background thread leak in ScanTaskIterable#16768
sejal-gupta-ksolves wants to merge 1 commit into
apache:mainfrom
sejal-gupta-ksolves:fix/scan-task-iterable-thread-leak

Conversation

@sejal-gupta-ksolves

Copy link
Copy Markdown

Closes: #16758

Problem

When downstream query engines (such as StarRocks, Trino, or Spark) cancel or abort a REST table scan early due to client disconnects, timeouts, or query limits, they trigger the cleanup sequence on the outer execution container.

In Apache Iceberg, ScanTaskIterable.close() was implemented as an empty no-op method. Because this outer close() call failed to cascade the shutdown signal to the underlying data structures:

  • The internal shutdown state atomic flag remained false.
  • Background PlanTaskWorker threads continued running indefinitely.
  • Once the internal taskQueue reached its 1000 item capacity limit, all active worker threads became permanently deadlocked inside offerWithTimeout(), leading to thread pool exhaustion on the engine coordinator side.

Solution

  1. Implemented State Tracking and Cleanup: Added thread-safe execution barriers inside ScanTaskIterable.close() utilizing shutdown.compareAndSet(false, true).
  2. Queue Eviction Matrix: Updated the close block to explicitly flush taskQueue, planTasks, and initialFileScanTasks lists upon termination. This allows background threads stuck in an offer wait cycle to instantly unblock, evaluate the flipped shutdown state, and exit gracefully.
  3. Decoupled Iterator Lifecycle: Refactored the internal ScanTasksIterator.close() block to eliminate redundant code duplication, rewriting it to delegate its cleanup tasks straight up to ScanTaskIterable.this.close(). This ensures unified thread termination safety across all potential entry points.
  4. Regression Test Addition: Designed and integrated TestScanTaskIterableLeak under the org.apache.iceberg.rest test package, proving that active planning thread allocations successfully scale back down to 0 upon premature termination.

Verification Testing

# 1. Clean format code using Spotless rules
./gradlew spotlessApply

# 2. Run static quality analysis lint checks on modified packages
./gradlew :iceberg-core:compileJava :iceberg-core:compileTestJava

# 3. Verify the core build pass and execute the regression test case
./gradlew :iceberg-core:test --tests "org.apache.iceberg.rest.TestScanTaskIterableLeak" --info

@github-actions github-actions Bot added the core label Jun 11, 2026
@singhpk234 singhpk234 self-requested a review June 11, 2026 15:18
Comment thread core/src/main/java/org/apache/iceberg/rest/ScanTaskIterable.java
Comment thread core/src/main/java/org/apache/iceberg/rest/ScanTaskIterable.java
Comment thread core/src/main/java/org/apache/iceberg/rest/ScanTaskIterable.java
@chenwyi2

Copy link
Copy Markdown

At present, In StarRocks FE deployment, when a query is cancelled or times out during the DeployScanRanges / scan-planning phase, close() can be invoked?

@sejal-gupta-ksolves sejal-gupta-ksolves force-pushed the fix/scan-task-iterable-thread-leak branch from 07f4e38 to be101a2 Compare June 12, 2026 07:36
@sejal-gupta-ksolves

Copy link
Copy Markdown
Author

@chenwyi2 Yes, close() will be invoked automatically by StarRocks FE without requiring any downstream code changes.

During query planning and range deployment, StarRocks processes the file splits via Iceberg's CloseableIterable scan stream. When a query is cancelled or times out, StarRocks' coordinator explicitly cascades a .close() signal down to this execution stream wrapper to free resources.

Because this patch unifies the cleanup paths by having the iterator directly delegate to ScanTaskIterable.this.close(), any termination signal triggered by StarRocks will instantly flip the shutdown barrier, empty the planning queues, and gracefully exit the blocked background worker threads.

Fixes an issue where background PlanTaskWorker threads remain indefinitely blocked in offerWithTimeout() when a query is cancelled or abandoned early because the outer ScanTaskIterable.close() method was a no-op.
@sejal-gupta-ksolves sejal-gupta-ksolves force-pushed the fix/scan-task-iterable-thread-leak branch from be101a2 to 70f17fc Compare June 12, 2026 07:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

REST ScanTaskIterable: background workers not stopped on iterable close; offerWithTimeout can block forever when consumer aborts

3 participants