feat(pool): add reusable pod reset policy by restart pool pod containers#557
feat(pool): add reusable pod reset policy by restart pool pod containers#557fengcone wants to merge 12 commits intoalibaba:mainfrom
Conversation
- Introduce PodRecyclePolicy to control recycle behavior of Pods in BatchSandbox with options Delete and Restart - Extend CapacitySpec and Pool status with Restarting count for Pods undergoing recycle - Enhance allocator to track Pods released and requiring recycle processing - Implement RestartTracker to manage Pod restart lifecycle including kill signaling and status tracking - Integrate RestartTracker into PoolReconciler for handling Pod recycle based on policy - Add RBAC for pods/exec to allow container exec for kill operations during restart - Persist Pod recycle metadata in annotations to track state machine and attempts - Implement automatic cleanup of Pods that fail recycle restart - Refactor scheduler and reconciler logic to exclude Pods in restart flow from allocation - Add extensive handling for concurrency and error management in restart operations - Update CRD schema to include podRecyclePolicy and restarting count fields
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6d0211b1a2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 429c0236fa
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…ting pods from allocation alibaba#452 - Add restart-timeout flag to controller with default 90s for pod restart operations - Pass restartTimeout value to restartTracker for managing pod lifecycle - Modify restartTracker to use configurable restartTimeout instead of constant - Exclude pods in restarting state from allocator’s available pod list - Add unit test to verify allocator excludes restarting pods during scheduling - Update e2e test to deploy controller with restart-timeout=10s for timeout testing - Add setup and teardown steps in e2e test for namespace, CRDs, and controller deployment - Reduce pod restart timeout wait in e2e test from 4 minutes to 1 minute for faster feedback
429c023 to
2a339c8
Compare
…ve error handling - Add check to ensure pod exists in PodAllocation before deletion during release - Update test cases to verify pod removal and recycling behavior correctly - Return error immediately after logging failure to handle pod recycle in pool controller
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 36c0788b6b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
022a41a to
6255351
Compare
- Add FinalizerPoolRecycle for pool mode BatchSandbox with restart policy - Implement ensureFinalizer helper to manage finalizers robustly - Handle pool recycle process before task cleanup on BatchSandbox deletion - Enhance canAllocate to exclude pods not ready after recycle confirmation - Modify handlePodRecycle to support restart timeout from pool annotations - Adjust PoolReconciler to process pod recycle before scheduling and allocation - Introduce needsRecycleConfirmation to detect pods needing recycle handling - Count recycling pods in pool scaling decisions instead of restarting pods - Update allocator to skip pods that cannot allocate (e.g., still recycling) - Add unit tests for canAllocate logic on pod labels and annotations - Update e2e test to verify Delete policy deletes pods and pool replenishment - Remove deprecated InitialRestartCounts from PodRecycleMeta for clarity - Refactor restartTracker to remove embedded restartTimeout field - Update restartTracker HandleRestart call to accept timeout parameter - Clean up logging and error handling for finalizer and pod recycle operations
65281ec to
8aa745a
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8aa745acbd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
- Add support for customizing restart timeout via annotation in docs - Replace PodsToRecycle slice with RecyclingPods set for efficient lookup - Refactor allocator to skip pods currently recycling in allocation logic - Remove deprecated canAllocate function and simplify recycling checks - Update pool reconciler to collect recycling pods and handle recycling in batch - Refactor handlePodRecycle to process multiple pods and aggregate errors - Enhance restart tracker logging with timeout and elapsed time details - Remove outdated controller unit tests and reduce e2e test scope for simplicity
52258f5 to
7d09efd
Compare
|
@codex review |
7d09efd to
9f6fcd7
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7d09efd68c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
9f6fcd7 to
beb1e05
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: beb1e05e25
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
- Separate pooled and non-pooled pod retrieval for clarity and efficiency - Extract pod fetching by names into a dedicated method - Enhance addDeallocatedFromLabel to operate on provided pod slice - Simplify releasePods to use improved pod fetch and labeling methods feat(controller): improve pod restart detection accuracy - Record container restart counts and startedAt timestamps before restart trigger - Add restart detection via increased restart count or updated startedAt time - Log detailed restart detection method for container state analysis
Summary
Testing
Breaking Changes
RestarttoDelete)Checklist