Skip to content

feat(pool): add reusable pod reset policy by restart pool pod containers#557

Open
fengcone wants to merge 12 commits intoalibaba:mainfrom
fengcone:feature/public-podRestart1
Open

feat(pool): add reusable pod reset policy by restart pool pod containers#557
fengcone wants to merge 12 commits intoalibaba:mainfrom
fengcone:feature/public-podRestart1

Conversation

@fengcone
Copy link
Copy Markdown
Collaborator

Summary

Testing

  • Not run (explain why)
  • Unit tests
  • Integration tests
  • e2e / manual verification

Breaking Changes

  • None
  • Yes (the default recycling strategy for SandboxPool has been changed from Restart to Delete)

Checklist

  • Linked Issue or clearly described motivation
  • Added/updated docs (if needed)
  • Added/updated tests (if needed)
  • Security impact considered
  • Backward compatibility considered

- Introduce PodRecyclePolicy to control recycle behavior of Pods in BatchSandbox with options Delete and Restart
- Extend CapacitySpec and Pool status with Restarting count for Pods undergoing recycle
- Enhance allocator to track Pods released and requiring recycle processing
- Implement RestartTracker to manage Pod restart lifecycle including kill signaling and status tracking
- Integrate RestartTracker into PoolReconciler for handling Pod recycle based on policy
- Add RBAC for pods/exec to allow container exec for kill operations during restart
- Persist Pod recycle metadata in annotations to track state machine and attempts
- Implement automatic cleanup of Pods that fail recycle restart
- Refactor scheduler and reconciler logic to exclude Pods in restart flow from allocation
- Add extensive handling for concurrency and error management in restart operations
- Update CRD schema to include podRecyclePolicy and restarting count fields
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6d0211b1a2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@fengcone
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 429c0236fa

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

…ting pods from allocation alibaba#452

- Add restart-timeout flag to controller with default 90s for pod restart operations
- Pass restartTimeout value to restartTracker for managing pod lifecycle
- Modify restartTracker to use configurable restartTimeout instead of constant
- Exclude pods in restarting state from allocator’s available pod list
- Add unit test to verify allocator excludes restarting pods during scheduling
- Update e2e test to deploy controller with restart-timeout=10s for timeout testing
- Add setup and teardown steps in e2e test for namespace, CRDs, and controller deployment
- Reduce pod restart timeout wait in e2e test from 4 minutes to 1 minute for faster feedback
@fengcone fengcone force-pushed the feature/public-podRestart1 branch from 429c023 to 2a339c8 Compare March 24, 2026 12:09
…ve error handling

- Add check to ensure pod exists in PodAllocation before deletion during release
- Update test cases to verify pod removal and recycling behavior correctly
- Return error immediately after logging failure to handle pod recycle in pool controller
@fengcone
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 36c0788b6b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@fengcone fengcone force-pushed the feature/public-podRestart1 branch from 022a41a to 6255351 Compare March 24, 2026 13:48
@Pangjiping Pangjiping added feature New feature or request component/k8s For kubernetes runtime labels Mar 24, 2026
- Add FinalizerPoolRecycle for pool mode BatchSandbox with restart policy
- Implement ensureFinalizer helper to manage finalizers robustly
- Handle pool recycle process before task cleanup on BatchSandbox deletion
- Enhance canAllocate to exclude pods not ready after recycle confirmation
- Modify handlePodRecycle to support restart timeout from pool annotations
- Adjust PoolReconciler to process pod recycle before scheduling and allocation
- Introduce needsRecycleConfirmation to detect pods needing recycle handling
- Count recycling pods in pool scaling decisions instead of restarting pods
- Update allocator to skip pods that cannot allocate (e.g., still recycling)
- Add unit tests for canAllocate logic on pod labels and annotations
- Update e2e test to verify Delete policy deletes pods and pool replenishment
- Remove deprecated InitialRestartCounts from PodRecycleMeta for clarity
- Refactor restartTracker to remove embedded restartTimeout field
- Update restartTracker HandleRestart call to accept timeout parameter
- Clean up logging and error handling for finalizer and pod recycle operations
@fengcone fengcone force-pushed the feature/public-podRestart1 branch from 65281ec to 8aa745a Compare March 26, 2026 09:55
@fengcone
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8aa745acbd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@fengcone fengcone requested a review from Pangjiping March 26, 2026 15:48
- Add support for customizing restart timeout via annotation in docs
- Replace PodsToRecycle slice with RecyclingPods set for efficient lookup
- Refactor allocator to skip pods currently recycling in allocation logic
- Remove deprecated canAllocate function and simplify recycling checks
- Update pool reconciler to collect recycling pods and handle recycling in batch
- Refactor handlePodRecycle to process multiple pods and aggregate errors
- Enhance restart tracker logging with timeout and elapsed time details
- Remove outdated controller unit tests and reduce e2e test scope for simplicity
@fengcone fengcone force-pushed the feature/public-podRestart1 branch 4 times, most recently from 52258f5 to 7d09efd Compare March 27, 2026 03:21
@fengcone
Copy link
Copy Markdown
Collaborator Author

@codex review

@fengcone fengcone force-pushed the feature/public-podRestart1 branch from 7d09efd to 9f6fcd7 Compare March 27, 2026 04:04
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7d09efd68c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@fengcone fengcone force-pushed the feature/public-podRestart1 branch from 9f6fcd7 to beb1e05 Compare March 27, 2026 04:09
@fengcone
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: beb1e05e25

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@alibaba alibaba deleted a comment from chatgpt-codex-connector bot Mar 27, 2026
- Separate pooled and non-pooled pod retrieval for clarity and efficiency
- Extract pod fetching by names into a dedicated method
- Enhance addDeallocatedFromLabel to operate on provided pod slice
- Simplify releasePods to use improved pod fetch and labeling methods

feat(controller): improve pod restart detection accuracy

- Record container restart counts and startedAt timestamps before restart trigger
- Add restart detection via increased restart count or updated startedAt time
- Log detailed restart detection method for container state analysis
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/k8s For kubernetes runtime feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants