Added timeout and cancellation support to ModelClient polling loop…#559
Open
Timandes wants to merge 3 commits intoalibaba:masterfrom
Open
Added timeout and cancellation support to ModelClient polling loop…#559Timandes wants to merge 3 commits intoalibaba:masterfrom
Timandes wants to merge 3 commits intoalibaba:masterfrom
Conversation
Timandes
added a commit
to Timandes/ROCK
that referenced
this pull request
Mar 4, 2026
…ing loops - Add DEFAULT_POLL_TIMEOUT constant (60 seconds) - Add PollOptions interface for timeout and AbortSignal support - Update popRequest() to accept timeout and cancellation via AbortSignal - Update waitForFirstRequest() to accept timeout and cancellation - Add comprehensive unit tests for timeout and cancellation scenarios Fixes issue where while(true) loops could block forever without timeout or cancellation mechanism. Aligns with Python SDK PR alibaba#559.
d044c6b to
6a34796
Compare
BCeZn
reviewed
Mar 4, 2026
…aba#550) Replace hardcoded DEFAULT_POLL_TIMEOUT constant with ROCK_MODEL_CLIENT_POLL_TIMEOUT environment variable. Default is now None (infinite wait) which is appropriate for waiting on Agent Actions. Changes: - Add ROCK_MODEL_CLIENT_POLL_TIMEOUT env var in rock/env_vars.py - Remove DEFAULT_POLL_TIMEOUT = 60.0 constant from client.py - pop_request/wait_for_first_request now read timeout from env var - Add tests for environment variable functionality
BCeZn
reviewed
Mar 5, 2026
…#550) Per PR review feedback, use env_vars.ROCK_MODEL_CLIENT_POLL_TIMEOUT directly as the default value for timeout parameter instead of checking it inside the function body.
BCeZn
approved these changes
Mar 5, 2026
Timandes
added a commit
to Timandes/ROCK
that referenced
this pull request
Mar 30, 2026
…ing loops - Add DEFAULT_POLL_TIMEOUT constant (60 seconds) - Add PollOptions interface for timeout and AbortSignal support - Update popRequest() to accept timeout and cancellation via AbortSignal - Update waitForFirstRequest() to accept timeout and cancellation - Add comprehensive unit tests for timeout and cancellation scenarios Fixes issue where while(true) loops could block forever without timeout or cancellation mechanism. Aligns with Python SDK PR alibaba#559.
Timandes
added a commit
to Timandes/ROCK
that referenced
this pull request
Mar 30, 2026
…ing loops - Add DEFAULT_POLL_TIMEOUT constant (60 seconds) - Add PollOptions interface for timeout and AbortSignal support - Update popRequest() to accept timeout and cancellation via AbortSignal - Update waitForFirstRequest() to accept timeout and cancellation - Add comprehensive unit tests for timeout and cancellation scenarios Fixes issue where while(true) loops could block forever without timeout or cancellation mechanism. Aligns with Python SDK PR alibaba#559.
Timandes
added a commit
that referenced
this pull request
Mar 31, 2026
* feat: add ts-sdk
* chore: add related docs&examples
* fix(ts-sdk): use snake_case for response fields to match API
* fix: make basic-usage.ts works
* fix: make background-tasks.ts works
* docs: update doc files
* fix: make complete-workflow.ts works
* fix: make file-operations.ts works
* chore: add eslint
* feat(ts-sdk)!: adopt camelCase naming convention for public API
- Add ts-case-convert for automatic case conversion
- Implement HTTP layer auto-transform (request → snake_case, response → camelCase)
- Refactor all response types to use camelCase
- Update examples and documentation
BREAKING CHANGE: All response fields now use camelCase instead of snake_case
(e.g., sandbox_id → sandboxId, exit_code → exitCode)
* fix(ts-sdk): use snake_case for URL query parameters
- Fix getStatus URL to use sandbox_id instead of sandboxId
- Fix logger getLogLevel to use lowercase for level matching
* refactor(ts-sdk): rename methods to camelCase convention
- Rename write_file() to writeFile()
- Rename read_file() to readFile()
- Add ESLint naming-convention rule to enforce camelCase
- Update examples to use new method names
* feat(ts-sdk): include response headers in all response types
- Modify HttpUtils.get/post/postMultipart to return structured response
with {status, result, headers}
- Add HeaderFieldsSchema with cluster, requestId, eagleeyeTraceid
- Update all Response types to include header-derived fields
- Extract headers in all Sandbox methods and merge into results
* chore(ts-sdk): release v1.2.3
- Fix HttpUtils generic type parameter usage in client methods
- Fix EnvHubClient to use response.result from new HttpUtils return type
* fix: add missing field 'error'
* Revert "fix: add missing field 'error'"
This reverts commit 2ec4dcf.
* Revert "chore(ts-sdk): release v1.2.3"
This reverts commit 829b7aa.
* Revert "feat(ts-sdk): include response headers in all response types"
This reverts commit 2368f86.
* feat(ts-sdk): include response headers and error info in HTTP responses
* chore(ts-sdk): release v1.2.4
* fix(ts-sdk): correct type parameters for HttpUtils methods
* feat(ts-sdk): add integration test setup with Jest projects mode
- Configure Jest projects to separate unit and integration tests
- Unit tests run from src/, integration tests from tests/integration/
- Add test:unit and test:integration npm scripts
- Add sandbox lifecycle integration test (start, isAlive, stop)
* feat(ts-sdk): implement uploadDir for directory upload to sandbox
- Implement uploadDir method in LinuxFileSystem following Python SDK pattern
- Add arun method to AbstractSandbox interface for command execution
- Support validation, local tar packing, upload, remote extraction, and cleanup
- Add integration tests for uploadDir functionality
* fix(ts-sdk): correct Content-Type handling for multipart/form-data requests
- Remove manual Content-Type header that prevented axios from adding boundary
- Set Content-Type to null in post config to allow axios auto-detection
- Add unit tests for postMultipart method
* fix(ts-sdk): add shims for __dirname ESM compatibility
__dirname is not available in ESM modules. Enable tsup shims option
to inject proper polyfill using fileURLToPath(import.meta.url).
* fix(ts-sdk): use unquoted heredoc for APT speedup command
Using single-quoted heredoc delimiter (<< 'EOF') prevented shell
variable expansion, causing $(lsb_release -cs) to be written literally
instead of being evaluated to the Ubuntu codename (e.g., 'jammy').
This fix ensures the command substitution is properly expanded when
configuring APT mirror sources.
* fix(ts-sdk): add missing /download/ path in Python V31212 install URL
* fix(ts-sdk): sync VERSION with package.json dynamically
VERSION constant was hardcoded to '1.2.1' while package.json declared
'1.2.4', causing version mismatch. Now reads version from package.json
at runtime to ensure consistency.
* feat(ts-sdk): add RuntimeEnv, ModelService and Agent modules
- Add RuntimeEnv module for Python/Node.js runtime management in sandbox
- Add ModelService module for model service lifecycle in sandbox
- Add Agent module for automated task orchestration
- Remove old model/service.ts (local Node.js server approach)
- Add example files for new modules
- Update README with new module documentation
This resolves the 'server directory not found' issue by using
Python SDK's model service inside sandbox via pip install.
* fix(ts-sdk): add timeout and cancellation support to ModelClient polling loops
- Add DEFAULT_POLL_TIMEOUT constant (60 seconds)
- Add PollOptions interface for timeout and AbortSignal support
- Update popRequest() to accept timeout and cancellation via AbortSignal
- Update waitForFirstRequest() to accept timeout and cancellation
- Add comprehensive unit tests for timeout and cancellation scenarios
Fixes issue where while(true) loops could block forever without
timeout or cancellation mechanism. Aligns with Python SDK PR #559.
* fix(ts-sdk): integrate typed exceptions (raiseForCode) into Sandbox client
The exception classes (RockException, BadRequestRockError, etc.) and
raiseForCode() were well-designed but never used. All errors were thrown
as generic new Error(...), losing error code information.
Now all Sandbox methods check response.result.code and call raiseForCode()
to throw appropriate typed exceptions based on error code ranges:
- 4xxx: BadRequestRockError
- 5xxx: InternalServerRockError
- 6xxx: CommandRockError
This matches the Python SDK's exception handling pattern.
* fix(ts-sdk): add Zod validation to API response parsing
Add runtime validation for all API responses using Zod schemas,
matching Python SDK's Pydantic validation behavior (Model(**result)).
- Import response schemas for runtime validation
- Add Schema.parse() calls in execute(), getStatus(), createSession(),
closeSession(), runInSession(), readFile(), isAlive() methods
- Add 9 test cases verifying Zod validation catches invalid data types
This addresses the PR review feedback that Zod schemas were defined
but .parse() was never called, only used for z.infer<> type extraction.
* refactor(ts-sdk): remove duplicated code across modules
- RunModeType/RunMode: re-export from common/constants.ts in sandbox/types.ts
- extractNohupPid: re-export from utils/http.ts in sandbox/utils.ts
- sleep: import from utils/retry.ts in model/client.ts instead of private method
Addresses code duplication feedback from PR #492 review.
* fix(ts-sdk): change default backoff from 1.0 to 2.0 for exponential retry
Default backoff=1.0 resulted in fixed delay intervals (no exponential
backoff). Changed to backoff=2.0 so delays double on each retry,
which is the standard exponential backoff behavior.
Addresses PR #492 review feedback.
* fix(ts-sdk): add command injection protection across modules
Add comprehensive input validation and shell escaping to prevent
command injection attacks in network, file_system, and runtime_env
modules.
- Add shell.ts utility module with safe escaping functions
- Refactor network.ts to use script-based execution with validation
- Add path/username validation in file_system.ts for chown/chmod
- Add extensive test coverage for injection protection scenarios
Inspired by Python SDK shlex.quote() approach.
* chore: bump version
* fix(ts-sdk): remove fragile session detection in arun()
Replace String(e).includes('already exists') pattern with Python SDK
behavior:
- NORMAL mode: run command directly without pre-creating session
- NOHUP mode: only create session when session parameter is not provided
This eliminates dependency on server error message format, making the
SDK more robust against server-side message changes or localization.
Add unit tests for session creation behavior and update integration
tests to explicitly create sessions when needed.
* fix(ts-sdk): use winston.Container for logger cache management
Replace manual Map-based loggerCache with winston.Container to prevent
potential memory leaks in long-running processes. Add clearLoggerCache()
and getLoggerCacheSize() functions for cache lifecycle management.
* refactor(ts-sdk): remove deprecated Constants class (dead code)
The Constants class was deprecated with all empty string values for
BASE_URL_* properties and was never used anywhere in the codebase.
Tests now verify that Constants is not exported from the module.
* fix(ts-sdk): use winston logger with warn-once for deprecated decorator
- Replace console.warn with Winston logger for consistency
- Implement warn-once pattern using Set to track warned keys
- Each deprecated function/class only warns once per session
- Add clearDeprecatedWarnings() for testing isolation
- Add comprehensive test suite (11 tests)
* refactor(ts-sdk): make sandbox config defaults configurable via env vars
Replace hardcoded default values in sandbox config with environment
variables, allowing users to customize defaults without code changes.
New environment variables:
- ROCK_DEFAULT_IMAGE, ROCK_DEFAULT_MEMORY, ROCK_DEFAULT_CPUS
- ROCK_DEFAULT_CLUSTER, ROCK_DEFAULT_AUTO_CLEAR_SECONDS
- ROCK_DEFAULT_GROUP_SIZE, ROCK_DEFAULT_START_CONCURRENCY
- ROCK_DEFAULT_START_RETRY_TIMES
- ROCK_DEFAULT_ARUN_TIMEOUT, ROCK_DEFAULT_NOHUP_WAIT_TIMEOUT
- ROCK_DEFAULT_NOHUP_WAIT_INTERVAL, ROCK_DEFAULT_STATUS_CHECK_INTERVAL
This addresses PR #492 review feedback about hardcoded defaults
scattered in sandbox/client.ts and config.ts.
* refactor(ts-sdk): consolidate types - remove sandbox/types.ts
- Move SpeedupType to network.ts where it's used
- Update all RunMode/RunModeType imports to use common/constants.ts
- Remove unused RuntimeEnvId and AgentType from types.ts
- Delete sandbox/types.ts and types.test.ts
This addresses the PR review feedback about duplicated type definitions.
* refactor(ts-sdk): remove misleading isBrowser() - SDK requires Node.js only
- Remove isBrowser() function that implied browser compatibility
- Simplify getEnv() and isEnvSet() to directly use process.env
- SDK fundamentally cannot run in browsers due to Node.js module imports
- Add tests to verify isBrowser is not exported
* fix(ts-sdk): initialize sandboxId in RockEnv and use HttpUtils
- Add static factory method RockEnv.create() for async initialization
- Call 'make' API to get sandboxId during initialization
- Replace raw AxiosInstance with HttpUtils for consistent camelCase/snake_case conversion
- Update make() factory function to be async
- Add comprehensive tests for RockEnv
* fix(ts-sdk): add error handling for JSON.parse and use async file I/O in ModelClient
- Add try/catch for JSON.parse in parseRequestLine and parseResponseLine
to handle corrupted log files gracefully with meaningful error messages
- Replace readFileSync/appendFileSync with async readFile/appendFile
from fs/promises to avoid blocking the event loop
- Re-throw parse errors immediately in popRequest instead of retrying
- Add tests for JSON parse error handling and async file I/O usage
Fixes PR #492 review feedback
* fix(ts-sdk): use shared https.Agent for connection pooling
- Add module-level sharedHttpsAgent with keepAlive enabled
- Prevents creating new https.Agent on every HTTP request
- Enables TLS session reuse and connection pooling
- Follows Python SDK pattern (_SHARED_SSL_CONTEXT)
Fixes PR #492 reviewer feedback about performance degradation under load.
* fix(ts-sdk): use async file I/O in Sandbox.uploadByPath
Replace sync fs.existsSync and fs.readFileSync with async
fs/promises.access and fs/promises.readFile to avoid blocking
the event loop when uploading large files.
* fix(ts-sdk): use lazy evaluation for Zod schema defaults
Zod .default(envVars.XXX) captured values at module load time.
If environment variables changed after module load, the schema still
used stale values. Now using factory functions .default(() => envVars.XXX)
to evaluate defaults at parse time, matching Python SDK's Field(default_factory=...) pattern.
* fix(ts-sdk): use public PyPI mirror as default for ROCK_PIP_INDEX_URL
Change the default pip index URL from China-specific mirror
(mirrors.aliyun.com) to the public PyPI mirror (pypi.org) for
better compatibility with international users of this open-source SDK.
* fix(ts-sdk): update repository URLs to alibaba/ROCK
* chore: bump version
* fix(ts-sdk): preserve HTTP response in errors for status code detection
Previously, HTTP errors were wrapped in generic Error objects, losing
the response property. This prevented callers from detecting specific
HTTP status codes (e.g., 401 Unauthorized, 403 Forbidden).
Now re-throws the original AxiosError, allowing callers to access
error.response.status for proper error handling, consistent with
Python SDK behavior.
* chore: bump version
* docs(ts-sdk): add CHANGELOG entry for v1.2.7
* feat(ts-sdk): add OSS file download and enhanced upload mode (v1.3.0)
- Add downloadFile() method for downloading files from sandbox via OSS
- Add uploadMode parameter to uploadByPath() ('auto' | 'direct' | 'oss')
- Add getOssStsCredentials() and isTokenExpired() for OSS STS management
- Add new types: DownloadFileResponse, OssCredentials, UploadMode
- Add ENSURE_OSSUTIL_SCRIPT constant for ossutil installation
- Update documentation for v1.3.0 release
* fix(ts-sdk): OSS client now uses HTTPS by default (v1.3.1)
The ali-oss client defaults to HTTP protocol, but OSS buckets typically
require HTTPS connections. This caused connection refused errors when
uploading files via OSS.
Added secure: true to OSS client initialization in setupOss() method.
* fix(ts-sdk): signatureUrl now uses correct object parameter (v1.3.2)
The ali-oss signatureUrl method requires an options object with 'expires'
property, not a raw number. Passing a number caused TypeError when the
library tried to access options.method.toUpperCase().
Changed: signatureUrl(objectName, 600) -> signatureUrl(objectName, { expires: 600 })
* fix(ts-sdk): wrap multi-line scripts with bash -c in nohup mode (v1.3.3)
Multi-line scripts passed to arun() with nohup mode caused errors because
nohup cannot handle multi-line content directly. The fix wraps multi-line
scripts with 'bash -c $'script\'' syntax.
Before: nohup <multi-line-script> # fails with 'missing operand'
After: nohup bash -c $'<multi-line-script>' # works correctly
* fix(ts-sdk): update ossutil commands for v2 compatibility (v1.3.4)
ossutil v2 removed the -b flag and requires --region flag. Credentials
should be passed via environment variables instead of command-line flags.
Changes:
- Use OSS_ACCESS_KEY_ID, OSS_ACCESS_KEY_SECRET, OSS_SESSION_TOKEN env vars
- Add --region flag (required in v2)
- Remove -b flag (deprecated in v2)
Fixes errors: 'unknown shorthand flag: b' and 'region must be set in sign version 4'
* fix(ts-sdk): use command-line parameters for ossutil cp (v1.3.5)
ossutil v2 does not require separate config command. Credentials should be
passed directly as command-line parameters to ossutil cp, matching Python SDK.
Before: ossutil config + ossutil cp with env vars (incorrect)
After: ossutil cp with --access-key-id, --access-key-secret, --sts-token,
--endpoint, --region parameters (correct)
Fixes 'region must be set in sign version 4' error
* fix(ts-sdk): use ROCK_OSS_BUCKET_ENDPOINT and normalize region format (v1.3.6)
Fixed incompatibility between setupOss (ali-oss) and downloadFile (ossutil)
by using ROCK_OSS_BUCKET_ENDPOINT env var and normalizing region format.
Changes:
- Use ROCK_OSS_BUCKET_ENDPOINT if available (format: oss-cn-hangzhou.aliyuncs.com)
- Normalize region by removing 'oss-' prefix (oss-cn-hangzhou -> cn-hangzhou)
- Matches Python SDK configuration format
* fix(ts-sdk): wrap ossutil command with bash -c for nohup PATH (v1.3.7)
When using arun() with nohup mode, the shell is /bin/sh which may not
have /usr/local/bin in PATH. Since ossutil is typically installed in
/usr/local/bin, this causes 'command not found' errors.
Solution: wrap ossutil commands with 'bash -c' to ensure correct PATH.
Matches Python SDK implementation using shlex.quote().
* feat(ts-sdk): add configurable OSS timeout support (v1.3.8)
Add ROCK_OSS_TIMEOUT environment variable with default 300000ms (5 minutes).
uploadByPath and downloadFile now accept optional timeout parameter.
Priority: function parameter > environment variable > default
Fixes ResponseTimeoutError for large file uploads/downloads.
* fix(ts-sdk): use multipartUpload for large files >= 1MB (v1.3.9)
Large file uploads were failing with 'socket connection was closed
unexpectedly' error. Now using multipartUpload for files >= 1MB,
matching Python SDK's oss2.resumable_upload behavior.
- Files < 1MB: use put()
- Files >= 1MB: use multipartUpload() with 1MB parts
* feat(ts-sdk): add progress callback support
Breaking change: uploadByPath and downloadFile now use options object pattern.
- Added ProgressInfo, UploadPhase, DownloadPhase types
- Added UploadOptions and DownloadOptions interfaces with onProgress callback
- OSS operations (multipartUpload, get) report progress percentage (0-100)
- Sandbox operations (wget, ossutil) report -1 (progress unavailable)
- Updated all tests to use new options object API signature
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #550 - Add timeout and cancellation support to
ModelClientpolling loops.Problem
ModelClient中的轮询循环 (while True) 缺少超时机制和取消支持,可能导致程序永久阻塞:pop_request方法无限等待请求wait_for_first_request方法无限等待日志文件asyncio.CancelledError取消信号Solution
结合方案一和方案二,实现完整的超时和取消支持:
超时机制
timeout参数,默认值 60 秒time.monotonic()计算超时时间TimeoutError取消支持
asyncio.CancelledErrorChanges
Core Implementation
rock/sdk/model/client.pyDEFAULT_POLL_TIMEOUT = 60.0constantpop_request(index, timeout=DEFAULT_POLL_TIMEOUT)- 添加超时参数wait_for_first_request(timeout=DEFAULT_POLL_TIMEOUT)- 添加超时参数asyncio.CancelledErrorTests
tests/unit/sdk/model/test_model_client.pytest_pop_request_raises_timeout_error_when_timeout_expirestest_wait_for_first_request_raises_timeout_error_when_timeout_expirestest_pop_request_propagates_cancelled_errortest_wait_for_first_request_propagates_cancelled_errorDocumentation
Examples
examples/model_client_demo.py- Comprehensive demo showcasing:Usage Example
Breaking Changes
None. The
timeoutparameter has a sensible default value, making the change backward compatible.Related