Skip to content

Added timeout and cancellation support to ModelClient polling loop…#559

Open
Timandes wants to merge 3 commits intoalibaba:masterfrom
Timandes:fix/gh-issue-550
Open

Added timeout and cancellation support to ModelClient polling loop…#559
Timandes wants to merge 3 commits intoalibaba:masterfrom
Timandes:fix/gh-issue-550

Conversation

@Timandes
Copy link
Copy Markdown
Collaborator

@Timandes Timandes commented Mar 3, 2026

Summary

Fixes #550 - Add timeout and cancellation support to ModelClient polling loops.

Problem

ModelClient 中的轮询循环 (while True) 缺少超时机制和取消支持,可能导致程序永久阻塞:

  • pop_request 方法无限等待请求
  • wait_for_first_request 方法无限等待日志文件
  • 无法响应 asyncio.CancelledError 取消信号

Solution

结合方案一和方案二,实现完整的超时和取消支持:

  1. 超时机制

    • 添加 timeout 参数,默认值 60 秒
    • 使用 time.monotonic() 计算超时时间
    • 超时后抛出 TimeoutError
  2. 取消支持

    • 显式捕获 asyncio.CancelledError
    • 记录日志后重新抛出,遵循项目规范

Changes

Core Implementation

  • rock/sdk/model/client.py
    • Add DEFAULT_POLL_TIMEOUT = 60.0 constant
    • pop_request(index, timeout=DEFAULT_POLL_TIMEOUT) - 添加超时参数
    • wait_for_first_request(timeout=DEFAULT_POLL_TIMEOUT) - 添加超时参数
    • Both methods now properly handle asyncio.CancelledError

Tests

  • tests/unit/sdk/model/test_model_client.py
    • test_pop_request_raises_timeout_error_when_timeout_expires
    • test_wait_for_first_request_raises_timeout_error_when_timeout_expires
    • test_pop_request_propagates_cancelled_error
    • test_wait_for_first_request_propagates_cancelled_error

Documentation

  • Updated SDK documentation (English & Chinese)
  • Added new Model Service SDK section

Examples

  • examples/model_client_demo.py - Comprehensive demo showcasing:
    • Timeout configuration
    • Cancellation handling
    • Normal request retrieval

Usage Example

  from rock.sdk.model.client import ModelClient

  async def main():
      client = ModelClient()

      try:
          # Wait with 30-second timeout
          await client.wait_for_first_request(timeout=30.0)

          #  Pop request with default 60-second timeout
          request = await client.pop_request(index=1)
      except TimeoutError as e:
          print(f"Operation timed out: {e}")

  asyncio.run(main())

Breaking Changes

None. The timeout parameter has a sensible default value, making the change backward compatible.

Related

Timandes added a commit to Timandes/ROCK that referenced this pull request Mar 4, 2026
…ing loops

- Add DEFAULT_POLL_TIMEOUT constant (60 seconds)
- Add PollOptions interface for timeout and AbortSignal support
- Update popRequest() to accept timeout and cancellation via AbortSignal
- Update waitForFirstRequest() to accept timeout and cancellation
- Add comprehensive unit tests for timeout and cancellation scenarios

Fixes issue where while(true) loops could block forever without
timeout or cancellation mechanism. Aligns with Python SDK PR alibaba#559.
…aba#550)

Replace hardcoded DEFAULT_POLL_TIMEOUT constant with ROCK_MODEL_CLIENT_POLL_TIMEOUT
environment variable. Default is now None (infinite wait) which is appropriate for
waiting on Agent Actions.

Changes:
- Add ROCK_MODEL_CLIENT_POLL_TIMEOUT env var in rock/env_vars.py
- Remove DEFAULT_POLL_TIMEOUT = 60.0 constant from client.py
- pop_request/wait_for_first_request now read timeout from env var
- Add tests for environment variable functionality
…#550)

Per PR review feedback, use env_vars.ROCK_MODEL_CLIENT_POLL_TIMEOUT
directly as the default value for timeout parameter instead of
checking it inside the function body.
Timandes added a commit to Timandes/ROCK that referenced this pull request Mar 30, 2026
…ing loops

- Add DEFAULT_POLL_TIMEOUT constant (60 seconds)
- Add PollOptions interface for timeout and AbortSignal support
- Update popRequest() to accept timeout and cancellation via AbortSignal
- Update waitForFirstRequest() to accept timeout and cancellation
- Add comprehensive unit tests for timeout and cancellation scenarios

Fixes issue where while(true) loops could block forever without
timeout or cancellation mechanism. Aligns with Python SDK PR alibaba#559.
Timandes added a commit to Timandes/ROCK that referenced this pull request Mar 30, 2026
…ing loops

- Add DEFAULT_POLL_TIMEOUT constant (60 seconds)
- Add PollOptions interface for timeout and AbortSignal support
- Update popRequest() to accept timeout and cancellation via AbortSignal
- Update waitForFirstRequest() to accept timeout and cancellation
- Add comprehensive unit tests for timeout and cancellation scenarios

Fixes issue where while(true) loops could block forever without
timeout or cancellation mechanism. Aligns with Python SDK PR alibaba#559.
Timandes added a commit that referenced this pull request Mar 31, 2026
* feat: add ts-sdk

* chore: add related docs&examples

* fix(ts-sdk): use snake_case for response fields to match API

* fix: make basic-usage.ts works

* fix: make background-tasks.ts works

* docs: update doc files

* fix: make complete-workflow.ts works

* fix: make file-operations.ts works

* chore: add eslint

* feat(ts-sdk)!: adopt camelCase naming convention for public API

- Add ts-case-convert for automatic case conversion
- Implement HTTP layer auto-transform (request → snake_case, response → camelCase)
- Refactor all response types to use camelCase
- Update examples and documentation

BREAKING CHANGE: All response fields now use camelCase instead of snake_case
(e.g., sandbox_id → sandboxId, exit_code → exitCode)

* fix(ts-sdk): use snake_case for URL query parameters

- Fix getStatus URL to use sandbox_id instead of sandboxId
- Fix logger getLogLevel to use lowercase for level matching

* refactor(ts-sdk): rename methods to camelCase convention

- Rename write_file() to writeFile()
- Rename read_file() to readFile()
- Add ESLint naming-convention rule to enforce camelCase
- Update examples to use new method names

* feat(ts-sdk): include response headers in all response types

- Modify HttpUtils.get/post/postMultipart to return structured response
  with {status, result, headers}
- Add HeaderFieldsSchema with cluster, requestId, eagleeyeTraceid
- Update all Response types to include header-derived fields
- Extract headers in all Sandbox methods and merge into results

* chore(ts-sdk): release v1.2.3

- Fix HttpUtils generic type parameter usage in client methods
- Fix EnvHubClient to use response.result from new HttpUtils return type

* fix: add missing field 'error'

* Revert "fix: add missing field 'error'"

This reverts commit 2ec4dcf.

* Revert "chore(ts-sdk): release v1.2.3"

This reverts commit 829b7aa.

* Revert "feat(ts-sdk): include response headers in all response types"

This reverts commit 2368f86.

* feat(ts-sdk): include response headers and error info in HTTP responses

* chore(ts-sdk): release v1.2.4

* fix(ts-sdk): correct type parameters for HttpUtils methods

* feat(ts-sdk): add integration test setup with Jest projects mode

- Configure Jest projects to separate unit and integration tests
- Unit tests run from src/, integration tests from tests/integration/
- Add test:unit and test:integration npm scripts
- Add sandbox lifecycle integration test (start, isAlive, stop)

* feat(ts-sdk): implement uploadDir for directory upload to sandbox

- Implement uploadDir method in LinuxFileSystem following Python SDK pattern
- Add arun method to AbstractSandbox interface for command execution
- Support validation, local tar packing, upload, remote extraction, and cleanup
- Add integration tests for uploadDir functionality

* fix(ts-sdk): correct Content-Type handling for multipart/form-data requests

- Remove manual Content-Type header that prevented axios from adding boundary
- Set Content-Type to null in post config to allow axios auto-detection
- Add unit tests for postMultipart method

* fix(ts-sdk): add shims for __dirname ESM compatibility

__dirname is not available in ESM modules. Enable tsup shims option
to inject proper polyfill using fileURLToPath(import.meta.url).

* fix(ts-sdk): use unquoted heredoc for APT speedup command

Using single-quoted heredoc delimiter (<< 'EOF') prevented shell
variable expansion, causing $(lsb_release -cs) to be written literally
instead of being evaluated to the Ubuntu codename (e.g., 'jammy').

This fix ensures the command substitution is properly expanded when
configuring APT mirror sources.

* fix(ts-sdk): add missing /download/ path in Python V31212 install URL

* fix(ts-sdk): sync VERSION with package.json dynamically

VERSION constant was hardcoded to '1.2.1' while package.json declared
'1.2.4', causing version mismatch. Now reads version from package.json
at runtime to ensure consistency.

* feat(ts-sdk): add RuntimeEnv, ModelService and Agent modules

- Add RuntimeEnv module for Python/Node.js runtime management in sandbox
- Add ModelService module for model service lifecycle in sandbox
- Add Agent module for automated task orchestration
- Remove old model/service.ts (local Node.js server approach)
- Add example files for new modules
- Update README with new module documentation

This resolves the 'server directory not found' issue by using
Python SDK's model service inside sandbox via pip install.

* fix(ts-sdk): add timeout and cancellation support to ModelClient polling loops

- Add DEFAULT_POLL_TIMEOUT constant (60 seconds)
- Add PollOptions interface for timeout and AbortSignal support
- Update popRequest() to accept timeout and cancellation via AbortSignal
- Update waitForFirstRequest() to accept timeout and cancellation
- Add comprehensive unit tests for timeout and cancellation scenarios

Fixes issue where while(true) loops could block forever without
timeout or cancellation mechanism. Aligns with Python SDK PR #559.

* fix(ts-sdk): integrate typed exceptions (raiseForCode) into Sandbox client

The exception classes (RockException, BadRequestRockError, etc.) and
raiseForCode() were well-designed but never used. All errors were thrown
as generic new Error(...), losing error code information.

Now all Sandbox methods check response.result.code and call raiseForCode()
to throw appropriate typed exceptions based on error code ranges:
- 4xxx: BadRequestRockError
- 5xxx: InternalServerRockError
- 6xxx: CommandRockError

This matches the Python SDK's exception handling pattern.

* fix(ts-sdk): add Zod validation to API response parsing

Add runtime validation for all API responses using Zod schemas,
matching Python SDK's Pydantic validation behavior (Model(**result)).

- Import response schemas for runtime validation
- Add Schema.parse() calls in execute(), getStatus(), createSession(),
  closeSession(), runInSession(), readFile(), isAlive() methods
- Add 9 test cases verifying Zod validation catches invalid data types

This addresses the PR review feedback that Zod schemas were defined
but .parse() was never called, only used for z.infer<> type extraction.

* refactor(ts-sdk): remove duplicated code across modules

- RunModeType/RunMode: re-export from common/constants.ts in sandbox/types.ts
- extractNohupPid: re-export from utils/http.ts in sandbox/utils.ts
- sleep: import from utils/retry.ts in model/client.ts instead of private method

Addresses code duplication feedback from PR #492 review.

* fix(ts-sdk): change default backoff from 1.0 to 2.0 for exponential retry

Default backoff=1.0 resulted in fixed delay intervals (no exponential
backoff). Changed to backoff=2.0 so delays double on each retry,
which is the standard exponential backoff behavior.

Addresses PR #492 review feedback.

* fix(ts-sdk): add command injection protection across modules

Add comprehensive input validation and shell escaping to prevent
command injection attacks in network, file_system, and runtime_env
modules.

- Add shell.ts utility module with safe escaping functions
- Refactor network.ts to use script-based execution with validation
- Add path/username validation in file_system.ts for chown/chmod
- Add extensive test coverage for injection protection scenarios

Inspired by Python SDK shlex.quote() approach.

* chore: bump version

* fix(ts-sdk): remove fragile session detection in arun()

Replace String(e).includes('already exists') pattern with Python SDK
behavior:
- NORMAL mode: run command directly without pre-creating session
- NOHUP mode: only create session when session parameter is not provided

This eliminates dependency on server error message format, making the
SDK more robust against server-side message changes or localization.

Add unit tests for session creation behavior and update integration
tests to explicitly create sessions when needed.

* fix(ts-sdk): use winston.Container for logger cache management

Replace manual Map-based loggerCache with winston.Container to prevent
potential memory leaks in long-running processes. Add clearLoggerCache()
and getLoggerCacheSize() functions for cache lifecycle management.

* refactor(ts-sdk): remove deprecated Constants class (dead code)

The Constants class was deprecated with all empty string values for
BASE_URL_* properties and was never used anywhere in the codebase.
Tests now verify that Constants is not exported from the module.

* fix(ts-sdk): use winston logger with warn-once for deprecated decorator

- Replace console.warn with Winston logger for consistency
- Implement warn-once pattern using Set to track warned keys
- Each deprecated function/class only warns once per session
- Add clearDeprecatedWarnings() for testing isolation
- Add comprehensive test suite (11 tests)

* refactor(ts-sdk): make sandbox config defaults configurable via env vars

Replace hardcoded default values in sandbox config with environment
variables, allowing users to customize defaults without code changes.

New environment variables:
- ROCK_DEFAULT_IMAGE, ROCK_DEFAULT_MEMORY, ROCK_DEFAULT_CPUS
- ROCK_DEFAULT_CLUSTER, ROCK_DEFAULT_AUTO_CLEAR_SECONDS
- ROCK_DEFAULT_GROUP_SIZE, ROCK_DEFAULT_START_CONCURRENCY
- ROCK_DEFAULT_START_RETRY_TIMES
- ROCK_DEFAULT_ARUN_TIMEOUT, ROCK_DEFAULT_NOHUP_WAIT_TIMEOUT
- ROCK_DEFAULT_NOHUP_WAIT_INTERVAL, ROCK_DEFAULT_STATUS_CHECK_INTERVAL

This addresses PR #492 review feedback about hardcoded defaults
scattered in sandbox/client.ts and config.ts.

* refactor(ts-sdk): consolidate types - remove sandbox/types.ts

- Move SpeedupType to network.ts where it's used
- Update all RunMode/RunModeType imports to use common/constants.ts
- Remove unused RuntimeEnvId and AgentType from types.ts
- Delete sandbox/types.ts and types.test.ts

This addresses the PR review feedback about duplicated type definitions.

* refactor(ts-sdk): remove misleading isBrowser() - SDK requires Node.js only

- Remove isBrowser() function that implied browser compatibility
- Simplify getEnv() and isEnvSet() to directly use process.env
- SDK fundamentally cannot run in browsers due to Node.js module imports
- Add tests to verify isBrowser is not exported

* fix(ts-sdk): initialize sandboxId in RockEnv and use HttpUtils

- Add static factory method RockEnv.create() for async initialization
- Call 'make' API to get sandboxId during initialization
- Replace raw AxiosInstance with HttpUtils for consistent camelCase/snake_case conversion
- Update make() factory function to be async
- Add comprehensive tests for RockEnv

* fix(ts-sdk): add error handling for JSON.parse and use async file I/O in ModelClient

- Add try/catch for JSON.parse in parseRequestLine and parseResponseLine
  to handle corrupted log files gracefully with meaningful error messages
- Replace readFileSync/appendFileSync with async readFile/appendFile
  from fs/promises to avoid blocking the event loop
- Re-throw parse errors immediately in popRequest instead of retrying
- Add tests for JSON parse error handling and async file I/O usage

Fixes PR #492 review feedback

* fix(ts-sdk): use shared https.Agent for connection pooling

- Add module-level sharedHttpsAgent with keepAlive enabled
- Prevents creating new https.Agent on every HTTP request
- Enables TLS session reuse and connection pooling
- Follows Python SDK pattern (_SHARED_SSL_CONTEXT)

Fixes PR #492 reviewer feedback about performance degradation under load.

* fix(ts-sdk): use async file I/O in Sandbox.uploadByPath

Replace sync fs.existsSync and fs.readFileSync with async
fs/promises.access and fs/promises.readFile to avoid blocking
the event loop when uploading large files.

* fix(ts-sdk): use lazy evaluation for Zod schema defaults

Zod .default(envVars.XXX) captured values at module load time.
If environment variables changed after module load, the schema still
used stale values. Now using factory functions .default(() => envVars.XXX)
to evaluate defaults at parse time, matching Python SDK's Field(default_factory=...) pattern.

* fix(ts-sdk): use public PyPI mirror as default for ROCK_PIP_INDEX_URL

Change the default pip index URL from China-specific mirror
(mirrors.aliyun.com) to the public PyPI mirror (pypi.org) for
better compatibility with international users of this open-source SDK.

* fix(ts-sdk): update repository URLs to alibaba/ROCK

* chore: bump version

* fix(ts-sdk): preserve HTTP response in errors for status code detection

Previously, HTTP errors were wrapped in generic Error objects, losing
the response property. This prevented callers from detecting specific
HTTP status codes (e.g., 401 Unauthorized, 403 Forbidden).

Now re-throws the original AxiosError, allowing callers to access
error.response.status for proper error handling, consistent with
Python SDK behavior.

* chore: bump version

* docs(ts-sdk): add CHANGELOG entry for v1.2.7

* feat(ts-sdk): add OSS file download and enhanced upload mode (v1.3.0)

- Add downloadFile() method for downloading files from sandbox via OSS
- Add uploadMode parameter to uploadByPath() ('auto' | 'direct' | 'oss')
- Add getOssStsCredentials() and isTokenExpired() for OSS STS management
- Add new types: DownloadFileResponse, OssCredentials, UploadMode
- Add ENSURE_OSSUTIL_SCRIPT constant for ossutil installation
- Update documentation for v1.3.0 release

* fix(ts-sdk): OSS client now uses HTTPS by default (v1.3.1)

The ali-oss client defaults to HTTP protocol, but OSS buckets typically
require HTTPS connections. This caused connection refused errors when
uploading files via OSS.

Added secure: true to OSS client initialization in setupOss() method.

* fix(ts-sdk): signatureUrl now uses correct object parameter (v1.3.2)

The ali-oss signatureUrl method requires an options object with 'expires'
property, not a raw number. Passing a number caused TypeError when the
library tried to access options.method.toUpperCase().

Changed: signatureUrl(objectName, 600) -> signatureUrl(objectName, { expires: 600 })

* fix(ts-sdk): wrap multi-line scripts with bash -c in nohup mode (v1.3.3)

Multi-line scripts passed to arun() with nohup mode caused errors because
nohup cannot handle multi-line content directly. The fix wraps multi-line
scripts with 'bash -c $'script\'' syntax.

Before: nohup <multi-line-script> # fails with 'missing operand'
After:  nohup bash -c $'<multi-line-script>' # works correctly

* fix(ts-sdk): update ossutil commands for v2 compatibility (v1.3.4)

ossutil v2 removed the -b flag and requires --region flag. Credentials
should be passed via environment variables instead of command-line flags.

Changes:
- Use OSS_ACCESS_KEY_ID, OSS_ACCESS_KEY_SECRET, OSS_SESSION_TOKEN env vars
- Add --region flag (required in v2)
- Remove -b flag (deprecated in v2)

Fixes errors: 'unknown shorthand flag: b' and 'region must be set in sign version 4'

* fix(ts-sdk): use command-line parameters for ossutil cp (v1.3.5)

ossutil v2 does not require separate config command. Credentials should be
passed directly as command-line parameters to ossutil cp, matching Python SDK.

Before: ossutil config + ossutil cp with env vars (incorrect)
After:  ossutil cp with --access-key-id, --access-key-secret, --sts-token,
        --endpoint, --region parameters (correct)

Fixes 'region must be set in sign version 4' error

* fix(ts-sdk): use ROCK_OSS_BUCKET_ENDPOINT and normalize region format (v1.3.6)

Fixed incompatibility between setupOss (ali-oss) and downloadFile (ossutil)
by using ROCK_OSS_BUCKET_ENDPOINT env var and normalizing region format.

Changes:
- Use ROCK_OSS_BUCKET_ENDPOINT if available (format: oss-cn-hangzhou.aliyuncs.com)
- Normalize region by removing 'oss-' prefix (oss-cn-hangzhou -> cn-hangzhou)
- Matches Python SDK configuration format

* fix(ts-sdk): wrap ossutil command with bash -c for nohup PATH (v1.3.7)

When using arun() with nohup mode, the shell is /bin/sh which may not
have /usr/local/bin in PATH. Since ossutil is typically installed in
/usr/local/bin, this causes 'command not found' errors.

Solution: wrap ossutil commands with 'bash -c' to ensure correct PATH.
Matches Python SDK implementation using shlex.quote().

* feat(ts-sdk): add configurable OSS timeout support (v1.3.8)

Add ROCK_OSS_TIMEOUT environment variable with default 300000ms (5 minutes).
uploadByPath and downloadFile now accept optional timeout parameter.

Priority: function parameter > environment variable > default

Fixes ResponseTimeoutError for large file uploads/downloads.

* fix(ts-sdk): use multipartUpload for large files >= 1MB (v1.3.9)

Large file uploads were failing with 'socket connection was closed
unexpectedly' error. Now using multipartUpload for files >= 1MB,
matching Python SDK's oss2.resumable_upload behavior.

- Files < 1MB: use put()
- Files >= 1MB: use multipartUpload() with 1MB parts

* feat(ts-sdk): add progress callback support

Breaking change: uploadByPath and downloadFile now use options object pattern.

- Added ProgressInfo, UploadPhase, DownloadPhase types
- Added UploadOptions and DownloadOptions interfaces with onProgress callback
- OSS operations (multipartUpload, get) report progress percentage (0-100)
- Sandbox operations (wget, ossutil) report -1 (progress unavailable)
- Updated all tests to use new options object API signature
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Python SDK: ModelClient polling loops lack timeout and cancellation support

2 participants