Make gpu_test binary optional; unblock Nix installs (#498)#499
Make gpu_test binary optional; unblock Nix installs (#498)#499
Conversation
|
Promptless prepared a documentation update related to this change. Triggered by runpod/runpod-python PR #499 Added documentation for the new Review: Serverless fitness checks |
| tmp_path = Path(tmp.name) | ||
|
|
||
| try: | ||
| os.chmod(tmp_path, 0o750) |
There was a problem hiding this comment.
Pull request overview
Removes the precompiled gpu_test ELF binary from the universal PyPI wheel to unblock installs on strict/non-glibc platforms (e.g., Nix), and introduces an opt-in installer path via a new runpod install-gpu-test CLI plus a release-asset upload in CD.
Changes:
- Stop packaging
runpod/serverless/binaries/gpu_testin wheels/sdists; keep onlyREADME.md. - Add
runpod install-gpu-testCLI + download/sha256 verification logic and associated tests. - Extend CD workflow to compile
gpu_testand upload it (and its.sha256) to the matching GitHub release.
Reviewed changes
Copilot reviewed 16 out of 18 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
pyproject.toml |
Removes gpu_test from setuptools package data to keep wheels universal. |
setup.py |
Removes gpu_test from package_data. |
MANIFEST.in |
Ensures gpu_test is excluded from the sdist manifest. |
runpod/cli/groups/install/functions.py |
Implements download + sha256 verification + atomic install. |
runpod/cli/groups/install/commands.py |
Adds the install-gpu-test Click command. |
runpod/cli/entry.py |
Registers the new CLI command. |
.github/workflows/CD-publish_to_pypi.yml |
Adds job to compile and upload gpu_test + checksum to the GitHub release. |
tests/test_cli/test_install/test_install_gpu_test.py |
Adds unit tests for installer URL construction, checksum validation, and error paths. |
tests/test_package/test_wheel_contents.py |
Adds regression test asserting the wheel is py3-none-any and excludes gpu_test. |
docs/serverless/*.md, runpod/serverless/binaries/README.md, CHANGELOG.md |
Documents the new opt-in installation flow and fallback behavior. |
.gitignore |
Ignores locally-built gpu_test artifacts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| - name: Upload binary to release | ||
| env: | ||
| GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} | ||
| run: | | ||
| gh release upload "${{ needs.release-please.outputs.tag_name }}" \ | ||
| runpod/serverless/binaries/gpu_test \ | ||
| runpod/serverless/binaries/gpu_test.sha256 \ | ||
| --clobber |
There was a problem hiding this comment.
gh release upload uses ${{ needs.release-please.outputs.tag_name }} even when running under force_publish. If no release/tag was created (or tag_name is empty), this step will fail and block the workflow. Either gate this job/step on release_created, or supply a deterministic tag to upload against in the force-publish path.
| subprocess.run( | ||
| [sys.executable, "-m", "build", "--wheel", "--outdir", str(out_dir)], | ||
| cwd=REPO_ROOT, | ||
| check=True, | ||
| capture_output=True, | ||
| ) |
There was a problem hiding this comment.
This test invokes python -m build, but the build package is not in the project's test dependency group (only in dev). In CI (uv sync --group test), this will raise ModuleNotFoundError: No module named 'build' and fail the suite. Add build to the test dependencies (or ensure the test environment installs the dev group as well).
| help=( | ||
| "Download the optional gpu_test CUDA health-check binary from the " | ||
| "GitHub release matching the installed runpod version. " | ||
| "Runpod GPU workers only — no-op on CPU-only environments." | ||
| ), |
There was a problem hiding this comment.
The command help text says this is a "no-op on CPU-only environments", but install_gpu_test_cli always attempts the download regardless of GPU availability. Either remove that claim from the help text or implement an explicit CPU-only short-circuit so behavior matches the CLI help.
|
|
||
| Accepts either '1.9.0' or 'v1.9.0' — the leading 'v' is optional. | ||
| """ | ||
| clean = version.lstrip("v") |
There was a problem hiding this comment.
version.lstrip("v") strips all leading v characters (e.g., "vv1.2.3" becomes "1.2.3"), which is broader than intended. Prefer removing only a single leading "v" (e.g., removeprefix("v") or version[1:] if version.startswith("v") else version) to avoid surprising tag construction.
| clean = version.lstrip("v") | |
| clean = version[1:] if version.startswith("v") else version |
| def _parse_sha256(checksum_body: bytes) -> str: | ||
| """Extract the hex digest from a 'sha256 filename' line.""" | ||
| text = checksum_body.decode("utf-8", errors="replace").strip() | ||
| first_token = text.split()[0] if text else "" | ||
| if len(first_token) != 64: | ||
| raise BinaryDownloadError( | ||
| f"checksum file did not contain a sha256 digest: {text!r}" | ||
| ) | ||
| return first_token.lower() |
There was a problem hiding this comment.
_parse_sha256() only checks token length, not that it is valid hex. A 64-character non-hex token will incorrectly pass parsing and then fail later as BinaryChecksumMismatch, which misclassifies the problem. Validate with a hex regex / string.hexdigits check and raise BinaryDownloadError when the checksum file content is malformed.
| - name: Checkout code | ||
| uses: actions/checkout@v5 | ||
| with: | ||
| ref: ${{ needs.release-please.outputs.tag_name }} | ||
|
|
There was a problem hiding this comment.
This job runs on the force_publish path too, but it checks out ref: ${{ needs.release-please.outputs.tag_name }}. When release_created is false, tag_name can be empty/undefined, causing checkout to fail. Consider defaulting to github.ref_name (or adding an explicit tag_name workflow input) when force-publishing.
The pre-compiled Linux x86_64 ELF binary at runpod/serverless/binaries/gpu_test was bundled into the universal py3-none-any wheel, breaking installs on Nix and other non-glibc platforms. Runtime already falls back to nvidia-smi when the binary is missing (see rp_gpu_fitness._run_gpu_test_fallback), so this is a safe change for end users. Runpod GPU workers that want the full memory-allocation test back can use 'runpod install-gpu-test' (added in a follow-up commit).
MANIFEST.in's exclude directive is sufficient once stale build/ artifacts are cleared. Revert the belt-and-suspenders config sections added in db0a02f per spec-review feedback (YAGNI).
Fetches gpu_test from the matching GitHub release asset, verifies sha256, and writes it atomically into the destination path. Used by the 'runpod install-gpu-test' CLI command (next commit) to restore the native CUDA memory-allocation test on GPU workers.
…#498) Address code-review feedback on 6983e74: - Unlink the temp file if os.chmod/os.replace fails so repeated install attempts don't accumulate tmp* orphans in the binaries dir. - BinaryChecksumMismatch now includes the release URL and downloaded byte count — a 64-char digest diff turns into an immediate 'oh, that's a Cloudflare error page' diagnosis. - URLError reason now formatted with repr() for structured output. - Add module docstring to runpod/cli/groups/install/__init__.py to match sibling groups. - Two new tests cover the temp-file cleanup path and the enriched checksum error message.
Opt-in installer for the CUDA memory-allocation test binary that's no longer bundled in the universal PyPI wheel. Runpod GPU worker Dockerfiles should add 'RUN runpod install-gpu-test' after 'pip install runpod' to restore the native test. Uses the download helper from the previous commits; this commit just wraps it in a click CLI command with --version / --dest overrides.
Address code-review feedback on dd9c114: - Move inline CliRunner / install_gpu_test_cli imports to top of file per PEP 8. - Add test for --dest override that asserts _default_install_path is not consulted when the user supplies a destination. - Add test for the 'unknown' version exit path which surfaces the --version hint to the user.
Builds gpu_test inside nvidia/cuda:11.8.0-devel-ubuntu22.04 on each release-please release (nvcc only needs the CUDA SDK, not GPU hardware, so ubuntu-latest works) and uploads it plus its sha256 to the GitHub release. The 'runpod install-gpu-test' CLI added in earlier commits downloads from these assets, restoring the native CUDA memory-allocation test on Runpod GPU workers that opt in via their Dockerfile.
…#498) Address code-review feedback on a726d07: - Critical: mirror pypi-publish's 'if:' on upload-gpu-test-asset so workflow_dispatch + force_publish=true still flows through to notify-runpod-workers. Previously the new job's narrower gate silently skipped worker notifications on force-publish runs. - Add optional docker/login-action step to avoid Docker Hub anonymous rate limits on release days. Gated on vars.DOCKERHUB_USERNAME so it's a no-op until the secret is provisioned. - Remove the compiled gpu_test binary from git. CI compiles and uploads it as a release asset per a726d07; tracking it in source control was the root cause that originally let the binary sneak into the 'universal' wheel (#498). .gitignore now guards against accidental re-add.
CodeQL flagged 0o755 as world-readable. The binary is installed into a Python package directory and runs inside the runpod worker process as a single user context — 'others' access isn't needed. 0o750 keeps owner rwx + group r-x (covers Docker usage where the process user matches the installing user) while dropping world access.
6e0cdf4 to
f5e5af1
Compare
Summary
Removes the pre-compiled Linux x86_64
gpu_testELF binary from the universal PyPI wheel sopip install runpodworks cleanly on Nix and other strict/non-glibc platforms. Adds arunpod install-gpu-testCLI command and CI release-asset upload so Runpod GPU workers can opt back into the native CUDA memory-allocation test.Fixes #498.
Changes
pyproject.toml/setup.py/MANIFEST.in: stop packaginggpu_testinto the wheel. Binary is also removed from git atrunpod/serverless/binaries/gpu_test(closes the root cause — tracked binary + auto-include packaging rules was how it snuck in).runpod install-gpu-testCLI (runpod/cli/groups/install/) that downloads the binary from the GitHub release matching the installed runpod version, verifies sha256, and installs it atomically with temp-file cleanup on failure..github/workflows/CD-publish_to_pypi.yml: newupload-gpu-test-assetjob compilesgpu_testvia the existingbuild_tools/compile_gpu_test.sh(insidenvidia/cuda:11.8.0-devel-ubuntu22.04) and uploads it + sha256 to the release. Gated on the samerelease_created || force_publishcondition aspypi-publishso the force-publish path still notifies worker repos. Optional Docker Hub login step to avoid anonymous rate limits when the secret is provisioned.docs/serverless/gpu_binary_compilation.md,docs/serverless/worker_fitness_checks.md,runpod/serverless/binaries/README.md,CHANGELOG.md).Runtime fallback to
nvidia-smiwas already in place (rp_gpu_fitness._run_gpu_test_fallback), so end users without the binary continue to get a GPU availability check — just not the memory-allocation test.Runpod GPU worker migration
After this ships, Dockerfiles that rely on the native memory-allocation test need:
RUN pip install runpod && runpod install-gpu-testWithout this, those workers drop to the
nvidia-smifallback.Known follow-ups (out of scope)
setup.pyduplicatespyproject.tomland could be removed entirely (PEP 517 doesn't execute it).pypi-publishon a prior compile job so a Docker Hub outage can't produce a half-shipped release (wheel without matching asset).Test plan
pytest tests/test_package/test_wheel_contents.py— wheel excludesgpu_test, wheel tag ispy3-none-anypytest tests/test_cli/test_install/— 12 tests covering URL construction, happy path, checksum mismatch, HTTP/URL errors, temp-file cleanup on replace failure,--versionoverride,--destoverride, unknown-version exitpytest tests/test_serverless/test_modules/test_fitness/— 96 tests, no regressionstest_download_files_from_urls) is pre-existing onorigin/mainunzip -l dist/runpod-*.whl— onlyREADME.mdand__init__.pyunderrunpod/serverless/binaries/runpod install-gpu-test --helprenders with--version/--destoptionsworkflow_dispatchwithforce_publish=trueon a staging tag to verify theupload-gpu-test-assetjob compiles and uploads correctly