Make gpu_test binary optional; unblock Nix installs (#498) by deanq · Pull Request #499 · runpod/runpod-python

deanq · 2026-04-17T05:08:38Z

Summary

Removes the pre-compiled Linux x86_64 gpu_test ELF binary from the universal PyPI wheel so pip install runpod works cleanly on Nix and other strict/non-glibc platforms. Adds a runpod install-gpu-test CLI command and CI release-asset upload so Runpod GPU workers can opt back into the native CUDA memory-allocation test.

Fixes #498.

Changes

pyproject.toml / setup.py / MANIFEST.in: stop packaging gpu_test into the wheel. Binary is also removed from git at runpod/serverless/binaries/gpu_test (closes the root cause — tracked binary + auto-include packaging rules was how it snuck in).
New runpod install-gpu-test CLI (runpod/cli/groups/install/) that downloads the binary from the GitHub release matching the installed runpod version, verifies sha256, and installs it atomically with temp-file cleanup on failure.
.github/workflows/CD-publish_to_pypi.yml: new upload-gpu-test-asset job compiles gpu_test via the existing build_tools/compile_gpu_test.sh (inside nvidia/cuda:11.8.0-devel-ubuntu22.04) and uploads it + sha256 to the release. Gated on the same release_created || force_publish condition as pypi-publish so the force-publish path still notifies worker repos. Optional Docker Hub login step to avoid anonymous rate limits when the secret is provisioned.
Docs updated (docs/serverless/gpu_binary_compilation.md, docs/serverless/worker_fitness_checks.md, runpod/serverless/binaries/README.md, CHANGELOG.md).

Runtime fallback to nvidia-smi was already in place (rp_gpu_fitness._run_gpu_test_fallback), so end users without the binary continue to get a GPU availability check — just not the memory-allocation test.

Runpod GPU worker migration

After this ships, Dockerfiles that rely on the native memory-allocation test need:

RUN pip install runpod && runpod install-gpu-test

Without this, those workers drop to the nvidia-smi fallback.

Known follow-ups (out of scope)

setup.py duplicates pyproject.toml and could be removed entirely (PEP 517 doesn't execute it).
Consider gating pypi-publish on a prior compile job so a Docker Hub outage can't produce a half-shipped release (wheel without matching asset).

Test plan

pytest tests/test_package/test_wheel_contents.py — wheel excludes gpu_test, wheel tag is py3-none-any
pytest tests/test_cli/test_install/ — 12 tests covering URL construction, happy path, checksum mismatch, HTTP/URL errors, temp-file cleanup on replace failure, --version override, --dest override, unknown-version exit
pytest tests/test_serverless/test_modules/test_fitness/ — 96 tests, no regressions
Full suite: 480/481 pass; the single failure (test_download_files_from_urls) is pre-existing on origin/main
unzip -l dist/runpod-*.whl — only README.md and __init__.py under runpod/serverless/binaries/
runpod install-gpu-test --help renders with --version / --dest options
CI dry-run required before first release: trigger workflow_dispatch with force_publish=true on a staging tag to verify the upload-gpu-test-asset job compiles and uploads correctly

promptless · 2026-04-17T05:17:35Z

Promptless prepared a documentation update related to this change.

Triggered by runpod/runpod-python PR #499

Added documentation for the new runpod install-gpu-test CLI command, including installation instructions, Dockerfile examples for GPU workers, CLI options, and fallback behavior when the binary isn't installed.

Review: Serverless fitness checks

+        tmp_path = Path(tmp.name)
+
+    try:
+        os.chmod(tmp_path, 0o750)


Copilot

Pull request overview

Removes the precompiled gpu_test ELF binary from the universal PyPI wheel to unblock installs on strict/non-glibc platforms (e.g., Nix), and introduces an opt-in installer path via a new runpod install-gpu-test CLI plus a release-asset upload in CD.

Changes:

Stop packaging runpod/serverless/binaries/gpu_test in wheels/sdists; keep only README.md.
Add runpod install-gpu-test CLI + download/sha256 verification logic and associated tests.
Extend CD workflow to compile gpu_test and upload it (and its .sha256) to the matching GitHub release.

Reviewed changes

Copilot reviewed 16 out of 18 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`pyproject.toml`	Removes `gpu_test` from setuptools package data to keep wheels universal.
`setup.py`	Removes `gpu_test` from `package_data`.
`MANIFEST.in`	Ensures `gpu_test` is excluded from the sdist manifest.
`runpod/cli/groups/install/functions.py`	Implements download + sha256 verification + atomic install.
`runpod/cli/groups/install/commands.py`	Adds the `install-gpu-test` Click command.
`runpod/cli/entry.py`	Registers the new CLI command.
`.github/workflows/CD-publish_to_pypi.yml`	Adds job to compile and upload `gpu_test` + checksum to the GitHub release.
`tests/test_cli/test_install/test_install_gpu_test.py`	Adds unit tests for installer URL construction, checksum validation, and error paths.
`tests/test_package/test_wheel_contents.py`	Adds regression test asserting the wheel is `py3-none-any` and excludes `gpu_test`.
`docs/serverless/*.md`, `runpod/serverless/binaries/README.md`, `CHANGELOG.md`	Documents the new opt-in installation flow and fallback behavior.
`.gitignore`	Ignores locally-built `gpu_test` artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-17T06:19:33Z

+      - name: Upload binary to release
+        env:
+          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        run: |
+          gh release upload "${{ needs.release-please.outputs.tag_name }}" \
+            runpod/serverless/binaries/gpu_test \
+            runpod/serverless/binaries/gpu_test.sha256 \
+            --clobber


gh release upload uses ${{ needs.release-please.outputs.tag_name }} even when running under force_publish. If no release/tag was created (or tag_name is empty), this step will fail and block the workflow. Either gate this job/step on release_created, or supply a deterministic tag to upload against in the force-publish path.

Copilot · 2026-04-17T06:19:33Z

+    subprocess.run(
+        [sys.executable, "-m", "build", "--wheel", "--outdir", str(out_dir)],
+        cwd=REPO_ROOT,
+        check=True,
+        capture_output=True,
+    )


This test invokes python -m build, but the build package is not in the project's test dependency group (only in dev). In CI (uv sync --group test), this will raise ModuleNotFoundError: No module named 'build' and fail the suite. Add build to the test dependencies (or ensure the test environment installs the dev group as well).

Copilot · 2026-04-17T06:19:33Z

+    help=(
+        "Download the optional gpu_test CUDA health-check binary from the "
+        "GitHub release matching the installed runpod version. "
+        "Runpod GPU workers only — no-op on CPU-only environments."
+    ),


The command help text says this is a "no-op on CPU-only environments", but install_gpu_test_cli always attempts the download regardless of GPU availability. Either remove that claim from the help text or implement an explicit CPU-only short-circuit so behavior matches the CLI help.

Copilot · 2026-04-17T06:19:34Z

+
+    Accepts either '1.9.0' or 'v1.9.0' — the leading 'v' is optional.
+    """
+    clean = version.lstrip("v")


version.lstrip("v") strips all leading v characters (e.g., "vv1.2.3" becomes "1.2.3"), which is broader than intended. Prefer removing only a single leading "v" (e.g., removeprefix("v") or version[1:] if version.startswith("v") else version) to avoid surprising tag construction.

Suggested change

clean = version.lstrip("v")

clean = version[1:] if version.startswith("v") else version

Copilot · 2026-04-17T06:19:34Z

+def _parse_sha256(checksum_body: bytes) -> str:
+    """Extract the hex digest from a 'sha256  filename' line."""
+    text = checksum_body.decode("utf-8", errors="replace").strip()
+    first_token = text.split()[0] if text else ""
+    if len(first_token) != 64:
+        raise BinaryDownloadError(
+            f"checksum file did not contain a sha256 digest: {text!r}"
+        )
+    return first_token.lower()


_parse_sha256() only checks token length, not that it is valid hex. A 64-character non-hex token will incorrectly pass parsing and then fail later as BinaryChecksumMismatch, which misclassifies the problem. Validate with a hex regex / string.hexdigits check and raise BinaryDownloadError when the checksum file content is malformed.

Copilot · 2026-04-17T06:19:34Z

+      - name: Checkout code
+        uses: actions/checkout@v5
+        with:
+          ref: ${{ needs.release-please.outputs.tag_name }}
+


This job runs on the force_publish path too, but it checks out ref: ${{ needs.release-please.outputs.tag_name }}. When release_created is false, tag_name can be empty/undefined, causing checkout to fail. Consider defaulting to github.ref_name (or adding an explicit tag_name workflow input) when force-publishing.

The pre-compiled Linux x86_64 ELF binary at runpod/serverless/binaries/gpu_test was bundled into the universal py3-none-any wheel, breaking installs on Nix and other non-glibc platforms. Runtime already falls back to nvidia-smi when the binary is missing (see rp_gpu_fitness._run_gpu_test_fallback), so this is a safe change for end users. Runpod GPU workers that want the full memory-allocation test back can use 'runpod install-gpu-test' (added in a follow-up commit).

MANIFEST.in's exclude directive is sufficient once stale build/ artifacts are cleared. Revert the belt-and-suspenders config sections added in db0a02f per spec-review feedback (YAGNI).

Fetches gpu_test from the matching GitHub release asset, verifies sha256, and writes it atomically into the destination path. Used by the 'runpod install-gpu-test' CLI command (next commit) to restore the native CUDA memory-allocation test on GPU workers.

…#498) Address code-review feedback on 6983e74: - Unlink the temp file if os.chmod/os.replace fails so repeated install attempts don't accumulate tmp* orphans in the binaries dir. - BinaryChecksumMismatch now includes the release URL and downloaded byte count — a 64-char digest diff turns into an immediate 'oh, that's a Cloudflare error page' diagnosis. - URLError reason now formatted with repr() for structured output. - Add module docstring to runpod/cli/groups/install/__init__.py to match sibling groups. - Two new tests cover the temp-file cleanup path and the enriched checksum error message.

Opt-in installer for the CUDA memory-allocation test binary that's no longer bundled in the universal PyPI wheel. Runpod GPU worker Dockerfiles should add 'RUN runpod install-gpu-test' after 'pip install runpod' to restore the native test. Uses the download helper from the previous commits; this commit just wraps it in a click CLI command with --version / --dest overrides.

Address code-review feedback on dd9c114: - Move inline CliRunner / install_gpu_test_cli imports to top of file per PEP 8. - Add test for --dest override that asserts _default_install_path is not consulted when the user supplies a destination. - Add test for the 'unknown' version exit path which surfaces the --version hint to the user.

Builds gpu_test inside nvidia/cuda:11.8.0-devel-ubuntu22.04 on each release-please release (nvcc only needs the CUDA SDK, not GPU hardware, so ubuntu-latest works) and uploads it plus its sha256 to the GitHub release. The 'runpod install-gpu-test' CLI added in earlier commits downloads from these assets, restoring the native CUDA memory-allocation test on Runpod GPU workers that opt in via their Dockerfile.

…#498) Address code-review feedback on a726d07: - Critical: mirror pypi-publish's 'if:' on upload-gpu-test-asset so workflow_dispatch + force_publish=true still flows through to notify-runpod-workers. Previously the new job's narrower gate silently skipped worker notifications on force-publish runs. - Add optional docker/login-action step to avoid Docker Hub anonymous rate limits on release days. Gated on vars.DOCKERHUB_USERNAME so it's a no-op until the secret is provisioned. - Remove the compiled gpu_test binary from git. CI compiles and uploads it as a release asset per a726d07; tracking it in source control was the root cause that originally let the binary sneak into the 'universal' wheel (#498). .gitignore now guards against accidental re-add.

CodeQL flagged 0o755 as world-readable. The binary is installed into a Python package directory and runs inside the runpod worker process as a single user context — 'others' access isn't needed. 0o750 keeps owner rwx + group r-x (covers Docker usage where the process user matches the installing user) while dropping world access.

github-advanced-security AI found potential problems Apr 17, 2026

View reviewed changes

Comment thread runpod/cli/groups/install/functions.py Fixed

deanq requested a review from Copilot April 17, 2026 06:12

Copilot started reviewing on behalf of deanq April 17, 2026 06:13 View session

github-advanced-security AI found potential problems Apr 17, 2026

View reviewed changes

Comment thread runpod/cli/groups/install/functions.py

tmp_path = Path(tmp.name)

try:

os.chmod(tmp_path, 0o750)

Copilot AI reviewed Apr 17, 2026

View reviewed changes

deanq added 14 commits April 17, 2026 01:06

test: add wheel-contents regression test for gpu_test binary (#498)

60d69de

style: add EOF newline to tests/test_package/__init__.py

ab2b52a

build: drop redundant exclude-package-data entries (#498)

cc78063

MANIFEST.in's exclude directive is sufficient once stale build/ artifacts are cleared. Revert the belt-and-suspenders config sections added in db0a02f per spec-review feedback (YAGNI).

docs: document optional gpu_test binary install (#498)

96535b5

docs: fix stale 'bundled in package' note in compilation guide (#498)

19dda64

style: apply ruff format to functions.py (#498)

c1e9c6b

deanq force-pushed the deanq-fix-498-optional-gpu-test-binary branch from 6e0cdf4 to f5e5af1 Compare April 17, 2026 08:11

deanq closed this Apr 17, 2026

deanq deleted the deanq-fix-498-optional-gpu-test-binary branch April 17, 2026 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make gpu_test binary optional; unblock Nix installs (#498)#499

Make gpu_test binary optional; unblock Nix installs (#498)#499
deanq wants to merge 14 commits intomainfrom
deanq-fix-498-optional-gpu-test-binary

deanq commented Apr 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

promptless bot commented Apr 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	clean = version.lstrip("v")
	clean = version[1:] if version.startswith("v") else version

Conversation

deanq commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Runpod GPU worker migration

Known follow-ups (out of scope)

Test plan

Uh oh!

Uh oh!

promptless bot commented Apr 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

deanq commented Apr 17, 2026 •

edited

Loading