Skip to content

Make gpu_test binary optional; unblock Nix installs (#498)#499

Closed
deanq wants to merge 14 commits intomainfrom
deanq-fix-498-optional-gpu-test-binary
Closed

Make gpu_test binary optional; unblock Nix installs (#498)#499
deanq wants to merge 14 commits intomainfrom
deanq-fix-498-optional-gpu-test-binary

Conversation

@deanq
Copy link
Copy Markdown
Member

@deanq deanq commented Apr 17, 2026

Summary

Removes the pre-compiled Linux x86_64 gpu_test ELF binary from the universal PyPI wheel so pip install runpod works cleanly on Nix and other strict/non-glibc platforms. Adds a runpod install-gpu-test CLI command and CI release-asset upload so Runpod GPU workers can opt back into the native CUDA memory-allocation test.

Fixes #498.

Changes

  • pyproject.toml / setup.py / MANIFEST.in: stop packaging gpu_test into the wheel. Binary is also removed from git at runpod/serverless/binaries/gpu_test (closes the root cause — tracked binary + auto-include packaging rules was how it snuck in).
  • New runpod install-gpu-test CLI (runpod/cli/groups/install/) that downloads the binary from the GitHub release matching the installed runpod version, verifies sha256, and installs it atomically with temp-file cleanup on failure.
  • .github/workflows/CD-publish_to_pypi.yml: new upload-gpu-test-asset job compiles gpu_test via the existing build_tools/compile_gpu_test.sh (inside nvidia/cuda:11.8.0-devel-ubuntu22.04) and uploads it + sha256 to the release. Gated on the same release_created || force_publish condition as pypi-publish so the force-publish path still notifies worker repos. Optional Docker Hub login step to avoid anonymous rate limits when the secret is provisioned.
  • Docs updated (docs/serverless/gpu_binary_compilation.md, docs/serverless/worker_fitness_checks.md, runpod/serverless/binaries/README.md, CHANGELOG.md).

Runtime fallback to nvidia-smi was already in place (rp_gpu_fitness._run_gpu_test_fallback), so end users without the binary continue to get a GPU availability check — just not the memory-allocation test.

Runpod GPU worker migration

After this ships, Dockerfiles that rely on the native memory-allocation test need:

RUN pip install runpod && runpod install-gpu-test

Without this, those workers drop to the nvidia-smi fallback.

Known follow-ups (out of scope)

  • setup.py duplicates pyproject.toml and could be removed entirely (PEP 517 doesn't execute it).
  • Consider gating pypi-publish on a prior compile job so a Docker Hub outage can't produce a half-shipped release (wheel without matching asset).

Test plan

  • pytest tests/test_package/test_wheel_contents.py — wheel excludes gpu_test, wheel tag is py3-none-any
  • pytest tests/test_cli/test_install/ — 12 tests covering URL construction, happy path, checksum mismatch, HTTP/URL errors, temp-file cleanup on replace failure, --version override, --dest override, unknown-version exit
  • pytest tests/test_serverless/test_modules/test_fitness/ — 96 tests, no regressions
  • Full suite: 480/481 pass; the single failure (test_download_files_from_urls) is pre-existing on origin/main
  • unzip -l dist/runpod-*.whl — only README.md and __init__.py under runpod/serverless/binaries/
  • runpod install-gpu-test --help renders with --version / --dest options
  • CI dry-run required before first release: trigger workflow_dispatch with force_publish=true on a staging tag to verify the upload-gpu-test-asset job compiles and uploads correctly

Comment thread runpod/cli/groups/install/functions.py Fixed
@promptless
Copy link
Copy Markdown

promptless bot commented Apr 17, 2026

Promptless prepared a documentation update related to this change.

Triggered by runpod/runpod-python PR #499

Added documentation for the new runpod install-gpu-test CLI command, including installation instructions, Dockerfile examples for GPU workers, CLI options, and fallback behavior when the binary isn't installed.

Review: Serverless fitness checks

tmp_path = Path(tmp.name)

try:
os.chmod(tmp_path, 0o750)
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Removes the precompiled gpu_test ELF binary from the universal PyPI wheel to unblock installs on strict/non-glibc platforms (e.g., Nix), and introduces an opt-in installer path via a new runpod install-gpu-test CLI plus a release-asset upload in CD.

Changes:

  • Stop packaging runpod/serverless/binaries/gpu_test in wheels/sdists; keep only README.md.
  • Add runpod install-gpu-test CLI + download/sha256 verification logic and associated tests.
  • Extend CD workflow to compile gpu_test and upload it (and its .sha256) to the matching GitHub release.

Reviewed changes

Copilot reviewed 16 out of 18 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pyproject.toml Removes gpu_test from setuptools package data to keep wheels universal.
setup.py Removes gpu_test from package_data.
MANIFEST.in Ensures gpu_test is excluded from the sdist manifest.
runpod/cli/groups/install/functions.py Implements download + sha256 verification + atomic install.
runpod/cli/groups/install/commands.py Adds the install-gpu-test Click command.
runpod/cli/entry.py Registers the new CLI command.
.github/workflows/CD-publish_to_pypi.yml Adds job to compile and upload gpu_test + checksum to the GitHub release.
tests/test_cli/test_install/test_install_gpu_test.py Adds unit tests for installer URL construction, checksum validation, and error paths.
tests/test_package/test_wheel_contents.py Adds regression test asserting the wheel is py3-none-any and excludes gpu_test.
docs/serverless/*.md, runpod/serverless/binaries/README.md, CHANGELOG.md Documents the new opt-in installation flow and fallback behavior.
.gitignore Ignores locally-built gpu_test artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +110 to +117
- name: Upload binary to release
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
gh release upload "${{ needs.release-please.outputs.tag_name }}" \
runpod/serverless/binaries/gpu_test \
runpod/serverless/binaries/gpu_test.sha256 \
--clobber
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gh release upload uses ${{ needs.release-please.outputs.tag_name }} even when running under force_publish. If no release/tag was created (or tag_name is empty), this step will fail and block the workflow. Either gate this job/step on release_created, or supply a deterministic tag to upload against in the force-publish path.

Copilot uses AI. Check for mistakes.
Comment on lines +24 to +29
subprocess.run(
[sys.executable, "-m", "build", "--wheel", "--outdir", str(out_dir)],
cwd=REPO_ROOT,
check=True,
capture_output=True,
)
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test invokes python -m build, but the build package is not in the project's test dependency group (only in dev). In CI (uv sync --group test), this will raise ModuleNotFoundError: No module named 'build' and fail the suite. Add build to the test dependencies (or ensure the test environment installs the dev group as well).

Copilot uses AI. Check for mistakes.
Comment on lines +29 to +33
help=(
"Download the optional gpu_test CUDA health-check binary from the "
"GitHub release matching the installed runpod version. "
"Runpod GPU workers only — no-op on CPU-only environments."
),
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command help text says this is a "no-op on CPU-only environments", but install_gpu_test_cli always attempts the download regardless of GPU availability. Either remove that claim from the help text or implement an explicit CPU-only short-circuit so behavior matches the CLI help.

Copilot uses AI. Check for mistakes.

Accepts either '1.9.0' or 'v1.9.0' — the leading 'v' is optional.
"""
clean = version.lstrip("v")
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

version.lstrip("v") strips all leading v characters (e.g., "vv1.2.3" becomes "1.2.3"), which is broader than intended. Prefer removing only a single leading "v" (e.g., removeprefix("v") or version[1:] if version.startswith("v") else version) to avoid surprising tag construction.

Suggested change
clean = version.lstrip("v")
clean = version[1:] if version.startswith("v") else version

Copilot uses AI. Check for mistakes.
Comment on lines +64 to +72
def _parse_sha256(checksum_body: bytes) -> str:
"""Extract the hex digest from a 'sha256 filename' line."""
text = checksum_body.decode("utf-8", errors="replace").strip()
first_token = text.split()[0] if text else ""
if len(first_token) != 64:
raise BinaryDownloadError(
f"checksum file did not contain a sha256 digest: {text!r}"
)
return first_token.lower()
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_parse_sha256() only checks token length, not that it is valid hex. A 64-character non-hex token will incorrectly pass parsing and then fail later as BinaryChecksumMismatch, which misclassifies the problem. Validate with a hex regex / string.hexdigits check and raise BinaryDownloadError when the checksum file content is malformed.

Copilot uses AI. Check for mistakes.
Comment on lines +85 to +89
- name: Checkout code
uses: actions/checkout@v5
with:
ref: ${{ needs.release-please.outputs.tag_name }}

Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This job runs on the force_publish path too, but it checks out ref: ${{ needs.release-please.outputs.tag_name }}. When release_created is false, tag_name can be empty/undefined, causing checkout to fail. Consider defaulting to github.ref_name (or adding an explicit tag_name workflow input) when force-publishing.

Copilot uses AI. Check for mistakes.
deanq added 14 commits April 17, 2026 01:06
The pre-compiled Linux x86_64 ELF binary at
runpod/serverless/binaries/gpu_test was bundled into the universal
py3-none-any wheel, breaking installs on Nix and other non-glibc
platforms. Runtime already falls back to nvidia-smi when the binary
is missing (see rp_gpu_fitness._run_gpu_test_fallback), so this is
a safe change for end users.

Runpod GPU workers that want the full memory-allocation test back
can use 'runpod install-gpu-test' (added in a follow-up commit).
MANIFEST.in's exclude directive is sufficient once stale build/
artifacts are cleared. Revert the belt-and-suspenders config
sections added in db0a02f per spec-review feedback (YAGNI).
Fetches gpu_test from the matching GitHub release asset, verifies
sha256, and writes it atomically into the destination path. Used
by the 'runpod install-gpu-test' CLI command (next commit) to
restore the native CUDA memory-allocation test on GPU workers.
…#498)

Address code-review feedback on 6983e74:
- Unlink the temp file if os.chmod/os.replace fails so repeated
  install attempts don't accumulate tmp* orphans in the binaries dir.
- BinaryChecksumMismatch now includes the release URL and downloaded
  byte count — a 64-char digest diff turns into an immediate 'oh,
  that's a Cloudflare error page' diagnosis.
- URLError reason now formatted with repr() for structured output.
- Add module docstring to runpod/cli/groups/install/__init__.py to
  match sibling groups.
- Two new tests cover the temp-file cleanup path and the enriched
  checksum error message.
Opt-in installer for the CUDA memory-allocation test binary that's
no longer bundled in the universal PyPI wheel. Runpod GPU worker
Dockerfiles should add 'RUN runpod install-gpu-test' after
'pip install runpod' to restore the native test.

Uses the download helper from the previous commits; this commit
just wraps it in a click CLI command with --version / --dest
overrides.
Address code-review feedback on dd9c114:
- Move inline CliRunner / install_gpu_test_cli imports to top of file
  per PEP 8.
- Add test for --dest override that asserts _default_install_path is
  not consulted when the user supplies a destination.
- Add test for the 'unknown' version exit path which surfaces the
  --version hint to the user.
Builds gpu_test inside nvidia/cuda:11.8.0-devel-ubuntu22.04 on each
release-please release (nvcc only needs the CUDA SDK, not GPU
hardware, so ubuntu-latest works) and uploads it plus its sha256
to the GitHub release.

The 'runpod install-gpu-test' CLI added in earlier commits downloads
from these assets, restoring the native CUDA memory-allocation test
on Runpod GPU workers that opt in via their Dockerfile.
…#498)

Address code-review feedback on a726d07:

- Critical: mirror pypi-publish's 'if:' on upload-gpu-test-asset so
  workflow_dispatch + force_publish=true still flows through to
  notify-runpod-workers. Previously the new job's narrower gate
  silently skipped worker notifications on force-publish runs.
- Add optional docker/login-action step to avoid Docker Hub
  anonymous rate limits on release days. Gated on vars.DOCKERHUB_USERNAME
  so it's a no-op until the secret is provisioned.
- Remove the compiled gpu_test binary from git. CI compiles and
  uploads it as a release asset per a726d07; tracking it in source
  control was the root cause that originally let the binary sneak
  into the 'universal' wheel (#498). .gitignore now guards against
  accidental re-add.
CodeQL flagged 0o755 as world-readable. The binary is installed
into a Python package directory and runs inside the runpod worker
process as a single user context — 'others' access isn't needed.
0o750 keeps owner rwx + group r-x (covers Docker usage where the
process user matches the installing user) while dropping world
access.
@deanq deanq force-pushed the deanq-fix-498-optional-gpu-test-binary branch from 6e0cdf4 to f5e5af1 Compare April 17, 2026 08:11
@deanq deanq closed this Apr 17, 2026
@deanq deanq deleted the deanq-fix-498-optional-gpu-test-binary branch April 17, 2026 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Requesting built .whl files hosted on PyPi/etc for easier build on other distros

3 participants