Refactor: Improve Proxy Handling and Secure Boot in GPU Install Script by cjac · Pull Request #1374 · GoogleCloudDataproc/initialization-actions

cjac · 2026-01-23T22:43:49Z

GPU Initialization Action Enhancements for Secure Boot, Proxy, and Reliability

This large update significantly improves the install_gpu_driver.sh script and its accompanying documentation, focusing on robust support for complex environments involving Secure Boot and HTTP/S proxies, and increasing overall reliability and maintainability.

1. gpu/README.md:

Comprehensive Documentation for Secure Boot & Proxy:
- Added a major section: "Building Custom Images with Secure Boot and Proxy Support". This details the end-to-end process using the GoogleCloudDataproc/custom-images repository to create Dataproc images with NVIDIA drivers signed for Secure Boot. It covers environment setup, key management in GCP Secret Manager, Docker builder image creation, and running the image generation process.
- Added a major section: "Launching a Cluster with the Secure Boot Custom Image". This explains how to use the custom-built images to launch Dataproc clusters with --shielded-secure-boot. It includes instructions for private network setups using Google Cloud Secure Web Proxy, leveraging scripts from the GoogleCloudDataproc/cloud-dataproc repository for VPC, subnet, and proxy configuration.
- Includes essential verification steps for checking driver status, module signatures, and system logs on the cluster nodes.
Enhanced Proxy Metadata: Clarified and expanded descriptions for proxy-related metadata: http-proxy, https-proxy, proxy-uri, no-proxy, and http-proxy-pem-uri.
New Section: "Enhanced Proxy Support": Explicitly outlines the script's capabilities in proxied environments, including custom CA certificate handling, automatic tool configuration (curl, apt, dnf, gpg, Java), and bypass mechanisms.
Troubleshooting: Added specific points for debugging network and proxy issues.

2. gpu/install_gpu_driver.sh:

Robust Proxy Handling (set_proxy):
- Completely revamped to handle http-proxy, https-proxy, and proxy-uri metadata, determining the correct proxy values for HTTP and HTTPS.
- Dynamically sets HTTP_PROXY, HTTPS_PROXY, and NO_PROXY environment variables.
- Updates /etc/environment with the current proxy settings.
- Conditionally configures gcloud proxy settings only if the gcloud SDK version is 547.0.0 or greater.
- Performs TCP and HTTP(S) connection tests to the proxy to validate setup.
- Configures apt and dnf to use the proxy.
- Ensures dirmngr or gnupg2-smime is installed and configures dirmngr.conf to use the HTTP proxy.
- Installs custom proxy CA certificates from http-proxy-pem-uri into system, Java, and Conda trust stores. Switches to HTTPS for proxy communications when a CA cert is provided.
- Includes comprehensive verification steps for the proxy and certificate setup.
Reliable GPG Key Importing (import_gpg_keys):
- Introduced a new function import_gpg_keys to handle GPG key fetching and importing in a proxy-aware manner using curl over HTTPS, replacing direct gpg --recv-keys calls to keyservers.
- This function supports fetching keys by URL or Key ID and is used throughout the script for repository setup (NVIDIA Container Toolkit, CUDA, Bigtop, Adoptium, Docker, Google Cloud, CRAN-R, MySQL).
Conda/Mamba Environment (install_pytorch):
- Refined package list: numba, pytorch, tensorflow[and-cuda], rapids, pyspark, and cuda-version<=${CUDA_VERSION}. Explicit CUDA runtime (e.g., cudart_spec) is no longer added, allowing the solver more flexibility.
- Uses Mamba preferentially, with a Conda fallback.
- Implements cache/environment clearing logic based on install_gpu_driver-main and pytorch sentinels to allow forced refreshes.
- Improved error handling for environment creation, with specific messages for Mamba failures in proxied environments.
NVIDIA Driver Handling:
- set_driver_version: Uses curl -I for a more lightweight HEAD request to check URL validity.
- build_driver_from_github: Caches the open kernel module source tarball from GitHub to GCS. Checks for existing signed and loadable modules to avoid unnecessary rebuilds.
- execute_github_driver_build: Refactored to accept tarball paths. popd removed to balance pushd in caller. Removed a debug echo of the sign-file exit code.
- Added make -j$(nproc) to modules_install for parallelization.
- Post-install verification loop checks modinfo for signer: to confirm modules are signed.
Lifecycle Improvements:
- prepare_to_install: Moved curl_retry_args definition earlier.
- install_nvidia_gpu_driver: Checks if nvidia module loads at the start and marks incomplete if not.
- main: Added mark_complete install_gpu_driver-main at the end.
- configure_dkms_certs: Always fetches keys from secret manager if PSN is set to ensure modulus_md5sum is available.
- install_gpu_agent: Checks if METADATA_HTTP_PROXY_PEM_URI is non-empty before using it.
Secure Boot Check: Issues a warning instead of exiting for Secure Boot on Debian < 2.2.

gemini-code-assist · 2026-01-23T22:44:13Z

Summary of Changes

Hello @cjac, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the GPU driver installation script's reliability and adaptability, particularly in complex network environments requiring HTTP/HTTPS proxies and for systems utilizing Secure Boot. The changes focus on making the installation process more robust, configurable, and resilient to common issues like network restrictions and module signing requirements, while also refining the Conda environment setup and updating documentation.

Highlights

Enhanced Proxy Configuration: Introduced support for https-proxy and proxy-uri metadata, refined NO_PROXY handling with intelligent defaults, and enabled custom CA certificate installation into system, Java, and Conda trust stores. Proxy connections now default to HTTPS when a custom CA is provided, and the script includes pre-flight tests for proxy connectivity and reachability.
Robust GPG Key Import: A new import_gpg_keys function centralizes GPG key fetching from URLs or keyservers, ensuring it respects proxy settings and custom CA certificates, replacing previous direct curl | gpg --import calls.
Secure Boot Signing Refinements: The configure_dkms_certs function now consistently fetches keys from Secret Manager if private_secret_name is set, ensuring modulus_md5sum is available. Kernel module signing is integrated into the build process, with checks to verify modules are signed and loadable after installation.
Resilient Driver Installation: The install_nvidia_gpu_driver function now includes an initial check for the nvidia module's loadability, triggering a re-installation attempt if it fails. curl commands for downloads now incorporate retry flags and honor proxy settings.
Conda Environment Adjustments: The PyTorch Conda environment package list was streamlined by removing TensorFlow. Specific workarounds were added for Debian 10, including using conda instead of mamba and disabling SSL verification.
Documentation Updates: The gpu/README.md file has been updated to reflect the new proxy metadata (https-proxy, proxy-uri, no-proxy), detail the enhanced proxy support, and add proxy-related troubleshooting guidance.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request significantly enhances the GPU driver installation script by introducing robust proxy handling, improving Secure Boot integration, and refining Conda environment setup. Key improvements include flexible proxy configuration with support for HTTPS proxies and custom CA certificates, a new import_gpg_keys function for reliable GPG key fetching, and more thorough verification steps for signed kernel modules under Secure Boot. The documentation has also been updated to reflect these new features and provide better troubleshooting guidance. Overall, these changes make the script more resilient and configurable for diverse network environments and security requirements.

gpu/install_gpu_driver.sh

gemini-code-assist · 2026-01-23T22:49:42Z

gpu/install_gpu_driver.sh

+  if [[ -v METADATA_HTTP_PROXY_PEM_URI ]] && [[ -n "${METADATA_HTTP_PROXY_PEM_URI}" ]]; then
+      if [[ -z "${trusted_pem_path:-}" ]]; then
+          echo "WARNING: METADATA_HTTP_PROXY_PEM_URI is set, but trusted_pem_path is not defined." >&2
+      else
+          curl_retry_args+=(--cacert "${trusted_pem_path}")
+      fi


The warning METADATA_HTTP_PROXY_PEM_URI is set, but trusted_pem_path is not defined indicates a potential issue. trusted_pem_path is only set within set_proxy if both a proxy (http-proxy/https-proxy) and a PEM URI are provided. If http-proxy-pem-uri is provided but no http-proxy or https-proxy is set, set_proxy returns early, leaving trusted_pem_path undefined. This could lead to GPG key imports failing to use the custom CA, even if the PEM URI is present.

gemini-code-assist · 2026-01-23T22:49:42Z

gpu/install_gpu_driver.sh

    pkg_proxy_conf_file="/etc/apt/apt.conf.d/99proxy"
-    cat > "${pkg_proxy_conf_file}" <<EOF
-Acquire::http::Proxy "http://${METADATA_HTTP_PROXY}";
-Acquire::https::Proxy "http://${METADATA_HTTP_PROXY}";
-EOF
+    echo "Acquire::http::Proxy \"http://${effective_proxy}\";" > "${pkg_proxy_conf_file}"
+    echo "Acquire::https::Proxy \"http://${effective_proxy}\";" >> "${pkg_proxy_conf_file}"
+    echo "DEBUG: set_proxy: Configured apt proxy: ${pkg_proxy_conf_file}"
  elif is_rocky ; then
    pkg_proxy_conf_file="/etc/dnf/dnf.conf"
-
    touch "${pkg_proxy_conf_file}"
-
-    if grep -q "^proxy=" "${pkg_proxy_conf_file}"; then
-      sed -i.bak "s@^proxy=.*@proxy=${HTTP_PROXY}@" "${pkg_proxy_conf_file}"
-    elif grep -q "^\[main\]" "${pkg_proxy_conf_file}"; then
-      sed -i.bak "/^\[main\]/a proxy=${HTTP_PROXY}" "${pkg_proxy_conf_file}"
+    sed -i.bak '/^proxy=/d' "${pkg_proxy_conf_file}"
+    if grep -q "^\[main\]" "${pkg_proxy_conf_file}"; then
+      sed -i.bak "/^\\\[main\\\\]/a proxy=http://${effective_proxy}" "${pkg_proxy_conf_file}"
    else
-      local TMP_FILE=$(mktemp)
-      printf "[main]\nproxy=%s\n" "${HTTP_PROXY}" > "${TMP_FILE}"
-
-      cat "${TMP_FILE}" "${pkg_proxy_conf_file}" > "${pkg_proxy_conf_file}".new
-      mv "${pkg_proxy_conf_file}".new "${pkg_proxy_conf_file}"
+      echo -e "[main]\nproxy=http://${effective_proxy}" >> "${pkg_proxy_conf_file}"
+    fi
+    echo "DEBUG: set_proxy: Configured dnf proxy: ${pkg_proxy_conf_file}"
+  fi


The apt and dnf proxy configurations (Acquire::http::Proxy "http://${effective_proxy}"; and proxy=http://${effective_proxy}) use an http:// prefix. If effective_proxy is derived solely from https_proxy_val (meaning only an HTTPS proxy was specified), this could lead to apt/dnf attempting to connect to an HTTPS proxy using an HTTP scheme. While a later sed command attempts to correct this if http-proxy-pem-uri is set, it might be incorrect if http-proxy-pem-uri is not provided.

Suggested change

pkg_proxy_conf_file="/etc/apt/apt.conf.d/99proxy"

cat > "${pkg_proxy_conf_file}" <<EOF

Acquire::http::Proxy "http://${METADATA_HTTP_PROXY}";

Acquire::https::Proxy "http://${METADATA_HTTP_PROXY}";

EOF

echo "Acquire::http::Proxy \"http://${effective_proxy}\";" > "${pkg_proxy_conf_file}"

echo "Acquire::https::Proxy \"http://${effective_proxy}\";" >> "${pkg_proxy_conf_file}"

echo "DEBUG: set_proxy: Configured apt proxy: ${pkg_proxy_conf_file}"

elif is_rocky ; then

pkg_proxy_conf_file="/etc/dnf/dnf.conf"

touch "${pkg_proxy_conf_file}"

if grep -q "^proxy=" "${pkg_proxy_conf_file}"; then

sed -i.bak "s@^proxy=.*@proxy=${HTTP_PROXY}@" "${pkg_proxy_conf_file}"

elif grep -q "^\[main\]" "${pkg_proxy_conf_file}"; then

sed -i.bak "/^\[main\]/a proxy=${HTTP_PROXY}" "${pkg_proxy_conf_file}"

sed -i.bak '/^proxy=/d' "${pkg_proxy_conf_file}"

if grep -q "^\[main\]" "${pkg_proxy_conf_file}"; then

sed -i.bak "/^\\\[main\\\\]/a proxy=http://${effective_proxy}" "${pkg_proxy_conf_file}"

else

local TMP_FILE=$(mktemp)

printf "[main]\nproxy=%s\n" "${HTTP_PROXY}" > "${TMP_FILE}"

cat "${TMP_FILE}" "${pkg_proxy_conf_file}" > "${pkg_proxy_conf_file}".new

mv "${pkg_proxy_conf_file}".new "${pkg_proxy_conf_file}"

echo -e "[main]\nproxy=http://${effective_proxy}" >> "${pkg_proxy_conf_file}"

fi

echo "DEBUG: set_proxy: Configured dnf proxy: ${pkg_proxy_conf_file}"

fi

echo "Acquire::http::Proxy \"${HTTP_PROXY}\";" > "${pkg_proxy_conf_file}"

echo "Acquire::https::Proxy \"${HTTPS_PROXY}\";" >> "${pkg_proxy_conf_file}"

echo "DEBUG: set_proxy: Configured apt proxy: ${pkg_proxy_conf_file}"

This was resolved by using a variable scheme

gpu/install_gpu_driver.sh

gpu/README.md

cjac · 2026-01-24T02:28:47Z

/gcbrun

cjac · 2026-01-24T21:52:02Z

/gcbrun

cjac · 2026-01-27T23:26:26Z

I've made substantial changes. Please perform another review, discarding previous review comments. Let's start from the top once again.

/gemini review

gemini-code-assist

Code Review

This pull request significantly refactors the GPU driver installation script, enhancing proxy handling, improving Secure Boot integration, and refining Conda environment setup. The documentation has been updated to reflect these changes, providing clearer guidance on proxy configuration and custom image building with Secure Boot. The introduction of the import_gpg_keys function centralizes GPG key management, making the script more robust in various network environments. Overall, these changes contribute to a more reliable and configurable GPU initialization process.

gpu/install_gpu_driver.sh

gemini-code-assist · 2026-01-27T23:28:55Z

gpu/install_gpu_driver.sh

+      building_file=""
+      rm "${local_tarball}"
+      make clean
+      popd


The execute_github_driver_build function includes a popd command. However, this function is called from build_driver_from_github, which also has a popd at its end (line 1331). This results in an extra popd call, which could lead to a directory stack underflow or unexpected behavior if the script's current directory is not what's expected. Ensure that pushd and popd calls are balanced.

gpu/README.md

gpu/install_gpu_driver.sh

cjac · 2026-01-28T00:57:47Z

I've made substantial changes in response to your previous review. Please perform another review.

/gemini review

gemini-code-assist

Code Review

This pull request introduces significant enhancements to the GPU installation script, primarily focusing on robust proxy handling and support for Secure Boot. The documentation in gpu/README.md has been extensively updated with detailed instructions for these complex scenarios, which is a great improvement. The script gpu/install_gpu_driver.sh has been refactored to centralize GPG key imports, improve caching logic for driver and source builds, and add comprehensive proxy configuration, including custom CA certificate handling.

My review focuses on ensuring the new logic is robust and consistent. I've identified a minor documentation inconsistency and a couple of potential issues in the script related to command execution order and error handling. Overall, this is a very strong refactoring that greatly improves the reliability and maintainability of the script.

gpu/README.md

gpu/install_gpu_driver.sh

cjac · 2026-01-28T01:44:39Z

/gemini review

gemini-code-assist

Code Review

This is a significant and valuable pull request that greatly enhances the GPU installation script's reliability, especially in complex environments with proxies and Secure Boot. The refactoring for proxy handling is comprehensive, and the new GPG key import mechanism is a major improvement for robustness. The documentation updates in the README are thorough and will be very helpful for users. The changes are well-executed, but I've identified one high-severity issue related to directory stack management (pushd/popd) in the new execute_github_driver_build function that should be addressed to prevent potential issues with script execution flow.

gpu/install_gpu_driver.sh

cjac · 2026-02-06T23:00:12Z

/gcbrun

dilipgodhia · 2026-02-25T23:23:52Z

cloudbuild/cloudbuild.yaml

+    id: 'dataproc-2.3-debian12-tests'
+    waitFor: ['gcr-push']
+    entrypoint: 'bash'
+    args: ['cloudbuild/run-presubmit-on-k8s.sh', 'gcr.io/$PROJECT_ID/init-actions-image:$BUILD_ID', '$BUILD_ID', '2.3-debian12']


nit: need to replace all variables like $BUILD_ID and others with ${BUILD_ID}

dilipgodhia · 2026-02-25T23:31:16Z

gpu/README.md

+          "TEMP_BUCKET": "YOUR_TEMP_BUCKET",
+          "CUSTOM_IMAGE_NAME": "YOUR_BUILT_IMAGE_NAME",
+          "PURPOSE": "secure-boot-cluster",
+          // Add these for a private, proxied environment


// may not be valid

dilipgodhia · 2026-02-25T23:33:07Z

gpu/README.md

+    *   Set `CUSTOM_IMAGE_NAME` to the image you built in the `custom-images` process.
+
+3.  **Create the Private Environment and Cluster:**
+    This script sets up the VPC, subnets, Secure Web Proxy, and then creates the Dataproc cluster using the custom image. The `--shielded-secure-boot` flag is handled internally by the scripts when a `CUSTOM_IMAGE_NAME` is provided.


Add a note: "make sure that you are in the right directory ...

dilipgodhia · 2026-02-25T23:36:46Z

gpu/install_gpu_driver.sh

    "2.1" ) DEFAULT_CUDA_VERSION="12.4.1" ;;
-    "2.2" ) DEFAULT_CUDA_VERSION="12.6.3" ;;
-    "2.3" ) DEFAULT_CUDA_VERSION="12.6.3" ;;
+    "2.2" ) DEFAULT_CUDA_VERSION="13.0.1" ;;


should this be 13.1.1?

dilipgodhia · 2026-02-25T23:46:20Z

gpu/install_gpu_driver.sh

      -c conda-forge -c nvidia -c rapidsai \
-      numba pytorch tensorflow[and-cuda] rapids pyspark \
-      "cuda-version<=${CUDA_VERSION}" "${cudart_spec}"
+      ${conda_pkg} 2> "${conda_err_file}"


nit: may need to put quotes around this

dilipgodhia · 2026-02-25T23:46:37Z

gpu/install_gpu_driver.sh

+    local conda_exit_code=$?
+    set -e
+
+    if [[ ${conda_exit_code} -ne 0 ]]; then


nit: may need to put quotes around this

dilipgodhia · 2026-02-25T23:52:09Z

gpu/install_gpu_driver.sh

+      ${gsutil_cmd} cp "${local_tarball}.building" "${gcs_tarball}.building"
+      building_file="${gcs_tarball}.building"
+
+      pushd open-gpu-kernel-modules


pushd does not appear to have a corresponding popd. is that ok?

Excellent catch, @dilipgodhia! You are correct, the popd was missing within the execute_github_driver_build function to balance the pushd open-gpu-kernel-modules. I've added the popd to the end of the function to fix this. Thanks! e37b8eb

dilipgodhia · 2026-02-25T23:55:54Z

gpu/install_gpu_driver.sh

+    echo "Extracting source tarball..."
+    tar xzf "${local_tarball}" -C "${workdir}"
+    mv "${workdir}/open-gpu-kernel-modules-${DRIVER_VERSION}" "${workdir}/open-gpu-kernel-modules"
+    # rm -f "${local_tarball}" # Keep the local tarball for potential reuse


nit: remove # command line unless it is a comment

This is indeed a comment, @dilipgodhia. The # at the beginning of the rm command is intentional to keep the line but prevent execution, as the comment on the same line explains the intent is to preserve the tarball for potential reuse.

dilipgodhia · 2026-02-26T00:08:26Z

gpu/install_gpu_driver.sh

+        ethtool_info: ($ethtool_info | if . == "null" or . == "" then null else . end),
+        ethtool_driver: ($ethtool_driver | if . == "null" or . == "" then null else . end)
+      }')
+    ni_array=$(echo "$ni_array" | jq --argjson item "$interface_json" '. += [$item]')


nit: may need to put curly braces around this array

Thanks, @dilipgodhia! This section of the evaluate_network function has been refactored. The network interface details are now directly integrated into the main JSON object without using the intermediate ni_array variable, so this specific concern is obsolete in the current version.

dilipgodhia · 2026-02-26T00:16:41Z

gpu/install_gpu_driver.sh

+  # Configure gcloud proxy
+  local gcloud_version
+  gcloud_version=$(gcloud version --format="value(google_cloud_sdk)")
+  if version_ge "${gcloud_version}" "547.0.0"; then


consider creating an uppercase variable for the hard-coded value

Good suggestion, @dilipgodhia! This has been updated in the latest version. The script now uses a constant min_gcloud_proxy_ver="547.0.0" for better readability and maintainability.

dilipgodhia · 2026-02-26T00:26:43Z

gpu/install_gpu_driver.sh

+    if [[ -f "/etc/environment" ]]; then
+        JAVA_HOME="$(awk -F= '/^JAVA_HOME=/ {print $2}' /etc/environment)"
+        if [[ -n "${JAVA_HOME:-}" && -f "${JAVA_HOME}/bin/keytool" ]]; then
+            "${JAVA_HOME}/bin/keytool" -import -cacerts -storepass changeit -noprompt -alias swp_ca -file "${proxy_ca_pem}"


please check if the keytool requires -importcert if dataproc 2.0 is used.

Thanks for the check, @dilipgodhia! According to the keytool documentation, -import is a valid alias for -importcert. They are functionally equivalent for importing certificates. The current command should work correctly across different Java versions, including the ones used in Dataproc 2.0.

dilipgodhia

Review done

cjac · 2026-03-30T18:42:30Z

/gcbrun

cjac · 2026-03-31T00:42:39Z

/gcbrun

cjac · 2026-04-02T13:32:49Z

/gcbrun

cjac · 2026-04-02T13:58:50Z

/gcbrun

…uild process This commit introduces significant improvements to the GPU initialization action, focusing on robustness, configurability, and debuggability. **Core Enhancements:** * **Version Updates:** Updated mappings for CUDA, NVIDIA Drivers, CUDNN, and NCCL to support newer versions (up to CUDA 13.1). * **GCS Caching for CUDA Runfile:** The script now caches the CUDA runfile in the GCS bucket, similar to the driver, reducing download times on subsequent runs. * **Refactored Proxy Handling (`set_proxy`):** * Completely overhauled to support `http-proxy`, `https-proxy`, `proxy-uri`, and `http-proxy-pem-uri` metadata. * Dynamically determines proxy protocol (HTTP/HTTPS) based on PEM URI presence. * Configures environment variables, `/etc/environment`, gcloud, apt/dnf, and dirmngr. * Installs the proxy CA certificate into OS, Java, and Conda trust stores if provided. * Includes TCP and HTTPS connectivity tests through the configured proxy. * **Network Evaluation (`evaluate_network`):** * New function to gather extensive network configuration, metadata, IP information, and connectivity test results. * Saves the output to `/run/dpgce-network.json` for debugging and use by other scripts. * Includes helper functions (`get_network_info`, `get_primary_ip`, etc.) to easily query this file. * **Improved Open Kernel Module Build:** * Caches the NVIDIA open-gpu-kernel-modules source tarball in GCS. * Refactored build logic into `execute_github_driver_build`. * Added checks to only rebuild modules if they are missing, unsigned, or fail to load. * Enhanced module signing process within the build. * **Robust GPG Key Import (`import_gpg_keys`):** * New function to handle GPG key fetching from URLs or Keyservers. * Replaces various `curl | gpg --import` and `gpg --recv-keys` calls with a more resilient and unified approach. * **Mamba Integration:** The script now attempts to use `mamba` for faster Conda environment creation for PyTorch, with a fallback to `conda`. Includes error handling for common mamba/proxy issues. * **PyTorch Environment Cache Purge:** Added logic to automatically clear the GCS cache and local environment for the PyTorch Conda package if a rebuild is likely needed (e.g., after driver changes). **Other Changes:** * Updated default CUDA version for Dataproc 2.2+ images to 13.1.0. * Adjusted `NVCC_GENCODE` flags for different CUDA versions to optimize for relevant GPU architectures. * Refined `configure_dkms_certs` to always fetch keys from Secrets Manager if `PSN` metadata is set. * Added a check to `install_nvidia_gpu_driver` to force re-installation if the `nvidia` module doesn't load. * Moved network evaluation and tool setup earlier in `prepare_to_install`. * Minor fixes and quoting improvements throughout the script.

cjac · 2026-04-02T14:12:45Z

/gcbrun

cjac · 2026-04-02T14:31:57Z

/gcbrun

cjac · 2026-04-02T14:41:19Z

/gcbrun

This commit significantly expands and refines the `gpu/README.md` to guide users on effectively using the GPU initialization action, particularly emphasizing the use of the companion tooling in the `GoogleCloudDataproc/cloud-dataproc` repository. **Key README Changes:** * **Recommended Approach:** Strongly recommends using the `cloud-dataproc/gcloud` scripts (`bin/create-dpgce`, `bin/recreate-dpgce`) for cluster creation, especially for complex setups involving custom images, Secure Boot, or proxies. * **`env.json` Configuration:** Detailed explanation of key properties in `env.json` for configuring GPU clusters through the `cloud-dataproc` tooling. * **Secure Boot Custom Images:** Added a comprehensive section on "Building Custom Images with Secure Boot and Proxy Support," referencing the `GoogleCloudDataproc/custom-images` repository and the `examples/secure-boot/` toolkit. * **Launching with Custom Image:** Explains how to launch a cluster using the built custom image with Secure Boot enabled, again using the `cloud-dataproc/gcloud` scripts. * **Network Evaluation & Proxy Support:** New sections describing the built-in network diagnostics (`evaluate_network`, `/run/dpgce-network.json`) and the enhanced proxy support capabilities, including custom CA certificate handling. * **Metadata Parameters:** Updated descriptions for proxy-related metadata (`http-proxy`, `https-proxy`, `proxy-uri`, `http-proxy-pem-uri`, etc.). * **Troubleshooting:** Enhanced troubleshooting guide, including tips for network/proxy issues and referencing the network diagnostics file. * **Clarity:** Improved overall structure and clarity of instructions. **Other Changes:** * Reverted the functional changes to `install_gpu_driver.sh` and `test_gpu.py` that were present in the previous diff. The script and tests are now back to the state before the caching, proxy, and test refactoring enhancements.

cjac · 2026-04-02T15:49:00Z

/gcbrun

cjac self-assigned this Jan 23, 2026

gemini-code-assist bot reviewed Jan 23, 2026

View reviewed changes

cjac force-pushed the gpu-202601 branch from f6c93fd to ae9be1e Compare January 23, 2026 23:09

cjac requested review from a team, GHWendy, JerryLeiDing, KoopaKing, abmodi, bradmiro, cnauroth, codelixir, dilipgodhia, jayadeep-jayaraman, kuldeepkk-dev, rrohanarora and xby-G January 24, 2026 06:57

cjac force-pushed the gpu-202601 branch from ae9be1e to f2cee3d Compare January 24, 2026 21:51

gemini-code-assist bot reviewed Jan 27, 2026

View reviewed changes

cjac force-pushed the gpu-202601 branch 4 times, most recently from 25df48a to 3f5811e Compare January 28, 2026 00:56

gemini-code-assist bot reviewed Jan 28, 2026

View reviewed changes

gpu/README.md Outdated Show resolved Hide resolved

gpu/install_gpu_driver.sh Outdated Show resolved Hide resolved

gpu/install_gpu_driver.sh Show resolved Hide resolved

gpu/install_gpu_driver.sh Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Jan 28, 2026

View reviewed changes

gpu/install_gpu_driver.sh Show resolved Hide resolved

dilipgodhia reviewed Feb 25, 2026

View reviewed changes

dilipgodhia reviewed Feb 26, 2026

View reviewed changes

cjac force-pushed the gpu-202601 branch from aecceac to e37b8eb Compare March 13, 2026 03:42

codelixir removed their request for review March 27, 2026 11:18

cjac force-pushed the gpu-202601 branch from f5de693 to a6a1f17 Compare April 2, 2026 13:58

cjac force-pushed the gpu-202601 branch 2 times, most recently from b0d005a to 631ecd2 Compare April 2, 2026 14:12

cjac force-pushed the gpu-202601 branch from 631ecd2 to 202fb72 Compare April 2, 2026 14:41

cjac force-pushed the gpu-202601 branch from 202fb72 to 040a851 Compare April 2, 2026 15:48

Conversation

cjac commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GPU Initialization Action Enhancements for Secure Boot, Proxy, and Reliability

Uh oh!

gemini-code-assist bot commented Jan 23, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cjac commented Jan 24, 2026

Uh oh!

cjac commented Jan 24, 2026

Uh oh!

cjac commented Jan 27, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cjac commented Jan 28, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cjac commented Jan 28, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

cjac commented Feb 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

cjac commented Jan 23, 2026 •

edited

Loading