GoogleCloudDataproc · cjac · Apr 2, 2026 · Apr 2, 2026 · gemini-code-assist · Jan 23, 2026
diff --git a/gpu/README.md b/gpu/README.md
@@ -2,8 +2,8 @@
 
 GPUs require special drivers and software which are not pre-installed on
 [Dataproc](https://cloud.google.com/dataproc) clusters by default.
-This initialization action installs GPU driver for NVIDIA GPUs on master and
-worker nodes in a Dataproc cluster.
+This initialization action installs GPU driver for NVIDIA GPUs on -m node(s) and
+-w nodes in a Dataproc cluster.
 
 ## Default versions
 
@@ -15,6 +15,7 @@ Specifying a supported value for the `cuda-version` metadata variable
 will select compatible values for Driver, cuDNN, and NCCL from the script's
 internal matrix. Default CUDA versions are typically:
 
+  * Dataproc 1.5: `11.6.2`
   * Dataproc 2.0: `12.1.1`
   * Dataproc 2.1: `12.4.1`
   * Dataproc 2.2 & 2.3: `12.6.3`
@@ -26,10 +27,12 @@ Refer to internal arrays in `install_gpu_driver.sh` for the full matrix.)*
 
 CUDA | Full Version | Driver    | cuDNN     | NCCL   | Tested Dataproc Image Versions
 -----| ------------ | --------- | --------- | -------| ---------------------------
-11.8 | 11.8.0       | 525.147.05| 9.5.1.17  | 2.21.5 | 2.0, 2.1 (Debian/Ubuntu/Rocky); 2.2 (Ubuntu 22.04)
-12.0 | 12.0.1       | 525.147.05| 8.8.1.3   | 2.16.5 | 2.0, 2.1 (Debian/Ubuntu/Rocky); 2.2 (Rocky 9, Ubuntu 22.04)
-12.4 | 12.4.1       | 550.135   | 9.1.0.70  | 2.23.4 | 2.1 (Ubuntu 20.04, Rocky 8); Dataproc 2.2+
-12.6 | 12.6.3       | 550.142   | 9.6.0.74  | 2.23.4 | 2.1 (Ubuntu 20.04, Rocky 8); Dataproc 2.2+
+11.8 | 11.8.0       | 525.147.05| 9.5.1.17  | 2.21.5 | 2.0, 2.1 (Debian, Ubuntu)
+12.0 | 12.0.1       | 525.147.05| 8.8.1.3   | 2.16.5 | 2.0, 2.1 (Debian, Ubuntu)
+12.4 | 12.4.1       | 550.135   | 9.1.0.70  | 2.23.4 | 2.0, 2.1 (Debian, Ubuntu); 2.2+ (Debian, Ubuntu, Rocky)
+12.6 | 12.6.3       | 550.142   | 9.6.0.74  | 2.23.4 | 2.2+ (Debian, Ubuntu, Rocky)
+
+*Note: Secure Boot is only supported on Dataproc 2.2+ images.*
 
 **Supported Operating Systems:**
 
@@ -43,68 +46,60 @@ CUDA | Full Version | Driver    | cuDNN     | NCCL   | Tested Dataproc Image Ver
 [best practices](/README.md#how-initialization-actions-are-used)
 of using initialization actions in production.
 
-This initialization action will install NVIDIA GPU drivers and the CUDA toolkit.
-Optional components like cuDNN, NCCL, and PyTorch can be included via
-metadata.
-
-1.  Use the `gcloud` command to create a new cluster with this initialization
-    action. The following command will create a new cluster named
-    `<CLUSTER_NAME>` and install default GPU drivers (GPU agent is enabled
-    by default).
-
-    ```bash
-    REGION=<region>
-    CLUSTER_NAME=<cluster_name>
-    DATAPROC_IMAGE_VERSION=<image_version> # e.g., 2.2-debian12
-
-    gcloud dataproc clusters create ${CLUSTER_NAME} \
-      --region ${REGION} \
-      --image-version ${DATAPROC_IMAGE_VERSION} \
-      --master-accelerator type=nvidia-tesla-t4,count=1 \
-      --worker-accelerator type=nvidia-tesla-t4,count=2 \
-      --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/gpu/install_gpu_driver.sh \
-      --scopes https://www.googleapis.com/auth/monitoring.write # For GPU agent
-    ```
+The recommended way to create a Dataproc cluster with GPU support, especially for environments requiring custom images, Secure Boot, or private networks with proxies, is to use the tooling provided in the [GoogleCloudDataproc/cloud-dataproc](https://github.com/GoogleCloudDataproc/cloud-dataproc) repository. This approach simplifies configuration and automates the `gcloud` command generation.
 
-2.  Use the `gcloud` command to create a new cluster specifying a custom CUDA
-    version and providing direct HTTP/HTTPS URLs for the driver and CUDA
-    `.run` files. This example also disables the GPU agent.
+**Steps:**
 
+1.  **Clone the `cloud-dataproc` Repository:**
     ```bash
-    REGION=<region>
-    CLUSTER_NAME=<cluster_name>
-    DATAPROC_IMAGE_VERSION=<image_version> # e.g., 2.2-ubuntu22
-    MY_DRIVER_URL="https://us.download.nvidia.com/XFree86/Linux-x86_64/550.90.07/NVIDIA-Linux-x86_64-550.90.07.run"
-    MY_CUDA_URL="https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux.run"
-
-    gcloud dataproc clusters create ${CLUSTER_NAME} \
-      --region ${REGION} \
-      --image-version ${DATAPROC_IMAGE_VERSION} \
-      --master-accelerator type=nvidia-tesla-t4,count=1 \
-      --worker-accelerator type=nvidia-tesla-t4,count=2 \
-      --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/gpu/install_gpu_driver.sh \
-      --metadata gpu-driver-url=${MY_DRIVER_URL},cuda-url=${MY_CUDA_URL},install-gpu-agent=false
+    git clone https://github.com/GoogleCloudDataproc/cloud-dataproc.git
+    cd cloud-dataproc/gcloud
     ```
 
-3.  To create a cluster with Multi-Instance GPU (MIG) enabled (e.g., for
-    NVIDIA A100 GPUs), you must use this `install_gpu_driver.sh` script
-    for the base driver installation, and additionally specify `gpu/mig.sh`
-    as a startup script.
-
-    ```bash
-    REGION=<region>
-    CLUSTER_NAME=<cluster_name>
-    DATAPROC_IMAGE_VERSION=<image_version> # e.g., 2.2-rocky9
-
-    gcloud dataproc clusters create ${CLUSTER_NAME} \
-      --region ${REGION} \
-      --image-version ${DATAPROC_IMAGE_VERSION} \
-      --worker-machine-type a2-highgpu-1g \
-      --worker-accelerator type=nvidia-tesla-a100,count=1 \
-      --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/gpu/install_gpu_driver.sh \
-      --properties "dataproc:startup.script.uri=gs://goog-dataproc-initialization-actions-${REGION}/gpu/mig.sh" \
-      --metadata MIG_CGI='1g.5gb,1g.5gb,1g.5gb,1g.5gb,1g.5gb,1g.5gb,1g.5gb' # Example MIG profiles
-    ```
+2.  **Configure Your Environment:**
+    *   Copy the sample configuration: `cp env.json.sample env.json`
+    *   Edit `env.json` to match your desired cluster setup.
+
+    **Note on JSON Examples:** Any lines in the JSON example below starting with `//` are comments for explanation and should be removed before using the JSON.
+
+    **Key `env.json` Properties:**
+
+    *   **Required:**
+        *   `PROJECT_ID`: Your Google Cloud Project ID.
+        *   `REGION`: The GCP region for the cluster.
+        *   `ZONE`: The GCP zone within the region.
+        *   `BUCKET`: A GCS bucket for staging and temporary files.
+    *   **GPU Related:**
+        *   `GPU_MASTER_ACCELERATORS`: e.g., "type=nvidia-tesla-t4,count=1" (Optional, can be omitted if no GPU on master)
+        *   `GPU_WORKER_ACCELERATORS`: e.g., "type=nvidia-tesla-t4,count=1" (Optional, to have GPUs on workers)
+    *   **Image:**
+        *   `DATAPROC_IMAGE_VERSION`: e.g., "2.2-debian12" (Required, if not using `CUSTOM_IMAGE_NAME`)
+        *   `CUSTOM_IMAGE_NAME`: Set this to the name of your pre-built custom image if you have one (e.g., from the Secure Boot image building process).
+    *   **Optional (Defaults & Advanced):**
+        *   `MACHINE_TYPE_MASTER`, `MACHINE_TYPE_WORKER`
+        *   `NUM_MASTERS`, `NUM_WORKERS`
+        *   `BOOT_DISK_SIZE`, `BOOT_DISK_TYPE`
+        *   `NETWORK`, `SUBNET`: For specifying existing networks.
+        *   `INTERNAL_IP_ONLY`: Set to `true` for private clusters.
+        *   **Proxy Settings:** `SWP_IP`, `SWP_PORT`, `SWP_HOSTNAME`, `PROXY_PEM_URI`, `PROXY_PEM_HASH` (for private networks with Secure Web Proxy).
+        *   **Secure Boot:** `ENABLE_SECURE_BOOT` (set to `true` if using a Secure Boot enabled custom image).
+
+    The `install_gpu_driver.sh` initialization action is automatically added by the scripts in `bin/` if any `GPU_*_ACCELERATORS` are defined in `env.json`.
+
+3.  **Create the Cluster:**
+    Make sure you are in the `cloud-dataproc/gcloud` directory before running these commands.
+    *   To create a new environment (VPC, subnet, proxy if configured) and the cluster:
+        ```bash
+        bash bin/create-dpgce
+        ```
+    *   To recreate the cluster in an existing environment defined by `env.json`:
+        ```bash
+        bash bin/recreate-dpgce
+        ```
+
+These scripts will parse `env.json` and construct the appropriate `gcloud dataproc clusters create` command with all necessary flags, including the initialization action, metadata, scopes, and network settings.
+
+For detailed instructions on Secure Boot custom image creation and private network setup, see the "Building Custom Images with Secure Boot and Proxy Support" section below.
 
 ### Using for Custom Image Creation
 
@@ -191,20 +186,20 @@ This script accepts the following metadata parameters:
   * `cudnn-version`: (Optional) Specify cuDNN version (e.g., `8.9.7.29`).
   * `nccl-version`: (Optional) Specify NCCL version.
   * `include-pytorch`: (Optional) `yes`|`no`. Default: `no`.
-    If `yes`, installs PyTorch, TensorFlow, RAPIDS, and PySpark in a Conda
-    environment.
+    If `yes`, installs PyTorch, Numba, TensorFlow, RAPIDS, and PySpark
+    in a Conda environment (named by `gpu-conda-env`). **This also registers
+    the created Conda environment as a Jupyter kernel.**
   * `gpu-conda-env`: (Optional) Name for the PyTorch Conda environment.
     Default: `dpgce`.
   * `container-runtime`: (Optional) E.g., `docker`, `containerd`, `crio`.
     For NVIDIA Container Toolkit configuration. Auto-detected if not specified.
-  * `http-proxy`: (Optional) URL of an HTTP proxy for downloads.
+  * `http-proxy`: (Optional) Proxy address and port for HTTP requests (e.g., `your-proxy.com:3128`).
+  * `https-proxy`: (Optional) Proxy address and port for HTTPS requests (e.g., `your-proxy.com:3128`). Defaults to `http-proxy` if not set.
+  * `proxy-uri`: (Optional) A single proxy URI for both HTTP and HTTPS. Overridden by `http-proxy` or `https-proxy` if they are set.
+  * `no-proxy`: (Optional) Comma or space-separated list of hosts/domains to bypass the proxy. Defaults include localhost, metadata server, and Google APIs. User-provided values are appended to the defaults.
   * `http-proxy-pem-uri`: (Optional) A `gs://` path to the
-    PEM-encoded certificate file used by the proxy specified in
-    `http-proxy`. This is needed if the proxy uses TLS and its
-    certificate is not already trusted by the cluster's default trust
-    store (e.g., if it's a self-signed certificate or signed by an
-    internal CA). The script will install this certificate into the
-    system and Java trust stores.
+    PEM-encoded CA certificate file for the proxy specified in
+    `http-proxy`/`https-proxy`. Required if the proxy uses TLS with a certificate not in the default system trust store. This certificate will be added to the system, Java, and Conda trust stores, and proxy connections will use HTTPS.
   * `invocation-type`: (For Custom Images) Set to `custom-images` by image
     building tools. Not typically set by end-users creating clusters.
   * **Secure Boot Signing Parameters:** Used if Secure Boot is enabled and
@@ -217,6 +212,35 @@ This script accepts the following metadata parameters:
     modulus_md5sum=<md5sum-of-your-mok-key-modulus>
     ```
 
+### Network Evaluation
+
+This script now includes a network evaluation function (`evaluate_network`) that runs early during execution. It gathers detailed information about the instance's network environment, including:
+
+*   GCP Metadata (instance, project, network interface details)
+*   Local IP and routing table information (`ip` commands)
+*   DNS configuration (`/etc/resolv.conf`)
+*   Proxy settings from metadata
+*   External connectivity tests (e.g., public IP, reachability of key services)
+*   Kerberos configuration status
+
+The results are stored in `/run/dpgce-network.json` and printed to the log. This allows subsequent script logic to make more informed decisions based on the actual network state. Helper functions like `has_default_route()`, `is_proxy_enabled()`, and `can_reach_gstatic()` are available to query this information.
+
+### Enhanced Proxy Support
+
+This script includes robust support for environments requiring an HTTP/HTTPS proxy:
+
+  *   **Configuration:** Use the `http-proxy`, `https-proxy`, or `proxy-uri` metadata to specify your proxy server (host:port).
+  *   **Custom CA Certificates:** If your proxy uses a custom CA (e.g., self-signed), provide the CA certificate in PEM format via the `http-proxy-pem-uri` metadata (as a `gs://` path).
+      *   **Integrity Check:** Optionally, provide the SHA256 hash of the PEM file via `http-proxy-pem-sha256` to ensure the downloaded file is correct.
+      *   The script will:
+          *   Install the CA into the system trust store (`update-ca-certificates` or `update-ca-trust`).
+          *   Add the CA to the Java cacerts trust store.
+          *   Configure Conda to use the system trust store.
+          *   Switch proxy communications to use HTTPS.
+  *   **Tool Configuration:** The script automatically configures `curl`, `apt`, `dnf`, `gpg`, `pip`, and Java to use the specified proxy settings and custom CA if provided. This is now guided by the results of the `evaluate_network` function.
+  *   **Bypass:** The `no-proxy` metadata allows specifying hosts to bypass the proxy. Defaults include `localhost`, the metadata server, `.google.com`, and `.googleapis.com` to ensure essential services function correctly.
+  *   **Verification:** The script performs connection tests to the proxy and attempts to reach external sites (google.com, nvidia.com) through the proxy to validate the configuration before proceeding with downloads.
+
 ### Loading Built Kernel Module & Secure Boot
 
 When the script needs to build NVIDIA kernel modules from source (e.g., using
@@ -238,6 +262,82 @@ not suitable), special considerations apply if Secure Boot is enabled.
     or `dmesg` output for errors like "Operation not permitted" or messages
     related to signature verification failure.
 
+## Building Custom Images with Secure Boot and Proxy Support
+
+For environments requiring NVIDIA drivers to be signed for Secure Boot, especially when operating behind an HTTP/S proxy, you must first build a custom Dataproc image. This process uses tools from the [GoogleCloudDataproc/custom-images](https://github.com/GoogleCloudDataproc/custom-images) repository, specifically the scripts within the `examples/secure-boot/` directory.
+
+**Base Image:** Typically Dataproc 2.2-debian12 or newer.
+
+**Process Overview:**
+
+1.  **Clone `custom-images` Repository:**
+    ```bash
+    git clone https://github.com/GoogleCloudDataproc/custom-images.git
+    cd custom-images
+    ```
+
+2.  **Configure Build:** Set up `env.json` with your project, network, and bucket details. See the `examples/secure-boot/env.json.sample` in the `custom-images` repo.
+
+3.  **Prepare Signing Keys:** Ensure Secure Boot signing keys are available in GCP Secret Manager. Use `examples/secure-boot/create-key-pair.sh` from the `custom-images` repo to create/manage these.
+
+4.  **Build Docker Image:** Build the builder environment: `docker build -t dataproc-secure-boot-builder:latest .`
+
+5.  **Run Image Generation:** Use `generate_custom_image.py` within the Docker container, typically orchestrated by `examples/secure-boot/pre-init.sh`. The core customization script `examples/secure-boot/install_gpu_driver.sh` handles driver installation, proxy setup, and module signing.
+
+    *   Refer to the [Secure Boot example documentation](https://github.com/GoogleCloudDataproc/custom-images/tree/master/examples/secure-boot) for detailed `docker run` commands and metadata requirements (proxy settings, secret names, etc.).
+
+### Launching a Cluster with the Secure Boot Custom Image
+
+Once you have successfully built a custom image with signed drivers, you can create a Dataproc cluster with Secure Boot enabled.
+
+**Important:** To launch a Dataproc cluster with the `--shielded-secure-boot` flag and have NVIDIA drivers function correctly, you MUST use a custom image created through the process detailed above. Standard Dataproc images do not contain the necessary signed modules.
+
+**Network and Cluster Setup:**
+
+To create the cluster in a private network environment with a Secure Web Proxy, use the scripts from the [GoogleCloudDataproc/cloud-dataproc](https://github.com/GoogleCloudDataproc/cloud-dataproc) repository:
+
+1.  **Clone `cloud-dataproc` Repository:**
+    ```bash
+    git clone https://github.com/GoogleCloudDataproc/cloud-dataproc.git
+    cd cloud-dataproc/gcloud
+    ```
+
+2.  **Configure Environment:**
+    *   Copy `env.json.sample` to `env.json`.
+    *   Edit `env.json` with your project details, ensuring you specify the custom image name and any necessary proxy details if you intend to run in a private network. Example:
+        ```json
+        {
+          "PROJECT_ID": "YOUR_GCP_PROJECT_ID",
+          "REGION": "us-west4",
+          "ZONE": "us-west4-a",
+          "BUCKET": "YOUR_STAGING_BUCKET",
+          "TEMP_BUCKET": "YOUR_TEMP_BUCKET",
+          "CUSTOM_IMAGE_NAME": "YOUR_BUILT_IMAGE_NAME",
+          "PURPOSE": "secure-boot-cluster",
+          // Add these for a private, proxied environment
+          "PRIVATE_RANGE": "10.43.79.0/24",
+          "SWP_RANGE": "10.44.79.0/24",
+          "SWP_IP": "10.43.79.245",
+          "SWP_PORT": "3128",
+          "SWP_HOSTNAME": "swp.your-project.example.com"
+          // ... other variables as needed
+        }
+        ```
+    *   Set `CUSTOM_IMAGE_NAME` to the image you built in the `custom-images` process.
+
+3.  **Create the Private Environment and Cluster:**
+    This script sets up the VPC, subnets, Secure Web Proxy, and then creates the Dataproc cluster using the custom image. The `--shielded-secure-boot` flag is handled internally by the scripts when a `CUSTOM_IMAGE_NAME` is provided.
+    ```bash
+    bash bin/create-dpgce-private
+    ```
+
+**Verification:**
+
+1.  SSH into the -m node of the created cluster.
+2.  Check driver status: `sudo nvidia-smi`
+3.  Verify module signature: `sudo modinfo nvidia | grep signer` (should show your custom CA).
+4.  Check for errors: `dmesg | grep -iE "Secure Boot|NVRM|nvidia"`
+
 ### Verification
 
 1.  Once the cluster has been created, you can access the Dataproc cluster and
@@ -280,6 +380,7 @@ handles metric creation and reporting.
   * **Installation Failures:** Examine the initialization action log on the
     affected node, typically `/var/log/dataproc-initialization-script-0.log`
     (or a similar name if multiple init actions are used).
+  * **Network/Proxy Issues:** If using a proxy, double-check the `http-proxy`, `https-proxy`, `proxy-uri`, `no-proxy`, `http-proxy-pem-uri`, and `http-proxy-pem-sha256` metadata settings. Ensure the proxy allows access to NVIDIA domains, GitHub, and package repositories. Check the init action log for curl errors or proxy test failures. The `/run/dpgce-network.json` file contains detailed network diagnostics.
   * **GPU Agent Issues:** If the agent was installed (`install-gpu-agent=true`),
     check its service logs using `sudo journalctl -u gpu-utilization-agent.service`.
   * **Driver Load or Secure Boot Problems:** Review `dmesg` output and
@@ -298,7 +399,7 @@ handles metric creation and reporting.
       * The script extensively caches downloaded artifacts (drivers, CUDA `.run`
         files) and compiled components (kernel modules, NCCL, Conda environments)
         to a GCS bucket. This bucket is typically specified by the
-        `dataproc-temp-bucket` cluster property or metadata.
+        `dataproc-temp-bucket` cluster property or metadata. Downloads and cache operations are proxy-aware.
       * **First Run / Cache Warming:** Initial runs on new configurations (OS,
         kernel, or driver version combinations) that require source compilation
         (e.g., for NCCL or kernel modules when no pre-compiled version is
@@ -324,4 +425,4 @@ handles metric creation and reporting.
     Debian-based systems, including handling of archived backports repositories
     to ensure dependencies can be met.
   * Tested primarily with Dataproc 2.0+ images. Support for older Dataproc
-    1.5 images is limited.
+    1.5 images is limited.