Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
249 changes: 175 additions & 74 deletions gpu/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@

GPUs require special drivers and software which are not pre-installed on
[Dataproc](https://cloud.google.com/dataproc) clusters by default.
This initialization action installs GPU driver for NVIDIA GPUs on master and
worker nodes in a Dataproc cluster.
This initialization action installs GPU driver for NVIDIA GPUs on -m node(s) and
-w nodes in a Dataproc cluster.

## Default versions

Expand All @@ -15,6 +15,7 @@ Specifying a supported value for the `cuda-version` metadata variable
will select compatible values for Driver, cuDNN, and NCCL from the script's
internal matrix. Default CUDA versions are typically:

* Dataproc 1.5: `11.6.2`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Adding Dataproc 1.5 CUDA version to the default CUDA versions list improves the documentation's completeness and helps users understand the supported versions for older Dataproc images.

* Dataproc 2.0: `12.1.1`
* Dataproc 2.1: `12.4.1`
* Dataproc 2.2 & 2.3: `12.6.3`
Expand All @@ -26,10 +27,12 @@ Refer to internal arrays in `install_gpu_driver.sh` for the full matrix.)*

CUDA | Full Version | Driver | cuDNN | NCCL | Tested Dataproc Image Versions
-----| ------------ | --------- | --------- | -------| ---------------------------
11.8 | 11.8.0 | 525.147.05| 9.5.1.17 | 2.21.5 | 2.0, 2.1 (Debian/Ubuntu/Rocky); 2.2 (Ubuntu 22.04)
12.0 | 12.0.1 | 525.147.05| 8.8.1.3 | 2.16.5 | 2.0, 2.1 (Debian/Ubuntu/Rocky); 2.2 (Rocky 9, Ubuntu 22.04)
12.4 | 12.4.1 | 550.135 | 9.1.0.70 | 2.23.4 | 2.1 (Ubuntu 20.04, Rocky 8); Dataproc 2.2+
12.6 | 12.6.3 | 550.142 | 9.6.0.74 | 2.23.4 | 2.1 (Ubuntu 20.04, Rocky 8); Dataproc 2.2+
11.8 | 11.8.0 | 525.147.05| 9.5.1.17 | 2.21.5 | 2.0, 2.1 (Debian, Ubuntu)
12.0 | 12.0.1 | 525.147.05| 8.8.1.3 | 2.16.5 | 2.0, 2.1 (Debian, Ubuntu)
12.4 | 12.4.1 | 550.135 | 9.1.0.70 | 2.23.4 | 2.0, 2.1 (Debian, Ubuntu); 2.2+ (Debian, Ubuntu, Rocky)
12.6 | 12.6.3 | 550.142 | 9.6.0.74 | 2.23.4 | 2.2+ (Debian, Ubuntu, Rocky)

*Note: Secure Boot is only supported on Dataproc 2.2+ images.*

**Supported Operating Systems:**

Expand All @@ -43,68 +46,60 @@ CUDA | Full Version | Driver | cuDNN | NCCL | Tested Dataproc Image Ver
[best practices](/README.md#how-initialization-actions-are-used)
of using initialization actions in production.

This initialization action will install NVIDIA GPU drivers and the CUDA toolkit.
Optional components like cuDNN, NCCL, and PyTorch can be included via
metadata.

1. Use the `gcloud` command to create a new cluster with this initialization
action. The following command will create a new cluster named
`<CLUSTER_NAME>` and install default GPU drivers (GPU agent is enabled
by default).

```bash
REGION=<region>
CLUSTER_NAME=<cluster_name>
DATAPROC_IMAGE_VERSION=<image_version> # e.g., 2.2-debian12

gcloud dataproc clusters create ${CLUSTER_NAME} \
--region ${REGION} \
--image-version ${DATAPROC_IMAGE_VERSION} \
--master-accelerator type=nvidia-tesla-t4,count=1 \
--worker-accelerator type=nvidia-tesla-t4,count=2 \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/gpu/install_gpu_driver.sh \
--scopes https://www.googleapis.com/auth/monitoring.write # For GPU agent
```
The recommended way to create a Dataproc cluster with GPU support, especially for environments requiring custom images, Secure Boot, or private networks with proxies, is to use the tooling provided in the [GoogleCloudDataproc/cloud-dataproc](https://github.com/GoogleCloudDataproc/cloud-dataproc) repository. This approach simplifies configuration and automates the `gcloud` command generation.

2. Use the `gcloud` command to create a new cluster specifying a custom CUDA
version and providing direct HTTP/HTTPS URLs for the driver and CUDA
`.run` files. This example also disables the GPU agent.
**Steps:**

1. **Clone the `cloud-dataproc` Repository:**
```bash
REGION=<region>
CLUSTER_NAME=<cluster_name>
DATAPROC_IMAGE_VERSION=<image_version> # e.g., 2.2-ubuntu22
MY_DRIVER_URL="https://us.download.nvidia.com/XFree86/Linux-x86_64/550.90.07/NVIDIA-Linux-x86_64-550.90.07.run"
MY_CUDA_URL="https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux.run"

gcloud dataproc clusters create ${CLUSTER_NAME} \
--region ${REGION} \
--image-version ${DATAPROC_IMAGE_VERSION} \
--master-accelerator type=nvidia-tesla-t4,count=1 \
--worker-accelerator type=nvidia-tesla-t4,count=2 \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/gpu/install_gpu_driver.sh \
--metadata gpu-driver-url=${MY_DRIVER_URL},cuda-url=${MY_CUDA_URL},install-gpu-agent=false
git clone https://github.com/GoogleCloudDataproc/cloud-dataproc.git
cd cloud-dataproc/gcloud
```

3. To create a cluster with Multi-Instance GPU (MIG) enabled (e.g., for
NVIDIA A100 GPUs), you must use this `install_gpu_driver.sh` script
for the base driver installation, and additionally specify `gpu/mig.sh`
as a startup script.

```bash
REGION=<region>
CLUSTER_NAME=<cluster_name>
DATAPROC_IMAGE_VERSION=<image_version> # e.g., 2.2-rocky9

gcloud dataproc clusters create ${CLUSTER_NAME} \
--region ${REGION} \
--image-version ${DATAPROC_IMAGE_VERSION} \
--worker-machine-type a2-highgpu-1g \
--worker-accelerator type=nvidia-tesla-a100,count=1 \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/gpu/install_gpu_driver.sh \
--properties "dataproc:startup.script.uri=gs://goog-dataproc-initialization-actions-${REGION}/gpu/mig.sh" \
--metadata MIG_CGI='1g.5gb,1g.5gb,1g.5gb,1g.5gb,1g.5gb,1g.5gb,1g.5gb' # Example MIG profiles
```
2. **Configure Your Environment:**
* Copy the sample configuration: `cp env.json.sample env.json`
* Edit `env.json` to match your desired cluster setup.

**Note on JSON Examples:** Any lines in the JSON example below starting with `//` are comments for explanation and should be removed before using the JSON.

**Key `env.json` Properties:**

* **Required:**
* `PROJECT_ID`: Your Google Cloud Project ID.
* `REGION`: The GCP region for the cluster.
* `ZONE`: The GCP zone within the region.
* `BUCKET`: A GCS bucket for staging and temporary files.
* **GPU Related:**
* `GPU_MASTER_ACCELERATORS`: e.g., "type=nvidia-tesla-t4,count=1" (Optional, can be omitted if no GPU on master)
* `GPU_WORKER_ACCELERATORS`: e.g., "type=nvidia-tesla-t4,count=1" (Optional, to have GPUs on workers)
* **Image:**
* `DATAPROC_IMAGE_VERSION`: e.g., "2.2-debian12" (Required, if not using `CUSTOM_IMAGE_NAME`)
* `CUSTOM_IMAGE_NAME`: Set this to the name of your pre-built custom image if you have one (e.g., from the Secure Boot image building process).
* **Optional (Defaults & Advanced):**
* `MACHINE_TYPE_MASTER`, `MACHINE_TYPE_WORKER`
* `NUM_MASTERS`, `NUM_WORKERS`
* `BOOT_DISK_SIZE`, `BOOT_DISK_TYPE`
* `NETWORK`, `SUBNET`: For specifying existing networks.
* `INTERNAL_IP_ONLY`: Set to `true` for private clusters.
* **Proxy Settings:** `SWP_IP`, `SWP_PORT`, `SWP_HOSTNAME`, `PROXY_PEM_URI`, `PROXY_PEM_HASH` (for private networks with Secure Web Proxy).
* **Secure Boot:** `ENABLE_SECURE_BOOT` (set to `true` if using a Secure Boot enabled custom image).

The `install_gpu_driver.sh` initialization action is automatically added by the scripts in `bin/` if any `GPU_*_ACCELERATORS` are defined in `env.json`.

3. **Create the Cluster:**
Make sure you are in the `cloud-dataproc/gcloud` directory before running these commands.
* To create a new environment (VPC, subnet, proxy if configured) and the cluster:
```bash
bash bin/create-dpgce
```
* To recreate the cluster in an existing environment defined by `env.json`:
```bash
bash bin/recreate-dpgce
```

These scripts will parse `env.json` and construct the appropriate `gcloud dataproc clusters create` command with all necessary flags, including the initialization action, metadata, scopes, and network settings.

For detailed instructions on Secure Boot custom image creation and private network setup, see the "Building Custom Images with Secure Boot and Proxy Support" section below.

### Using for Custom Image Creation

Expand Down Expand Up @@ -191,20 +186,20 @@ This script accepts the following metadata parameters:
* `cudnn-version`: (Optional) Specify cuDNN version (e.g., `8.9.7.29`).
* `nccl-version`: (Optional) Specify NCCL version.
* `include-pytorch`: (Optional) `yes`|`no`. Default: `no`.
If `yes`, installs PyTorch, TensorFlow, RAPIDS, and PySpark in a Conda
environment.
If `yes`, installs PyTorch, Numba, TensorFlow, RAPIDS, and PySpark
in a Conda environment (named by `gpu-conda-env`). **This also registers
the created Conda environment as a Jupyter kernel.**
* `gpu-conda-env`: (Optional) Name for the PyTorch Conda environment.
Default: `dpgce`.
* `container-runtime`: (Optional) E.g., `docker`, `containerd`, `crio`.
For NVIDIA Container Toolkit configuration. Auto-detected if not specified.
* `http-proxy`: (Optional) URL of an HTTP proxy for downloads.
* `http-proxy`: (Optional) Proxy address and port for HTTP requests (e.g., `your-proxy.com:3128`).
* `https-proxy`: (Optional) Proxy address and port for HTTPS requests (e.g., `your-proxy.com:3128`). Defaults to `http-proxy` if not set.
* `proxy-uri`: (Optional) A single proxy URI for both HTTP and HTTPS. Overridden by `http-proxy` or `https-proxy` if they are set.
* `no-proxy`: (Optional) Comma or space-separated list of hosts/domains to bypass the proxy. Defaults include localhost, metadata server, and Google APIs. User-provided values are appended to the defaults.
* `http-proxy-pem-uri`: (Optional) A `gs://` path to the
PEM-encoded certificate file used by the proxy specified in
`http-proxy`. This is needed if the proxy uses TLS and its
certificate is not already trusted by the cluster's default trust
store (e.g., if it's a self-signed certificate or signed by an
internal CA). The script will install this certificate into the
system and Java trust stores.
PEM-encoded CA certificate file for the proxy specified in
`http-proxy`/`https-proxy`. Required if the proxy uses TLS with a certificate not in the default system trust store. This certificate will be added to the system, Java, and Conda trust stores, and proxy connections will use HTTPS.
* `invocation-type`: (For Custom Images) Set to `custom-images` by image
building tools. Not typically set by end-users creating clusters.
* **Secure Boot Signing Parameters:** Used if Secure Boot is enabled and
Expand All @@ -217,6 +212,35 @@ This script accepts the following metadata parameters:
modulus_md5sum=<md5sum-of-your-mok-key-modulus>
```

### Network Evaluation

This script now includes a network evaluation function (`evaluate_network`) that runs early during execution. It gathers detailed information about the instance's network environment, including:

* GCP Metadata (instance, project, network interface details)
* Local IP and routing table information (`ip` commands)
* DNS configuration (`/etc/resolv.conf`)
* Proxy settings from metadata
* External connectivity tests (e.g., public IP, reachability of key services)
* Kerberos configuration status

The results are stored in `/run/dpgce-network.json` and printed to the log. This allows subsequent script logic to make more informed decisions based on the actual network state. Helper functions like `has_default_route()`, `is_proxy_enabled()`, and `can_reach_gstatic()` are available to query this information.

### Enhanced Proxy Support

This script includes robust support for environments requiring an HTTP/HTTPS proxy:

* **Configuration:** Use the `http-proxy`, `https-proxy`, or `proxy-uri` metadata to specify your proxy server (host:port).
* **Custom CA Certificates:** If your proxy uses a custom CA (e.g., self-signed), provide the CA certificate in PEM format via the `http-proxy-pem-uri` metadata (as a `gs://` path).
* **Integrity Check:** Optionally, provide the SHA256 hash of the PEM file via `http-proxy-pem-sha256` to ensure the downloaded file is correct.
* The script will:
* Install the CA into the system trust store (`update-ca-certificates` or `update-ca-trust`).
* Add the CA to the Java cacerts trust store.
* Configure Conda to use the system trust store.
* Switch proxy communications to use HTTPS.
* **Tool Configuration:** The script automatically configures `curl`, `apt`, `dnf`, `gpg`, `pip`, and Java to use the specified proxy settings and custom CA if provided. This is now guided by the results of the `evaluate_network` function.
* **Bypass:** The `no-proxy` metadata allows specifying hosts to bypass the proxy. Defaults include `localhost`, the metadata server, `.google.com`, and `.googleapis.com` to ensure essential services function correctly.
* **Verification:** The script performs connection tests to the proxy and attempts to reach external sites (google.com, nvidia.com) through the proxy to validate the configuration before proceeding with downloads.
Comment on lines +228 to +242
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The new 'Enhanced Proxy Support' section clearly outlines the capabilities and configuration options for proxy usage, including custom CA certificates and tool integration. This detailed explanation is highly beneficial for users operating in proxied environments.


### Loading Built Kernel Module & Secure Boot

When the script needs to build NVIDIA kernel modules from source (e.g., using
Expand All @@ -238,6 +262,82 @@ not suitable), special considerations apply if Secure Boot is enabled.
or `dmesg` output for errors like "Operation not permitted" or messages
related to signature verification failure.

## Building Custom Images with Secure Boot and Proxy Support

For environments requiring NVIDIA drivers to be signed for Secure Boot, especially when operating behind an HTTP/S proxy, you must first build a custom Dataproc image. This process uses tools from the [GoogleCloudDataproc/custom-images](https://github.com/GoogleCloudDataproc/custom-images) repository, specifically the scripts within the `examples/secure-boot/` directory.

**Base Image:** Typically Dataproc 2.2-debian12 or newer.

**Process Overview:**

1. **Clone `custom-images` Repository:**
```bash
git clone https://github.com/GoogleCloudDataproc/custom-images.git
cd custom-images
```

2. **Configure Build:** Set up `env.json` with your project, network, and bucket details. See the `examples/secure-boot/env.json.sample` in the `custom-images` repo.

3. **Prepare Signing Keys:** Ensure Secure Boot signing keys are available in GCP Secret Manager. Use `examples/secure-boot/create-key-pair.sh` from the `custom-images` repo to create/manage these.

4. **Build Docker Image:** Build the builder environment: `docker build -t dataproc-secure-boot-builder:latest .`

5. **Run Image Generation:** Use `generate_custom_image.py` within the Docker container, typically orchestrated by `examples/secure-boot/pre-init.sh`. The core customization script `examples/secure-boot/install_gpu_driver.sh` handles driver installation, proxy setup, and module signing.

* Refer to the [Secure Boot example documentation](https://github.com/GoogleCloudDataproc/custom-images/tree/master/examples/secure-boot) for detailed `docker run` commands and metadata requirements (proxy settings, secret names, etc.).

### Launching a Cluster with the Secure Boot Custom Image

Once you have successfully built a custom image with signed drivers, you can create a Dataproc cluster with Secure Boot enabled.

**Important:** To launch a Dataproc cluster with the `--shielded-secure-boot` flag and have NVIDIA drivers function correctly, you MUST use a custom image created through the process detailed above. Standard Dataproc images do not contain the necessary signed modules.

**Network and Cluster Setup:**

To create the cluster in a private network environment with a Secure Web Proxy, use the scripts from the [GoogleCloudDataproc/cloud-dataproc](https://github.com/GoogleCloudDataproc/cloud-dataproc) repository:

1. **Clone `cloud-dataproc` Repository:**
```bash
git clone https://github.com/GoogleCloudDataproc/cloud-dataproc.git
cd cloud-dataproc/gcloud
```

2. **Configure Environment:**
* Copy `env.json.sample` to `env.json`.
* Edit `env.json` with your project details, ensuring you specify the custom image name and any necessary proxy details if you intend to run in a private network. Example:
```json
{
"PROJECT_ID": "YOUR_GCP_PROJECT_ID",
"REGION": "us-west4",
"ZONE": "us-west4-a",
"BUCKET": "YOUR_STAGING_BUCKET",
"TEMP_BUCKET": "YOUR_TEMP_BUCKET",
"CUSTOM_IMAGE_NAME": "YOUR_BUILT_IMAGE_NAME",
"PURPOSE": "secure-boot-cluster",
// Add these for a private, proxied environment
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// may not be valid

"PRIVATE_RANGE": "10.43.79.0/24",
"SWP_RANGE": "10.44.79.0/24",
"SWP_IP": "10.43.79.245",
"SWP_PORT": "3128",
"SWP_HOSTNAME": "swp.your-project.example.com"
// ... other variables as needed
}
```
* Set `CUSTOM_IMAGE_NAME` to the image you built in the `custom-images` process.

3. **Create the Private Environment and Cluster:**
This script sets up the VPC, subnets, Secure Web Proxy, and then creates the Dataproc cluster using the custom image. The `--shielded-secure-boot` flag is handled internally by the scripts when a `CUSTOM_IMAGE_NAME` is provided.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note: "make sure that you are in the right directory ...

```bash
bash bin/create-dpgce-private
```

**Verification:**

1. SSH into the -m node of the created cluster.
2. Check driver status: `sudo nvidia-smi`
3. Verify module signature: `sudo modinfo nvidia | grep signer` (should show your custom CA).
4. Check for errors: `dmesg | grep -iE "Secure Boot|NVRM|nvidia"`

### Verification

1. Once the cluster has been created, you can access the Dataproc cluster and
Expand Down Expand Up @@ -280,6 +380,7 @@ handles metric creation and reporting.
* **Installation Failures:** Examine the initialization action log on the
affected node, typically `/var/log/dataproc-initialization-script-0.log`
(or a similar name if multiple init actions are used).
* **Network/Proxy Issues:** If using a proxy, double-check the `http-proxy`, `https-proxy`, `proxy-uri`, `no-proxy`, `http-proxy-pem-uri`, and `http-proxy-pem-sha256` metadata settings. Ensure the proxy allows access to NVIDIA domains, GitHub, and package repositories. Check the init action log for curl errors or proxy test failures. The `/run/dpgce-network.json` file contains detailed network diagnostics.
* **GPU Agent Issues:** If the agent was installed (`install-gpu-agent=true`),
check its service logs using `sudo journalctl -u gpu-utilization-agent.service`.
* **Driver Load or Secure Boot Problems:** Review `dmesg` output and
Expand All @@ -298,7 +399,7 @@ handles metric creation and reporting.
* The script extensively caches downloaded artifacts (drivers, CUDA `.run`
files) and compiled components (kernel modules, NCCL, Conda environments)
to a GCS bucket. This bucket is typically specified by the
`dataproc-temp-bucket` cluster property or metadata.
`dataproc-temp-bucket` cluster property or metadata. Downloads and cache operations are proxy-aware.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Clarifying that 'Downloads and cache operations are proxy-aware' in the 'Performance & Caching' section is important. It assures users that the caching mechanism will function correctly even when a proxy is configured.

* **First Run / Cache Warming:** Initial runs on new configurations (OS,
kernel, or driver version combinations) that require source compilation
(e.g., for NCCL or kernel modules when no pre-compiled version is
Expand All @@ -324,4 +425,4 @@ handles metric creation and reporting.
Debian-based systems, including handling of archived backports repositories
to ensure dependencies can be met.
* Tested primarily with Dataproc 2.0+ images. Support for older Dataproc
1.5 images is limited.
1.5 images is limited.
Loading