Skip to content

CTM-397: Replace GKE/Helm Galaxy path with GCE VM-based deployment#4904

Open
LizBaldo wants to merge 20 commits into
developfrom
CTM-397-deploy-galaxy-on-GCE
Open

CTM-397: Replace GKE/Helm Galaxy path with GCE VM-based deployment#4904
LizBaldo wants to merge 20 commits into
developfrom
CTM-397-deploy-galaxy-on-GCE

Conversation

@LizBaldo

@LizBaldo LizBaldo commented Apr 7, 2026

Copy link
Copy Markdown
Collaborator

Summary

Replaces the GKE/Helm-based Galaxy deployment path with a GCE VM-based deployment using galaxy-k8s-boot. Rather than provisioning a full GKE cluster and deploying Galaxy via Helm, Leo now creates a single GCE VM that runs Galaxy via Ansible/microk8s.

What changed

Galaxy VM provisioning (GKEInterpreter.installGalaxyVm)

  • Creates a GCE VM with a boot disk, data disk, and PostgreSQL disk
  • Passes galaxy-user-email GCE metadata (from app.auditInfo.creator) so the workspace user becomes the Galaxy admin
  • Gets or creates a galaxy-batch-runner service account in the user's project
  • Grants the pet SA roles/batch.jobsEditor at the project level and roles/iam.serviceAccountUser on the Batch SA, so the VM can submit GCP Batch jobs
  • Creates an NFS firewall rule (leonardo-galaxy-allow-nfs-for-batch) so Batch VMs can reach the Galaxy VM's NFS server (TCP/UDP 2049 and 111)
  • Polls the instance until it has an external IP; stores it as loadBalancerIp so the Leo proxy can reach the VM across VPC boundaries

Network topology and IP choice

  • Leo's GKE cluster is in Leo's GCP project; Galaxy VMs are created in the user's workspace project. The two VPCs are not peered, so the VM's internal IP (10.x.x.x) is not routable from Leo's pod
  • The leonardo-allow-http firewall rule (TCP port 80, source 0.0.0.0/0, targeting VMs with the leonardo network tag) allows Leo to reach the VM's external IP. Galaxy VMs are created with the leonardo tag
  • Both the readiness health check (isVmReachable) and the Akka HTTP proxy use the external IP

⚠️ Security note: leonardo-allow-http currently allows port 80 from 0.0.0.0/0, meaning the Galaxy VM is reachable directly from the internet, bypassing Leo's workspace-level authorization. This is tracked as a follow-up — options include restricting source ranges to Leo's GKE node CIDR, a GCP service-account-based firewall rule, or a shared-secret header enforced by Galaxy's nginx. See PR discussion for details.

Galaxy VM readiness health check

  • isProxyAvailable routes through Leo's own proxy hostname; in BEE environments the wildcard DNS resolves to the ingress controller's external IP, unreachable via hairpin NAT from within the GKE pod → TCP timeout
  • Added AppDAO.isVmReachable(ip, port) — a direct http4s HTTP GET to http://<externalIp>:80/. No proxy hostname resolution required. MockAppDAO returns IO.pure(isUp) for tests

Leo proxy: HTTP support for Galaxy VM backends

  • Added useHttp: Boolean = false to HostReady — when true the proxy connects via plain HTTP port 80 (ws:// for WebSocket) instead of HTTPS port 443
  • KubernetesDnsCache sets useHttp = true for AppType.Galaxy apps and maps the fake proxy hostname to the VM's external IP
  • ProxyService.handleHttpRequest / handleWebSocketRequest branch on useHttp; all non-Galaxy backends are unchanged

Leo proxy: path handling for Galaxy VM (ProxyService.proxyAppRequest)

  • Leo forwards the full path (e.g. /proxy/google/v1/apps/{project}/{app}/galaxy/...) to the Galaxy VM unchanged
  • galaxy-k8s-boot's ansible playbook configures Galaxy's nginx ingress.path to the value of galaxy_prefix, which Leo passes as a GCE metadata item (galaxy-url-prefix). Galaxy's nginx therefore serves at the full Leo proxy path, so all requests route correctly without any path rewriting in Leo

NFS PVC size: GB → GiB conversion fix (GKEInterpreter.installGalaxyVm)

  • pvSizeGi = nfsDisk.size.gb - 11 treated decimal GB as binary GiB. For a 500 GB disk: the disk holds ~466 GiB but Leo requested 489 GiB → NFS provisioner fails with insufficient available space, leaving all Galaxy pods Pending
  • Fixed to convert first: diskSizeGiB = (nfsDisk.size.gb.toLong * 1000^3) / 1024^3, then subtract 11 GiB overhead

Lifecycle: restore from existing disks

  • restore = msg.appType == AppType.Galaxy && msg.createDisk.isEmpty: when a Galaxy app is created without a new disk, the disks already exist (prior app was deleted keeping disks)
  • In restore mode: skips creating the PostgreSQL disk; passes restore_galaxy=true metadata to Ansible
  • CreateAppParams.restore: Boolean propagates the flag through to installGalaxyVm

Lifecycle: delete keeping disks

  • DeleteAppMessage(diskId = None) → VM is deleted, both disks are preserved
  • DeleteAppMessage(diskId = Some(...)) → VM + both disks deleted

Config cleanup

  • Removed gcpBatchServiceAccountEmail from reference.conf / GalaxyVmConfig / Config.scala — Leo now creates the SA dynamically via getOrCreateServiceAccount instead of relying on a pre-configured email

Architecture notes

Property GKE Galaxy (before) GCE VM Galaxy (now)
Backend GKE cluster + Helm Single GCE VM (galaxy-k8s-boot anvil branch)
VM bootstrap N/A GCE user-data cloud-init via guest agent
Proxy backend protocol HTTPS port 443 (nginx ingress TLS) HTTP port 80 (nginx reverse proxy)
Backend IP Ingress load balancer IP (external) VM external IP
Proxy reachability Via GKE LoadBalancer service IP Via leonardo-allow-http firewall (0.0.0.0/0 → port 80, leonardo tag)
Readiness check isProxyAvailable via proxy hostname isVmReachable direct HTTP to external IP
Batch jobs N/A GCP Batch via galaxy-batch-runner SA

Security comparison: old GKE-based vs. new VM-based

Property Old (GKE-based) New (VM-based)
Protocol HTTPS (mTLS) Plain HTTP
Target IP GKE load balancer (external) VM external IP
Port exposed 443 80
Firewall source range 0.0.0.0/0 0.0.0.0/0
Certificate validation Yes — Leo-issued cert on nginx None

Regressions introduced:

  1. No encryption — traffic between Leo's pod and the Galaxy VM crosses the public internet in plaintext
  2. Direct VM access — port 80 is open to 0.0.0.0/0, so anyone who discovers the VM's external IP can reach Galaxy directly, bypassing Leo's authentication

Alternatives for follow-up:

  • Option A — VPC peering / Private Service Connect (recommended structural fix): peer Leo's project VPC with the user's workspace VPC so Leo can reach the VM on its internal IP; the 0.0.0.0/0 firewall rule is no longer needed
  • Option B — HTTPS on the Galaxy VM (closest to old security posture): configure nginx on the Galaxy VM with Leo's CA certificate (as Jupyter VMs do); flip useHttp = false for Galaxy; requires provisioning Leo certs onto the VM during installGalaxyVm
  • Option C — GCP IAP or Cloud Armor (lower-effort mitigation): restricts direct VM access without VPC changes, but does not encrypt the Leo→VM leg

Test plan

  • Unit tests pass (GKEInterpreterSpec, LeoPubsubMessageSubscriberSpec)
  • Scala formatting clean
  • BEE: create Galaxy app → VM boots, Ansible runs, status goes to Running
  • BEE: access Galaxy through Leo proxy URL
  • BEE: verify workspace user email is the Galaxy admin (not a hardcoded address)
  • BEE: delete app keeping disks → VM deleted, disks remain
  • BEE: re-create app from existing disks → restore_galaxy=true passed to Ansible, Galaxy restores state
  • Follow-up: restrict leonardo-allow-http source range from 0.0.0.0/0 to Leo's GKE node CIDR

@LizBaldo LizBaldo requested a review from a team as a code owner April 7, 2026 17:39
Liz Baldo and others added 4 commits April 7, 2026 13:51
At Galaxy VM creation, grant the pet SA roles/batch.jobsEditor on the
user's project so it can submit and monitor GCP Batch jobs. When the
Batch SA lives in the same project, also grant serviceAccountUser on it;
cross-project Batch SAs must have that binding configured externally.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov

codecov Bot commented Apr 8, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 79.89418% with 38 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.86%. Comparing base (5b339ea) to head (aa73c84).
⚠️ Report is 8 commits behind head on develop.

Files with missing lines Patch % Lines
.../dsde/workbench/leonardo/util/GKEInterpreter.scala 85.43% 22 Missing ⚠️
...itute/dsde/workbench/leonardo/dao/HttpAppDAO.scala 0.00% 9 Missing ⚠️
...de/workbench/leonardo/dns/KubernetesDnsCache.scala 0.00% 3 Missing ⚠️
...e/dsde/workbench/leonardo/dao/HttpJupyterDAO.scala 0.00% 2 Missing ⚠️
...workbench/leonardo/http/service/ProxyService.scala 75.00% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4904      +/-   ##
===========================================
- Coverage    74.08%   73.86%   -0.22%     
===========================================
  Files          131      131              
  Lines        11100    11162      +62     
  Branches       895      920      +25     
===========================================
+ Hits          8223     8245      +22     
- Misses        2877     2917      +40     
Files with missing lines Coverage Δ
...titute/dsde/workbench/leonardo/config/Config.scala 97.79% <100.00%> (+0.03%) ⬆️
...orkbench/leonardo/config/KubernetesAppConfig.scala 95.00% <ø> (ø)
...stitute/dsde/workbench/leonardo/dao/ProxyDAO.scala 25.00% <ø> (ø)
.../leonardo/monitor/LeoPubsubMessageSubscriber.scala 76.80% <100.00%> (-0.20%) ⬇️
...workbench/leonardo/util/BuildHelmChartValues.scala 97.87% <ø> (-0.47%) ⬇️
...tute/dsde/workbench/leonardo/util/GKEAlgebra.scala 80.00% <ø> (-20.00%) ⬇️
...e/dsde/workbench/leonardo/dao/HttpJupyterDAO.scala 19.44% <0.00%> (-0.56%) ⬇️
...workbench/leonardo/http/service/ProxyService.scala 76.88% <75.00%> (-0.67%) ⬇️
...de/workbench/leonardo/dns/KubernetesDnsCache.scala 0.00% <0.00%> (ø)
...itute/dsde/workbench/leonardo/dao/HttpAppDAO.scala 0.00% <0.00%> (ø)
... and 1 more

... and 2 files with indirect coverage changes


Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b339ea...aa73c84. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@LizBaldo LizBaldo requested a review from lucymcnatt April 8, 2026 15:58
@LizBaldo LizBaldo requested a review from afgane April 13, 2026 13:00
@LizBaldo

Copy link
Copy Markdown
Collaborator Author

I am currently blocked fro testing further because of a lack of permission on the galaxy image to use on the boot VM:
Required 'compute.images.useReadOnly' permission for 'projects/anvil-and-terra-development/global/images/galaxy-k8s-boot-v2026-02-25'

I think I would need the Galaxy team to make the image allAuthenticatedUsers in the anvil-and-terra-development project

@LizBaldo

Copy link
Copy Markdown
Collaborator Author

I am currently blocked fro testing further because of a lack of permission on the galaxy image to use on the boot VM: Required 'compute.images.useReadOnly' permission for 'projects/anvil-and-terra-development/global/images/galaxy-k8s-boot-v2026-02-25'

I think I would need the Galaxy team to make the image allAuthenticatedUsers in the anvil-and-terra-development project

The Galaxy team made the image public so I am currently unblocked :)

@aednichols

Copy link
Copy Markdown
Contributor

the Galaxy VM is reachable directly from the internet

I'm not sure this is a concern, don't the instances live in a VPC with NAT?

@LizBaldo

Copy link
Copy Markdown
Collaborator Author

the Galaxy VM is reachable directly from the internet

I'm not sure this is a concern, don't the instances live in a VPC with NAT?

I agree, but this is a departure from how we used to handle it, I am not sure that the compliance review covered this so I want to triple check before merging

Liz Baldo and others added 4 commits May 8, 2026 13:21
…, gcpBatchSaProject parsing

- Pass app.auditInfo.creator as galaxy-user-email GCE metadata so the
  actual workspace user (not dev@galaxyproject.org) becomes the Galaxy admin
- Fix scala.io.Source resource leak in installGalaxyVm using scala.util.Using
- Update sourceImage to galaxy-k8s-boot-v2026-06-10 and gitBranch to "anvil"
- Fix HOST_IP to use GCE metadata server instead of external ifconfig.me
- Fix gcpBatchSaProject SA email parsing: lift(1) + stripSuffix instead of
  lastOption + replace to avoid matching suffix in unexpected positions
- Correct stale comments: galaxy_url_prefix → galaxy_prefix, dev → anvil branch,
  wrong "internal IP" comment corrected to "external IP"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The pre-baked galaxy-k8s-boot image carries cloud-init state from its build,
so cloud-init treats new VM launches as subsequent boots and skips runcmd —
causing "No startup scripts to run" in Guest Agent logs and a VM that never
bootstraps Galaxy.

Fix: pass the bootstrap script as the "startup-script" metadata key instead
of "user-data". The GCE Guest Agent always executes startup-script on boot,
regardless of cloud-init state.

galaxy-user-data.sh is reformatted from cloud-config YAML to a plain bash
script. The sudo -u debian block now uses a single-quoted heredoc delimiter
(<<'DEBIAN_EOF') to avoid apostrophes in comments breaking shell quoting.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n playbook.yml

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants