CTM-397: Replace GKE/Helm Galaxy path with GCE VM-based deployment#4904
CTM-397: Replace GKE/Helm Galaxy path with GCE VM-based deployment#4904LizBaldo wants to merge 20 commits into
Conversation
At Galaxy VM creation, grant the pet SA roles/batch.jobsEditor on the user's project so it can submit and monitor GCP Batch jobs. When the Batch SA lives in the same project, also grant serviceAccountUser on it; cross-project Batch SAs must have that binding configured externally. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #4904 +/- ##
===========================================
- Coverage 74.08% 73.86% -0.22%
===========================================
Files 131 131
Lines 11100 11162 +62
Branches 895 920 +25
===========================================
+ Hits 8223 8245 +22
- Misses 2877 2917 +40
... and 2 files with indirect coverage changes Continue to review full report in Codecov by Harness.
🚀 New features to boost your workflow:
|
|
I am currently blocked fro testing further because of a lack of permission on the galaxy image to use on the boot VM: I think I would need the Galaxy team to make the image allAuthenticatedUsers in the anvil-and-terra-development project |
The Galaxy team made the image public so I am currently unblocked :) |
I'm not sure this is a concern, don't the instances live in a VPC with NAT? |
I agree, but this is a departure from how we used to handle it, I am not sure that the compliance review covered this so I want to triple check before merging |
…, gcpBatchSaProject parsing - Pass app.auditInfo.creator as galaxy-user-email GCE metadata so the actual workspace user (not dev@galaxyproject.org) becomes the Galaxy admin - Fix scala.io.Source resource leak in installGalaxyVm using scala.util.Using - Update sourceImage to galaxy-k8s-boot-v2026-06-10 and gitBranch to "anvil" - Fix HOST_IP to use GCE metadata server instead of external ifconfig.me - Fix gcpBatchSaProject SA email parsing: lift(1) + stripSuffix instead of lastOption + replace to avoid matching suffix in unexpected positions - Correct stale comments: galaxy_url_prefix → galaxy_prefix, dev → anvil branch, wrong "internal IP" comment corrected to "external IP" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The pre-baked galaxy-k8s-boot image carries cloud-init state from its build, so cloud-init treats new VM launches as subsequent boots and skips runcmd — causing "No startup scripts to run" in Guest Agent logs and a VM that never bootstraps Galaxy. Fix: pass the bootstrap script as the "startup-script" metadata key instead of "user-data". The GCE Guest Agent always executes startup-script on boot, regardless of cloud-init state. galaxy-user-data.sh is reformatted from cloud-config YAML to a plain bash script. The sudo -u debian block now uses a single-quoted heredoc delimiter (<<'DEBIAN_EOF') to avoid apostrophes in comments breaking shell quoting. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n playbook.yml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Replaces the GKE/Helm-based Galaxy deployment path with a GCE VM-based deployment using galaxy-k8s-boot. Rather than provisioning a full GKE cluster and deploying Galaxy via Helm, Leo now creates a single GCE VM that runs Galaxy via Ansible/microk8s.
What changed
Galaxy VM provisioning (
GKEInterpreter.installGalaxyVm)galaxy-user-emailGCE metadata (fromapp.auditInfo.creator) so the workspace user becomes the Galaxy admingalaxy-batch-runnerservice account in the user's projectroles/batch.jobsEditorat the project level androles/iam.serviceAccountUseron the Batch SA, so the VM can submit GCP Batch jobsleonardo-galaxy-allow-nfs-for-batch) so Batch VMs can reach the Galaxy VM's NFS server (TCP/UDP 2049 and 111)loadBalancerIpso the Leo proxy can reach the VM across VPC boundariesNetwork topology and IP choice
10.x.x.x) is not routable from Leo's podleonardo-allow-httpfirewall rule (TCP port 80, source0.0.0.0/0, targeting VMs with theleonardonetwork tag) allows Leo to reach the VM's external IP. Galaxy VMs are created with theleonardotagisVmReachable) and the Akka HTTP proxy use the external IPGalaxy VM readiness health check
isProxyAvailableroutes through Leo's own proxy hostname; in BEE environments the wildcard DNS resolves to the ingress controller's external IP, unreachable via hairpin NAT from within the GKE pod → TCP timeoutAppDAO.isVmReachable(ip, port)— a direct http4s HTTP GET tohttp://<externalIp>:80/. No proxy hostname resolution required.MockAppDAOreturnsIO.pure(isUp)for testsLeo proxy: HTTP support for Galaxy VM backends
useHttp: Boolean = falsetoHostReady— when true the proxy connects via plain HTTP port 80 (ws:// for WebSocket) instead of HTTPS port 443KubernetesDnsCachesetsuseHttp = trueforAppType.Galaxyapps and maps the fake proxy hostname to the VM's external IPProxyService.handleHttpRequest/handleWebSocketRequestbranch onuseHttp; all non-Galaxy backends are unchangedLeo proxy: path handling for Galaxy VM (
ProxyService.proxyAppRequest)/proxy/google/v1/apps/{project}/{app}/galaxy/...) to the Galaxy VM unchangedingress.pathto the value ofgalaxy_prefix, which Leo passes as a GCE metadata item (galaxy-url-prefix). Galaxy's nginx therefore serves at the full Leo proxy path, so all requests route correctly without any path rewriting in LeoNFS PVC size: GB → GiB conversion fix (
GKEInterpreter.installGalaxyVm)pvSizeGi = nfsDisk.size.gb - 11treated decimal GB as binary GiB. For a 500 GB disk: the disk holds ~466 GiB but Leo requested 489 GiB → NFS provisioner fails withinsufficient available space, leaving all Galaxy podsPendingdiskSizeGiB = (nfsDisk.size.gb.toLong * 1000^3) / 1024^3, then subtract 11 GiB overheadLifecycle: restore from existing disks
restore = msg.appType == AppType.Galaxy && msg.createDisk.isEmpty: when a Galaxy app is created without a new disk, the disks already exist (prior app was deleted keeping disks)restore_galaxy=truemetadata to AnsibleCreateAppParams.restore: Booleanpropagates the flag through toinstallGalaxyVmLifecycle: delete keeping disks
DeleteAppMessage(diskId = None)→ VM is deleted, both disks are preservedDeleteAppMessage(diskId = Some(...))→ VM + both disks deletedConfig cleanup
gcpBatchServiceAccountEmailfromreference.conf/GalaxyVmConfig/Config.scala— Leo now creates the SA dynamically viagetOrCreateServiceAccountinstead of relying on a pre-configured emailArchitecture notes
anvilbranch)user-datacloud-init via guest agentleonardo-allow-httpfirewall (0.0.0.0/0 → port 80,leonardotag)isProxyAvailablevia proxy hostnameisVmReachabledirect HTTP to external IPgalaxy-batch-runnerSASecurity comparison: old GKE-based vs. new VM-based
0.0.0.0/00.0.0.0/0Regressions introduced:
0.0.0.0/0, so anyone who discovers the VM's external IP can reach Galaxy directly, bypassing Leo's authenticationAlternatives for follow-up:
0.0.0.0/0firewall rule is no longer neededuseHttp = falsefor Galaxy; requires provisioning Leo certs onto the VM duringinstallGalaxyVmTest plan
GKEInterpreterSpec,LeoPubsubMessageSubscriberSpec)restore_galaxy=truepassed to Ansible, Galaxy restores stateleonardo-allow-httpsource range from0.0.0.0/0to Leo's GKE node CIDR