-
Notifications
You must be signed in to change notification settings - Fork 436
Openshell failure #712
Description
Agent Diagnostic
elyees@elyees-HP-EliteBook-x360-1030-G2 / $ openshell gateway destroy --name nemoclaw && openshell gateway start
• Destroying gateway nemoclaw...
✓ Gateway nemoclaw destroyed.
✓ Checking Docker
✓ Downloading gateway
x Initializing environment x Gateway failed: openshell
Gateway failed to start
The gateway encountered an unexpected error during startup.
To fix:
-
Check container logs for details
openshell doctor logs --name openshell
-
Run diagnostics
openshell doctor check --name openshell
-
Try destroying and recreating the gateway
openshell gateway destroy --name openshell && openshell gateway start
-
If the issue persists, report it at https://github.com/nvidia/openshell/issues
Error: × K8s namespace not ready
╰─▶ timed out waiting for namespace 'openshell' to exist: Error from server (NotFound): namespaces "openshell" not found
container logs:
time="2026-03-31T20:22:30Z" level=info msg="Running flannel backend."
I0331 20:22:30.260501 94 vxlan_network.go:68] watching for new subnet leases
I0331 20:22:30.260524 94 vxlan_network.go:115] starting vxlan device watcher
I0331 20:22:30.275553 94 iptables.go:358] bootstrap done
I0331 20:22:30.287214 94 iptables.go:358] bootstrap done
time="2026-03-31T20:22:32Z" level=info msg="Started tunnel to 172.18.0.2:6443"
time="2026-03-31T20:22:32Z" level=info msg="Connecting to proxy" url="wss://172.18.0.2:6443/v1-k3s/connect"
time="2026-03-31T20:22:32Z" level=info msg="Stopped tunnel to 127.0.0.1:6443"
time="2026-03-31T20:22:32Z" level=info msg="Proxy done" err="context canceled" url="wss://127.0.0.1:6443/v1-k3s/connect"
time="2026-03-31T20:22:32Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
time="2026-03-31T20:22:32Z" level=info msg="Handling backend connection request [b2e3d1c467dd]"
time="2026-03-31T20:22:32Z" level=info msg="Connected to proxy" url="wss://172.18.0.2:6443/v1-k3s/connect"
time="2026-03-31T20:22:32Z" level=info msg="Remotedialer connected to proxy" url="wss://172.18.0.2:6443/v1-k3s/connect"
time="2026-03-31T20:22:49Z" level=info msg="Starting network policy controller version v2.6.3-k3s1, built on 2026-03-04T22:29:48Z, go1.25.7"
I0331 20:22:49.291901 94 network_policy_controller.go:164] Starting network policy controller
I0331 20:22:49.465585 94 network_policy_controller.go:179] Starting network policy controller full sync goroutine
E0331 20:22:53.159942 94 resource_quota_controller.go:460] "Error during resource discovery" err="unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1:
stale GroupVersion discovery: metrics.k8s.io/v1beta1"
I0331 20:22:53.197871 94 garbagecollector.go:792] "failed to discover some groups" groups="map[\"metrics.k8s.io/v1beta1\":\"stale GroupVersion discovery: metrics.k8s.io/
v1beta1\"]"
I0331 20:23:02.766833 94 pod_startup_latency_tracker.go:108] "Observed pod startup duration" pod="agent-sandbox-system/agent-sandbox-controller-0"
podStartSLOduration=23.698492291 podStartE2EDuration="41.766786415s" podCreationTimestamp="2026-03-31 20:22:21 +0000 UTC" firstStartedPulling="2026-03-31 20:22:43.545748443 +0000 UTC
m=+33.592945921" lastFinishedPulling="2026-03-31 20:23:01.614042575 +0000 UTC m=+51.661240045" observedRunningTime="2026-03-31 20:23:02.765706124 +0000 UTC m=+52.812903653"
watchObservedRunningTime="2026-03-31 20:23:02.766786415 +0000 UTC m=+52.813983948"
E0331 20:23:14.021273 94 handler_proxy.go:143] error resolving kube-system/metrics-server: no endpoints available for service "metrics-server"
E0331 20:23:23.169836 94 resource_quota_controller.go:460] "Error during resource discovery" err="unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1:
stale GroupVersion discovery: metrics.k8s.io/v1beta1"
I0331 20:23:23.210245 94 garbagecollector.go:792] "failed to discover some groups" groups="map[\"metrics.k8s.io/v1beta1\":\"stale GroupVersion discovery: metrics.k8s.io/
v1beta1\"]"
W0331 20:23:27.205027 94 handler_proxy.go:99] no RequestInfo found in the context
E0331 20:23:27.205144 94 controller.go:113] "Unhandled Error" err="loading OpenAPI spec for \"v1beta1.metrics.k8s.io\" failed with: Error, could not get list of group versions
for APIService"
I0331 20:23:27.205196 94 controller.go:126] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
W0331 20:23:27.206150 94 handler_proxy.go:99] no RequestInfo found in the context
E0331 20:23:27.206345 94 controller.go:102] "Unhandled Error" err=<
loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to download v1beta1.metrics.k8s.io: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body:
service unavailable
, Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
>
I0331 20:23:27.206393 94 controller.go:109] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
I0331 20:23:38.911768 94 pod_startup_latency_tracker.go:108] "Observed pod startup duration" pod="kube-system/local-path-provisioner-6bc6568469-pkxph"
podStartSLOduration=36.402677146 podStartE2EDuration="1m16.91172317s" podCreationTimestamp="2026-03-31 20:22:22 +0000 UTC" firstStartedPulling="2026-03-31 20:22:43.397386178 +0000
UTC m=+33.444583654" lastFinishedPulling="2026-03-31 20:23:23.906432208 +0000 UTC m=+73.953629678" observedRunningTime="2026-03-31 20:23:24.82696973 +0000 UTC m=+74.874167197"
watchObservedRunningTime="2026-03-31 20:23:38.91172317 +0000 UTC m=+88.958920730"
I0331 20:23:38.966021 94 pod_startup_latency_tracker.go:108] "Observed pod startup duration" pod="kube-system/coredns-7566b5ff58-28j8r" podStartSLOduration=22.185733663
podStartE2EDuration="1m16.966002736s" podCreationTimestamp="2026-03-31 20:22:22 +0000 UTC" firstStartedPulling="2026-03-31 20:22:43.527197919 +0000 UTC m=+33.574395394"
lastFinishedPulling="2026-03-31 20:23:38.307467001 +0000 UTC m=+88.354664467" observedRunningTime="2026-03-31 20:23:38.913871271 +0000 UTC m=+88.961068813"
watchObservedRunningTime="2026-03-31 20:23:38.966002736 +0000 UTC m=+89.013200220"
time="2026-03-31T20:23:50Z" level=info msg="Slow SQL (started: 2026-03-31 20:23:48.019318608 +0000 UTC m=+98.066516115) (total time: 2.589620038s): INSERT INTO kine(name, created,
deleted, create_revision, prev_revision, lease, value, old_value) values(?, ?, ?, ?, ?, ?, ?, ?)" duration=2.589620038s
time="2026-03-31T20:23:50Z" level=info msg="Slow SQL (started: 2026-03-31 20:23:49.313800541 +0000 UTC m=+99.360998062) (total time: 1.330503926s): INSERT INTO kine(name, created,
deleted, create_revision, prev_revision, lease, value, old_value) values(?, ?, ?, ?, ?, ?, ?, ?)" duration=1.330503926s
E0331 20:23:53.173950 94 resource_quota_controller.go:460] "Error during resource discovery" err="unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1:
stale GroupVersion discovery: metrics.k8s.io/v1beta1"
I0331 20:23:53.216466 94 garbagecollector.go:792] "failed to discover some groups" groups="map[\"metrics.k8s.io/v1beta1\":\"stale GroupVersion discovery: metrics.k8s.io/
v1beta1\"]"
E0331 20:24:14.022215 94 handler_proxy.go:143] error resolving kube-system/metrics-server: no endpoints available for service "metrics-server"
E0331 20:24:23.178879 94 resource_quota_controller.go:460] "Error during resource discovery" err="unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1:
stale GroupVersion discovery: metrics.k8s.io/v1beta1"
I0331 20:24:23.222340 94 garbagecollector.go:792] "failed to discover some groups" groups="map[\"metrics.k8s.io/v1beta1\":\"stale GroupVersion discovery: metrics.k8s.io/
v1beta1\"]"
elyees@elyees-HP-EliteBook-x360-1030-G2 / $ openshell doctor logs --name nemoclaw
openshell doctor check --name nemoclaw
Error: × error reading container logs: Docker responded with status code 404: No such container: openshell-cluster-nemoclaw
error: unexpected argument '--name' found
Usage: openshell doctor check [OPTIONS]
For more information, try '--help'.
Description
I have tried to setup the nemoclaw instance more than 7 times. At first I had issues related to my docker instance not being able to start, I reconfigured my DNS and then restarted docker, then uninstalled the nemoclaw instance before beginning the installation process again. I keep getting that error, the CLI tries to retry but it always leads to that same output three times.
Reproduction Steps
Run curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash
sudo journalctl -xeu docker.service | cat
Environment
Linux Ubuntu 24.04
openshell 0.0.19
Docker version 29.3.0, build 5927d80
Logs
Agent-First Checklist
- I pointed my agent at the repo and had it investigate this issue
- I loaded relevant skills (e.g.,
debug-openshell-cluster,debug-inference,openshell-cli) - My agent could not resolve this — the diagnostic above explains why