Skip to content

Openshell failure #712

@kooya3

Description

@kooya3

Agent Diagnostic

elyees@elyees-HP-EliteBook-x360-1030-G2 / $ openshell gateway destroy --name nemoclaw && openshell gateway start
• Destroying gateway nemoclaw...
✓ Gateway nemoclaw destroyed.
✓ Checking Docker
✓ Downloading gateway
x Initializing environment x Gateway failed: openshell

Gateway failed to start

The gateway encountered an unexpected error during startup.

To fix:

  1. Check container logs for details

    openshell doctor logs --name openshell

  2. Run diagnostics

    openshell doctor check --name openshell

  3. Try destroying and recreating the gateway

    openshell gateway destroy --name openshell && openshell gateway start

  4. If the issue persists, report it at https://github.com/nvidia/openshell/issues

Error: × K8s namespace not ready
╰─▶ timed out waiting for namespace 'openshell' to exist: Error from server (NotFound): namespaces "openshell" not found

  container logs:
    time="2026-03-31T20:22:30Z" level=info msg="Running flannel backend."
    I0331 20:22:30.260501      94 vxlan_network.go:68] watching for new subnet leases
    I0331 20:22:30.260524      94 vxlan_network.go:115] starting vxlan device watcher
    I0331 20:22:30.275553      94 iptables.go:358] bootstrap done
    I0331 20:22:30.287214      94 iptables.go:358] bootstrap done
    time="2026-03-31T20:22:32Z" level=info msg="Started tunnel to 172.18.0.2:6443"
    time="2026-03-31T20:22:32Z" level=info msg="Connecting to proxy" url="wss://172.18.0.2:6443/v1-k3s/connect"
    time="2026-03-31T20:22:32Z" level=info msg="Stopped tunnel to 127.0.0.1:6443"
    time="2026-03-31T20:22:32Z" level=info msg="Proxy done" err="context canceled" url="wss://127.0.0.1:6443/v1-k3s/connect"
    time="2026-03-31T20:22:32Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
    time="2026-03-31T20:22:32Z" level=info msg="Handling backend connection request [b2e3d1c467dd]"
    time="2026-03-31T20:22:32Z" level=info msg="Connected to proxy" url="wss://172.18.0.2:6443/v1-k3s/connect"
    time="2026-03-31T20:22:32Z" level=info msg="Remotedialer connected to proxy" url="wss://172.18.0.2:6443/v1-k3s/connect"
    time="2026-03-31T20:22:49Z" level=info msg="Starting network policy controller version v2.6.3-k3s1, built on 2026-03-04T22:29:48Z, go1.25.7"
    I0331 20:22:49.291901      94 network_policy_controller.go:164] Starting network policy controller
    I0331 20:22:49.465585      94 network_policy_controller.go:179] Starting network policy controller full sync goroutine
    E0331 20:22:53.159942      94 resource_quota_controller.go:460] "Error during resource discovery" err="unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1:
  stale GroupVersion discovery: metrics.k8s.io/v1beta1"
    I0331 20:22:53.197871      94 garbagecollector.go:792] "failed to discover some groups" groups="map[\"metrics.k8s.io/v1beta1\":\"stale GroupVersion discovery: metrics.k8s.io/
  v1beta1\"]"
    I0331 20:23:02.766833      94 pod_startup_latency_tracker.go:108] "Observed pod startup duration" pod="agent-sandbox-system/agent-sandbox-controller-0"
  podStartSLOduration=23.698492291 podStartE2EDuration="41.766786415s" podCreationTimestamp="2026-03-31 20:22:21 +0000 UTC" firstStartedPulling="2026-03-31 20:22:43.545748443 +0000 UTC
  m=+33.592945921" lastFinishedPulling="2026-03-31 20:23:01.614042575 +0000 UTC m=+51.661240045" observedRunningTime="2026-03-31 20:23:02.765706124 +0000 UTC m=+52.812903653"
  watchObservedRunningTime="2026-03-31 20:23:02.766786415 +0000 UTC m=+52.813983948"
    E0331 20:23:14.021273      94 handler_proxy.go:143] error resolving kube-system/metrics-server: no endpoints available for service "metrics-server"
    E0331 20:23:23.169836      94 resource_quota_controller.go:460] "Error during resource discovery" err="unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1:
  stale GroupVersion discovery: metrics.k8s.io/v1beta1"
    I0331 20:23:23.210245      94 garbagecollector.go:792] "failed to discover some groups" groups="map[\"metrics.k8s.io/v1beta1\":\"stale GroupVersion discovery: metrics.k8s.io/
  v1beta1\"]"
    W0331 20:23:27.205027      94 handler_proxy.go:99] no RequestInfo found in the context
    E0331 20:23:27.205144      94 controller.go:113] "Unhandled Error" err="loading OpenAPI spec for \"v1beta1.metrics.k8s.io\" failed with: Error, could not get list of group versions
  for APIService"
    I0331 20:23:27.205196      94 controller.go:126] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
    W0331 20:23:27.206150      94 handler_proxy.go:99] no RequestInfo found in the context
    E0331 20:23:27.206345      94 controller.go:102] "Unhandled Error" err=<
    loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to download v1beta1.metrics.k8s.io: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body:
  service unavailable
    , Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
    >
    I0331 20:23:27.206393      94 controller.go:109] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
    I0331 20:23:38.911768      94 pod_startup_latency_tracker.go:108] "Observed pod startup duration" pod="kube-system/local-path-provisioner-6bc6568469-pkxph"
  podStartSLOduration=36.402677146 podStartE2EDuration="1m16.91172317s" podCreationTimestamp="2026-03-31 20:22:22 +0000 UTC" firstStartedPulling="2026-03-31 20:22:43.397386178 +0000
  UTC m=+33.444583654" lastFinishedPulling="2026-03-31 20:23:23.906432208 +0000 UTC m=+73.953629678" observedRunningTime="2026-03-31 20:23:24.82696973 +0000 UTC m=+74.874167197"
  watchObservedRunningTime="2026-03-31 20:23:38.91172317 +0000 UTC m=+88.958920730"
    I0331 20:23:38.966021      94 pod_startup_latency_tracker.go:108] "Observed pod startup duration" pod="kube-system/coredns-7566b5ff58-28j8r" podStartSLOduration=22.185733663
  podStartE2EDuration="1m16.966002736s" podCreationTimestamp="2026-03-31 20:22:22 +0000 UTC" firstStartedPulling="2026-03-31 20:22:43.527197919 +0000 UTC m=+33.574395394"
  lastFinishedPulling="2026-03-31 20:23:38.307467001 +0000 UTC m=+88.354664467" observedRunningTime="2026-03-31 20:23:38.913871271 +0000 UTC m=+88.961068813"
  watchObservedRunningTime="2026-03-31 20:23:38.966002736 +0000 UTC m=+89.013200220"
    time="2026-03-31T20:23:50Z" level=info msg="Slow SQL (started: 2026-03-31 20:23:48.019318608 +0000 UTC m=+98.066516115) (total time: 2.589620038s): INSERT INTO kine(name, created,
  deleted, create_revision, prev_revision, lease, value, old_value) values(?, ?, ?, ?, ?, ?, ?, ?)" duration=2.589620038s
    time="2026-03-31T20:23:50Z" level=info msg="Slow SQL (started: 2026-03-31 20:23:49.313800541 +0000 UTC m=+99.360998062) (total time: 1.330503926s): INSERT INTO kine(name, created,
  deleted, create_revision, prev_revision, lease, value, old_value) values(?, ?, ?, ?, ?, ?, ?, ?)" duration=1.330503926s
    E0331 20:23:53.173950      94 resource_quota_controller.go:460] "Error during resource discovery" err="unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1:
  stale GroupVersion discovery: metrics.k8s.io/v1beta1"
    I0331 20:23:53.216466      94 garbagecollector.go:792] "failed to discover some groups" groups="map[\"metrics.k8s.io/v1beta1\":\"stale GroupVersion discovery: metrics.k8s.io/
  v1beta1\"]"
    E0331 20:24:14.022215      94 handler_proxy.go:143] error resolving kube-system/metrics-server: no endpoints available for service "metrics-server"
    E0331 20:24:23.178879      94 resource_quota_controller.go:460] "Error during resource discovery" err="unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1:
  stale GroupVersion discovery: metrics.k8s.io/v1beta1"
    I0331 20:24:23.222340      94 garbagecollector.go:792] "failed to discover some groups" groups="map[\"metrics.k8s.io/v1beta1\":\"stale GroupVersion discovery: metrics.k8s.io/
  v1beta1\"]"

elyees@elyees-HP-EliteBook-x360-1030-G2 / $ openshell doctor logs --name nemoclaw
openshell doctor check --name nemoclaw
Error: × error reading container logs: Docker responded with status code 404: No such container: openshell-cluster-nemoclaw

error: unexpected argument '--name' found

Usage: openshell doctor check [OPTIONS]

For more information, try '--help'.

Description

I have tried to setup the nemoclaw instance more than 7 times. At first I had issues related to my docker instance not being able to start, I reconfigured my DNS and then restarted docker, then uninstalled the nemoclaw instance before beginning the installation process again. I keep getting that error, the CLI tries to retry but it always leads to that same output three times.

Reproduction Steps

Run curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash

sudo journalctl -xeu docker.service | cat

Environment

Linux Ubuntu 24.04
openshell 0.0.19
Docker version 29.3.0, build 5927d80

Logs

Agent-First Checklist

  • I pointed my agent at the repo and had it investigate this issue
  • I loaded relevant skills (e.g., debug-openshell-cluster, debug-inference, openshell-cli)
  • My agent could not resolve this — the diagnostic above explains why

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions