Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 69 additions & 0 deletions devnets_monitor/.claude/skills/devnet-ops/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
name: devnet-ops
description: Operate, inspect, monitor, and debug ethrex EL nodes on ethpandaops devnets (status sweeps, peers/logs, blob & fork tracking, Hive results, wipe/resync, incident history). Use whenever the user asks to check/inspect a devnet node, ssh into one, read logs, investigate a devnet error, track blob inclusion or fork schedule, see Hive conformance, deploy/wipe a node, or ask about past devnet problems.
---

# devnet-ops

This ethrex devnet ops + monitoring toolkit lives in the `devnets_monitor/`
directory of the `ethrex-tooling` repo (ported from the standalone `ethrex-devnets`
repo). It is self-contained: procedures, per-devnet history, and the `dv` CLI all
live here. Run every command from `devnets_monitor/` (`cd devnets_monitor` first);
all paths below are relative to that directory.

Before any devnet operation or answering a question about a node/incident, READ:

1. `CLAUDE.md` — the working agreement (golden rules, exact data sources, the wipe
sequence).
2. `docs/devnet-ops.md` — generic access & inspection procedures (SSH, inventory,
container layout, build & deploy, debug logging, wipe & resync, Dora API).
Substitute `<devnet>` with the target network.
3. `docs/history/<devnet>.md` — per-devnet facts and incident history (roster, fork
schedule, commit map, known issues with root cause + recovery). These files are
gitignored (local-only; they embed host internals), so a freshly-cloned tree has
none. If a devnet has no history file, create one from
`docs/history/_template.md` as you learn facts.

## The `dv` CLI

Run from `devnets_monitor/` via `uv run dv ...`. Read-only by default; only `dv wipe`
mutates (gated behind `--yes`). Target devnet resolves: explicit arg > `$DEVNET`
env > `config/devnets.yaml` `default`.

```
uv run dv discover <devnet> # refresh roster/forks/image from the ethpandaops repo (gh)
uv run dv status <devnet> [node] # EL build/head/peers/state@head + CL + watchtower (--json)
uv run dv peers <devnet> <node> # peer count, inbound/outbound, client mix, body-serving fails
uv run dv logs <devnet> <node> [--since 2m]
uv run dv cl <devnet> <node> [--since 3m]
uv run dv collect <devnet> [all|blobs|health|hive|forks] # into data/ethrex-devnets.sqlite
uv run dv blob <devnet> # blob inclusion per proposer + ethrex-vs-others (decay lens)
uv run dv fork <devnet> # fork schedule + EIPs + countdown
uv run dv hive <devnet> # Hive conformance summary (groups from config/devnets.yaml)
uv run dv serve # read-only dashboard at http://127.0.0.1:8099
uv run dv wipe <devnet> <node> --yes # MUTATING: recover a wedged EL
```

## Workflow

1. Read `docs/devnet-ops.md` for HOW (procedures).
2. Read `docs/history/<devnet>.md` for WHAT/WHY — check whether the current symptom
matches a known issue before investigating from scratch; many recur.
3. ALWAYS verify the live host before trusting a node name. `dv status` reads the
live `docker inspect execution` image; a `*-ethrex-*` node may have been swapped
to another client.
4. Default to read-only. `dv wipe` and any deploy/recreate are mutating; confirm
with the user first.

## Adding a devnet

1. Add an entry under `devnets:` in `config/devnets.yaml` (see `config/schema.md`).
2. `uv run dv discover <devnet>` to populate `config/devnets/<devnet>.yaml`.
3. Create `docs/history/<devnet>.md` from `docs/history/_template.md`.

## Maintenance

When you discover a new incident, divergence, or devnet fact, append a dated entry
to `docs/history/<devnet>.md` (local-only, gitignored; do not commit it). If a
procedure changed, update `docs/devnet-ops.md` and commit that. The fork -> EIP map
is `config/eips.json` (sourced via eipmcp; re-run `get_hardfork` to refresh).
33 changes: 33 additions & 0 deletions devnets_monitor/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# No secrets, ever. The entries below guard by filename; a pre-commit content
# audit (rg for xatu/jwt/Bearer/password/secret/BEGIN KEY) is the real gate.
# SSH keys, JWTs, and Xatu credentials live only on hosts, never in this repo.

# Data store + caches (regenerable)
data/*.sqlite
data/*.sqlite-*
data/*.parquet
.cache/
tmp/

# Secrets / captured run commands (may embed tokens or host internals)
*.secret
*.jwt
*xatu*cred*
*credentials*
runlike-*.txt
run-*.sh
run-execution*.txt

# Per-devnet incident history (may embed host internals); keep only the template
docs/history/*.md
!docs/history/_template.md

# Python
__pycache__/
*.pyc
.venv/
.uv/

# Editor / OS
.DS_Store
*.swp
142 changes: 142 additions & 0 deletions devnets_monitor/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# CLAUDE.md — ethrex-devnets

Working agreement for any agent operating in this repo. Read this first, then
`docs/architecture.md` for the component map and `docs/devnet-ops.md` for the
operational procedures.

## What this repo is

A standalone ops + monitoring toolkit for ethrex on ethpandaops devnets. NOT part
of the ethrex codebase; it operates ethrex nodes from the outside (SSH + docker +
JSON-RPC) and pulls public devnet data (Dora, Hive, the ethpandaops devnet repo).
The owner uses it to track ethrex behavior across forks ("forks news").

It replaces git-excluded notes that used to live scattered inside the ethrex
working copies. This is now the single home for the devnet ops runbook and
per-devnet incident history.

## Golden rules

1. **Read-only by default.** Inspection (status/peers/logs/curl/collect) never
mutates. Every mutating action (wipe, deploy, recreate) MUST be gated behind an
explicit flag (`--yes`) and refuse to run without it. When in doubt, do not mutate.
2. **No secrets in git, ever.** SSH keys, JWTs, and Xatu/ClickHouse credentials
live only on hosts. Before any `git add` of new/edited bash or docs, run:
`rg -n 'xatu|jwt|Bearer|password|secret|-----BEGIN' lib/bash/ bin/ docs/ config/`
and require empty output. The `.gitignore` guards by filename; this audit
guards by content. Captured `runlike`/`run-*.sh` files (which embed host
internals) are gitignored, never commit them.
3. **Generic over devnets.** Nothing is hardcoded to one devnet. New devnet =
one entry in `config/devnets.yaml` + `dv discover <name>`. Never add a
`NODES=()`-style hardcoded roster; read it from `config/devnets/<name>.yaml`.
4. **Verify the live host before trusting a name.** Inventory is INTENDED config;
a `*-ethrex-*` node may have been manually swapped to another client. The
status path always reads the live `docker inspect execution` image.
5. **Preserve the incident-tested remote sequences.** The shell snippets that run
ON the devnet host (the wipe sequence, the status/peers probes) are generalized
from a working, incident-tested helper. They run on the host over SSH and stay
shell; the `wipe` sequence in particular is load-bearing (see "wipe" below).
Change them only with a clear reason and preserve every step.

## Architecture in one paragraph

One codebase: **Python**, run via `uv`, with a single `dv` console entry point.
Python does everything local: config/YAML/JSON parsing, SSH orchestration
(subprocess to the system `ssh`), the SQLite store, analysis, and the dashboard.
The only shell involved is the snippets that execute ON the devnet host (docker,
runlike, curl, the wipe sequence); those are sent over `ssh ... bash -s` as
heredoc strings because they run on the host regardless of local language. A host
status probe emits JSON that Python parses in-process; there is no local-bash
intermediary and no cross-language file seam. Earlier drafts used a separate bash
CLI layer; it was collapsed into Python to remove the two-language seam and the
shell-injection / fragile-parsing surface. Full detail in `docs/architecture.md`.

## Conventions

- **Commits:** conventional commits, short. Types: `feat`, `fix`, `docs`, `chore`,
`refactor`. No Co-Authored-By lines. Run the secret audit before committing.
- **Prose / user-facing text:** use `;` or `,`, never the double-hyphen dash.
- **Python:** run via `uv` (`uv run dv ...`); stdlib first; deps minimal
(`pyyaml`, `requests`, and `fastapi`/`uvicorn`/`jinja2` only for the dashboard).
No premature abstraction. One module per concern (`config`, `ssh`, `remote`,
`status`, `discover`, `wipe`, `store`, `dora`, `hive`, `forks`, `collect`,
`blobtrack`, `forkview`). Idempotent, re-runnable collectors (upsert on primary
key, watermark incremental fetch).
- **Remote shell snippets:** keep them in `devnets/remote.py` as named string
constants; send via `ssh ... bash -s`. Validate any value interpolated into a
remote command (durations, node names) before sending; never f-string raw user
input into a shell command. Prefer passing values as positional args to
`bash -s` over interpolation.
- **Mutations:** only `dv wipe` mutates, gated behind `--yes`. Everything else is
read-only.
- **Simplicity over complexity.** This is a personal/small-team ops tool, not a
platform. Prefer the smallest thing that works.

## Data sources (exact)

- **SSH:** `devops@<node>.srv.<devnet>.ethpandaops.io`. Per-node docker
containers: `execution` (ethrex), `beacon` (CL), `validator`, `snooper-engine`
(engine proxy CL<->EL, logs FCU/newPayload), `xatu-sentry`, `vector`,
`ethereum-node-docker-watchtower`, `prometheus`, `node_exporter`.
- **EL RPC** on node: `localhost:8545`, namespaces `eth,net,web3,admin,debug`
(NO `txpool`). Metrics `localhost:6060/metrics` (Prometheus; cumulative
counters, no current-pool-size gauge).
- **Dora** (blob inclusion over time, the practical source):
`<dora_base>/api/v1/slots?limit=N&with_missing=1&with_orphaned=1` ->
`data.slots[]` each with `slot`, `proposer_name`, `blob_count`,
`eth_block_number`, `status`, `time`, `epoch`, `gas_used`, `execution_times[]`;
paginate via `data.next_page` (also supports `min_slot`/`max_slot` range params).
- **ethpandaops config API:** `<config_base>/api/v1/nodes/inventory` (enodes/ENRs).
- **Devnet repo (source of truth)** via `gh api`:
`repos/<devnets_repo>/contents/ansible/inventories/<repo_path>/inventory.ini`
(group `[ethrex:children]` = roster), `.../group_vars/all/images.yaml`
(`default_ethereum_client_images.ethrex` = image tag),
`network-configs/<repo_path>/metadata/genesis.json` (chainId, fork timestamps,
blobSchedule).
- **Hive:** `https://hive.ethpandaops.io` group listings; per-devnet groups in
`devnets.yaml` `hive_groups`. Fetch the API directly; do not depend on any
external/Claude-only wrapper script.
- **eipmcp** (MCP tool, not HTTP): EIP/fork data for `dv fork` enrichment;
`dv eips-refresh` regenerates `config/eips.json`. Not auto-fetched at runtime.
- **Xatu / ClickHouse (FUTURE, credential-gated):** each node's `xatu-sentry`
ships CL events (incl. `blob_sidecar`) via gRPC/TLS to
`server.xatu-experimental.ethpandaops.io:443`. The public query endpoint
`clickhouse.xatu.ethpandaops.io` needs credentials we do NOT have, and devnet
data may live only in the experimental instance. Treat as optional; the core
must NOT depend on it. Dora is the available substitute.

## The `wipe` sequence (do not break)

Recover a wedged EL. The datadir is owned by uid 1004, so wipe via a root
container. The full, incident-tested sequence:
pause watchtower -> `runlike execution` capture (ABORT if capture is not a valid
`docker run`) -> note if `--nat.extip` already present -> `docker pull` ->
`docker rm -f execution` -> root-container `rm -rf` of the datadir -> recreate
with `-d` (runlike omits detach) -> **`docker restart snooper-engine`** (REQUIRED:
the proxy holds a stale connection to the old EL; without this the CL can't drive
the fresh EL and it logs "No messages from the consensus layer") -> unpause
watchtower -> print post-status. If a node talks to the EL directly without a
snooper, restart `beacon` instead. Gated behind `--yes`.

## Adding a devnet

1. Add an entry under `devnets:` in `config/devnets.yaml` (see `config/schema.md`).
2. `dv discover <name>` to populate `config/devnets/<name>.yaml` from the repo.
3. Optionally `dv collect <name> all` to start a history.
4. Create `docs/history/<name>.md` from `docs/history/_template.md` and log
facts/incidents there as you learn them (dated entries; what / why / recovery).

## Incident history is part of the job

`docs/history/<devnet>.md` is hard-won knowledge: root causes and recovery steps
for real wedges (blob decay, snap-leftover state wedges, 0-inbound-peers, etc.).
When you discover a new incident or devnet fact, append a dated entry. Check the
history before investigating a symptom from scratch; many recur.

## Where to look

- `docs/architecture.md` — components, the bash/Python seam, data flow.
- `docs/devnet-ops.md` — operational procedures (SSH, inspection, build/deploy,
debug logging, wipe/resync, Dora API).
- `docs/history/<devnet>.md` — per-devnet facts + dated incident log.
- `config/schema.md` — config file field reference.
86 changes: 86 additions & 0 deletions devnets_monitor/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Devnets Monitor

Ops + monitoring toolkit for [ethrex](https://github.com/lambdaclass/ethrex) on
[ethpandaops](https://ethpandaops.io) devnets, living in the `devnets_monitor/`
directory of `ethrex-tooling` (ported from the standalone `ethrex-devnets` repo).
Single home for the devnet ops runbook, per-devnet incident history, and the
tooling that watches ethrex as new forks (glamsterdam, BAL, fusaka, ...) roll out.

Self-contained: it has its own Python package (`uv`-managed) and `dv` CLI; run all
commands from this directory (`cd devnets_monitor`).

Generic across devnets: everything is parameterized by devnet name
(`glamsterdam-devnet-5`, `bal-devnet-3`, ...). Read-only by default; every
mutating action (wipe, deploy) is gated behind an explicit flag.

## What it does

Four capability areas, one front door (`dv`):

1. **Node health monitoring** — multi-node, multi-client status sweeps (EL
build/commit/head/peers/sync/state-at-head, CL sync line, watchtower);
peer mix; log tails.
2. **Blob & fork tracking** — blob inclusion per proposer over time, ethrex vs
other clients; fork schedule + EIP-per-fork; next-fork countdown.
3. **Hive / conformance** — pull and summarize Hive group runs (bal, bal-quick,
future fork groups) and pass rates for ethrex.
4. **Ops automation** — wipe/resync, debug-log capture, watchtower control,
wrapped as safe (mutation-gated) commands.

## Layout

```
devnets/ # the Python package: CLI, SSH orchestration, collectors, analysis, store
cli.py # `dv` argparse dispatcher (the daily driver)
remote.py # shell snippets that run ON the devnet host (sent over ssh bash -s)
config/ # devnets.yaml (registry) + devnets/<name>.yaml (discovered cache) + eips.json
docs/ # devnet-ops runbook, per-devnet history, architecture/agent guide
data/ # SQLite store (gitignored, regenerable)
web/ # FastAPI dashboard (read-only, localhost)
pyproject.toml # uv-managed; exposes the `dv` console script
```

## Quick start

Everything runs through `uv run dv` (or just `dv` if the venv is on PATH):

```bash
# discover a devnet's roster + fork schedule from the ethpandaops repo
uv run dv discover glamsterdam-devnet-5

# live health sweep across all ethrex nodes
uv run dv status glamsterdam-devnet-5

# collect historical data into SQLite, then analyze
uv run dv collect glamsterdam-devnet-5 all
uv run dv blob glamsterdam-devnet-5 # blob inclusion per proposer over time
uv run dv fork glamsterdam-devnet-5 # fork schedule + EIPs + countdown

# pull Hive conformance results
uv run dv hive glamsterdam-devnet-5

# read-only dashboard at http://127.0.0.1:8099
uv run dv serve

uv run dv --help # full subcommand list (read-only vs mutating)
```

The default devnet (`config/devnets.yaml` `default:`) is used when no `<devnet>`
argument is given. Override per-invocation with `DEVNET=<name>`.

## Requirements

- `uv` (drives the whole CLI; manages the Python env)
- `ssh` (access to `devops@<node>.srv.<devnet>.ethpandaops.io`)
- `gh` authenticated with read access to the ethpandaops devnet repos (for `dv discover`)
- On the devnet hosts: `docker`, `runlike` (for the wipe path), `curl`

## Conventions

- Read-only by default. Mutating commands (`dv wipe`) refuse to run without `--yes`.
- No secrets in git. SSH keys, JWTs, Xatu credentials live only on hosts; a
content audit gates commits (see CLAUDE.md).
- User-facing text uses `;` or `,`, not the double-hyphen dash.

See `CLAUDE.md` for the full working agreement and `docs/architecture.md` for how
the pieces fit together.
48 changes: 48 additions & 0 deletions devnets_monitor/config/devnets.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# ethrex-devnets: registry of ethpandaops devnets this toolkit operates on.
#
# This file is the static, hand-maintained entry point. Per-devnet LIVE facts
# (node roster, fork schedule, image tag) are auto-discovered into
# config/devnets/<name>.yaml by `dv discover <name>` and read from there.
# Keep secrets OUT of this file; it is committed.
#
# Field reference: see config/schema.md

# Default devnet used when a command is given no explicit <devnet> argument
# and the DEVNET env var is unset.
default: glamsterdam-devnet-5

devnets:
glamsterdam-devnet-5:
# GitHub repo holding ansible inventory + network-configs (source of truth
# for roster, image tags, genesis/fork schedule). Used by `dv discover`.
devnets_repo: ethpandaops/glamsterdam-devnets
# Sub-path inside that repo for this devnet (ansible inventory dir name and
# network-configs dir name). For glamsterdam the dirs are "devnet-5".
repo_path: devnet-5
# Public service base URLs (no trailing slash).
dora_base: https://dora.glamsterdam-devnet-5.ethpandaops.io
config_base: https://config.glamsterdam-devnet-5.ethpandaops.io
# ethrex devnet branch + image builder target (informational).
branch: glamsterdam-devnet-5
builder_target: lambdaclass/ethrex@glamsterdam-devnet-5
# SSH host pattern. Optional; defaults to the ethpandaops scheme below.
# Placeholders: {user}, {node}, {devnet}. Override only if a devnet differs.
# ssh_user: devops
# ssh_host_template: "{user}@{node}.srv.{devnet}.ethpandaops.io"
# Hive groups to pull conformance results from for this devnet.
hive_groups:
- bal
- bal-quick

# ---- Template for adding another devnet (uncomment + adjust, then run
# ---- `dv discover <name>`). repo_path is the dir name inside the repo.
#
# bal-devnet-3:
# devnets_repo: ethpandaops/bal-devnets
# repo_path: devnet-3
# dora_base: https://dora.bal-devnet-3.ethpandaops.io
# config_base: https://config.bal-devnet-3.ethpandaops.io
# branch: bal-devnet-3
# builder_target: lambdaclass/ethrex@bal-devnet-3
# hive_groups:
# - bal
Empty file.
Loading