diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..9582427 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,31 @@ + + +# Agent Guide for Apache Kvrocks Controller + +This file is read by automated agents (security scanners, code analyzers, +AI assistants) operating on this repository. It points them at the +human-authored references they should consult before producing output. + +## Security + +Security model: [SECURITY.md](./SECURITY.md) → [THREAT_MODEL.md](./THREAT_MODEL.md) + +Agents that scan this repository should consult `SECURITY.md` and the linked +`THREAT_MODEL.md` for the project's threat model — in particular the +control-plane API/UI authentication posture, the controller's failover +authority over managed Kvrocks clusters, metadata-store trust, and the SSRF +surface of node registration — before reporting issues. diff --git a/SECURITY.md b/SECURITY.md new file mode 100644 index 0000000..1dbd7e5 --- /dev/null +++ b/SECURITY.md @@ -0,0 +1,14 @@ +# Security Policy + +## Reporting a Vulnerability + +Apache Kvrocks Controller follows the [Apache Software Foundation security process](https://www.apache.org/security/). +Please report suspected vulnerabilities **privately** to `security@apache.org` (the Kvrocks PMC is reachable +at `private@kvrocks.apache.org`). Do **not** open public GitHub issues or pull requests for security reports. + +## Threat Model + +What the controller treats as in/out of scope, the security properties it claims and disclaims (the +control-plane API/UI authentication posture, its failover authority over managed clusters, metadata-store +trust, and the SSRF surface of node registration), the adversary model, and how findings are triaged are +documented in [THREAT_MODEL.md](./THREAT_MODEL.md). Reporters and triagers should consult it alongside this policy. diff --git a/THREAT_MODEL.md b/THREAT_MODEL.md new file mode 100644 index 0000000..2214b37 --- /dev/null +++ b/THREAT_MODEL.md @@ -0,0 +1,273 @@ + + +# Threat Model — Apache Kvrocks Controller + +## §1 Header + +- **Project:** Apache Kvrocks Controller — a Go cluster-management control plane for Apache Kvrocks. + It probes managed Kvrocks nodes and performs **failover**, scales clusters out/in, manages many + clusters from one (itself-clustered) controller, and stores cluster metadata in a pluggable backend + (ETCD by default; ZooKeeper / Consul / Raft optional) *(documented — README, `config/config.yaml`)*. +- **Modelled against:** `apache/kvrocks-controller` `unstable`/HEAD (2026-05-31). +- **Status:** **DRAFT — v0, not yet reviewed by the Kvrocks PMC.** Produced by the ASF Security team via + the `threat-model-producer` rubric + (). Companion to the + `apache/kvrocks` model; this one covers the **control plane**, whose trust surface differs. +- **Version binding:** versioned with the project. +- **Reporting cross-reference:** §8-property violations → report privately per `SECURITY.md` / + ; §3 / §9 findings closed citing this document. +- **Provenance legend:** *(documented)* / *(maintainer)* / *(inferred)* as in the sibling model; each + *(inferred)* routes to a §14 question. +- **Draft confidence:** ~12 documented / 0 maintainer / ~40 inferred. + +The controller exposes an **HTTP API** (default `addr: 127.0.0.1:9379`) and a bundled **web UI**, through +which operators create clusters, add/remove nodes, migrate slots, and trigger or automate failover +*(documented — README, config)*. It is, in effect, the **administrative authority** over every Kvrocks +cluster it manages: whoever controls the controller controls those clusters' topology and data placement. + +## §2 Scope and intended use + +Primary intended use *(documented)*: an operations control plane that a cluster administrator runs to +manage one or more Kvrocks clusters, backed by a consensus/metadata store for its own state and HA. + +Caller roles: + +- **API / UI client** — whoever can reach `:9379` or the web UI. **The config exposes no built-in + authentication knob**, so by default this role is *unauthenticated* and gated only by `addr` / + network reachability *(inferred — `config/config.yaml` has no API-auth section; default `addr` is + localhost; `server/middleware` present but role unconfirmed)*. This is the load-bearing §14 wave-1 item. +- **Metadata store (ETCD/…)** — trusted source of truth for cluster topology; the controller believes it. +- **Managed Kvrocks node** — a data node the controller connects to and issues admin/cluster commands to. +- **Peer controller** — another controller instance in the same HA group, coordinating via the store. +- **Operator / deployer** — controls `config.yaml`, the store, TLS material, network exposure, and the + managed-node credentials. Fully trusted; **out of model** as adversary (§3). + +**Component-family table:** + +| Family | Entry point | Touches outside process | In model? | +| --- | --- | --- | --- | +| HTTP control-plane API | `:9379`, `server/` | network; store; managed nodes | **Yes** | +| Web UI / dashboard | `webui/` | browser; the API | **Yes** | +| Metadata store engine | `store/engine` (etcd/zk/consul/raft) | network (the store) | **Yes** | +| Controller → Kvrocks-node client | probing, cluster ops, failover | network (managed nodes) | **Yes** | +| Controller HA / leader election | via the store | network | **Yes** | +| Build / tooling | `cmd`, `scripts`, `x.py`, `Makefile` | — | No → §3 | + +## §3 Out of scope (explicit non-goals) + +- **The operator / deployer as adversary** — controls `config.yaml`, the metadata store, and the + deployment (§9) *(inferred)*. +- **Hardening of the metadata store itself** (ETCD/ZK/Consul auth, ACLs, TLS) — the controller consumes + it; securing it is the operator's job, though the controller must use the credentials/TLS the operator + supplies *(documented — etcd `username`/`password`/`tls` fields exist, default empty/off)*. +- **The managed Kvrocks nodes' own threat model** — covered by the `apache/kvrocks` model; this document + covers the controller's *use* of them. +- **Network/transport hardening** beyond the TLS the controller supports. +- **Build tooling and tests.** + +## §4 Trust boundaries and data flow + +The primary trust boundary is the **HTTP API / web-UI surface**. A request that reaches it can read or +mutate cluster topology — create/delete clusters, add/remove nodes, migrate slots, force failover +*(documented — feature list)*. With no built-in auth (pending §14), the boundary is effectively the +network ACL around `:9379` *(inferred)*. + +Secondary boundaries: + +- **Controller ↔ metadata store:** the controller trusts store contents as ground truth for topology and + leadership. An actor who can write the store can redirect the controller *(inferred)*. +- **Controller → managed node:** the controller connects out to node addresses taken from its + metadata/API input and issues admin/cluster commands, holding whatever node credentials the operator + configured *(inferred)*. Node addresses supplied via the API are an **SSRF / connect-to-arbitrary-host** + vector if the caller is untrusted (§9). + +**Reachability preconditions:** + +- A finding in the **API/UI** path is in-model if reachable by a client the deployment treats as untrusted + (which, absent built-in auth, is "anyone who can reach the port" — see §14 wave-1). +- A finding in **store handling** is in-model if reachable from API input or from store contents an + attacker could influence. +- A finding in the **node client** (e.g. SSRF, credential mishandling) is in-model if a node address or + connection parameter is attacker-influenceable via the API. +- A finding reachable only from `config.yaml` or operator-held credentials is out of model (§3). + +## §5 Assumptions about the environment + +- **Runtime:** a Go binary on a POSIX host; default API bind `127.0.0.1:9379` *(documented)*. +- **Metadata store:** an operator-run ETCD (default) or alternative, reachable at the configured addr, + with operator-chosen auth/TLS (defaults: no auth, `tls.enable: false`) *(documented — config)*. +- **Managed nodes:** reachable Kvrocks instances the operator has provisioned; the controller is assumed + to be authorized to administer them *(inferred)*. +- **Network:** the operator controls who can reach the API, the store, and the nodes *(inferred)*. +- **What the controller does to its host/network (*(inferred)* — wave-2):** binds the API port; connects + out to the store and to managed nodes; reads `config.yaml`; may write logs/metrics. Not assumed to + execute arbitrary host commands. + +## §5a Build-time and configuration variants + +| Knob | Default *(documented — config)* | Effect | Insecure-default ruling | +| --- | --- | --- | --- | +| API `addr` | `127.0.0.1:9379` | Localhost-only by default | Safe default; exposing it without a fronting auth proxy is the risk | +| API authentication | **none present in config** | Control plane appears unauthenticated | **Open (wave-1):** is there *any* API/UI auth, and is "network-trust only" the supported posture? | +| API/UI TLS | *(unconfirmed)* | Plaintext control-plane traffic | **Open (wave-1)** | +| `storage_type` | `etcd` | Choice of metadata backend (etcd/zk/consul/raft; raft = experimental, "not recommended for production") | raft-in-prod is an operator misuse (§11) | +| etcd `username`/`password`/`tls.enable` | empty / `false` | Store auth + encryption off unless set | Operator responsibility (§10) | + +## §6 Assumptions about inputs + +| Entry point | Parameter | Attacker-controllable? | Caller/operator must enforce | +| --- | --- | --- | --- | +| API: create/scale/failover/migrate | cluster/node/slot args | **yes** (absent auth, any API client) | front with authn/authz; network ACL | +| API: add node | **node address/host:port** | **yes** | restrict who may register nodes (SSRF surface) | +| Web UI | form/query fields | **yes** | output encoding; CSRF protection; auth | +| metadata store reads | topology/leadership records | from a **trusted** store *(inferred)* | store auth/TLS (operator) | +| node responses | probe/command replies | from a **trusted** node *(inferred)* | node authenticity (TLS/credentials) | +| `config.yaml` | all keys | **no — operator-trusted** | never sourced from a request | + +## §7 Adversary model + +- **Primary adversary:** an actor who can reach the controller API or web UI but is not the operator — + realistically present whenever the API is exposed beyond a trusted network without a fronting auth layer. + Capabilities: call any API endpoint (create/delete clusters, add/remove nodes, **force failover**, + migrate slots), register node addresses (SSRF), drive the UI. +- **Goals:** seize administrative control of managed Kvrocks clusters; cause data movement/loss via slot + migration or failover; make the controller connect to attacker-chosen hosts; read/modify topology. +- **Out of model:** the operator, anyone with `config.yaml` / store / managed-node credentials, and + (pending §14) a malicious peer controller or a compromised metadata store. + +## §8 Security properties the project provides + +*(Thin by design for a control plane assumed to run on a trusted network; all *(inferred)* pending §14.)* + +1. **Failover correctness under trusted operation.** Given a healthy store and honest peers, failover + promotes a valid replica rather than corrupting topology *(inferred — core function)*. *Symptom:* split + brain / promotion of a stale or wrong node / topology corruption. *Severity:* high (availability/integrity). +2. **Leader-election safety (single active controller).** HA controllers coordinate via the store so that + conflicting administrative actions are serialized *(inferred)*. *Symptom:* two controllers issuing + conflicting cluster ops. *Severity:* high. +3. **Memory/handler safety on API input.** Malformed API/UI requests do not crash or corrupt the controller + *(inferred)*. *Symptom:* panic/crash/OOB from crafted input. *Severity:* medium–high. +4. **Authentication / authorization — UNRESOLVED.** Whether the API/UI authenticate callers at all is the + central open question (§14 wave-1); absent that, the controller claims **no** access-control property and + relies entirely on §10 network controls. + +## §9 Security properties the project does NOT provide + +- **No built-in control-plane authentication by default** *(inferred)* — exposing `:9379` (or the UI) to an + untrusted network without a fronting auth proxy yields an **unauthenticated administrative control plane** + over every managed cluster. +- **No transport encryption guarantee by default** for the API/UI; store TLS is off unless configured. +- **No defence against a malicious operator** or anyone holding store/managed-node credentials (§3). +- **The controller trusts its metadata store and its peers** — it is not a defence against a tampered store + or a Byzantine peer controller (pending §14). +- **No SSRF protection on registered node addresses** *(inferred)* — the controller connects to host:port + values it is given; an untrusted caller who can add nodes can aim those connections. + +**False friends:** + +- *A bundled web UI looks like an admin console with a login, but (pending §14) may have no authentication at + all* — its presence does not imply an access-control boundary. +- *Managing "clusters" implies isolation, but the controller is a single authority across all managed + clusters* — compromising it is not contained to one cluster. + +**Well-known attack classes left to the operator:** unauthenticated admin-API exposure; CSRF/XSS on the web +UI; SSRF via node registration; plaintext control traffic; an unsecured ETCD that anyone can rewrite. + +## §10 Downstream (operator) responsibilities + +- **Do not expose the API/UI to an untrusted network without an authenticating reverse proxy / network ACL** + (until/unless built-in auth is confirmed). +- Enable TLS for the API/UI, for the store connection (`etcd.tls`), and for managed-node connections on + untrusted networks. +- Secure the metadata store (ETCD/ZK/Consul) with its own auth + ACLs; supply those credentials to the + controller; do not run the experimental `raft` engine in production. +- Restrict and protect the credentials the controller uses to administer managed nodes. +- Restrict who may register node addresses (SSRF surface). + +## §11 Known misuse patterns + +- Binding the controller API to `0.0.0.0` on a shared network without fronting authentication. +- Running the experimental `raft` store engine in production (the README warns against it). +- Pointing the controller at an ETCD with no auth/TLS on a reachable network. +- Treating the web UI as an authenticated admin console when it is not. + +## §11a Known non-findings (recurring false positives) + +*(v0 seed — PMC's real list is the key §14 input.)* + +- **"Unauthenticated admin API / no login on the UI"** against the default config — by design *if* the PMC + rules network-trust the supported posture (else it is `VALID`); resolved in §14 wave-1. +- **"ETCD has no auth/TLS"** — operator-configured store, defaults documented (§3/§10). +- **Findings in `cmd/`, `scripts/`, `x.py`, build tooling, tests** — out of scope (§3). +- **Controller can delete/failover clusters** — that is its purpose; an authorized call is not a finding (§7). + +## §12 Conditions that would change this model + +- Introduction (or confirmation) of built-in API/UI authentication/authorization — would add §8 properties + and shrink §9/§10. +- A change to the default `addr` / TLS posture. +- A new API surface or store engine, or promotion of the `raft` engine to supported. +- Treating peer controllers or the metadata store as untrusted (pulls them into §7). +- Any report not cleanly routable to a §13 disposition. + +## §13 Triage dispositions + +| Disposition | Meaning | Licensed by | +| --- | --- | --- | +| `VALID` | Violates a claimed property via an in-scope adversary/input. | §8, §6, §7 | +| `VALID-HARDENING` | No §8 property broken, but a §11 misuse warrants hardening (e.g. add CSRF tokens, SSRF allow-list). | §11 | +| `OUT-OF-MODEL: trusted-input` | Requires control of config / store / node credentials. | §6 | +| `OUT-OF-MODEL: adversary-not-in-scope` | Requires operator / store / peer capability. | §7, §3 | +| `OUT-OF-MODEL: unsupported-component` | Lands in tooling/tests, or the experimental `raft` engine. | §3, §5a | +| `OUT-OF-MODEL: non-default-build` | Only under a discouraged/non-default config. | §5a | +| `BY-DESIGN: property-disclaimed` | Concerns a §9-disclaimed property (no built-in auth/TLS default, single cross-cluster authority). | §9 | +| `KNOWN-NON-FINDING` | Matches a §11a entry. | §11a | +| `MODEL-GAP` | Routes to none of the above → revise the model. | §12 | + +## §14 Open questions for the maintainers + +**Wave 1 — the central auth question (§2/§5a/§8/§9):** +1. Does the controller API have **any built-in authentication/authorization** (token, mTLS, basic-auth), or + is it **network-trust only**? If the latter, is exposing it behind an operator's own auth proxy the + *supported* posture, so an "unauthenticated API" report is `BY-DESIGN`? *Proposed:* network-trust only + today; operator fronts it; unauth-on-trusted-network is by design, unauth-on-untrusted-network is misuse. +2. Does the **web UI** authenticate, and does it have **CSRF** protection on state-changing calls? + *Proposed:* same as the API; CSRF is a hardening item. +3. Is **TLS** supported/expected for the API and managed-node connections? *Proposed:* operator-configurable; + plaintext acceptable only on trusted networks. + +**Wave 2 — store, nodes, SSRF (§4/§6/§9):** +4. How are **managed-node admin credentials** stored and used (in the store? in config?), and what protects + them at rest? *Proposed:* operator-supplied; stored in the metadata store; protected at the store layer. +5. Are **node addresses** registered via the API validated/allow-listed, or will the controller connect to + any host:port given (SSRF)? *Proposed:* currently connects as given; allow-listing is a hardening item. +6. Does the controller **trust the metadata store and peer controllers** as honest (out of §7), or should a + tampered store / Byzantine peer be modelled? *Proposed:* trusted; out of scope. + +**Wave 3 — properties & §11a (§8/§11a):** +7. What **failover-correctness / split-brain** guarantees does the controller make, and under what store/peer + assumptions? *Proposed:* correct given a healthy store and a quorum of honest peers. +8. What do scanners/researchers most often report here that the PMC considers a **non-finding**? (Seeds §11a.) + +**Meta:** +9. Confirm this model lives as root `THREAT_MODEL.md` referenced from a new `SECURITY.md`, separate from the + `apache/kvrocks` model. *Proposed:* yes. + +## §15 Machine-readable companion + +Deferred for v0; a `threat-model.yaml` can later encode the §6 trust table, §2/§3 scoping, §8 rows, §9 false +friends, §11a non-findings, and §13 dispositions.