Skip to content

evpn: ADR-0059 slice 4 — M40 manual smoke against FRR EVPN-MH#89

Open
lance0 wants to merge 6 commits into
mainfrom
evpn/adr-0059-slice-4-m40-frr-aliasing-ecmp
Open

evpn: ADR-0059 slice 4 — M40 manual smoke against FRR EVPN-MH#89
lance0 wants to merge 6 commits into
mainfrom
evpn/adr-0059-slice-4-m40-frr-aliasing-ecmp

Conversation

@lance0
Copy link
Copy Markdown
Owner

@lance0 lance0 commented May 13, 2026

Summary

First end-to-end validation of the slice 3b coordinator + Pass 1b diff path against a real upstream peer. M40 proves that rustbgpd consumes real FRR EVPN-MH routes (Type 1 EAD-per-EVI + Type 2 with shared ESI) and installs an FDB nexthop group on the receiving VTEP via `NDA_NH_ID`. After this lands, aliasing-ECMP graduates from "shipped, kernel-primitive smoke green" (PR #88) to "shipped, FRR-validated against 10.3.1" — the gating evidence for flipping the production-default knob.

Topology

`tests/interop/m40-evpn-aliasing-ecmp-frr.clab.yml`:

```
pe-rb (rustbgpd VTEP) ── pe-a (FRR, ES1)
└── pe-c (FRR, ES1)
```

rustbgpd peers with two FRR VTEPs that share ES1 (`es-id 1`, `es-sys-mac aa:bb:cc:dd:ee:ff` — same convention as M32). FRR side reuses the existing `start-frr-vtep-mh.sh` so the bond-as-ES shape (FRR EVPN-MH requires it for the local ES to register) is the same known-good config the M32 reflector test exercises.

Test driver

`scripts/test-m40-evpn-aliasing-ecmp-frr.sh`:

Phase 1 — Inject a test MAC on pe-a's `esdummy` bond. Expect:

  • FDB row on rustbgpd's `vxlan100` with `nhid N` (decimal; iproute2's `bridge fdb show` always prints decimal NHIDs).
  • `ip nexthop show` lists `id N group X/Y … fdb` whose two members resolve to pe-a's and pe-c's VTEP IPs.

Phase 2 — Stop pe-c's bgpd. EAD-per-EVI from pe-c withdraws; per the projection invariant (`empty alias_vtep_ips ⇔ alias_group_key.is_none()`) the entry collapses from FDB-NHG to single-dst pointing at pe-a. Expect the row to fall back to `dst pe-a` (no `nhid`) and the group + pe-c member to tear down.

Phase 3 — `bridge fdb del` removes the MAC on pe-a. Expect the row + remaining FDB nexthops for the test VTEPs to disappear.

What's deliberately not in this PR

  • Per-flow ECMP via real traffic. Kernel hashing + container topology make traffic-based ECMP assertions flaky; PR evpn-linux: ADR-0059 slice 3b — wire FDB-NHG onto the apply path #88's actor-level tests + M40's kernel-programming assertions are sufficient evidence that the data path is wired correctly. A traffic-validation PR is a v1.1 follow-up if operators ask.
  • Privileged-runner CI gate. Same as M36/M37/M38/M39 — manual / local only. ROADMAP already tracks the privileged-runner gate as a Gate 9 follow-up; M40 lands in the same slot once that exists.
  • pe-a ↔ pe-c peering. Both FRR nodes treat the ES as local (`evpn mh es-id` makes it local regardless of peering). DF election between them is not tested here; the test focuses on rustbgpd's kernel programming.

Test plan

  • `cargo fmt --all -- --check`
  • `cargo clippy --workspace --all-targets -- -D warnings`
  • `cargo test --workspace --no-fail-fast` (no Rust source changes; pre-existing test set passes — the slice 3b actor-level tests added in PR evpn-linux: ADR-0059 slice 3b — wire FDB-NHG onto the apply path #88 cover the coordinator the new M40 exercises against the kernel)
  • Bash syntax check on the driver (`bash -n`)
  • Manual containerlab smoke on the cloudbox (privileged runner):
    ```bash
    docker build -t rustbgpd:dev .
    containerlab deploy -t tests/interop/m40-evpn-aliasing-ecmp-frr.clab.yml
    bash tests/interop/scripts/test-m40-evpn-aliasing-ecmp-frr.sh
    containerlab destroy -t tests/interop/m40-evpn-aliasing-ecmp-frr.clab.yml
    ```

References

First end-to-end validation of the slice 3b coordinator + Pass 1b
diff path against a real upstream peer. After this, aliasing-ECMP
moves from "shipped, kernel-primitive smoke green" to "shipped,
FRR-validated against 10.3.1" — gating evidence to flip the
production-default knob.

Topology (`tests/interop/m40-evpn-aliasing-ecmp-frr.clab.yml`):

  pe-rb (rustbgpd VTEP) ── pe-a (FRR, ES1)
                       └── pe-c (FRR, ES1)

rustbgpd peers with two FRR VTEPs that share ES1 (es-id 1,
es-sys-mac aa:bb:cc:dd:ee:ff — same as M32). The FRR side
reuses `start-frr-vtep-mh.sh` so the bond-as-ES shape (FRR
requires it for EVPN-MH to register the local ES) is the same
known-good config the M32 reflector test exercises.

Driver (`scripts/test-m40-evpn-aliasing-ecmp-frr.sh`):

  Phase 1 — Inject a test MAC on pe-a's `esdummy` bond. Expect:
    - FDB row on rustbgpd's `vxlan100` with `nhid N` (decimal —
      iproute2's `bridge fdb show` always prints decimal NHIDs).
    - `ip nexthop show` lists `id N group X/Y … fdb` whose two
      members resolve to pe-a's and pe-c's VTEP IPs.
  Phase 2 — Stop pe-c's bgpd. EAD-per-EVI from pe-c withdraws;
    per the projection invariant (`empty alias_vtep_ips ⇔
    alias_group_key.is_none()`) the entry collapses from
    FDB-NHG to single-dst pointing at pe-a. Expect the row to
    fall back to `dst pe-a` (no `nhid`) and the group + pe-c
    member to be torn down.
  Phase 3 — `bridge fdb del` removes the MAC on pe-a. Expect the
    row + remaining FDB nexthops for the test VTEPs to disappear.

Manual / local only — containerlab needs privileges CI doesn't
carry. M39's privileged-runner gate (ROADMAP follow-up) is what
brings this into automated coverage; until then the test runs
the same way M36 / M37 / M38 / M39 do.

Files: clab topology, rustbgpd VTEP TOML, two FRR confs, the
driver. The rustbgpd-side bridge + VXLAN topology is set up by
the existing generic `start-rustbgpd-vtep.sh` (parameterized by
loopback IP + VNI). ROADMAP flips slice 3b → shipped and slice
4 → in-flight; CHANGELOG `[Unreleased] ### Added` gets the slice
4 entry.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the M40 manual containerlab interop smoke to validate ADR-0059 aliasing-ECMP (FDB nexthop groups via NDA_NH_ID) against real FRR EVPN-MH, plus documentation updates recording slice-4 status.

Changes:

  • Adds a new containerlab topology for rustbgpd ↔ 2× FRR EVPN-MH peers sharing an ES.
  • Adds a new M40 bash driver that injects/withdraws a MAC and asserts FDB-NHG programming via bridge fdb show + ip nexthop show.
  • Updates ROADMAP.md and CHANGELOG.md to reflect slice-4 progress.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/interop/scripts/test-m40-evpn-aliasing-ecmp-frr.sh New M40 driver script that validates FDB-NHG install, drain-to-single-dst, and teardown.
tests/interop/m40-evpn-aliasing-ecmp-frr.clab.yml New containerlab topology wiring rustbgpd to two FRR EVPN-MH VTEPs.
tests/interop/configs/rustbgpd-m40-vtep.toml rustbgpd config for M40 EVPN VTEP + two EVPN neighbors.
tests/interop/configs/frr-bgpd-m40-vtep-a.conf FRR pe-a config: ES1 + EVPN BGP session to rustbgpd.
tests/interop/configs/frr-bgpd-m40-vtep-c.conf FRR pe-c config: ES1 (alias-only) + EVPN BGP session to rustbgpd.
ROADMAP.md Marks slice 4 (M40) as in-flight and updates status text for ADR-0059.
CHANGELOG.md Adds an Unreleased entry describing the new M40 manual smoke and what it validates.
Comments suppressed due to low confidence (1)

tests/interop/scripts/test-m40-evpn-aliasing-ecmp-frr.sh:88

  • The topology already starts rustbgpd via /usr/local/bin/start-rustbgpd-vtep.sh in the containerlab exec section, but the driver also calls start_rustbgpd here. That will try to launch a second rustbgpd process (and by default uses /usr/local/bin/start-rustbgpd.sh, which isn’t mounted in this topology), making the smoke unreliable. Pick a single owner for daemon startup: either remove this start_rustbgpd call, or stop auto-starting in the .clab.yml and rely on the driver.
resolve_grpc_addr
start_rustbgpd


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +38 to +42
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
# shellcheck source=test-lib.sh
source "$SCRIPT_DIR/test-lib.sh"

PE_RB="clab-${TOPO}-pe-rb"
Comment on lines +198 to +201
log "[phase 2] stop pe-c — withdrawal of EAD-per-EVI must drain the group"
docker exec "$PE_C" sh -c "pkill -TERM bgpd || true; sleep 1; pkill -KILL bgpd || true" \
>/dev/null 2>&1
ok "pe-c bgpd stopped"
- ip link set eth1 up
- ip addr add 10.0.1.1/24 dev eth2
- ip link set eth2 up
- /usr/local/bin/start-rustbgpd-vtep.sh 10.0.0.1 100
Three findings on the slice 4 PR (#89):

1. `test-lib.sh::preflight` runs at source time and expects
   the rustbgpd container at `clab-${TOPO}-rustbgpd`. M40's
   node is `pe-rb`, so the source-time check would fail and
   the driver would exit before any assertion ran. Override
   `RUSTBGPD=$PE_RB` + `export RUSTBGPD` before sourcing the
   lib — same pattern M39 uses for its `pe1` node.

2. Phase 2 used `pkill -TERM/-KILL bgpd` inside pe-c, but
   FRR's `watchfrr` would auto-restart bgpd a second later
   and re-originate EAD-per-EVI, racing the drain assertion.
   Switch to `docker stop --time 5 pe-c` so watchfrr +
   bgpd go down together — M1 relies on the auto-restart
   behavior; M40 needs the opposite (deterministic drain).

3. The clab `exec` block already runs `start-rustbgpd-vtep.sh`
   which launches rustbgpd in the background. The driver was
   ALSO calling `start_rustbgpd` from test-lib, which would
   spawn a second daemon competing for the same gRPC port.
   Match the M36 pattern: only call `resolve_grpc_addr` +
   poll `grpc_health` until ready; don't double-spawn.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (5)

tests/interop/scripts/test-m40-evpn-aliasing-ecmp-frr.sh:156

  • In the wait_for_rb predicate, $TEST_MAC is wrapped in single quotes inside the sh -c string, so it will not be expanded by the outer shell and (unless exported into the container) will be treated as the literal string $TEST_MAC inside the container. This will prevent the predicate from ever matching the MAC and cause the wait to time out. Prefer embedding the expanded value (or passing args without sh -c) so the grep sees the actual MAC.
wait_for_rb \
    "FDB row for $TEST_MAC has nhid" \
    "bridge fdb show dev $VXLAN | grep -i '$TEST_MAC' | grep -E ' nhid [0-9]+'" \
    45

tests/interop/scripts/test-m40-evpn-aliasing-ecmp-frr.sh:235

  • Same quoting issue as phase 1: $TEST_MAC and $PE_A_IP are inside single quotes in the sh -c command string, so they won’t expand and the wait_for_rb predicate can never succeed. Remove the single quotes (or otherwise ensure the host expands these values before passing into the container).
wait_for_rb \
    "FDB row falls back to single-dst (no nhid)" \
    "bridge fdb show dev $VXLAN | grep -i '$TEST_MAC' | grep -vE ' nhid [0-9]+' | grep -E ' dst $PE_A_IP'" \
    45

tests/interop/scripts/test-m40-evpn-aliasing-ecmp-frr.sh:259

  • Same quoting issue as earlier phases: the wait_for_rb predicate uses '$TEST_MAC', which won’t expand (and isn’t exported into the container), so the negated grep will not reliably assert the MAC is absent. Expand the MAC value before sending the command to sh -c (or avoid sh -c).
wait_for_rb \
    "FDB row for $TEST_MAC removed from rustbgpd's bridge FDB" \
    "! bridge fdb show dev $VXLAN | grep -qi '$TEST_MAC'" \
    45

tests/interop/scripts/test-m40-evpn-aliasing-ecmp-frr.sh:118

  • wait_frr_established already emits ok/fail and updates the pass/fail counters. Wrapping it with && ok ... || fail ... double-counts results and can report misleading totals. Call wait_frr_established directly and rely on its return code if you need to abort.
log "[baseline] both FRR peers Established to rustbgpd"
wait_frr_established "$PE_A" "10.0.0.1" "pe-a EVPN" \
    && ok "pe-a Established" || fail "pe-a not Established"
wait_frr_established "$PE_C" "10.0.1.1" "pe-c EVPN" \
    && ok "pe-c Established" || fail "pe-c not Established"

tests/interop/scripts/test-m40-evpn-aliasing-ecmp-frr.sh:186

  • The ip nexthop show parsing assumes member IDs are strictly decimal (group [0-9/]+), but elsewhere in the repo the netns FDB-NHG tests explicitly allow iproute2 to print member IDs in hex (e.g. group 0x30000001). To make this interop script resilient across iproute2 versions, accept both decimal and 0x... forms and strip the 0x prefix before matching member lines.
nh_dump=$(nexthop_show)
group_line=$(echo "$nh_dump" | grep -E "^id $nhid group [0-9/]+ .*fdb" || true)
if [ -n "$group_line" ]; then
    ok "group line present: $group_line"
else
    fail "no group line for nhid=$nhid in ip nexthop show"
    echo "--- nh_dump ---" >&2
    echo "$nh_dump" | head -30 >&2
fi

# Extract member IDs from the group line: `group 12/13 ...` → 12 and 13.
member_ids=$(echo "$group_line" \
    | grep -oE 'group [0-9/]+' \
    | awk '{print $2}' \
    | tr '/' ' ')

break
fi
if [ "$i" -eq 15 ]; then
fail "gRPC endpoint not reachable within 30s"
Comment on lines +24 to +26
# Shared ESI Type 0 derived from es-sys-mac aa:bb:cc:dd:ee:ff +
# es-id 1 (matches the M32 convention). pe-a and pe-c don't
# peer directly with each other — we don't need their Type 4
Two findings on `bd5d69c`:

1. ESI type comment in `m40-evpn-aliasing-ecmp-frr.clab.yml`
   said "Type 0" but the test asserts on a leading `03:` (Type 3
   MAC-based ESI per RFC 7432 §5; same as `start-frr-vtep-mh.sh`
   documents). Carried over wrong from M32's clab comment.
   Corrected.

2. Hard prereqs in the driver called `fail` but continued
   running, producing cascading false-negative assertions in
   subsequent phases. The four checkpoints whose failure
   invalidates every downstream phase (gRPC ready,
   Phase-1 FDB-row-with-nhid, Phase-2 single-dst-fallback,
   Phase-3 FDB-row-removed) now `print_summary` (which exits
   non-zero) on failure. The in-phase soft assertions still
   accumulate per the M-series convention so the operator
   sees all simultaneous breakage in a single run.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comment on lines +175 to +180
log "[phase 1] ip nexthop show has the matching group with two members"
# Group line shape: `id <gid> group <mid>/<mid> ... fdb`. Members
# print as `id <mid> via <ip> fdb`. We assert:
# - the group exists with our nhid
# - exactly two members via pe-a's IP and pe-c's IP
nh_dump=$(nexthop_show)
Comment on lines +236 to +238
docker stop --time 5 "$PE_C" >/dev/null 2>&1 \
&& ok "pe-c container stopped" \
|| fail "could not stop pe-c container"
fdb_row=$(fdb_show_test_mac)
nhid=$(echo "$fdb_row" | grep -oE ' nhid [0-9]+' | awk '{print $2}' | head -1)
if [ -z "$nhid" ]; then
fail "could not parse nhid from FDB row: $fdb_row"
Three findings on `d47a4d0`:

1. Member-count assertion was permissive: checked that both
   expected VTEP IPs appeared in the group but tolerated extra
   members. Now asserts exactly 2 distinct gateways covering
   pe-a + pe-c; an unexpected third member trips the check.

2. `docker stop pe-c` failure was soft-failed (continued into
   the drain assertions even though the prerequisite withdrawal
   never happened). Now `print_summary` on failure — Phase 2 is
   meaningless without it.

3. `nhid` parse failure left `\$nhid` empty, so the next
   `^id  group ...` grep would match anything starting with
   `id `. Hard-exit if the parse fails so the cascade can't
   fire.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

tests/interop/scripts/test-m40-evpn-aliasing-ecmp-frr.sh:202

  • Because this script sources test-lib.sh (which enables set -euo pipefail), this command substitution will cause the whole script to exit early if grep finds no match (e.g., if ip nexthop show output format differs or the group line is temporarily absent). Add || true / a non-failing parser (e.g., awk) so the script can fall through to the explicit if [ -n ... ] failure path and still print a summary.
# Extract member IDs from the group line: `group 12/13 ...` → 12 and 13.
member_ids=$(echo "$group_line" \
    | grep -oE 'group [0-9/]+' \
    | awk '{print $2}' \
    | tr '/' ' ')
if [ -n "$member_ids" ]; then
    ok "parsed group member IDs: $member_ids"
else
    fail "could not parse group member IDs from: $group_line"
fi

tests/interop/scripts/test-m40-evpn-aliasing-ecmp-frr.sh:219

  • Same strict-mode issue here: if a member line doesn’t match, grep -oE 'via ...' (and the grep -v / grep -c pipelines below) can return exit code 1 inside command substitutions, and set -e will abort the script before it can report a structured failure. Make these parsing pipelines non-fatal (e.g., append || true, or use awk/wc -l for counting).
    mline=$(echo "$nh_dump" | grep -E "^id $mid via [0-9.]+ .*fdb" || true)
    mip=$(echo "$mline" | grep -oE 'via [0-9.]+' | awk '{print $2}')
    if [ -n "$mip" ]; then
        member_ips="$member_ips $mip"
    fi
done
# Sort + dedupe so we count distinct gateways (one IP shouldn't
# appear twice in a healthy group; the count check below rules
# that out).
member_ip_list=$(echo "$member_ips" | tr ' ' '\n' | sort -u | grep -v '^$')
member_ip_count=$(echo "$member_ip_list" | grep -c .)
if [ "$member_ip_count" -eq 2 ] \

Comment on lines +170 to +175
fdb_row=$(fdb_show_test_mac)
nhid=$(echo "$fdb_row" | grep -oE ' nhid [0-9]+' | awk '{print $2}' | head -1)
if [ -z "$nhid" ]; then
fail "could not parse nhid from FDB row: $fdb_row"
print_summary
fi
`test-lib.sh` sets `set -euo pipefail`; the captured-grep
pipelines in the driver could trip `inherit_errexit` (one
`shopt` away) and abort the script before the explicit
empty-checks could handle the no-match path. Add defensive
`|| true` to every captured grep pipeline:

  - nhid parse out of the FDB row
  - group-line member-ID list parse
  - per-member `via <ip>` extraction
  - dedupe + count of distinct gateways

Each one is paired with an explicit `[ -z ... ]` / count check
that already covers the failure mode; the `|| true` keeps the
no-match path flowing into that check instead of taking down
the script.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (2)

tests/interop/scripts/test-m40-evpn-aliasing-ecmp-frr.sh:266

  • Same quoting issue as Phase 1: grep -i '$TEST_MAC' searches for the literal $TEST_MAC rather than the actual MAC, so this Phase 2 wait condition can never succeed. Use a pattern that expands to the MAC value (or export TEST_MAC into the container exec environment).
wait_for_rb \
    "FDB row falls back to single-dst (no nhid)" \
    "bridge fdb show dev $VXLAN | grep -i '$TEST_MAC' | grep -vE ' nhid [0-9]+' | grep -E ' dst $PE_A_IP'" \
    45 \
    || { print_summary; }

tests/interop/scripts/test-m40-evpn-aliasing-ecmp-frr.sh:291

  • Same quoting issue as earlier phases: grep -qi '$TEST_MAC' matches the literal $TEST_MAC, so this wait condition will not detect the MAC being removed. Expand the variable in the host script (or export it) before passing the command to sh -c.
wait_for_rb \
    "FDB row for $TEST_MAC removed from rustbgpd's bridge FDB" \
    "! bridge fdb show dev $VXLAN | grep -qi '$TEST_MAC'" \
    45 \
    || { print_summary; }

# bail out on failure rather than emit cascading false negatives.
wait_for_rb \
"FDB row for $TEST_MAC has nhid" \
"bridge fdb show dev $VXLAN | grep -i '$TEST_MAC' | grep -E ' nhid [0-9]+'" \
Comment on lines +120 to +123
wait_frr_established "$PE_A" "10.0.0.1" "pe-a EVPN" \
&& ok "pe-a Established" || fail "pe-a not Established"
wait_frr_established "$PE_C" "10.0.1.1" "pe-c EVPN" \
&& ok "pe-c Established" || fail "pe-c not Established"
One legit finding fixed; one false positive skipped.

Fixed: `wait_frr_established` already calls `ok` / `fail` and
increments the pass/fail counters internally. The
`&& ok ... || fail ...` wrapper M32 / older M-series scripts use
double-counts the result. Dispatch the helper directly and use
`|| print_summary` to abort on a failed prereq instead. (The
existing sibling tests have the same bug; cleanup in a separate
pass.)

Skipped (false positive): Copilot flagged `grep -i '$TEST_MAC'`
inside the `sh -c` predicate as not expanding `$TEST_MAC`. The
nested single quotes are inside an outer double-quoted string,
so bash expands `$TEST_MAC` on the host before the string ever
reaches `sh -c`. Verified with `bash -c 'str="...'\''$VAR'\''..."; echo "$str"`
— variable is interpolated, single quotes survive as literals
to protect the MAC's colons from word-splitting in `sh -c`. The
quoting is correct.

This is round 5/5 — per the project memory rule capping
Copilot loops at ~5 rounds before switching to deep human
review, no further Copilot rounds will be requested. Branch
ready for human review.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants