Add support for MP-BGP by taspelund · Pull Request #584 · oxidecomputer/maghemite

taspelund · 2025-12-15T02:41:50Z

Initial implementation of MP-BGP, supporting IPv4 Unicast and IPv6 Unicast address-families.

Adds MP_REACH_NLRI and MP_UNREACH_NLRI types and handling, capabilities negotiation and handling, updated message parsing framework, enhanced UPDATE error handling (RFC 7606), lots of tests, and more flexible BGP next-hop encoding support (in preparation for extended next-hop and BGP unnumbered).

taspelund · 2026-01-07T01:02:32Z

test-interop is currently failing because interop hasn't gotten the updates to support the new mgadm bgp config neighbor create syntax. This is addressed in https://github.com/oxidecomputer/testbed/pull/244 and the test should start passing when we pull this in (either by merging the testbed PR into main, or by pinning maghemite to the PR branch).

Implement MP_REACH_NLRI and MP_UNREACH_NLRI (RFC 4760) for IPv6 route exchange. Wire format (RFC 7606 compliance): - Always encode MP-BGP attributes first in UPDATE - Never encode both traditional and MP-BGP encodings in the same UPDATE - De-duplicate received non-MP attributes - Reject duplicate MP-BGP attributes with NOTIFICATION - Handle received UPDATE carrying traditional and MP-BGP encodings - Handle received UPDATE carrying both MP_REACH_NLRI and MP_UNREACH_NLRI Next-hop handling: - Add BgpNexthop enum: Ipv4, Ipv6Single, Ipv6Double (link-local + GUA) - Add UpdateMessage::nexthop() to centralize next-hop extraction logic - Preserve raw bytes in MpReach for capability-aware parsing Fanout refactor: - Split into Fanout4/Fanout6 using zero-sized marker types - Add RouteUpdate enum carrying typed prefix Vecs - Replace FSM event passing to use AFI/SAFI-specific announcements Session layer updates: - Process MP_REACH_NLRI/MP_UNREACH_NLRI in received UPDATEs - Extract IPv4/IPv6 prefixes separately for RIB updates - Encode IPv4 Unicast in traditional nlri/withdrawn fields - Encode IPv6 Unicast in MP-BGP attributes Message parsing: - Add MpReach/MpUnreach structs - Add prefixes4_from_wire()/prefixes6_from_wire() helpers - Restrict UPDATE nlri/withdrawn fields to Prefix4 - Restrict NEXT_HOP attribute to Ipv4Addr - Add Display impl for Aggregator/As4Aggregator Error handling: - Add UnsupportedAddressFamily error variant - Add MalformedAttributeList error variant Testing: - Add property-based tests for codec round-trips - Cover BgpNexthop, MpReach, MpUnreach, UpdateMessage variants - Test RFC 7606 encoding order and deduplication behavior Closes: #397

Continue MP-BGP (RFC 4760) implementation with MpReachNlri and MpUnreachNlri path attribute types. Add comprehensive framework for parse error handling for all BGP message types, with revised UPDATE error handling per RFC 7606. Message layer changes: - Rewrite MpReachNlri/MpUnreachNlri path attribute types to use parsed types in the struct, moving structural parsing into connection layer and treating unsupported AFI/SAFIs as parsing errors (if we don't support it, then we didn't advertise it, and you sending it to us is an error). - Introduce MessageParseError enum covering all message types (Open, Update, Notification, RouteRefresh) with structured error reasons - Add UpdateParseError with RFC 7606 error actions and reason types preserving section context (withdrawn/attributes/nlri) - Add OpenParseError, NotificationParseError, RouteRefreshParseError for non-UPDATE message failures - Implement mandatory attribute validation (ORIGIN, AS_PATH, NEXT_HOP) with treat-as-withdraw semantics per RFC 7606 - Track parse errors in UpdateMessage for session-layer processing Session layer changes: - Add AfiSafiState enum to track capability negotiation outcome per address family (Unconfigured/Advertised/Negotiated) - Add StopReason::ParseError variant carrying error codes for proper NOTIFICATION message generation - Handle ConnectionEvent::ParseError in all FSM states - Add TcpConnectionFails event for recv loop IO errors Connection layer changes: - Distinguish parse errors from IO errors via RecvError enum - Send ParseError events to FSM for all message types instead of converting to generic IO errors, preserving error codes for NOTIFICATION generation Testing: - Add unit tests for attribute error action selection per RFC 7606 - Add unit tests for attribute flag validation - Add unit tests for error collection during UPDATE parsing - Add unit tests for mandatory attribute validation - Update proptest strategies for new MpReachNlri/MpUnreachNlri types

Splits RouteUpdate into V4(RouteUpdate4) and V6(RouteUpdate6) variants, each of which have an IP-version specific Announce/Withdraw variant. This uses the type system to ensure that any sender of a RouteUpdate will be forced to use multiple events if they want to trigger both an announcement and a withdrawal, upholding the RFC 7606 requirement that Updates cannot be encoded with more than one of the following populated: - Traditional Withdrawn field (IPv4) - Traditional NLRI field (IPv4) - MP_REACH_NLRI (IPv4 or IPv6) - MP_UNREACH_NLRI (IPv4 or IPv6) Additionally, this cleans up some unhelpful tests (like verifying length or emptiness of nlri/withdrawn fields) and adds several more: - round-trip codec tests for route-refresh and notification - simultaneous traditional/mp-bgp encoding of ipv4 unicast - making sure tests covering mp-bgp enums cover all variants - making sure tests covering BgpNexthop enum cover all variants

Split ImportExportPolicy into typed IPv4 and IPv6 variants (ImportExportPolicy4, ImportExportPolicy6) to enable proper per-AF filtering. This adds IPv6 import filtering. Key changes: - Add ImportExportPolicy4/6 types for compile-time type safety - Add ImportExportPolicyV1 for API backward compatibility - Update SessionInfo to use per-AF policy fields - Add per-AF export policy change handling - Add per-AF route refresh (used for import policy changes) - Add mark_bgp_peer_stale4/6 for per-AF stale marking The Neighbor struct retains legacy allow_import/allow_export fields for API compatibility, with conversion helpers to combine per-AF policies back to the legacy format. Closes: #211

- Adds length checks for message and path attribute parsing - Adds AtomicAggregate mandatory path attribute (basic support) - Updates Aggregator/As4Aggregator path attribute types to use parsed fields instead of raw bytes and have a real Display impl - Adds logging of non-zero reserved field in MP_REACH_NLRI - Adds unit tests to ensure length validation works properly

Removes the filtering of originated routes from a received UPDATE. Scenario: 1. Peer advertises us a route that we are originating and we discard it. 2. We stop advertising that route. 3. Us withdrawing our advertisement doesn't cause the peer to choose a different bestpath for that route. 4. We no longer have a path for that route from this peer. This could also be addressed by: A) Sending a route-refresh to the peer when we stop originating routes. B) Inserting a Local path into the RIB for routes we are originating which is preferred over BGP paths learned from peers during bestpath. (A) is not a good idea because we don't know how large the peer's RIB will be and we'd have to re-process their entire RIB every time we stop advertising even a single route. (B) is fine conceptually, as most routing stacks use this approach: When using a "network" statement to pull routes from other protocols into the BGP topology or creating an aggregate, a locally originated path generally goes into the BGP LOC-RIB that is preferred over any peer-learned path. If an existing routing table entry doesn't exist, then this generally creates a blackhole route for the originated prefix (in order to prevent routing loops from occurring as a result of missing entries for the component routes). In the case of maghemite, well, really in the case of dendrite, this routing loop prevention is done without using blackhole routes: packets arriving on a front panel port that enocounter a NAT "miss" are dropped rather than submitted to an LPM lookup in the External routing table (avoiding us becoming a transit router + hairpinning packets sent to us from outside the rack). All that to say: we don't need local routes for any kind of forwarding (or blackholing) data plane behavior. This brings us to the crux of the issue: Inter-VPC routing. Since we don't currently have a native inter-VPC data plane within the rack, then we need to have installed an External route that covers our originated prefix regardless of what mask is used. If we do an exact match check and filtering, then we just make it harder for our peer to advertise us a route we need. We effectively require the peer to advertise us reachability to our own prefix, so long as the mask doesn't match what we're using (e.g. if we send a /24, they have to send us any mask other than a /24 that covers the original IP range). So while removing this filter doesn't address the issues inherent to us needing the peer to provide us a route back to our own address space, it does make it simpler for us to learn that route (no longer needing a route with anything but our own mask) and to avoid extra logic to ensure we can recover routes that we filtered.

Adds a new API version for MP-BGP. Updates mgadm to use the new API version. Adds backwards compatible type and conversion for peers created with old API versions.

This reverts commit 7260db3.

Adds an optional nexthop parameter to the address-family config for BGP peers. This allows an operator to define the nexthop used in route advertisements on a per AFI/SAFI basis, rather than unconditionally deriving the next-hop from the connection (SIP of the TCP session). The current implementation enforces that the NLRI address-family and the next-hop address-family are homogenous. However, this is intended to be changed when Extended Next-Hop support is added. Adds Unit tests for the next-hop derivation function. Adds Integration test cases: - IPv4 NLRI (only) over IPv6 connection (explicit IPv4 next-hop) - IPv6 NLRI (only) over IPv4 connection (explicit IPv6 next-hop) - Dual Stack NLRI over IPv4 connection (explicit IPv6 next-hop) - Dual Stack NLRI over IPv6 connection (explicit IPv4 next-hop) Refactors test helpers to allow caller to specify what NLRI should be exchanged between peers in the test (v4 only, v6 only, or Dual Stack). Updates Neighbor create/update API and bumps mgadm to use it.

Expose detailed peer session state, configuration, and metrics through a new v3.0.0 API endpoint. Includes FSM state, timers with remaining duration, connection counters, and advertised capabilities. Changes: - Add PeerCounters and BgpCapability types for API responses - Ossify DynamicTimerInfo as DynamicTimerInfoV1 for backwards compat - Add new DynamicTimerInfo with remaining field for v3 API - Add PeerTimersV1 for backward compatibility with v1/v2 - Extend PeerInfo with timers, counters, config (v3 only) - Add get_timers() and get_peer_info() to SessionRunner - Plumb peer_group through PeerConfig and NeighborInfo - Implement get_neighbors_v3 endpoint returning PeerInfo - Constrain get_neighbors_v2 to VERSION_IPV6_BASIC..VERSION_MP_BGP - Add Capability::code() method for OpenAPI serialization - Limit BgpNexthop doc comment in OpenAPI schema Closes: #387

- Adds handler for nexthop override updates - Removes PathAttributesChanged FSM event This was functionally equivalent to ReadvertiseRoutes except it didn't have an address-family discriminator. When a change occurs that affects both address-families, two ReadvertiseRoute events can be sent in place of one PathAttributesChanged event.

Replace BTreeSet with HashMap in the RIB to explicitly track path identity via PathKey. Paths from the same source (peer ID for BGP, nexthop+vlan_id for static routes) now properly replace each other rather than coexisting. This ensures correct bestpath selection when routes are updated. Adds new unit tests for bestpaths and rib insertion/removal to ensure path identity properties are upheld.

Fix deadlock in trigger_bestpath_when() caused by re-acquiring already-held Mutex locks. Iterate over the acquired lock reference instead of calling full_rib*() which tries to re-acquire the same mutex. Optimize fanout reads by caching it once per operation instead of reading from persistent storage for each prefix. Apply optimization to both bestpath triggering and nexthop shutdown operations. Fix test database conflicts by using thread names to ensure each test function gets a unique database path.

Open file-backed logger in append mode so old entries aren't lost. Use the thread name in the filename to avoid multiple processes stomping on each other. With a per-process instance of slog::Logger and no central authority across processes to coordinate writes, it's possible for two processes to corrupt the log file contents by writing at the same time. Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>

Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>

Moving common neighbor args into NeighborCommon resulted in some of the CLI args appearing in a different order, breaking scripts elsewhere (e.g. init.sh for interop topology). This unifies numbered/unnumbered neighbors into a single CLI which is then split into Neighbor + UnnumberedNeighbor API types for the appropriate call.

Expose NDP interfaces and their neighbor cache via the API. This allows an operator to see what the current config/state is for a managed unnumbered interface. Renames get_neighbors_unified() to get_neighbors_v4() aligning with the versioned method names.

Fixes issue where update API would return "no such interface" errors for unnumbered neighbors. Allows API handlers to determine if there is already an active session for unnumbered peers. Without this, there was no handling for updates to an existing unnumbered peer (requiring a delete/add workflow).

Fix lookup that was using PeerId::Ip instead of PeerId::Interface. Ensure that interface removal empties the interface scope map even if the interface isn't a registered interface (e.g. if there was a partial failure during an earlier call).

Fixes a bug where multiple peers with the same IP have their routes tracked as coming from the same source. This is really only exposed when using link-local addrs (BGP unnumbered) since that's the only time the same IP can be found on multiple peers. Scenario: - R2 has BGP unnumbered peers R1 / R3 - R1/R3 both use the same link-local IP, e.g. fe80::1 - R2 learns 3fff:10::/64 from R1, 3fff:30::/64 from R3 - R2 stores fe80::1 in BgpPathProperties.peer field for both R1/R3 - R1 shuts down the BGP session - R2 cleans up all BGP routes matching BgpPathProperties.peer == fe80::1 - 3fff:20::/64 is correctly removed from the RIB - 3fff:30::/64 is erroneously removed from the RIB <-- wrong! test_three_router_chain_unnumbered has been updated to exercise this exact scenario - which failed prior to the IpAddr -> PeerId type update. Also updates API to use Path where peer is PeerId instead of IpAddr.

Fixes CLI commands that only accepted IP addresses, preventing operations on unnumbered BGP peers (identified by interface name). Changes: - Clear command: addr (IpAddr) → peer (String) - Added routing logic to detect peer type via parse_peer_type() - Routes to clear_neighbor_v2() for numbered peers - Routes to clear_unnumbered_neighbor() for unnumbered peers - FSM history: peer (Option<IpAddr>) → peer (Option<String>) - Parses peer string to PeerId enum before API call - Updated get_fsm_history() handler to convert types - Message history: peer (IpAddr) → peer (String) - Parses peer string to PeerId enum before API call - Updated get_message_history() handler to convert types - Removed dead code: ExportPolicy struct (never used) All commands now follow the pattern established by neighbor read/ delete/create commands, which already supported both peer types. Users can now specify peers as either IP addresses (numbered) or interface names (unnumbered) consistently across all CLI commands.

taspelund · 2026-01-31T05:42:43Z

CI failures for falcon job seem to be due to https://www.illumos.org/issues/17853
It would seem that if our instance of falcon is the only one running, then cleanup of the VNICs would trigger this bug

Adds a dummy0 vnic that persists beyond the shutdown of the falcon topology. This ensures there is no deletion of a final vnic so there's no panic causing the test runner to fail.

taspelund · 2026-02-03T13:07:37Z

I've added a workaround into falcon-lab.sh to avoid triggering a panic on the buildomat runner. This should avoid test failures due to the underlying OS panicking.

Updates API to no longer return an error when configuring BGP unnumbered on an interface that doesn't yet exist on the system. This is needed for early networking, as the BGP config and uplinks are created in parallel and there is no guarantee that the uplinks will get setup first. In order to support this, interface existence is ensured before completing new connections -- both inbound and outbound. For inbound connections, the Dispatcher queries the unnumbered manager before routing the new connection to the interface-based FSM. For outbound connections, the FSM queries the unnumbered manager before resolving the peer's socket addr that we connect() to. A new test is added that runs a BGP FSM through the lifecycle of an unnumbered interface, ensuring the FSM stays in the expected states based on the status of an interface (missing, added w/ NDP neighbor and removed). This supplements the existing tests that exercise the NDP neighbor lifecycle, together covering the possible state changes for an unnumbered interface and its NDP discovery.

When receiving BgpNexthop::Ipv6Double, choose the appropriate address based on session type: - Use link-local address for unnumbered peer - Use global address for numbered peer Also update next-hop handling for traditional NLRI encoding. When we receive IPv4 Unicast routes encoded in the traditional NLRI field, check for the NEXT_HOP attribute (Ipv4Addr) instead of grabbing a BgpNexthop that must be destructured (and also wouldn't make sense as an IPv6 value).

Adds update_unnumbered_session() wrapper that properly uses PeerIp::Interface to do the peer lookup. Fixes a lookup failure that occurred when the previous code tried to do a lookup by IP.

Unify the numbered/unnumbered neighbor creation flow by having Router own the methods to do so, calling into the ndp manager simply for interface registration. This flow now matches how it works for numbered peers.

Adds a monitoring thread for detecting system interface addition and removal on an ongoing basis. This is required above and beyond the initial changes to the Dispatcher and initiate_connection() because the FSM triggers only happen as a result of new connections. We cannot depend on inbound connections if we aren't able to start the NDP threads because we can't expect peers to try to connect to us if we haven't sent an RA yet, which means that passive connections could sit indefinitely without a timer-based mechanism. Currently the only OS level interface query that we have is not interface-specific, rather it collects all interfaces known to the system. If we're going to be querying the OS for all interfaces all the time, we may as well centralize that logic into the NDP manager and reduce the surface area in the FSM that needs to be aware of the system state. This also updates the mock impl of the unnumbered manager to represent a third state: ndp initialization. The prior two-state implementation could not represent a scenario where an interface is configured and exists on the system, but does not have ndp initialized. This exact scenario is what we'd encounter without a monitor thread: 1) unnumbered interface configured 2) ndp initialization skipped (interface doesn't exist on system) 3) interface created 4) no router advertisements sent to trigger inbound connections These updates to the mock impl are now used in the unnumbered interface lifecycle tests, which would have caught the issue and determined that the prior work was insufficient for addressing the problem.

Replace hardcoded wait durations in unnumbered BGP tests with constants derived from the test's BGP timer configuration. This ensures verification loops wait long enough to be meaningful while avoiding excessive test duration. Add module-level constants: - CONNECT_RETRY_VERIFICATION: 3s (3 connect_retry cycles) - ESTABLISHED_VERIFICATION: 8s (hold_time + margin) Test timing improvements: - test_unnumbered_peer_expiry_and_rediscovery: 31s -> 9s - test_unnumbered_interface_lifecycle: 49s -> 18s

Add visibility into NDP manager internals including: - Monitor thread state (running/stopped) - Pending interfaces (configured but not on system) - Active interfaces with rx/tx thread state New CLI structure: mgadm ndp status manager <asn> mgadm ndp status interfaces <asn> mgadm ndp status interface <asn> <name> New API endpoint: GET /ndp/manager

test_same_linklocal_multiple_interfaces was setting Router1's discovered peer to fe80::1 (its own address), causing sessions to attempt self-connections. Use fe80::2 (Router2) instead. Topology: eth0 (scope 2) eth0 (scope 2) ┌─────────────┐ ┌─────────────┐ │ fe80::1%2 │◄─────────────────►│ fe80::2%2 │ │ │ link 0 │ │ │ Router1 │ │ Router2 │ │ │ link 1 │ │ │ fe80::1%3 │◄─────────────────►│ fe80::2%3 │ └─────────────┘ └─────────────┘ eth1 (scope 3) eth1 (scope 3) NDP discovery (what each router sees as its peer): R1's view: eth0 -> fe80::2%2, eth1 -> fe80::2%3 (R2) R2's view: eth0 -> fe80::1%2, eth1 -> fe80::1%3 (R1) The test still validates scope_id differentiation for the same IP discovered on multiple interfaces.

taspelund requested a review from rcgoodfellow December 15, 2025 02:41

taspelund self-assigned this Dec 15, 2025

taspelund added needs testing bgp Border Gateway Protocol mgd Maghemite daemon customer For any bug reports or feature requests tied to customer requests rust Pull requests that update rust code labels Dec 15, 2025

rcgoodfellow added this to the 18 milestone Dec 18, 2025

taspelund mentioned this pull request Dec 31, 2025

bgp: consider removing unconditional rx filtering of originated routes #583

Open

taspelund force-pushed the trey/mp-bgp branch from 6fccf6d to e908474 Compare January 3, 2026 05:49

rcgoodfellow mentioned this pull request Jan 3, 2026

ipv6 e2e integration oxidecomputer/omicron#9570

Open

17 tasks

taspelund marked this pull request as ready for review January 8, 2026 01:04

rcgoodfellow mentioned this pull request Jan 8, 2026

BGP unnumbered #598

Merged

taspelund added 16 commits January 10, 2026 00:42

Bump API for MP-BGP

e00e6a2

Adds a new API version for MP-BGP. Updates mgadm to use the new API version. Adds backwards compatible type and conversion for peers created with old API versions.

Revert "bgp: remove update filtering of originated routes"

490a0d0

This reverts commit 7260db3.

mg-common: fix addrobj format for loopbackmanager

2671585

Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>

remove unused import

e77a190

Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>

taspelund added 10 commits January 29, 2026 11:24

Add MP-BGP/unnumbered support to bgp exported API

b392f7b

Fix PeerId lookup for unnumbered APIs

d5856ca

Fix bgp neighbor sorting... again

ef56abf

mgd: more unnumbered fixes

4b48a5f

Fix lookup that was using PeerId::Ip instead of PeerId::Interface. Ensure that interface removal empties the interface scope map even if the interface isn't a registered interface (e.g. if there was a partial failure during an earlier call).

Merge branch 'main' into trey/mp-bgp

508e2e7

rcgoodfellow linked an issue Jan 30, 2026 that may be closed by this pull request

bgp: IPv6 support #137

Open

2 tasks

falcon-lab.sh: remove pinned dendrite rev

bbac087

falcon-lab.sh: add workaround for illumos #17853

1c1b5c2

Adds a dummy0 vnic that persists beyond the shutdown of the falcon topology. This ensures there is no deletion of a final vnic so there's no panic causing the test runner to fail.

internet-diglett and others added 15 commits February 4, 2026 22:00

accept v6 prefixes in ApplyRequest

9e81f5d

feed clippy

2d7b14e

bump openapi

367c1b1

fixup! bump openapi

3f7eb28

read unnumbered peers from db on startup

035d857

cargo fmt

284a11d

Add more claude stuff to gitignore

932e8af

bgp: fix updates to unnumbered neighbors

78d7d9a

Adds update_unnumbered_session() wrapper that properly uses PeerIp::Interface to do the peer lookup. Fixes a lookup failure that occurred when the previous code tried to do a lookup by IP.

Move unnumbered neighbor creation into router

dc4b492

Unify the numbered/unnumbered neighbor creation flow by having Router own the methods to do so, calling into the ndp manager simply for interface registration. This flow now matches how it works for numbered peers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for MP-BGP#584

Add support for MP-BGP#584
taspelund wants to merge 58 commits intomainfrom
trey/mp-bgp

taspelund commented Dec 15, 2025

Uh oh!

taspelund commented Jan 7, 2026

Uh oh!

taspelund commented Jan 31, 2026

Uh oh!

taspelund commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

taspelund commented Dec 15, 2025

Uh oh!

taspelund commented Jan 7, 2026

Uh oh!

taspelund commented Jan 31, 2026

Uh oh!

taspelund commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants