Skip to content

Add support for MP-BGP#584

Open
taspelund wants to merge 58 commits intomainfrom
trey/mp-bgp
Open

Add support for MP-BGP#584
taspelund wants to merge 58 commits intomainfrom
trey/mp-bgp

Conversation

@taspelund
Copy link
Contributor

Initial implementation of MP-BGP, supporting IPv4 Unicast and IPv6 Unicast address-families.

Adds MP_REACH_NLRI and MP_UNREACH_NLRI types and handling, capabilities negotiation and handling, updated message parsing framework, enhanced UPDATE error handling (RFC 7606), lots of tests, and more flexible BGP next-hop encoding support (in preparation for extended next-hop and BGP unnumbered).

@taspelund taspelund self-assigned this Dec 15, 2025
@taspelund taspelund added needs testing bgp Border Gateway Protocol mgd Maghemite daemon customer For any bug reports or feature requests tied to customer requests rust Pull requests that update rust code labels Dec 15, 2025
@rcgoodfellow rcgoodfellow added this to the 18 milestone Dec 18, 2025
@taspelund
Copy link
Contributor Author

test-interop is currently failing because interop hasn't gotten the updates to support the new mgadm bgp config neighbor create syntax. This is addressed in https://github.com/oxidecomputer/testbed/pull/244 and the test should start passing when we pull this in (either by merging the testbed PR into main, or by pinning maghemite to the PR branch).

@taspelund taspelund marked this pull request as ready for review January 8, 2026 01:04
@rcgoodfellow rcgoodfellow mentioned this pull request Jan 8, 2026
Implement MP_REACH_NLRI and MP_UNREACH_NLRI (RFC 4760) for IPv6 route
exchange.

Wire format (RFC 7606 compliance):
- Always encode MP-BGP attributes first in UPDATE
- Never encode both traditional and MP-BGP encodings in the same UPDATE
- De-duplicate received non-MP attributes
- Reject duplicate MP-BGP attributes with NOTIFICATION
- Handle received UPDATE carrying traditional and MP-BGP encodings
- Handle received UPDATE carrying both MP_REACH_NLRI and MP_UNREACH_NLRI

Next-hop handling:
- Add BgpNexthop enum: Ipv4, Ipv6Single, Ipv6Double (link-local + GUA)
- Add UpdateMessage::nexthop() to centralize next-hop extraction logic
- Preserve raw bytes in MpReach for capability-aware parsing

Fanout refactor:
- Split into Fanout4/Fanout6 using zero-sized marker types
- Add RouteUpdate enum carrying typed prefix Vecs
- Replace FSM event passing to use AFI/SAFI-specific announcements

Session layer updates:
- Process MP_REACH_NLRI/MP_UNREACH_NLRI in received UPDATEs
- Extract IPv4/IPv6 prefixes separately for RIB updates
- Encode IPv4 Unicast in traditional nlri/withdrawn fields 
- Encode IPv6 Unicast in MP-BGP attributes

Message parsing:
- Add MpReach/MpUnreach structs
- Add prefixes4_from_wire()/prefixes6_from_wire() helpers
- Restrict UPDATE nlri/withdrawn fields to Prefix4
- Restrict NEXT_HOP attribute to Ipv4Addr
- Add Display impl for Aggregator/As4Aggregator

Error handling:
- Add UnsupportedAddressFamily error variant
- Add MalformedAttributeList error variant

Testing:
- Add property-based tests for codec round-trips
- Cover BgpNexthop, MpReach, MpUnreach, UpdateMessage variants
- Test RFC 7606 encoding order and deduplication behavior

Closes: #397
Continue MP-BGP (RFC 4760) implementation with MpReachNlri and
MpUnreachNlri path attribute types. Add comprehensive framework for
parse error handling for all BGP message types, with revised UPDATE
error handling per RFC 7606.

Message layer changes:
- Rewrite MpReachNlri/MpUnreachNlri path attribute types to use parsed
  types in the struct, moving structural parsing into connection layer
  and treating unsupported AFI/SAFIs as parsing errors (if we don't
  support it, then we didn't advertise it, and you sending it to us is
  an error).
- Introduce MessageParseError enum covering all message types (Open,
  Update, Notification, RouteRefresh) with structured error reasons
- Add UpdateParseError with RFC 7606 error actions and reason types
  preserving section context (withdrawn/attributes/nlri)
- Add OpenParseError, NotificationParseError, RouteRefreshParseError
  for non-UPDATE message failures
- Implement mandatory attribute validation (ORIGIN, AS_PATH, NEXT_HOP)
  with treat-as-withdraw semantics per RFC 7606
- Track parse errors in UpdateMessage for session-layer processing

Session layer changes:
- Add AfiSafiState enum to track capability negotiation outcome per
  address family (Unconfigured/Advertised/Negotiated)
- Add StopReason::ParseError variant carrying error codes for proper
  NOTIFICATION message generation
- Handle ConnectionEvent::ParseError in all FSM states
- Add TcpConnectionFails event for recv loop IO errors

Connection layer changes:
- Distinguish parse errors from IO errors via RecvError enum
- Send ParseError events to FSM for all message types instead of
  converting to generic IO errors, preserving error codes for
  NOTIFICATION generation

Testing:
- Add unit tests for attribute error action selection per RFC 7606
- Add unit tests for attribute flag validation
- Add unit tests for error collection during UPDATE parsing
- Add unit tests for mandatory attribute validation
- Update proptest strategies for new MpReachNlri/MpUnreachNlri types
Splits RouteUpdate into V4(RouteUpdate4) and V6(RouteUpdate6) variants,
each of which have an IP-version specific Announce/Withdraw variant.
This uses the type system to ensure that any sender of a RouteUpdate
will be forced to use multiple events if they want to trigger both an
announcement and a withdrawal, upholding the RFC 7606 requirement that
Updates cannot be encoded with more than one of the following populated:
- Traditional Withdrawn field (IPv4)
- Traditional NLRI field (IPv4)
- MP_REACH_NLRI (IPv4 or IPv6)
- MP_UNREACH_NLRI (IPv4 or IPv6)

Additionally, this cleans up some unhelpful tests (like verifying length
or emptiness of nlri/withdrawn fields) and adds several more:
- round-trip codec tests for route-refresh and notification
- simultaneous traditional/mp-bgp encoding of ipv4 unicast
- making sure tests covering mp-bgp enums cover all variants
- making sure tests covering BgpNexthop enum cover all variants
Split ImportExportPolicy into typed IPv4 and IPv6 variants
(ImportExportPolicy4, ImportExportPolicy6) to enable proper
per-AF filtering. This adds IPv6 import filtering.

Key changes:
- Add ImportExportPolicy4/6 types for compile-time type safety
- Add ImportExportPolicyV1 for API backward compatibility
- Update SessionInfo to use per-AF policy fields
- Add per-AF export policy change handling
- Add per-AF route refresh (used for import policy changes)
- Add mark_bgp_peer_stale4/6 for per-AF stale marking

The Neighbor struct retains legacy allow_import/allow_export
fields for API compatibility, with conversion helpers to
combine per-AF policies back to the legacy format.

Closes: #211
- Adds length checks for message and path attribute parsing
- Adds AtomicAggregate mandatory path attribute (basic support)
- Updates Aggregator/As4Aggregator path attribute types to use parsed
  fields instead of raw bytes and have a real Display impl
- Adds logging of non-zero reserved field in MP_REACH_NLRI
- Adds unit tests to ensure length validation works properly
Removes the filtering of originated routes from a received UPDATE.

Scenario:
1. Peer advertises us a route that we are originating and we discard it.
2. We stop advertising that route.
3. Us withdrawing our advertisement doesn't cause the peer to choose a
   different bestpath for that route.
4. We no longer have a path for that route from this peer.

This could also be addressed by:
A) Sending a route-refresh to the peer when we stop originating routes.
B) Inserting a Local path into the RIB for routes we are originating
   which is preferred over BGP paths learned from peers during bestpath.

(A) is not a good idea because we don't know how large the peer's RIB
will be and we'd have to re-process their entire RIB every time we stop
advertising even a single route.

(B) is fine conceptually, as most routing stacks use this approach:
When using a "network" statement to pull routes from other protocols
into the BGP topology or creating an aggregate, a locally originated
path generally goes into the BGP LOC-RIB that is preferred over any
peer-learned path. If an existing routing table entry doesn't exist,
then this generally creates a blackhole route for the originated prefix
(in order to prevent routing loops from occurring as a result of missing
entries for the component routes).
In the case of maghemite, well, really in the case of dendrite, this
routing loop prevention is done without using blackhole routes: packets
arriving on a front panel port that enocounter a NAT "miss" are dropped
rather than submitted to an LPM lookup in the External routing table
(avoiding us becoming a transit router + hairpinning packets sent to us
from outside the rack). All that to say: we don't need local routes for
any kind of forwarding (or blackholing) data plane behavior.

This brings us to the crux of the issue: Inter-VPC routing.
Since we don't currently have a native inter-VPC data plane within the
rack, then we need to have installed an External route that covers our
originated prefix regardless of what mask is used. If we do an exact
match check and filtering, then we just make it harder for our peer to
advertise us a route we need. We effectively require the peer to
advertise us reachability to our own prefix, so long as the mask doesn't
match what we're using (e.g. if we send a /24, they have to send us any
mask other than a /24 that covers the original IP range).

So while removing this filter doesn't address the issues inherent to
us needing the peer to provide us a route back to our own address space,
it does make it simpler for us to learn that route (no longer needing
a route with anything but our own mask) and to avoid extra logic to
ensure we can recover routes that we filtered.
Adds a new API version for MP-BGP.
Updates mgadm to use the new API version.
Adds backwards compatible type and conversion for peers created with old
API versions.
Adds an optional nexthop parameter to the address-family config for BGP
peers. This allows an operator to define the nexthop used in route
advertisements on a per AFI/SAFI basis, rather than unconditionally
deriving the next-hop from the connection (SIP of the TCP session).
The current implementation enforces that the NLRI address-family and the
next-hop address-family are homogenous. However, this is intended to be
changed when Extended Next-Hop support is added.

Adds Unit tests for the next-hop derivation function.

Adds Integration test cases:
 - IPv4 NLRI (only) over IPv6 connection (explicit IPv4 next-hop)
 - IPv6 NLRI (only) over IPv4 connection (explicit IPv6 next-hop)
 - Dual Stack NLRI over IPv4 connection (explicit IPv6 next-hop)
 - Dual Stack NLRI over IPv6 connection (explicit IPv4 next-hop)

Refactors test helpers to allow caller to specify what NLRI should be
exchanged between peers in the test (v4 only, v6 only, or Dual Stack).

Updates Neighbor create/update API and bumps mgadm to use it.
Expose detailed peer session state, configuration, and metrics
through a new v3.0.0 API endpoint. Includes FSM state, timers
with remaining duration, connection counters, and advertised
capabilities.

Changes:
- Add PeerCounters and BgpCapability types for API responses
- Ossify DynamicTimerInfo as DynamicTimerInfoV1 for backwards compat
- Add new DynamicTimerInfo with remaining field for v3 API
- Add PeerTimersV1 for backward compatibility with v1/v2
- Extend PeerInfo with timers, counters, config (v3 only)
- Add get_timers() and get_peer_info() to SessionRunner
- Plumb peer_group through PeerConfig and NeighborInfo
- Implement get_neighbors_v3 endpoint returning PeerInfo
- Constrain get_neighbors_v2 to VERSION_IPV6_BASIC..VERSION_MP_BGP
- Add Capability::code() method for OpenAPI serialization
- Limit BgpNexthop doc comment in OpenAPI schema

Closes: #387
- Adds handler for nexthop override updates
- Removes PathAttributesChanged FSM event
  This was functionally equivalent to ReadvertiseRoutes except it didn't
  have an address-family discriminator. When a change occurs that
  affects both address-families, two ReadvertiseRoute events can be sent
  in place of one PathAttributesChanged event.
Replace BTreeSet with HashMap in the RIB to explicitly track path
identity via PathKey. Paths from the same source (peer ID for BGP,
nexthop+vlan_id for static routes) now properly replace each other
rather than coexisting. This ensures correct bestpath selection when
routes are updated.

Adds new unit tests for bestpaths and rib insertion/removal to ensure
path identity properties are upheld.
Fix deadlock in trigger_bestpath_when() caused by re-acquiring
already-held Mutex locks. Iterate over the acquired lock reference
instead of calling full_rib*() which tries to re-acquire the same mutex.

Optimize fanout reads by caching it once per operation instead of
reading from persistent storage for each prefix. Apply optimization to
both bestpath triggering and nexthop shutdown operations.

Fix test database conflicts by using thread names to ensure each test
function gets a unique database path.
Open file-backed logger in append mode so old entries aren't lost.
Use the thread name in the filename to avoid multiple processes stomping
on each other. With a per-process instance of slog::Logger and no
central authority across processes to coordinate writes, it's possible
for two processes to corrupt the log file contents by writing at the
same time.

Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>
Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>
Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>
Moving common neighbor args into NeighborCommon resulted in some of the
CLI args appearing in a different order, breaking scripts elsewhere
(e.g. init.sh for interop topology). This unifies numbered/unnumbered
neighbors into a single CLI which is then split into Neighbor +
UnnumberedNeighbor API types for the appropriate call.
Expose NDP interfaces and their neighbor cache via the API. This allows
an operator to see what the current config/state is for a managed
unnumbered interface.

Renames get_neighbors_unified() to get_neighbors_v4() aligning with the
versioned method names.
Fixes issue where update API would return "no such interface" errors
for unnumbered neighbors. Allows API handlers to determine if there
is already an active session for unnumbered peers. Without this, there
was no handling for updates to an existing unnumbered peer (requiring a
delete/add workflow).
Fix lookup that was using PeerId::Ip instead of PeerId::Interface.
Ensure that interface removal empties the interface scope map even if
the interface isn't a registered interface (e.g. if there was a partial
failure during an earlier call).
Fixes a bug where multiple peers with the same IP have their routes
tracked as coming from the same source. This is really only exposed
when using link-local addrs (BGP unnumbered) since that's the only time
the same IP can be found on multiple peers.

Scenario:
- R2 has BGP unnumbered peers R1 / R3
- R1/R3 both use the same link-local IP, e.g. fe80::1
- R2 learns 3fff:10::/64 from R1, 3fff:30::/64 from R3
- R2 stores fe80::1 in BgpPathProperties.peer field for both R1/R3
- R1 shuts down the BGP session
- R2 cleans up all BGP routes matching BgpPathProperties.peer == fe80::1
- 3fff:20::/64 is correctly removed from the RIB
- 3fff:30::/64 is erroneously removed from the RIB <-- wrong!

test_three_router_chain_unnumbered has been updated to exercise this
exact scenario - which failed prior to the IpAddr -> PeerId type update.

Also updates API to use Path where peer is PeerId instead of IpAddr.
Fixes CLI commands that only accepted IP addresses, preventing
operations on unnumbered BGP peers (identified by interface name).

Changes:
- Clear command: addr (IpAddr) → peer (String)
  - Added routing logic to detect peer type via parse_peer_type()
  - Routes to clear_neighbor_v2() for numbered peers
  - Routes to clear_unnumbered_neighbor() for unnumbered peers

- FSM history: peer (Option<IpAddr>) → peer (Option<String>)
  - Parses peer string to PeerId enum before API call
  - Updated get_fsm_history() handler to convert types

- Message history: peer (IpAddr) → peer (String)
  - Parses peer string to PeerId enum before API call
  - Updated get_message_history() handler to convert types

- Removed dead code: ExportPolicy struct (never used)

All commands now follow the pattern established by neighbor read/
delete/create commands, which already supported both peer types.

Users can now specify peers as either IP addresses (numbered) or
interface names (unnumbered) consistently across all CLI commands.
@rcgoodfellow rcgoodfellow linked an issue Jan 30, 2026 that may be closed by this pull request
2 tasks
@taspelund
Copy link
Contributor Author

CI failures for falcon job seem to be due to https://www.illumos.org/issues/17853
It would seem that if our instance of falcon is the only one running, then cleanup of the VNICs would trigger this bug

Adds a dummy0 vnic that persists beyond the shutdown of the falcon
topology. This ensures there is no deletion of a final vnic so there's
no panic causing the test runner to fail.
@taspelund
Copy link
Contributor Author

I've added a workaround into falcon-lab.sh to avoid triggering a panic on the buildomat runner. This should avoid test failures due to the underlying OS panicking.

internet-diglett and others added 15 commits February 4, 2026 22:00
Updates API to no longer return an error when configuring BGP unnumbered
on an interface that doesn't yet exist on the system. This is needed for
early networking, as the BGP config and uplinks are created in parallel
and there is no guarantee that the uplinks will get setup first.

In order to support this, interface existence is ensured before
completing new connections -- both inbound and outbound.
For inbound connections, the Dispatcher queries the unnumbered manager
before routing the new connection to the interface-based FSM.
For outbound connections, the FSM queries the unnumbered manager before
resolving the peer's socket addr that we connect() to.

A new test is added that runs a BGP FSM through the lifecycle of an
unnumbered interface, ensuring the FSM stays in the expected states
based on the status of an interface (missing, added w/ NDP neighbor and
removed). This supplements the existing tests that exercise the NDP
neighbor lifecycle, together covering the possible state changes for
an unnumbered interface and its NDP discovery.
When receiving BgpNexthop::Ipv6Double, choose the appropriate address
based on session type:
- Use link-local address for unnumbered peer
- Use global address for numbered peer

Also update next-hop handling for traditional NLRI encoding.
When we receive IPv4 Unicast routes encoded in the traditional NLRI
field, check for the NEXT_HOP attribute (Ipv4Addr) instead of grabbing a
BgpNexthop that must be destructured (and also wouldn't make sense as an
IPv6 value).
Adds update_unnumbered_session() wrapper that properly uses
PeerIp::Interface to do the peer lookup. Fixes a lookup failure that
occurred when the previous code tried to do a lookup by IP.
Unify the numbered/unnumbered neighbor creation flow by having Router
own the methods to do so, calling into the ndp manager simply for
interface registration. This flow now matches how it works for numbered
peers.
Adds a monitoring thread for detecting system interface addition and
removal on an ongoing basis. This is required above and beyond the
initial changes to the Dispatcher and initiate_connection() because the
FSM triggers only happen as a result of new connections. We cannot
depend on inbound connections if we aren't able to start the NDP threads
because we can't expect peers to try to connect to us if we haven't sent
an RA yet, which means that passive connections could sit indefinitely
without a timer-based mechanism. Currently the only OS level interface
query that we have is not interface-specific, rather it collects all
interfaces known to the system. If we're going to be querying the OS for
all interfaces all the time, we may as well centralize that logic into
the NDP manager and reduce the surface area in the FSM that needs to be
aware of the system state.

This also updates the mock impl of the unnumbered manager to represent
a third state: ndp initialization. The prior two-state implementation
could not represent a scenario where an interface is configured and
exists on the system, but does not have ndp initialized. This exact
scenario is what we'd encounter without a monitor thread:
1) unnumbered interface configured
2) ndp initialization skipped (interface doesn't exist on system)
3) interface created
4) no router advertisements sent to trigger inbound connections

These updates to the mock impl are now used in the unnumbered interface
lifecycle tests, which would have caught the issue and determined that
the prior work was insufficient for addressing the problem.
Replace hardcoded wait durations in unnumbered BGP tests with
constants derived from the test's BGP timer configuration. This
ensures verification loops wait long enough to be meaningful while
avoiding excessive test duration.

Add module-level constants:
- CONNECT_RETRY_VERIFICATION: 3s (3 connect_retry cycles)
- ESTABLISHED_VERIFICATION: 8s (hold_time + margin)

Test timing improvements:
- test_unnumbered_peer_expiry_and_rediscovery: 31s -> 9s
- test_unnumbered_interface_lifecycle: 49s -> 18s
Add visibility into NDP manager internals including:
- Monitor thread state (running/stopped)
- Pending interfaces (configured but not on system)
- Active interfaces with rx/tx thread state

New CLI structure:
  mgadm ndp status manager <asn>
  mgadm ndp status interfaces <asn>
  mgadm ndp status interface <asn> <name>

New API endpoint: GET /ndp/manager
test_same_linklocal_multiple_interfaces was setting Router1's
discovered peer to fe80::1 (its own address), causing sessions to
attempt self-connections. Use fe80::2 (Router2) instead.

Topology:
        eth0 (scope 2)                    eth0 (scope 2)
       ┌─────────────┐                   ┌─────────────┐
       │  fe80::1%2  │◄─────────────────►│  fe80::2%2  │
       │             │      link 0       │             │
       │   Router1   │                   │   Router2   │
       │             │      link 1       │             │
       │  fe80::1%3  │◄─────────────────►│  fe80::2%3  │
       └─────────────┘                   └─────────────┘
        eth1 (scope 3)                    eth1 (scope 3)

NDP discovery (what each router sees as its peer):
  R1's view:  eth0 -> fe80::2%2,  eth1 -> fe80::2%3  (R2)
  R2's view:  eth0 -> fe80::1%2,  eth1 -> fe80::1%3  (R1)

The test still validates scope_id differentiation for the same IP
discovered on multiple interfaces.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bgp Border Gateway Protocol customer For any bug reports or feature requests tied to customer requests mgd Maghemite daemon needs testing rust Pull requests that update rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bgp: IPv6 support

3 participants