BGP unnumbered by rcgoodfellow · Pull Request #598 · oxidecomputer/maghemite

rcgoodfellow · 2026-01-02T09:08:56Z

This PR adds support for BGP unnumbered.

Depends on

The Main Event

Unnumbered peers are added as a new type of peer with their own set of API functions. When an unnumbered peer is added it goes into the pending state. The new peer request is handed over to an unnumbered peer manager where IPv6/NDP router advertisements (RA) are used to discover the address of the peer over the interface provided. When a router is discovered on the interface the pending peer then turns into a real peer. We use the IPv6 link-local address discovered in the RA and create a BGP session using that address. From here the extant BGP machinery works in the same way it currently does.

Neighbor discovery is a one-shot process when the unnumbered peer is created. Once we determine the peer address through an RA, that is what we use. If the peer address changes, the session needs to be deleted and recreated so discovery will happen again. No attempt is made to track neighbor state beyond initial session establishment. This may be revisited in a future release. One shot semantics has the benefit of being simple and stable. I don't believe we want to be overly sensitive to NDP RA dynamics as that would undercut the intent of things like BGP hold time and TCP connection tracking. Having a persistent monitor of NDP dynamics will need to be designed carefully.

Unnumbered support is designed to work in a point to point setting. We do not currently handle situations where there are multiple reachable routers reachable over a broadcast domain that the unnumbered interface is connected to. We simply attempt to establish a session with the first peer we find. This is how things have generally worked for unnumbered in my experience. We could develop something more robust, but this becomes a rather complex state machine where we have to try to establish sessions with every peer we discover. This could lead to doomed BGP sessions iterating through a state machine indefinitely.

Testing

This PR also adds a new testing substrate in the falcon-lab folder. BGP unnumbered requires multicast ICMPv6/NDP to work. This means we need a testing environment that routers connected over a broadcast domain, and some other unnumbered implementation to peer with to ensure we are not deluding ourselves with self-peering. Using illumos zones would have been nice here, but none of the BGP implementations I'm aware of on illumos besides mgd fully support unnumbered. FRR on the other hand has great support for unnumbered. So peering with a linux-based router puts us in falcon territory.

Another really important thing to test for unnumbered is mg-lower interaction. We need test coverage on nexthop interface mapping for data plane route installation. The interop test environment is currently upper-half only. Due to license nonsense, the interop topology machinery is also stuck in purgatory. So in this PR I decided to add a new falcon-based testing substrate that only uses unencumbered peers and the oxide nodes are softnpu vms so we have a full lower half with dendrite, P4 and the whole 9 yards. This also means the testing machinery can actually live in this repository which is a nice win. The best way to get acquainted with this new testing substrate is starting by reading through the first test.

falcon-lab/src/util.rs

falcon-lab/src/test.rs

taspelund · 2026-01-12T15:10:46Z

ndp/src/packet.rs

-            lifetime: 0, //indicates this is not a default router
+            // XXX arista routers will only peer with neighbors that have a
+            // non-zero lifetime. This is extremely bad behavior as a non-zero
+            // lifetime indicates the router is advertising itself as a default
+            // router. We need to plumb an arista-workaround option for this.
+            lifetime: 1800,
+            //lifetime: 0, //indicates this is not a default router


Are we wanting to add a config option for this in this PR? if we address it in a later PR, we may have to introduce a new dropshot API version to accommodate it

This is still a TODO for this PR. We should talk live about what we want to do.

Going to plumb a user accessible option act-as-a-default-ipv6-router that takes the lifetime as a parameter.

This is now done.

.github/buildomat/jobs/build.sh

.github/buildomat/jobs/falcon-lab.sh

taspelund · 2026-01-12T15:19:53Z

bgp/src/messages.rs

+                    // Default to supporting v4 over v6 encoding.
+                    Capability::ExtendedNextHopEncoding {
+                        elements: vec![ExtendedNexthopElement {
+                            afi: Afi::Ipv4.into(),
+                            safi: u8::from(Safi::Unicast).into(),
+                            nh_afi: Afi::Ipv6.into(),
+                        }],
+                    },
+                ],


I don't think we can use this as a default in Open messages for all peers.
My understanding of RFC 8950 is that we should only advertise this capability if we intend to advertise IPv4 Unicast NLRI with IPv6 next-hops... which is not something we can assume all the time, especially for preexisting numbered IPv4 peers.

I'm not immediately seeing how advertising this capability would be harmful even when not used, but happy to talk through the issues this could cause.

I don't recall if RFC 8950 says that ENHE advertises an additional next-hop AFI for the session or if ENHE advertises the only next-hop AFI for the session, but my experience with other routing stacks is that they assume all routes will use the ENHE next-hop AFI when that cap has been negotiated.

My concern is that unconditionally advertising ENHE would mean that we're allowing the peer to dictate which next-hop AFI we should use, rather than explicitly disallowing it when we don't want it.
e.g.
We configure a numbered peer over v4 and want to continue using v4 nexthops. If we enable ENHE unconditionally, it's now up to the peer to dictate whether we should send v4 or v6 next-hops depending on whether they advertise ENHE.

For this PR we'll advertise extended nexthop for unnumbered sessions only and revisit broader support later.

This is now done.

bgp/src/messages.rs

taspelund · 2026-01-12T15:23:32Z

bgp/src/messages.rs

+    ExtendedNextHopEncoding {
+        //XXX trying to avoid a version bump on 86 billion data structures
+        // right now.
+        #[schemars(skip)]
+        elements: Vec<ExtendedNexthopElement>,
+    },


If you want help with the struct version updates, I'm happy to do so. I think it would be best to try and get this into the API when we release ENHE so it can be visible to consumers of the API (e.g. when they poll the neighbor status to see what capabilities have been exchanged).

Yeah, this is still a TODO on this PR. Now that we have the API versioning machinery in place, I think we need to think a bit about having raw BGP messages in the API definitions. In the original message history APIs I had these data structures as opaque objects to avoid all the API pain as we fill in capabilities.

bgp/src/messages.rs

bgp/src/params.rs

falcon-lab/src/frr.rs

falcon-lab/src/illumos.rs

ndp/src/manager.rs

taspelund · 2026-01-12T21:04:51Z

Another couple items:

There aren't any cargo tests for any of the rust code. I'd really like to see some unit tests for the unnumbered manager, and possibly some integration tests (although, it understandably seems difficult to integrate unnumbered into bgp/src/test.rs)
I'd like to see some docs update for this. I've made an effort to get BGP documented (docs/bgp-architecture.md), so it would be nice to have the BGP-related changes added to that doc + to have the unnumbered manager documented too (perhaps in its own markdown file), even if it's just LLM-generated docs that have been quickly reviewed. This isn't a blocker, but a nice to have.

taspelund · 2026-01-13T16:28:13Z

Re-running proptest -- it failed w/ the tokio "error 0" signature

rcgoodfellow · 2026-01-14T20:55:04Z

This logic needs to be updated

maghemite/bgp/src/session.rs

Lines 7542 to 7561 in 365caf9

    
           RouteUpdate::V4(RouteUpdate4::Announce(nlri)) => { 
        
               let nh4 = match self.derive_nexthop(Afi::Ipv4, pc)? { 
        
                   BgpNexthop::Ipv4(addr) => addr, 
        
                   _ => { 
        
                       return Err(Error::InvalidAddress( 
        
                           "IPv4 routes require IPv4 next-hop".into(), 
        
                       )); 
        
                   } 
        
               }; 
        
               let mut path_attributes = self.router.base_attributes(); 
        
               path_attributes.push(PathAttributeValue::NextHop(nh4).into()); 
        
               UpdateMessage { 
        
                   withdrawn: vec![], 
        
                   path_attributes, 
        
                   nlri, 
        
                   ..Default::default() 
        
               } 
        
           }

taspelund · 2026-01-23T02:10:49Z

One other small item I noticed: .github/buildomat/jobs/falcon-lab.sh doesn't have execute permissions tracked by git.
Should be easy to add a commit after running git update-index --chmod=+x .github/buildomat/jobs/falcon-lab.sh so there's no chance of a permissions issue later.

Moves unnumbered peers from being a one-shot operation that creates an FSM when NDP discovery is successful, to having a persistent FSM that is created/deleted when the unnumbered peer is created/deleted (just like numbered peers). Since the FSM is now always present and there's no more pending peers, the pending unnumbered members concept has been removed. The FSM is updated to query the UnnumberedManager for an NDP neighbor as part of initiate_connection(). This allows the FSM to manage its own state transitions based on FSM events rather than allowing NDP events to drive the state machine. The outbound connection attempt logic is unchanged: connection attempts are still asynchronous and failures are transparent to the FSM, allowing it to transition back to Idle upon a ConnectRetryTimerExpires event if a successful connection event doesn't arrive first. However, now an unknown NDP neighbor at the time of the connection attempt is treated as an async failure (no outbound connection success will be reported) and the FSM will cycle back to Idle when ConnectRetryTimerExpires. Similarly, if the FSM receives an inbound connection attempt and there is no known NDP neighbor, the connection is rejected. The unnumbered manager receives interface registration/unregistration events when unnumbered peers are created/deleted. While the unnumbered peer exists, the unnumbered manager runs the tx/rx loops to maintain a single-entry NDP cache holding the discovered peer. The unnumbered manager also maintains a map of scope_id to interface name (learned from the OS). The NDP cache is queried by the FSM during connection attempts, and the scope_id to interface map is used by the connection Dispatcher. The Dispatcher has been updated to query the unnumbered manager to map a scope_id (from the SocketAddr it accept()'s) back to an interface name which allows it to find the appropriate SessionRunner/FSM. Source IP validation for the inbound connection is done by the FSM: the new connection's Source IP is checked against the NDP peer IP if there is no ongoing connection, else it will check against the ongoing connection's Peer IP (If it doesn't match, reject the new connection. If it does match, consider it a collision.). In order to facilitate all this session mapping, the addr_to_session data structures have been updated (now called peer_to_session) to use a new PeerId enum as the key instead of an IpAddr. PeerId has two variants: Ip that holds an IpAddr (matching the old behavior) and Interface which holds a String of the interface name. This allows the session lookups to work for unnumbered peers by interface name without needing to know anything about the dynamic NDP peer state. Local BGP integration tests are added to exercise the unnumbered code paths for session establishment, route exchange, and NDP state changes. The NDP state changes are manually manipulated using a new trait called UnnumberedManager and its test impl UnnumberedManagerMock. The existing code is accessed via the prod impl UnnumberedManagerNdp.

Add test coverage for unnumbered BGP sessions, including multi-interface support, NDP lifecycle testing, and route exchange validation. Channel based tests rely on each bind() call being used to register ownership of a single SocketAddr into the NET HashMap. This means that in order to have more than 1 peer in a channel based test, you must call bind() multiple times. However, the Dispatcher is responsible for bind() as well as for spawning a Listener for inbound connections. So to make this work, the new test_three_router_chain_unnumbered() creates a Dispatcher per unnumbered interface. Since it is a SocketAddr being registered in NET, we can re-use the same link-local address in multiple places simply by disambiguating them with a unique scope_id. This also allows us to test having two peers with the same link-local address learned via two different links. New tests verify: - Sessions persist through NDP updates and expiry - Sessions reconnect with current NDP neighbor after reset - scope_id isolates same link-local IP on different interfaces - IPv4/IPv6 routes exchange with correct link-local nexthops Fixes existing unnumbered tests by correcting scope_id handling.

Get rid of the unnumbered nexthop hash map in ndp manager in favor of encoding BGP unnumbered nexthop interfaces into the Path itself. This ensures that BGP is the source of truth for the next-hop and interface used for routes from a given session, rather than using the NDP cache which can update without BGP's knowledge. This also has the added benefit of allowing BGP to disambiguate the same link-local address across multiple interfaces -- each unnumbered peer knows what its interface name is, and thus can supply its name alongside the link-local next-hop in order to support the same link-local address via multiple interfaces at the same time. This also updates the mgd API to support both numbered and unnumbered peers via unified endpoints. Unfortunately, dropshot cannot represent a non-scalar type (like enum PeerId) as a Query. So the API has been moved to representing the peer as a String, which will be parsed as an IpAddr if possible with a fallback of accepting the String as an interface name.

Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>

Moving common neighbor args into NeighborCommon resulted in some of the CLI args appearing in a different order, breaking scripts elsewhere (e.g. init.sh for interop topology). This unifies numbered/unnumbered neighbors into a single CLI which is then split into Neighbor + UnnumberedNeighbor API types for the appropriate call.

Expose NDP interfaces and their neighbor cache via the API. This allows an operator to see what the current config/state is for a managed unnumbered interface. Renames get_neighbors_unified() to get_neighbors_v4() aligning with the versioned method names.

Fixes issue where update API would return "no such interface" errors for unnumbered neighbors. Allows API handlers to determine if there is already an active session for unnumbered peers. Without this, there was no handling for updates to an existing unnumbered peer (requiring a delete/add workflow).

Fix lookup that was using PeerId::Ip instead of PeerId::Interface. Ensure that interface removal empties the interface scope map even if the interface isn't a registered interface (e.g. if there was a partial failure during an earlier call).

Fixes a bug where multiple peers with the same IP have their routes tracked as coming from the same source. This is really only exposed when using link-local addrs (BGP unnumbered) since that's the only time the same IP can be found on multiple peers. Scenario: - R2 has BGP unnumbered peers R1 / R3 - R1/R3 both use the same link-local IP, e.g. fe80::1 - R2 learns 3fff:10::/64 from R1, 3fff:30::/64 from R3 - R2 stores fe80::1 in BgpPathProperties.peer field for both R1/R3 - R1 shuts down the BGP session - R2 cleans up all BGP routes matching BgpPathProperties.peer == fe80::1 - 3fff:20::/64 is correctly removed from the RIB - 3fff:30::/64 is erroneously removed from the RIB <-- wrong! test_three_router_chain_unnumbered has been updated to exercise this exact scenario - which failed prior to the IpAddr -> PeerId type update. Also updates API to use Path where peer is PeerId instead of IpAddr.

Fixes CLI commands that only accepted IP addresses, preventing operations on unnumbered BGP peers (identified by interface name). Changes: - Clear command: addr (IpAddr) → peer (String) - Added routing logic to detect peer type via parse_peer_type() - Routes to clear_neighbor_v2() for numbered peers - Routes to clear_unnumbered_neighbor() for unnumbered peers - FSM history: peer (Option<IpAddr>) → peer (Option<String>) - Parses peer string to PeerId enum before API call - Updated get_fsm_history() handler to convert types - Message history: peer (IpAddr) → peer (String) - Parses peer string to PeerId enum before API call - Updated get_message_history() handler to convert types - Removed dead code: ExportPolicy struct (never used) All commands now follow the pattern established by neighbor read/ delete/create commands, which already supported both peer types. Users can now specify peers as either IP addresses (numbered) or interface names (unnumbered) consistently across all CLI commands.

rcgoodfellow added this to the 18 milestone Jan 2, 2026

rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch from 3b513d3 to ad340c9 Compare January 2, 2026 16:31

rcgoodfellow changed the title ~~add ndp router manager~~ BGP unnumbered Jan 3, 2026

rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch from 18b39ed to 54b29ba Compare January 3, 2026 08:35

rcgoodfellow mentioned this pull request Jan 3, 2026

ipv6 e2e integration oxidecomputer/omicron#9570

Open

17 tasks

rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch from fa6bc23 to c423402 Compare January 4, 2026 05:07

rcgoodfellow marked this pull request as ready for review January 5, 2026 22:09

rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch 5 times, most recently from f96b4d7 to fbbb83f Compare January 8, 2026 09:14

rcgoodfellow changed the base branch from main to trey/mp-bgp January 8, 2026 09:14

rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch from b7b85c5 to 0d429d2 Compare January 9, 2026 01:47

rcgoodfellow force-pushed the trey/mp-bgp branch from 1f0a732 to f13f187 Compare January 10, 2026 01:26

rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch 3 times, most recently from 0527ed9 to 80e651a Compare January 12, 2026 08:07