Skip to content

BGP unnumbered#598

Merged
taspelund merged 17 commits intotrey/mp-bgpfrom
ry/mgd-v6-router-disco
Jan 29, 2026
Merged

BGP unnumbered#598
taspelund merged 17 commits intotrey/mp-bgpfrom
ry/mgd-v6-router-disco

Conversation

@rcgoodfellow
Copy link
Collaborator

@rcgoodfellow rcgoodfellow commented Jan 2, 2026

This PR adds support for BGP unnumbered.

Depends on

The Main Event

Unnumbered peers are added as a new type of peer with their own set of API functions. When an unnumbered peer is added it goes into the pending state. The new peer request is handed over to an unnumbered peer manager where IPv6/NDP router advertisements (RA) are used to discover the address of the peer over the interface provided. When a router is discovered on the interface the pending peer then turns into a real peer. We use the IPv6 link-local address discovered in the RA and create a BGP session using that address. From here the extant BGP machinery works in the same way it currently does.

Neighbor discovery is a one-shot process when the unnumbered peer is created. Once we determine the peer address through an RA, that is what we use. If the peer address changes, the session needs to be deleted and recreated so discovery will happen again. No attempt is made to track neighbor state beyond initial session establishment. This may be revisited in a future release. One shot semantics has the benefit of being simple and stable. I don't believe we want to be overly sensitive to NDP RA dynamics as that would undercut the intent of things like BGP hold time and TCP connection tracking. Having a persistent monitor of NDP dynamics will need to be designed carefully.

Unnumbered support is designed to work in a point to point setting. We do not currently handle situations where there are multiple reachable routers reachable over a broadcast domain that the unnumbered interface is connected to. We simply attempt to establish a session with the first peer we find. This is how things have generally worked for unnumbered in my experience. We could develop something more robust, but this becomes a rather complex state machine where we have to try to establish sessions with every peer we discover. This could lead to doomed BGP sessions iterating through a state machine indefinitely.

Testing

This PR also adds a new testing substrate in the falcon-lab folder. BGP unnumbered requires multicast ICMPv6/NDP to work. This means we need a testing environment that routers connected over a broadcast domain, and some other unnumbered implementation to peer with to ensure we are not deluding ourselves with self-peering. Using illumos zones would have been nice here, but none of the BGP implementations I'm aware of on illumos besides mgd fully support unnumbered. FRR on the other hand has great support for unnumbered. So peering with a linux-based router puts us in falcon territory.

Another really important thing to test for unnumbered is mg-lower interaction. We need test coverage on nexthop interface mapping for data plane route installation. The interop test environment is currently upper-half only. Due to license nonsense, the interop topology machinery is also stuck in purgatory. So in this PR I decided to add a new falcon-based testing substrate that only uses unencumbered peers and the oxide nodes are softnpu vms so we have a full lower half with dendrite, P4 and the whole 9 yards. This also means the testing machinery can actually live in this repository which is a nice win. The best way to get acquainted with this new testing substrate is starting by reading through the first test.

@rcgoodfellow rcgoodfellow added this to the 18 milestone Jan 2, 2026
@rcgoodfellow rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch from 3b513d3 to ad340c9 Compare January 2, 2026 16:31
@rcgoodfellow rcgoodfellow changed the title add ndp router manager BGP unnumbered Jan 3, 2026
@rcgoodfellow rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch from 18b39ed to 54b29ba Compare January 3, 2026 08:35
@rcgoodfellow rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch from fa6bc23 to c423402 Compare January 4, 2026 05:07
@rcgoodfellow rcgoodfellow marked this pull request as ready for review January 5, 2026 22:09
@rcgoodfellow rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch 5 times, most recently from f96b4d7 to fbbb83f Compare January 8, 2026 09:14
@rcgoodfellow rcgoodfellow changed the base branch from main to trey/mp-bgp January 8, 2026 09:14
@rcgoodfellow rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch from b7b85c5 to 0d429d2 Compare January 9, 2026 01:47
@rcgoodfellow rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch 3 times, most recently from 0527ed9 to 80e651a Compare January 12, 2026 08:07
Comment on lines 77 to 82
lifetime: 0, //indicates this is not a default router
// XXX arista routers will only peer with neighbors that have a
// non-zero lifetime. This is extremely bad behavior as a non-zero
// lifetime indicates the router is advertising itself as a default
// router. We need to plumb an arista-workaround option for this.
lifetime: 1800,
//lifetime: 0, //indicates this is not a default router
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we wanting to add a config option for this in this PR? if we address it in a later PR, we may have to introduce a new dropshot API version to accommodate it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a TODO for this PR. We should talk live about what we want to do.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to plumb a user accessible option act-as-a-default-ipv6-router that takes the lifetime as a parameter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now done.

Comment on lines 571 to 579
// Default to supporting v4 over v6 encoding.
Capability::ExtendedNextHopEncoding {
elements: vec![ExtendedNexthopElement {
afi: Afi::Ipv4.into(),
safi: u8::from(Safi::Unicast).into(),
nh_afi: Afi::Ipv6.into(),
}],
},
],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can use this as a default in Open messages for all peers.
My understanding of RFC 8950 is that we should only advertise this capability if we intend to advertise IPv4 Unicast NLRI with IPv6 next-hops... which is not something we can assume all the time, especially for preexisting numbered IPv4 peers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not immediately seeing how advertising this capability would be harmful even when not used, but happy to talk through the issues this could cause.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall if RFC 8950 says that ENHE advertises an additional next-hop AFI for the session or if ENHE advertises the only next-hop AFI for the session, but my experience with other routing stacks is that they assume all routes will use the ENHE next-hop AFI when that cap has been negotiated.

My concern is that unconditionally advertising ENHE would mean that we're allowing the peer to dictate which next-hop AFI we should use, rather than explicitly disallowing it when we don't want it.
e.g.
We configure a numbered peer over v4 and want to continue using v4 nexthops. If we enable ENHE unconditionally, it's now up to the peer to dictate whether we should send v4 or v6 next-hops depending on whether they advertise ENHE.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this PR we'll advertise extended nexthop for unnumbered sessions only and revisit broader support later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now done.

Comment on lines +3977 to +4223
ExtendedNextHopEncoding {
//XXX trying to avoid a version bump on 86 billion data structures
// right now.
#[schemars(skip)]
elements: Vec<ExtendedNexthopElement>,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want help with the struct version updates, I'm happy to do so. I think it would be best to try and get this into the API when we release ENHE so it can be visible to consumers of the API (e.g. when they poll the neighbor status to see what capabilities have been exchanged).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is still a TODO on this PR. Now that we have the API versioning machinery in place, I think we need to think a bit about having raw BGP messages in the API definitions. In the original message history APIs I had these data structures as opaque objects to avoid all the API pain as we fill in capabilities.

@taspelund
Copy link
Contributor

Another couple items:

  1. There aren't any cargo tests for any of the rust code. I'd really like to see some unit tests for the unnumbered manager, and possibly some integration tests (although, it understandably seems difficult to integrate unnumbered into bgp/src/test.rs)
  2. I'd like to see some docs update for this. I've made an effort to get BGP documented (docs/bgp-architecture.md), so it would be nice to have the BGP-related changes added to that doc + to have the unnumbered manager documented too (perhaps in its own markdown file), even if it's just LLM-generated docs that have been quickly reviewed. This isn't a blocker, but a nice to have.

@taspelund
Copy link
Contributor

Re-running proptest -- it failed w/ the tokio "error 0" signature

@rcgoodfellow
Copy link
Collaborator Author

This logic needs to be updated

maghemite/bgp/src/session.rs

Lines 7542 to 7561 in 365caf9

RouteUpdate::V4(RouteUpdate4::Announce(nlri)) => {
let nh4 = match self.derive_nexthop(Afi::Ipv4, pc)? {
BgpNexthop::Ipv4(addr) => addr,
_ => {
return Err(Error::InvalidAddress(
"IPv4 routes require IPv4 next-hop".into(),
));
}
};
let mut path_attributes = self.router.base_attributes();
path_attributes.push(PathAttributeValue::NextHop(nh4).into());
UpdateMessage {
withdrawn: vec![],
path_attributes,
nlri,
..Default::default()
}
}

@taspelund
Copy link
Contributor

One other small item I noticed: .github/buildomat/jobs/falcon-lab.sh doesn't have execute permissions tracked by git.
Should be easy to add a commit after running git update-index --chmod=+x .github/buildomat/jobs/falcon-lab.sh so there's no chance of a permissions issue later.

@rcgoodfellow rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch from 6232580 to 9bfb191 Compare January 26, 2026 23:50
@taspelund taspelund mentioned this pull request Jan 27, 2026
Moves unnumbered peers from being a one-shot operation that creates an
FSM when NDP discovery is successful, to having a persistent FSM that is
created/deleted when the unnumbered peer is created/deleted (just like
numbered peers). Since the FSM is now always present and there's no
more pending peers, the pending unnumbered members concept has been
removed.

The FSM is updated to query the UnnumberedManager for an NDP neighbor
as part of initiate_connection(). This allows the FSM to manage its own
state transitions based on FSM events rather than allowing NDP events
to drive the state machine. The outbound connection attempt logic is
unchanged: connection attempts are still asynchronous and failures
are transparent to the FSM, allowing it to transition back to Idle
upon a ConnectRetryTimerExpires event if a successful connection event
doesn't arrive first. However, now an unknown NDP neighbor at the time
of the connection attempt is treated as an async failure (no outbound
connection success will be reported) and the FSM will cycle back to Idle
when ConnectRetryTimerExpires. Similarly, if the FSM receives an inbound
connection attempt and there is no known NDP neighbor, the connection
is rejected.

The unnumbered manager receives interface registration/unregistration
events when unnumbered peers are created/deleted. While the unnumbered
peer exists, the unnumbered manager runs the tx/rx loops to maintain
a single-entry NDP cache holding the discovered peer. The unnumbered
manager also maintains a map of scope_id to interface name (learned from
the OS). The NDP cache is queried by the FSM during connection attempts,
and the scope_id to interface map is used by the connection Dispatcher.

The Dispatcher has been updated to query the unnumbered manager to map
a scope_id (from the SocketAddr it accept()'s) back to an interface
name which allows it to find the appropriate SessionRunner/FSM. Source
IP validation for the inbound connection is done by the FSM: the new
connection's Source IP is checked against the NDP peer IP if there is no
ongoing connection, else it will check against the ongoing connection's
Peer IP (If it doesn't match, reject the new connection. If it does
match, consider it a collision.).

In order to facilitate all this session mapping, the addr_to_session
data structures have been updated (now called peer_to_session) to
use a new PeerId enum as the key instead of an IpAddr. PeerId has
two variants: Ip that holds an IpAddr (matching the old behavior)
and Interface which holds a String of the interface name. This allows
the session lookups to work for unnumbered peers by interface name
without needing to know anything about the dynamic NDP peer state.

Local BGP integration tests are added to exercise the unnumbered code
paths for session establishment, route exchange, and NDP state changes.
The NDP state changes are manually manipulated using a new trait called
UnnumberedManager and its test impl UnnumberedManagerMock. The existing
code is accessed via the prod impl UnnumberedManagerNdp.
Add  test coverage for unnumbered BGP sessions, including
multi-interface support, NDP lifecycle testing, and route exchange
validation.

Channel based tests rely on each bind() call being used to register
ownership of a single SocketAddr into the NET HashMap. This means that
in order to have more than 1 peer in a channel based test, you must call
bind() multiple times. However, the Dispatcher is responsible for bind()
as well as for spawning a Listener for inbound connections. So to make
this work, the new test_three_router_chain_unnumbered() creates a
Dispatcher per unnumbered interface. Since it is a SocketAddr being
registered in NET, we can re-use the same link-local address in multiple
places simply by disambiguating them with a unique scope_id. This also
allows us to test having two peers with the same link-local address
learned via two different links.

New tests verify:
- Sessions persist through NDP updates and expiry
- Sessions reconnect with current NDP neighbor after reset
- scope_id isolates same link-local IP on different interfaces
- IPv4/IPv6 routes exchange with correct link-local nexthops

Fixes existing unnumbered tests by correcting scope_id handling.
Get rid of the unnumbered nexthop hash map in ndp manager in favor of
encoding BGP unnumbered nexthop interfaces into the Path itself. This
ensures that BGP is the source of truth for the next-hop and interface
used for routes from a given session, rather than using the NDP cache
which can update without BGP's knowledge. This also has the added
benefit of allowing BGP to disambiguate the same link-local address
across multiple interfaces -- each unnumbered peer knows what its
interface name is, and thus can supply its name alongside the link-local
next-hop in order to support the same link-local address via multiple
interfaces at the same time.

This also updates the mgd API to support both numbered and unnumbered
peers via unified endpoints. Unfortunately, dropshot cannot represent
a non-scalar type (like enum PeerId) as a Query. So the API has been
moved to representing the peer as a String, which will be parsed as an
IpAddr if possible with a fallback of accepting the String as an
interface name.
Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>
Moving common neighbor args into NeighborCommon resulted in some of the
CLI args appearing in a different order, breaking scripts elsewhere
(e.g. init.sh for interop topology). This unifies numbered/unnumbered
neighbors into a single CLI which is then split into Neighbor +
UnnumberedNeighbor API types for the appropriate call.
Expose NDP interfaces and their neighbor cache via the API. This allows
an operator to see what the current config/state is for a managed
unnumbered interface.

Renames get_neighbors_unified() to get_neighbors_v4() aligning with the
versioned method names.
Fixes issue where update API would return "no such interface" errors
for unnumbered neighbors. Allows API handlers to determine if there
is already an active session for unnumbered peers. Without this, there
was no handling for updates to an existing unnumbered peer (requiring a
delete/add workflow).
Fix lookup that was using PeerId::Ip instead of PeerId::Interface.
Ensure that interface removal empties the interface scope map even if
the interface isn't a registered interface (e.g. if there was a partial
failure during an earlier call).
Fixes a bug where multiple peers with the same IP have their routes
tracked as coming from the same source. This is really only exposed
when using link-local addrs (BGP unnumbered) since that's the only time
the same IP can be found on multiple peers.

Scenario:
- R2 has BGP unnumbered peers R1 / R3
- R1/R3 both use the same link-local IP, e.g. fe80::1
- R2 learns 3fff:10::/64 from R1, 3fff:30::/64 from R3
- R2 stores fe80::1 in BgpPathProperties.peer field for both R1/R3
- R1 shuts down the BGP session
- R2 cleans up all BGP routes matching BgpPathProperties.peer == fe80::1
- 3fff:20::/64 is correctly removed from the RIB
- 3fff:30::/64 is erroneously removed from the RIB <-- wrong!

test_three_router_chain_unnumbered has been updated to exercise this
exact scenario - which failed prior to the IpAddr -> PeerId type update.

Also updates API to use Path where peer is PeerId instead of IpAddr.
Fixes CLI commands that only accepted IP addresses, preventing
operations on unnumbered BGP peers (identified by interface name).

Changes:
- Clear command: addr (IpAddr) → peer (String)
  - Added routing logic to detect peer type via parse_peer_type()
  - Routes to clear_neighbor_v2() for numbered peers
  - Routes to clear_unnumbered_neighbor() for unnumbered peers

- FSM history: peer (Option<IpAddr>) → peer (Option<String>)
  - Parses peer string to PeerId enum before API call
  - Updated get_fsm_history() handler to convert types

- Message history: peer (IpAddr) → peer (String)
  - Parses peer string to PeerId enum before API call
  - Updated get_message_history() handler to convert types

- Removed dead code: ExportPolicy struct (never used)

All commands now follow the pattern established by neighbor read/
delete/create commands, which already supported both peer types.

Users can now specify peers as either IP addresses (numbered) or
interface names (unnumbered) consistently across all CLI commands.
@taspelund taspelund merged commit 7846b13 into trey/mp-bgp Jan 29, 2026
15 checks passed
@taspelund taspelund deleted the ry/mgd-v6-router-disco branch January 29, 2026 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants