Conversation
3b513d3 to
ad340c9
Compare
18b39ed to
54b29ba
Compare
fa6bc23 to
c423402
Compare
f96b4d7 to
fbbb83f
Compare
b7b85c5 to
0d429d2
Compare
1f0a732 to
f13f187
Compare
0527ed9 to
80e651a
Compare
ndp/src/packet.rs
Outdated
| lifetime: 0, //indicates this is not a default router | ||
| // XXX arista routers will only peer with neighbors that have a | ||
| // non-zero lifetime. This is extremely bad behavior as a non-zero | ||
| // lifetime indicates the router is advertising itself as a default | ||
| // router. We need to plumb an arista-workaround option for this. | ||
| lifetime: 1800, | ||
| //lifetime: 0, //indicates this is not a default router |
There was a problem hiding this comment.
Are we wanting to add a config option for this in this PR? if we address it in a later PR, we may have to introduce a new dropshot API version to accommodate it
There was a problem hiding this comment.
This is still a TODO for this PR. We should talk live about what we want to do.
There was a problem hiding this comment.
Going to plumb a user accessible option act-as-a-default-ipv6-router that takes the lifetime as a parameter.
There was a problem hiding this comment.
This is now done.
bgp/src/messages.rs
Outdated
| // Default to supporting v4 over v6 encoding. | ||
| Capability::ExtendedNextHopEncoding { | ||
| elements: vec![ExtendedNexthopElement { | ||
| afi: Afi::Ipv4.into(), | ||
| safi: u8::from(Safi::Unicast).into(), | ||
| nh_afi: Afi::Ipv6.into(), | ||
| }], | ||
| }, | ||
| ], |
There was a problem hiding this comment.
I don't think we can use this as a default in Open messages for all peers.
My understanding of RFC 8950 is that we should only advertise this capability if we intend to advertise IPv4 Unicast NLRI with IPv6 next-hops... which is not something we can assume all the time, especially for preexisting numbered IPv4 peers.
There was a problem hiding this comment.
I'm not immediately seeing how advertising this capability would be harmful even when not used, but happy to talk through the issues this could cause.
There was a problem hiding this comment.
I don't recall if RFC 8950 says that ENHE advertises an additional next-hop AFI for the session or if ENHE advertises the only next-hop AFI for the session, but my experience with other routing stacks is that they assume all routes will use the ENHE next-hop AFI when that cap has been negotiated.
My concern is that unconditionally advertising ENHE would mean that we're allowing the peer to dictate which next-hop AFI we should use, rather than explicitly disallowing it when we don't want it.
e.g.
We configure a numbered peer over v4 and want to continue using v4 nexthops. If we enable ENHE unconditionally, it's now up to the peer to dictate whether we should send v4 or v6 next-hops depending on whether they advertise ENHE.
There was a problem hiding this comment.
For this PR we'll advertise extended nexthop for unnumbered sessions only and revisit broader support later.
There was a problem hiding this comment.
This is now done.
| ExtendedNextHopEncoding { | ||
| //XXX trying to avoid a version bump on 86 billion data structures | ||
| // right now. | ||
| #[schemars(skip)] | ||
| elements: Vec<ExtendedNexthopElement>, | ||
| }, |
There was a problem hiding this comment.
If you want help with the struct version updates, I'm happy to do so. I think it would be best to try and get this into the API when we release ENHE so it can be visible to consumers of the API (e.g. when they poll the neighbor status to see what capabilities have been exchanged).
There was a problem hiding this comment.
Yeah, this is still a TODO on this PR. Now that we have the API versioning machinery in place, I think we need to think a bit about having raw BGP messages in the API definitions. In the original message history APIs I had these data structures as opaque objects to avoid all the API pain as we fill in capabilities.
|
Another couple items:
|
|
Re-running proptest -- it failed w/ the tokio "error 0" signature |
|
This logic needs to be updated Lines 7542 to 7561 in 365caf9 |
7c5dd39 to
7f78e2b
Compare
|
One other small item I noticed: |
6232580 to
9bfb191
Compare
Moves unnumbered peers from being a one-shot operation that creates an FSM when NDP discovery is successful, to having a persistent FSM that is created/deleted when the unnumbered peer is created/deleted (just like numbered peers). Since the FSM is now always present and there's no more pending peers, the pending unnumbered members concept has been removed. The FSM is updated to query the UnnumberedManager for an NDP neighbor as part of initiate_connection(). This allows the FSM to manage its own state transitions based on FSM events rather than allowing NDP events to drive the state machine. The outbound connection attempt logic is unchanged: connection attempts are still asynchronous and failures are transparent to the FSM, allowing it to transition back to Idle upon a ConnectRetryTimerExpires event if a successful connection event doesn't arrive first. However, now an unknown NDP neighbor at the time of the connection attempt is treated as an async failure (no outbound connection success will be reported) and the FSM will cycle back to Idle when ConnectRetryTimerExpires. Similarly, if the FSM receives an inbound connection attempt and there is no known NDP neighbor, the connection is rejected. The unnumbered manager receives interface registration/unregistration events when unnumbered peers are created/deleted. While the unnumbered peer exists, the unnumbered manager runs the tx/rx loops to maintain a single-entry NDP cache holding the discovered peer. The unnumbered manager also maintains a map of scope_id to interface name (learned from the OS). The NDP cache is queried by the FSM during connection attempts, and the scope_id to interface map is used by the connection Dispatcher. The Dispatcher has been updated to query the unnumbered manager to map a scope_id (from the SocketAddr it accept()'s) back to an interface name which allows it to find the appropriate SessionRunner/FSM. Source IP validation for the inbound connection is done by the FSM: the new connection's Source IP is checked against the NDP peer IP if there is no ongoing connection, else it will check against the ongoing connection's Peer IP (If it doesn't match, reject the new connection. If it does match, consider it a collision.). In order to facilitate all this session mapping, the addr_to_session data structures have been updated (now called peer_to_session) to use a new PeerId enum as the key instead of an IpAddr. PeerId has two variants: Ip that holds an IpAddr (matching the old behavior) and Interface which holds a String of the interface name. This allows the session lookups to work for unnumbered peers by interface name without needing to know anything about the dynamic NDP peer state. Local BGP integration tests are added to exercise the unnumbered code paths for session establishment, route exchange, and NDP state changes. The NDP state changes are manually manipulated using a new trait called UnnumberedManager and its test impl UnnumberedManagerMock. The existing code is accessed via the prod impl UnnumberedManagerNdp.
Add test coverage for unnumbered BGP sessions, including multi-interface support, NDP lifecycle testing, and route exchange validation. Channel based tests rely on each bind() call being used to register ownership of a single SocketAddr into the NET HashMap. This means that in order to have more than 1 peer in a channel based test, you must call bind() multiple times. However, the Dispatcher is responsible for bind() as well as for spawning a Listener for inbound connections. So to make this work, the new test_three_router_chain_unnumbered() creates a Dispatcher per unnumbered interface. Since it is a SocketAddr being registered in NET, we can re-use the same link-local address in multiple places simply by disambiguating them with a unique scope_id. This also allows us to test having two peers with the same link-local address learned via two different links. New tests verify: - Sessions persist through NDP updates and expiry - Sessions reconnect with current NDP neighbor after reset - scope_id isolates same link-local IP on different interfaces - IPv4/IPv6 routes exchange with correct link-local nexthops Fixes existing unnumbered tests by correcting scope_id handling.
Get rid of the unnumbered nexthop hash map in ndp manager in favor of encoding BGP unnumbered nexthop interfaces into the Path itself. This ensures that BGP is the source of truth for the next-hop and interface used for routes from a given session, rather than using the NDP cache which can update without BGP's knowledge. This also has the added benefit of allowing BGP to disambiguate the same link-local address across multiple interfaces -- each unnumbered peer knows what its interface name is, and thus can supply its name alongside the link-local next-hop in order to support the same link-local address via multiple interfaces at the same time. This also updates the mgd API to support both numbered and unnumbered peers via unified endpoints. Unfortunately, dropshot cannot represent a non-scalar type (like enum PeerId) as a Query. So the API has been moved to representing the peer as a String, which will be parsed as an IpAddr if possible with a fallback of accepting the String as an interface name.
Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>
Moving common neighbor args into NeighborCommon resulted in some of the CLI args appearing in a different order, breaking scripts elsewhere (e.g. init.sh for interop topology). This unifies numbered/unnumbered neighbors into a single CLI which is then split into Neighbor + UnnumberedNeighbor API types for the appropriate call.
Expose NDP interfaces and their neighbor cache via the API. This allows an operator to see what the current config/state is for a managed unnumbered interface. Renames get_neighbors_unified() to get_neighbors_v4() aligning with the versioned method names.
Fixes issue where update API would return "no such interface" errors for unnumbered neighbors. Allows API handlers to determine if there is already an active session for unnumbered peers. Without this, there was no handling for updates to an existing unnumbered peer (requiring a delete/add workflow).
Fix lookup that was using PeerId::Ip instead of PeerId::Interface. Ensure that interface removal empties the interface scope map even if the interface isn't a registered interface (e.g. if there was a partial failure during an earlier call).
Fixes a bug where multiple peers with the same IP have their routes tracked as coming from the same source. This is really only exposed when using link-local addrs (BGP unnumbered) since that's the only time the same IP can be found on multiple peers. Scenario: - R2 has BGP unnumbered peers R1 / R3 - R1/R3 both use the same link-local IP, e.g. fe80::1 - R2 learns 3fff:10::/64 from R1, 3fff:30::/64 from R3 - R2 stores fe80::1 in BgpPathProperties.peer field for both R1/R3 - R1 shuts down the BGP session - R2 cleans up all BGP routes matching BgpPathProperties.peer == fe80::1 - 3fff:20::/64 is correctly removed from the RIB - 3fff:30::/64 is erroneously removed from the RIB <-- wrong! test_three_router_chain_unnumbered has been updated to exercise this exact scenario - which failed prior to the IpAddr -> PeerId type update. Also updates API to use Path where peer is PeerId instead of IpAddr.
Fixes CLI commands that only accepted IP addresses, preventing operations on unnumbered BGP peers (identified by interface name). Changes: - Clear command: addr (IpAddr) → peer (String) - Added routing logic to detect peer type via parse_peer_type() - Routes to clear_neighbor_v2() for numbered peers - Routes to clear_unnumbered_neighbor() for unnumbered peers - FSM history: peer (Option<IpAddr>) → peer (Option<String>) - Parses peer string to PeerId enum before API call - Updated get_fsm_history() handler to convert types - Message history: peer (IpAddr) → peer (String) - Parses peer string to PeerId enum before API call - Updated get_message_history() handler to convert types - Removed dead code: ExportPolicy struct (never used) All commands now follow the pattern established by neighbor read/ delete/create commands, which already supported both peer types. Users can now specify peers as either IP addresses (numbered) or interface names (unnumbered) consistently across all CLI commands.
This PR adds support for BGP unnumbered.
Depends on
The Main Event
Unnumbered peers are added as a new type of peer with their own set of API functions. When an unnumbered peer is added it goes into the pending state. The new peer request is handed over to an unnumbered peer manager where IPv6/NDP router advertisements (RA) are used to discover the address of the peer over the interface provided. When a router is discovered on the interface the pending peer then turns into a real peer. We use the IPv6 link-local address discovered in the RA and create a BGP session using that address. From here the extant BGP machinery works in the same way it currently does.
Neighbor discovery is a one-shot process when the unnumbered peer is created. Once we determine the peer address through an RA, that is what we use. If the peer address changes, the session needs to be deleted and recreated so discovery will happen again. No attempt is made to track neighbor state beyond initial session establishment. This may be revisited in a future release. One shot semantics has the benefit of being simple and stable. I don't believe we want to be overly sensitive to NDP RA dynamics as that would undercut the intent of things like BGP hold time and TCP connection tracking. Having a persistent monitor of NDP dynamics will need to be designed carefully.
Unnumbered support is designed to work in a point to point setting. We do not currently handle situations where there are multiple reachable routers reachable over a broadcast domain that the unnumbered interface is connected to. We simply attempt to establish a session with the first peer we find. This is how things have generally worked for unnumbered in my experience. We could develop something more robust, but this becomes a rather complex state machine where we have to try to establish sessions with every peer we discover. This could lead to doomed BGP sessions iterating through a state machine indefinitely.
Testing
This PR also adds a new testing substrate in the
falcon-labfolder. BGP unnumbered requires multicast ICMPv6/NDP to work. This means we need a testing environment that routers connected over a broadcast domain, and some other unnumbered implementation to peer with to ensure we are not deluding ourselves with self-peering. Using illumos zones would have been nice here, but none of the BGP implementations I'm aware of on illumos besides mgd fully support unnumbered. FRR on the other hand has great support for unnumbered. So peering with a linux-based router puts us in falcon territory.Another really important thing to test for unnumbered is mg-lower interaction. We need test coverage on nexthop interface mapping for data plane route installation. The interop test environment is currently upper-half only. Due to license nonsense, the interop topology machinery is also stuck in purgatory. So in this PR I decided to add a new falcon-based testing substrate that only uses unencumbered peers and the oxide nodes are softnpu vms so we have a full lower half with dendrite, P4 and the whole 9 yards. This also means the testing machinery can actually live in this repository which is a nice win. The best way to get acquainted with this new testing substrate is starting by reading through the first test.