Skip to content

ipv6 e2e integration#9570

Merged
internet-diglett merged 80 commits intomainfrom
ry/ipv6-all-the-things
Feb 14, 2026
Merged

ipv6 e2e integration#9570
internet-diglett merged 80 commits intomainfrom
ry/ipv6-all-the-things

Conversation

@rcgoodfellow
Copy link
Contributor

@rcgoodfellow rcgoodfellow commented Dec 27, 2025

This PR pulls in various IPv6 work. The biggest code changes revolve around integrating new Maghemite APIs around IPv6 peers and unnumbered peers. This has meant changing data structures in the rack initialize API.

Before this PR, the rack initialize API was in the client-side versioned bootstrap API. This makes it impossible to change. However, it was observed that the only client of the rack initialize API is wicketd, essentially making the three API endpoints under rack-initailize lockstep. With that in mind, the rack initialize endpoints have been factored out as a lockstep API, and a new version of the bootstrap client side API has been created that deprecates use of rack-initialize.

Another tricky aspect of changing these data structures is that they are in the boot store. In particular we are changing the BGP peer member from an Ipv4Addr to an IpAddr. This should be a bootstore backwards compatible change, but testing is required to ensure it is.

Remaining work items:

  • External API updates for IPv6 BGP peers
  • External API updates for unnumbered BPG peers
  • Wicket updates for IPv6 BGP peers
  • Wicket updates for unnumbered BGP peers

Functional milestones:

  • Passing CI
  • Bring up a4x2 with BGP unnumbered peering confirming
    • session establishment
    • ipv4 prefix exchange
    • ipv6 prefix exchange
    • working ipv4 probes
    • working ipv6 probes
  • Bring in asilomar racklette environment
    • session establishment
    • ipv4 prefix exchange
    • ipv6 prefix exchange
    • working ipv4 instance comms
    • working ipv6 instance comms

Depends on

@rcgoodfellow rcgoodfellow added this to the 18 milestone Dec 29, 2025
@rcgoodfellow rcgoodfellow force-pushed the ry/ipv6-all-the-things branch 5 times, most recently from c75fdb6 to b813569 Compare January 5, 2026 06:07
@rcgoodfellow rcgoodfellow force-pushed the ry/ipv6-all-the-things branch from b813569 to 1bd54d4 Compare January 5, 2026 18:43
@internet-diglett internet-diglett self-requested a review January 16, 2026 00:55
@internet-diglett
Copy link
Contributor

Things look good to me so far. I'm not able to replicate the CI build-and-test failures on my local workstation so those might be transient failures.

Looks like sled-agent is failing here on the deploy task:

sled-agent: Failed to delete all XDE devices

Caused by:
    0: Failure interacting with the OPTE ioctl(2) interface: command ListPorts failed: BadApiVersion { user: 38, kernel: 37 }
    1: command ListPorts failed: BadApiVersion { user: 38, kernel: 37 }
[ Jan  5 21:09:50 Stopping because all processes in service exited. ]
[ Jan  5 21:09:50 Executing stop method (:kill). ]
[ Jan  5 21:09:50 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml &"). ]
[ Jan  5 21:09:50 Method "start" exited with status 0. ]

@internet-diglett
Copy link
Contributor

internet-diglett commented Jan 16, 2026

Local deployment is working so it looks like the deploy task will work once we pull the new xde / update the illumos image in ci

@rcgoodfellow rcgoodfellow force-pushed the ry/ipv6-all-the-things branch from 88ec5ed to 4a5a0d4 Compare January 18, 2026 17:48
@rcgoodfellow rcgoodfellow force-pushed the ry/ipv6-all-the-things branch from 4a5a0d4 to a4fb72c Compare January 18, 2026 19:15
@rcgoodfellow rcgoodfellow force-pushed the ry/ipv6-all-the-things branch 2 times, most recently from dd2621a to d37350d Compare January 21, 2026 21:21
@rcgoodfellow rcgoodfellow force-pushed the ry/ipv6-all-the-things branch from d37350d to 4d42838 Compare January 21, 2026 23:55
internet-diglett and others added 7 commits February 10, 2026 01:15
Several small, related changes to `MaxPathConfig` and
`RouterLifetimeConfig`:

* remove `new_unchecked()` (required changing some `into()`s into
`try_into()`s, but I think this is quite a bit safer)
* add custom `Deserialize` impls that validate bounds
* add custom `JsonSchema` impls that describe the bounds (for
`MaxPathConfig`, the min value of 1 also causes progenitor to generate a
`NonZeroU8` in clients, which I didn't know it could do)
* remove a duplicate `MaxPathConfig` definition
async fn networking_bgp_exported(
rqctx: RequestContext<Self::Context>,
) -> Result<HttpResponseOk<BgpExported>, HttpError>;
) -> Result<HttpResponseOk<Vec<BgpExported>>, HttpError>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even in the case of relatively bounded Vec returns, we typically have a paginated response interface. In the past where actual pagination has been impractical, we've faked it up e.g. by always returning a ResultsPage with next_page: None.

I see several instances of that that I think we should address.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example of us doing this. Several benefits including future-proofing and client enumeration:

let instance_lookup =
nexus.instance_lookup(&opctx, instance_selector)?;
let ips = nexus
.instance_list_external_ips(&opctx, &instance_lookup)
.await?;
Ok(HttpResponseOk(ResultsPage { items: ips, next_page: None }))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

async fn networking_bgp_imported(
rqctx: RequestContext<Self::Context>,
query_params: Query<params::BgpRouteSelector>,
) -> Result<HttpResponseOk<Vec<BgpImported>>, HttpError>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

progenitor::generate_api!(
spec = "../../openapi/bootstrap-agent/bootstrap-agent-1.0.0-127591.json",
spec = "../../openapi/bootstrap-agent/bootstrap-agent-2.0.0-632b71.json",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? I was under the impression we needed to stay on bootstrap-agent 1.0.0 this release so that we could upgrade through this change from R17.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please double-check, but we think this is okay:

  • The only API changes here are to remove calls that now live in bootstrap-agent-lockstep, because those calls were only ever made by lockstep clients (during RSS).
  • The type changes made to types that are kept in the bootstore (slightly different than the bootstrap API, although there's overlap), under common/src/api/internal/shared/*/v2.rs, only made wire-compatible changes, allowing us to still deserialize the old bootstore. The kinds of changes made are:
    • adding new fields tagged with #[serde(default)] (e.g., BgpPeerConfig::router_lifetime) - still deserializes thanks to the default tag
    • making required fields optional (e.g., UplinkAddressConfig::address) - still deserializes and will show up as Some(_)
    • changing IP types that were ipv4-only to be generic IP (e.g., RackNetworkConfig::infra_ip_{first,last}) - still deserializes because we can parse an IPv4 string as a generic IpAddr

This is obviously all very manual and error prone, hence filing #9801, which we basically must do before any more changes need to happen to these types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I thought the issue was is that older bootstrap agent server will 404 any responses from clients that are generated from the 2.0.0 spec, because the server isn't aware of that version yet.

But if that's okay, then that all seems fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... I completely forgot about this. I think we only tested bootstore compatibility via mupdate, which wouldn't run into problems here. 🤦

That said, I think we're okay, but please double check this too! The only endpoints left in the bootstrap API are baseboard_get() and components_get(). I don't think there are any callers of components_get(). There's one caller of baseboard_get(): other sled-agent instances to service a "sled add". This would fail mid-online-update, but it's probably okay to note that adding a sled during an update from R17 to R18 needs to wait until after all the OS updates are done?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense, but I will admit to not having double-checked it. I suppose if there is a show-stopper we will catch it in upgrade testing (we should definitely make sure we perform online upgrade of a racklette from 17.1 to 18).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified online-update from 17.2 to 18 on a racklet and it worked just fine for me.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction: Upgrade to this commit turned out to cause bgp config to be lost (which manifests itself during a cold boot or in the next update). The fix is in #9863.

@internet-diglett internet-diglett enabled auto-merge (squash) February 14, 2026 01:41
@internet-diglett internet-diglett enabled auto-merge (squash) February 14, 2026 01:48
@internet-diglett internet-diglett merged commit 4a456c9 into main Feb 14, 2026
19 checks passed
@internet-diglett internet-diglett deleted the ry/ipv6-all-the-things branch February 14, 2026 03:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.