Skip to content

feat: Allow xnet on engines#10215

Open
schneiderstefan wants to merge 4 commits into
masterfrom
stschnei/no-cycles-or-unbounded-xnet-on-engines
Open

feat: Allow xnet on engines#10215
schneiderstefan wants to merge 4 commits into
masterfrom
stschnei/no-cycles-or-unbounded-xnet-on-engines

Conversation

@schneiderstefan

@schneiderstefan schneiderstefan commented May 13, 2026

Copy link
Copy Markdown
Contributor

This commit opens engines up to send and receive XNet messages with 2
restrictions on messages that involve engines (engine->subnet,
subnet->engine, engine->engine, but not messages staying on the same
engine):

  1. Only bounded wait calls are allowed
  2. Messages cannot contain attached cycles

Messages that are not allowed will be rejected by the protocol and the
canister receives an error.

The implementation is in 3 places:

  1. The StreamBuilder rejects these messages on the sending side
  2. The XNet PayloadBuilder rejects these messages on the receiving side.
    No honest subnet would send any of these messages, but this protects
    from malicious subnets.
  3. The NetworkTopology now contains all other subnets again. Previously,
    when engines were not able to do any XNet, the NetworkTopology would
    only contain other subnets if it could send messages to them. The
    exception to this is the list of ecsda subnets, which does not contain
    other subnets that do not allow sending cycles to them. This is because
    the threshold ecdsa endpoint expect to be called with cycles attached.

@github-actions github-actions Bot added the feat label May 13, 2026
This commit opens engines up to send and receive XNet messages with 2
restrictions on messages that involve engines (engine->subnet,
subnet->engine, engine->engine, but not messages staying on the same
engine):
1. Only bounded wait calls are allowed
2. Messages cannot contain attached cycles

Messages that are not allowed will be rejected by the protocol and the
canister receives an error.

The implementation is in 3 places:
1. The StreamBuilder rejects these messages on the sending side
2. The XNet PayloadBuilder rejects these messages on the receiving side.
No honest subnet would send any of these messages, but this protects
from malicious subnets.
3. The NetworkTopology now contains all other subnets again. Previously,
when engines were not able to do any XNet, the NetworkTopology would
only contain other subnets if it could send messages to them. The
exception to this is the list of ecsda subnets, which does not contain
other subnets that do not allow sending cycles to them. This is because
the threshold ecdsa endpoint expect to be called with cycles attached.
@schneiderstefan schneiderstefan force-pushed the stschnei/no-cycles-or-unbounded-xnet-on-engines branch from 8364950 to 35f20e0 Compare June 4, 2026 15:29
@schneiderstefan schneiderstefan changed the title feat: Allow limited xnet on engines feat: Allow xnet on engines Jun 4, 2026
@schneiderstefan schneiderstefan marked this pull request as ready for review June 4, 2026 15:58
@schneiderstefan schneiderstefan requested a review from a team as a code owner June 4, 2026 15:58

@alin-at-dfinity alin-at-dfinity left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of comments, still going through it.

Comment on lines +1914 to +1922
// subnets_for_certification also includes all three (via full_topology).
let cert_keys: Vec<_> = network_topology
.subnets_for_certification()
.keys()
.copied()
.collect();
assert!(all_keys.contains(&app_subnet_id));
assert!(all_keys.contains(&engine_subnet_id));
assert!(all_keys.contains(&nns_subnet_id));
assert!(cert_keys.contains(&app_subnet_id));
assert!(cert_keys.contains(&engine_subnet_id));
assert!(cert_keys.contains(&nns_subnet_id));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be the same as the previous two tests (only written differently). Do we still need full_topology now that all subnets see all subnets in every "view"?

Comment thread rs/messaging/src/message_routing.rs
Comment thread rs/messaging/src/message_routing.rs
Comment thread rs/messaging/src/message_routing.rs Outdated
schneiderstefan and others added 3 commits June 8, 2026 14:15

@alin-at-dfinity alin-at-dfinity left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More nitpicks, for the most part.


let chain_key_enabled_subnets = btreemap! {
shared_chain_key() => Valid(vec![app_subnet_id, engine_subnet_id]),
engine_only_chain_key() => Valid(vec![engine_subnet_id]),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clueless question: do system subnets have access to / use chain keys?

/// Tests that a guaranteed-response request from a CloudEngine subnet (own subnet) to a
/// non-engine subnet is rejected with a synthetic reject response.
#[test]
fn build_streams_engine_src_rejects_gr_request() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: AFAIK we've never used the GR abbreviation before (although I might have just missed it). It took me a second to parse it. I'd rather just spell it out.

originator_reply_callback: CallbackId::from(1),
refund: Cycles::new(100),
response_payload: Payload::Data(vec![]),
deadline: NO_DEADLINE,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Maybe we should make this a best-effort response, to make it clear that it's dropped because of the refund, not because it's guaranteed response.

Comment on lines +1042 to +1051
// The refund must NOT have been routed into the REMOTE_SUBNET stream.
let routed_refunds = result_state
.streams()
.get(&REMOTE_SUBNET)
.map_or(0, |s| s.refund_count());
assert_eq!(
0, routed_refunds,
"Refund leaked across engine boundary (own_subnet_type={own_subnet_type:?}, \
remote_subnet_type={remote_subnet_type:?})",
);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also check that the refund is also gone from the refund pool (if that's what it's called)?

Comment on lines +3395 to +3399
/// Tests that a best-effort request carrying cycles from a CloudEngine subnet is dropped
/// at the engine boundary: cycles observed as lost, accept signal pushed, and a critical
/// error raised. Mirrors the sender-side test
/// `build_streams_engine_src_rejects_cycles_request` on the receiving side, which is
/// the security-critical filter against a malicious engine.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking out loud:

  1. Should we be raising critical errors (i.e. page) if an engine misbehaves?
  2. Should we be recording such cycles as lost to begin with? (Assuming it's trivial to not do so. I don't think this matters enough to add another 100 or even 50 lines of code.)

// malicious peer. Drop it (any cycles are lost) and raise a critical
// error.
if let RequestOrResponse::Response(ref rep) = msg {
let is_gr = rep.deadline == NO_DEADLINE;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Same observation regarding the GR abbreviation. You can either expand it or not bother with the intermediate boolean and just plug rep.deadline == NO_DEADLINE into the if condition below.

CompoundCycles::new(rep.refund, own_cost_schedule),
);
}
stream.push_accept_signal();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking this even further, this branch only differs from the (two) Request branch(es) in that it produces an Accept instead of Reject(EngineNotAllowed).

Even if you don't feel like unifying them all, you can still put them under the same match block, there appears to be no need for a separate if.

),
RejectReason::EngineNotAllowed => (
RejectCode::SysFatal,
"Guaranteed-response calls from CloudEngine subnets are not allowed".to_string(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
"Guaranteed-response calls from CloudEngine subnets are not allowed".to_string(),
"Guaranteed-response calls and cycle transfers to / from CloudEngine subnets are not allowed".to_string(),

Comment on lines +70 to +71
#[serde(default, skip_serializing_if = "Vec::is_empty")]
pub engine_not_allowed_deltas: Vec<u64>,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn, this just occurred to me now: whether we go with an explicit certification version or not, a canonical_state change requires a staged rollout: first deploy the new "variant" (to all subnets, in this case, since it affects stream structure) with the full logic required to handle it (in this case, to inflate the signal into a reject response, so not much); and only then deploy replica binaries that may produce it.

If we're to do this properly, the new certification version would be introduced with the field; and it would become the "current version" once we have code that can produce it. An explicit version would also allow us to rely on tests to tell us when it's safe to merge follow-up changes.

You should also make a copy of the original RejectSignals type (no engine_not_allowed_deltas field) under rs/canonical_state/encoding/old_types.rs (likely as RejectSignalsV25) and update encoding/tests/compatibility.rs and encoding/tests/test_fixtures.rs to also cover the new variant.

.map(|reason| reason as i32)
.collect::<Vec<i32>>(),
[1, 2, 3, 4, 5, 6, 7]
[1, 2, 3, 4, 5, 6, 7, 8]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar (but less stringent) observation here as for rs/canonical_state. As per rs/replicated_state/best-practices-replicated-state.md, you should first deploy the ic_types and protobuf additions, without using them anywhere. Only once that has made it into a replica release can you proceed with the rest of the changes.

@alin-at-dfinity alin-at-dfinity left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All done, LGTM modulo the one bug and the staged rollout.

})
})
.collect();
let subnet_ids: Vec<_> = all_subnet_ids.into_iter().collect();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all_subnet_ids is already a Vec<SubnetId>.

Suggested change
let subnet_ids: Vec<_> = all_subnet_ids.into_iter().collect();
let subnet_ids = self
.registry
.get_subnet_ids(validation_context.registry_version)
.map_err(Error::RegistryGetSubnetsFailed)?
.unwrap_or_default();

Comment on lines 312 to 313
("NNS", &uc_nns, "CloudEngine 2", &uc_ce_2a),
("CloudEngine 2", &uc_ce_2a, "NNS", &uc_nns),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not part of your change, but these two appear redundant. AFAICT there is no difference between "CloudEngine 1" and "CloudEngine 2".

Suggested change

@@ -306,10 +315,10 @@
("CloudEngine 2", &uc_ce_2a, "CloudEngine 1", &uc_ce_1a),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Suggested change
("CloudEngine 2", &uc_ce_2a, "CloudEngine 1", &uc_ce_1a),

Comment on lines 601 to 605
let fixture = PayloadBuilderTestFixture::with_xnet_state_and_subnet_types(
1,
0,
btreemap![cloud_engine_subnet => SubnetType::CloudEngine],
None,
);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rant: This fixture is amazingly hard to read. Does this mean we get 4 subnets out of which SUBNET_1 is a CloudEngine? And that own_subnet_type defaults to Application?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants