feat: Allow xnet on engines#10215
Conversation
This commit opens engines up to send and receive XNet messages with 2 restrictions on messages that involve engines (engine->subnet, subnet->engine, engine->engine, but not messages staying on the same engine): 1. Only bounded wait calls are allowed 2. Messages cannot contain attached cycles Messages that are not allowed will be rejected by the protocol and the canister receives an error. The implementation is in 3 places: 1. The StreamBuilder rejects these messages on the sending side 2. The XNet PayloadBuilder rejects these messages on the receiving side. No honest subnet would send any of these messages, but this protects from malicious subnets. 3. The NetworkTopology now contains all other subnets again. Previously, when engines were not able to do any XNet, the NetworkTopology would only contain other subnets if it could send messages to them. The exception to this is the list of ecsda subnets, which does not contain other subnets that do not allow sending cycles to them. This is because the threshold ecdsa endpoint expect to be called with cycles attached.
8364950 to
35f20e0
Compare
alin-at-dfinity
left a comment
There was a problem hiding this comment.
First round of comments, still going through it.
| // subnets_for_certification also includes all three (via full_topology). | ||
| let cert_keys: Vec<_> = network_topology | ||
| .subnets_for_certification() | ||
| .keys() | ||
| .copied() | ||
| .collect(); | ||
| assert!(all_keys.contains(&app_subnet_id)); | ||
| assert!(all_keys.contains(&engine_subnet_id)); | ||
| assert!(all_keys.contains(&nns_subnet_id)); | ||
| assert!(cert_keys.contains(&app_subnet_id)); | ||
| assert!(cert_keys.contains(&engine_subnet_id)); | ||
| assert!(cert_keys.contains(&nns_subnet_id)); |
There was a problem hiding this comment.
This appears to be the same as the previous two tests (only written differently). Do we still need full_topology now that all subnets see all subnets in every "view"?
Co-authored-by: Alin Sinpalean <58422065+alin-at-dfinity@users.noreply.github.com>
alin-at-dfinity
left a comment
There was a problem hiding this comment.
More nitpicks, for the most part.
|
|
||
| let chain_key_enabled_subnets = btreemap! { | ||
| shared_chain_key() => Valid(vec![app_subnet_id, engine_subnet_id]), | ||
| engine_only_chain_key() => Valid(vec![engine_subnet_id]), |
There was a problem hiding this comment.
Clueless question: do system subnets have access to / use chain keys?
| /// Tests that a guaranteed-response request from a CloudEngine subnet (own subnet) to a | ||
| /// non-engine subnet is rejected with a synthetic reject response. | ||
| #[test] | ||
| fn build_streams_engine_src_rejects_gr_request() { |
There was a problem hiding this comment.
Nit: AFAIK we've never used the GR abbreviation before (although I might have just missed it). It took me a second to parse it. I'd rather just spell it out.
| originator_reply_callback: CallbackId::from(1), | ||
| refund: Cycles::new(100), | ||
| response_payload: Payload::Data(vec![]), | ||
| deadline: NO_DEADLINE, |
There was a problem hiding this comment.
Nit: Maybe we should make this a best-effort response, to make it clear that it's dropped because of the refund, not because it's guaranteed response.
| // The refund must NOT have been routed into the REMOTE_SUBNET stream. | ||
| let routed_refunds = result_state | ||
| .streams() | ||
| .get(&REMOTE_SUBNET) | ||
| .map_or(0, |s| s.refund_count()); | ||
| assert_eq!( | ||
| 0, routed_refunds, | ||
| "Refund leaked across engine boundary (own_subnet_type={own_subnet_type:?}, \ | ||
| remote_subnet_type={remote_subnet_type:?})", | ||
| ); |
There was a problem hiding this comment.
Should we also check that the refund is also gone from the refund pool (if that's what it's called)?
| /// Tests that a best-effort request carrying cycles from a CloudEngine subnet is dropped | ||
| /// at the engine boundary: cycles observed as lost, accept signal pushed, and a critical | ||
| /// error raised. Mirrors the sender-side test | ||
| /// `build_streams_engine_src_rejects_cycles_request` on the receiving side, which is | ||
| /// the security-critical filter against a malicious engine. |
There was a problem hiding this comment.
Just thinking out loud:
- Should we be raising critical errors (i.e. page) if an engine misbehaves?
- Should we be recording such cycles as lost to begin with? (Assuming it's trivial to not do so. I don't think this matters enough to add another 100 or even 50 lines of code.)
| // malicious peer. Drop it (any cycles are lost) and raise a critical | ||
| // error. | ||
| if let RequestOrResponse::Response(ref rep) = msg { | ||
| let is_gr = rep.deadline == NO_DEADLINE; |
There was a problem hiding this comment.
Nit: Same observation regarding the GR abbreviation. You can either expand it or not bother with the intermediate boolean and just plug rep.deadline == NO_DEADLINE into the if condition below.
| CompoundCycles::new(rep.refund, own_cost_schedule), | ||
| ); | ||
| } | ||
| stream.push_accept_signal(); |
There was a problem hiding this comment.
Taking this even further, this branch only differs from the (two) Request branch(es) in that it produces an Accept instead of Reject(EngineNotAllowed).
Even if you don't feel like unifying them all, you can still put them under the same match block, there appears to be no need for a separate if.
| ), | ||
| RejectReason::EngineNotAllowed => ( | ||
| RejectCode::SysFatal, | ||
| "Guaranteed-response calls from CloudEngine subnets are not allowed".to_string(), |
There was a problem hiding this comment.
Nit:
| "Guaranteed-response calls from CloudEngine subnets are not allowed".to_string(), | |
| "Guaranteed-response calls and cycle transfers to / from CloudEngine subnets are not allowed".to_string(), |
| #[serde(default, skip_serializing_if = "Vec::is_empty")] | ||
| pub engine_not_allowed_deltas: Vec<u64>, |
There was a problem hiding this comment.
Damn, this just occurred to me now: whether we go with an explicit certification version or not, a canonical_state change requires a staged rollout: first deploy the new "variant" (to all subnets, in this case, since it affects stream structure) with the full logic required to handle it (in this case, to inflate the signal into a reject response, so not much); and only then deploy replica binaries that may produce it.
If we're to do this properly, the new certification version would be introduced with the field; and it would become the "current version" once we have code that can produce it. An explicit version would also allow us to rely on tests to tell us when it's safe to merge follow-up changes.
You should also make a copy of the original RejectSignals type (no engine_not_allowed_deltas field) under rs/canonical_state/encoding/old_types.rs (likely as RejectSignalsV25) and update encoding/tests/compatibility.rs and encoding/tests/test_fixtures.rs to also cover the new variant.
| .map(|reason| reason as i32) | ||
| .collect::<Vec<i32>>(), | ||
| [1, 2, 3, 4, 5, 6, 7] | ||
| [1, 2, 3, 4, 5, 6, 7, 8] |
There was a problem hiding this comment.
Similar (but less stringent) observation here as for rs/canonical_state. As per rs/replicated_state/best-practices-replicated-state.md, you should first deploy the ic_types and protobuf additions, without using them anywhere. Only once that has made it into a replica release can you proceed with the rest of the changes.
alin-at-dfinity
left a comment
There was a problem hiding this comment.
All done, LGTM modulo the one bug and the staged rollout.
| }) | ||
| }) | ||
| .collect(); | ||
| let subnet_ids: Vec<_> = all_subnet_ids.into_iter().collect(); |
There was a problem hiding this comment.
all_subnet_ids is already a Vec<SubnetId>.
| let subnet_ids: Vec<_> = all_subnet_ids.into_iter().collect(); | |
| let subnet_ids = self | |
| .registry | |
| .get_subnet_ids(validation_context.registry_version) | |
| .map_err(Error::RegistryGetSubnetsFailed)? | |
| .unwrap_or_default(); |
| ("NNS", &uc_nns, "CloudEngine 2", &uc_ce_2a), | ||
| ("CloudEngine 2", &uc_ce_2a, "NNS", &uc_nns), |
There was a problem hiding this comment.
Not part of your change, but these two appear redundant. AFAICT there is no difference between "CloudEngine 1" and "CloudEngine 2".
| @@ -306,10 +315,10 @@ | |||
| ("CloudEngine 2", &uc_ce_2a, "CloudEngine 1", &uc_ce_1a), | |||
There was a problem hiding this comment.
Ditto.
| ("CloudEngine 2", &uc_ce_2a, "CloudEngine 1", &uc_ce_1a), |
| let fixture = PayloadBuilderTestFixture::with_xnet_state_and_subnet_types( | ||
| 1, | ||
| 0, | ||
| btreemap![cloud_engine_subnet => SubnetType::CloudEngine], | ||
| None, | ||
| ); |
There was a problem hiding this comment.
Rant: This fixture is amazingly hard to read. Does this mean we get 4 subnets out of which SUBNET_1 is a CloudEngine? And that own_subnet_type defaults to Application?
This commit opens engines up to send and receive XNet messages with 2
restrictions on messages that involve engines (engine->subnet,
subnet->engine, engine->engine, but not messages staying on the same
engine):
Messages that are not allowed will be rejected by the protocol and the
canister receives an error.
The implementation is in 3 places:
No honest subnet would send any of these messages, but this protects
from malicious subnets.
when engines were not able to do any XNet, the NetworkTopology would
only contain other subnets if it could send messages to them. The
exception to this is the list of ecsda subnets, which does not contain
other subnets that do not allow sending cycles to them. This is because
the threshold ecdsa endpoint expect to be called with cycles attached.