fix: provider state ownership via events#385
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the OpenFeature specification to shift the responsibility of emitting lifecycle events from the SDK to the providers. By requiring providers to emit events for status transitions such as initialization and context reconciliation, the SDK can derive provider status directly from the event stream, which helps prevent race conditions in multi-threaded environments. The changes include updated requirements, a new migration guide in Appendix E, and documentation adjustments. Reviewers identified an inconsistency where 'shutdown' was included in lists of methods requiring event emission despite the lack of a defined shutdown event; they provided suggestions to remove 'shutdown' from these requirements and ensure consistent terminology for context changes.
3177a63 to
4734a58
Compare
Providers emit events to signal all status transitions; the SDK derives provider status from these events rather than inferring from lifecycle method return values. - Add section 2.8 (Provider status) with requirements 2.8.1-2.8.4 - Add Condition 2.8.5 for simple providers (no lifecycle, no events) - Update 1.7.3/4/5 to reference provider events instead of init outcomes - Change 5.1.1 from MAY to MUST; reframe 5.3.1/2 as event-driven - Update 5.3.5 table (PROVIDER_CONTEXT_CHANGED -> READY, PROVIDER_RECONCILING -> RECONCILING) - Add non-normative threading note to 5.3.3 - Add Appendix E (Migrations) for legacy provider compatibility - Fix lint check for non-requirement entries Signed-off-by: Todd Baert <todd.baert@dynatrace.com>
Signed-off-by: Todd Baert <todd.baert@dynatrace.com>
Signed-off-by: Todd Baert <todd.baert@dynatrace.com>
b37af76 to
6f01fd6
Compare
erka
left a comment
There was a problem hiding this comment.
Nice work. Migrating providers will be challenging, but it’s for the best.
Agree |
lukas-reining
left a comment
There was a problem hiding this comment.
This looks good to me! Makes sense to go forward with this.
I left some small questions/remarks.
| Providers that do not define lifecycle methods or an event emission mechanism cannot emit events by design; see Condition 2.8.5. | ||
| Requirements 2.8.1-2.8.4 apply only to providers that define lifecycle methods and an event emission mechanism. |
There was a problem hiding this comment.
Can a provider without lifecycle methods but an event emission mechanism emit events?
I feel like in this case this works. But is does not work if the provider has lifecycle methods but no event emission mechanism. Right?
I feel like 2.8.1 indirectly requires an eventing mechanism if there is lifecycle methods.
|
|
||
| #### Condition 2.8.5 | ||
|
|
||
| > The provider does not define an `initialize` function or an event emission mechanism. |
There was a problem hiding this comment.
As above I am confused by the or.
If a provider does not have an initialize function we run PROVIDER_READY.
If I get 2.8.1 right, it could basically be:
| > The provider does not define an `initialize` function or an event emission mechanism. | |
| > The provider does not define an `initialize` function. |
There was a problem hiding this comment.
Also a bit confused.
With 5.1.1 going from MAY to MUST, does that mean providers without an event mechanism are not conformant to this version of the spec?
With this condition it seems we (SDK) intentionally support them. I guess that is just for the time being? Or expected to long term?
| > The provider **MUST** emit `PROVIDER_ERROR` if its `initialize` function terminates abnormally. | ||
|
|
||
| If the error is irrecoverable, the error code must indicate `PROVIDER_FATAL`. |
There was a problem hiding this comment.
in Requirement 5.3.5 of the current spec we say:
Some providers may emit events spontaneously, based on changes in their internal state (connections, caches, etc). The SDK must update its internal representation of the provider's state accordingly:
Should we add a requirement here that says that providers may emit a PROVIDER_ERROR at any time? I know that we also not say that they can't but I think it would make it clearer.
This would make the non normative from the beginning more misleading as the provider may error without initialization:
Providers that do not define lifecycle methods or an event emission mechanism cannot emit events by design
| The legacy path should be deprecated in the release that introduces the marker, with removal targeted for the next major version. | ||
| SDK authors should update any first-party providers and provider base classes to emit their own lifecycle events. |
There was a problem hiding this comment.
This makes much sense to me!
| The `PROVIDER_CONTEXT_CHANGED` is not emitted from the provider itself; the SDK implementation must run the `PROVIDER_CONTEXT_CHANGED` handlers if the `on context changed` function terminates normally. | ||
| It's possible that the `on context changed` function is invoked simultaneously or in quick succession; in this case the SDK will only run the `PROVIDER_CONTEXT_CHANGED` handlers after all reentrant invocations have terminated, and the last to terminate was successful (terminated normally). | ||
| The provider must emit `PROVIDER_CONTEXT_CHANGED` after it has successfully reconciled the context. | ||
| The `on context changed` function may be invoked simultaneously or in quick succession; the provider must be prepared to handle this case. |
There was a problem hiding this comment.
We are loosening the instruction a bit here around when the PROVIDER_CONTEXT_CHANGED is meant to run, is that deliberate?
Could/should we have that somewhere else?
| > The client **MUST** default, run error hooks, and indicate an error if flag resolution is attempted while the provider is in `NOT_READY`. | ||
|
|
||
| The client defaults and returns the `PROVIDER_NOT_READY` `error code` if evaluation is attempted before the provider is initialized (the provider is still in a `NOT_READY` state). | ||
| The SDK avoids calling the provider's resolver functions entirely ("short-circuits") if the provider is in this state. | ||
|
|
||
| see: [error codes](../types.md#error-code), [flag value resolution](./02-providers.md#22-flag-value-resolution) | ||
|
|
||
| #### Requirement 1.7.7 | ||
|
|
||
| > The client **MUST** default, run error hooks, and indicate an error if flag resolution is attempted while the provider is in `FATAL`. | ||
|
|
||
| The client defaults and returns the `PROVIDER_FATAL` `error code` if evaluation is attempted after the provider has transitioned to an irrecoverable error state. | ||
| The SDK avoids calling the provider's resolver functions entirely ("short-circuits") if the provider is in this state. | ||
|
|
||
| see: [error codes](../types.md#error-code), [flag value resolution](./02-providers.md#22-flag-value-resolution) |
There was a problem hiding this comment.
Do we have the client defaults case on evaluation during NOT_READY/FATAL defined somewhere else?
I guess we'll need to drop the "short-circuit" parts here but just want to make sure we keep the definitions on the defaults.
We could consider moving this into 1.4 alongside 1.4.10, without the short circuit of course.
| "id": "Requirement 1.7.9", | ||
| "machine_id": "requirement_1_7_9", | ||
| "id": "Requirement 1.7.7", | ||
| "machine_id": "requirement_1_7_7", |
There was a problem hiding this comment.
Nit (and very much a question to learn): I'm not very familiar with how these work, but assuming that code/tests are being generated from them. Is this re-numbering a good idea?
|
|
||
| #### Condition 2.8.5 | ||
|
|
||
| > The provider does not define an `initialize` function or an event emission mechanism. |
There was a problem hiding this comment.
Also a bit confused.
With 5.1.1 going from MAY to MUST, does that mean providers without an event mechanism are not conformant to this version of the spec?
With this condition it seems we (SDK) intentionally support them. I guess that is just for the time being? Or expected to long term?
| #### Requirement 5.3.5 | ||
|
|
||
| > If the provider emits an event, the value of the client's `provider status` **MUST** be updated to the status associated with that event **before** the SDK invokes any event handlers for that event, so that handlers observe a consistent status. | ||
| > When a provider emits an event, the SDK **MUST** update the `provider status` to the status associated with that event **before** invoking any event handlers for that event, so that handlers observe a consistent status. |
There was a problem hiding this comment.
The PR description states:
Trivial providers (no lifecycle methods, no event mechanism) are unaffected; the SDK treats them as ready immediately.
This requirement contradicts that.
In short: providers emit events to signal all status transitions; the SDK derives provider status from these events rather than inferring from lifecycle method return values.
This resolves a race condition wherein after the provider's init() completes, but before the SDK has emitted its synthetic READY event, the provider emits some other event, and ordering is not preserved, or messages are lost.
What changes for provider authors:
What changes for SDK authors:
A migration appendix is included for SDKs supporting legacy providers.
Resolves #365
This is an alternative to #380 which took a different approach (adding getState() to providers). That one works but is ~10x larger and introduces concurrent-safety requirements on the provider that don't exist if state lives only in the SDK. Same problem, simpler solution IMO. Happy to discuss tradeoffs.