Skip to content

fix: provider state ownership via events#385

Open
toddbaert wants to merge 4 commits into
mainfrom
fix/provider-event-derived-state
Open

fix: provider state ownership via events#385
toddbaert wants to merge 4 commits into
mainfrom
fix/provider-event-derived-state

Conversation

@toddbaert

@toddbaert toddbaert commented May 18, 2026

Copy link
Copy Markdown
Member

In short: providers emit events to signal all status transitions; the SDK derives provider status from these events rather than inferring from lifecycle method return values.

This resolves a race condition wherein after the provider's init() completes, but before the SDK has emitted its synthetic READY event, the provider emits some other event, and ordering is not preserved, or messages are lost.

What changes for provider authors:

  • Providers emit their own lifecycle events (PROVIDER_READY, PROVIDER_ERROR, etc.) instead of the SDK emitting them on the provider's behalf
  • Trivial providers (no lifecycle methods, no event mechanism) are unaffected; the SDK treats them as ready immediately

What changes for SDK authors:

  • The SDK no longer fires synthetic events after lifecycle methods return
  • Provider status is whatever the most recent event implies
  • Short-circuit behavior for NOT_READY and FATAL is no longer required (it created its own race)

A migration appendix is included for SDKs supporting legacy providers.

Resolves #365


This is an alternative to #380 which took a different approach (adding getState() to providers). That one works but is ~10x larger and introduces concurrent-safety requirements on the provider that don't exist if state lives only in the SDK. Same problem, simpler solution IMO. Happy to discuss tradeoffs.

@toddbaert toddbaert requested a review from a team as a code owner May 18, 2026 15:54

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the OpenFeature specification to shift the responsibility of emitting lifecycle events from the SDK to the providers. By requiring providers to emit events for status transitions such as initialization and context reconciliation, the SDK can derive provider status directly from the event stream, which helps prevent race conditions in multi-threaded environments. The changes include updated requirements, a new migration guide in Appendix E, and documentation adjustments. Reviewers identified an inconsistency where 'shutdown' was included in lists of methods requiring event emission despite the lack of a defined shutdown event; they provided suggestions to remove 'shutdown' from these requirements and ensure consistent terminology for context changes.

Comment thread specification.json Outdated
Comment thread specification/sections/02-providers.md Outdated
Comment thread specification/sections/05-events.md Outdated
toddbaert added 3 commits June 3, 2026 13:54
Providers emit events to signal all status transitions; the SDK derives
provider status from these events rather than inferring from lifecycle
method return values.

- Add section 2.8 (Provider status) with requirements 2.8.1-2.8.4
- Add Condition 2.8.5 for simple providers (no lifecycle, no events)
- Update 1.7.3/4/5 to reference provider events instead of init outcomes
- Change 5.1.1 from MAY to MUST; reframe 5.3.1/2 as event-driven
- Update 5.3.5 table (PROVIDER_CONTEXT_CHANGED -> READY, PROVIDER_RECONCILING -> RECONCILING)
- Add non-normative threading note to 5.3.3
- Add Appendix E (Migrations) for legacy provider compatibility
- Fix lint check for non-requirement entries

Signed-off-by: Todd Baert <todd.baert@dynatrace.com>
Signed-off-by: Todd Baert <todd.baert@dynatrace.com>
Signed-off-by: Todd Baert <todd.baert@dynatrace.com>
@toddbaert toddbaert force-pushed the fix/provider-event-derived-state branch from b37af76 to 6f01fd6 Compare June 3, 2026 17:54

@erka erka left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Migrating providers will be challenging, but it’s for the best.

@toddbaert

Copy link
Copy Markdown
Member Author

Nice work. Migrating providers will be challenging, but it’s for the best.

Agree

@lukas-reining lukas-reining left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! Makes sense to go forward with this.
I left some small questions/remarks.

Comment on lines +318 to +319
Providers that do not define lifecycle methods or an event emission mechanism cannot emit events by design; see Condition 2.8.5.
Requirements 2.8.1-2.8.4 apply only to providers that define lifecycle methods and an event emission mechanism.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a provider without lifecycle methods but an event emission mechanism emit events?
I feel like in this case this works. But is does not work if the provider has lifecycle methods but no event emission mechanism. Right?
I feel like 2.8.1 indirectly requires an eventing mechanism if there is lifecycle methods.


#### Condition 2.8.5

> The provider does not define an `initialize` function or an event emission mechanism.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above I am confused by the or.
If a provider does not have an initialize function we run PROVIDER_READY.
If I get 2.8.1 right, it could basically be:

Suggested change
> The provider does not define an `initialize` function or an event emission mechanism.
> The provider does not define an `initialize` function.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a bit confused.
With 5.1.1 going from MAY to MUST, does that mean providers without an event mechanism are not conformant to this version of the spec?
With this condition it seems we (SDK) intentionally support them. I guess that is just for the time being? Or expected to long term?

Comment on lines +338 to +340
> The provider **MUST** emit `PROVIDER_ERROR` if its `initialize` function terminates abnormally.

If the error is irrecoverable, the error code must indicate `PROVIDER_FATAL`.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in Requirement 5.3.5 of the current spec we say:

Some providers may emit events spontaneously, based on changes in their internal state (connections, caches, etc). The SDK must update its internal representation of the provider's state accordingly:

Should we add a requirement here that says that providers may emit a PROVIDER_ERROR at any time? I know that we also not say that they can't but I think it would make it clearer.
This would make the non normative from the beginning more misleading as the provider may error without initialization:
Providers that do not define lifecycle methods or an event emission mechanism cannot emit events by design

Comment on lines +43 to +44
The legacy path should be deprecated in the release that introduces the marker, with removal targeted for the next major version.
SDK authors should update any first-party providers and provider base classes to emit their own lifecycle events.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes much sense to me!

The `PROVIDER_CONTEXT_CHANGED` is not emitted from the provider itself; the SDK implementation must run the `PROVIDER_CONTEXT_CHANGED` handlers if the `on context changed` function terminates normally.
It's possible that the `on context changed` function is invoked simultaneously or in quick succession; in this case the SDK will only run the `PROVIDER_CONTEXT_CHANGED` handlers after all reentrant invocations have terminated, and the last to terminate was successful (terminated normally).
The provider must emit `PROVIDER_CONTEXT_CHANGED` after it has successfully reconciled the context.
The `on context changed` function may be invoked simultaneously or in quick succession; the provider must be prepared to handle this case.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are loosening the instruction a bit here around when the PROVIDER_CONTEXT_CHANGED is meant to run, is that deliberate?
Could/should we have that somewhere else?

Comment on lines -504 to -518
> The client **MUST** default, run error hooks, and indicate an error if flag resolution is attempted while the provider is in `NOT_READY`.

The client defaults and returns the `PROVIDER_NOT_READY` `error code` if evaluation is attempted before the provider is initialized (the provider is still in a `NOT_READY` state).
The SDK avoids calling the provider's resolver functions entirely ("short-circuits") if the provider is in this state.

see: [error codes](../types.md#error-code), [flag value resolution](./02-providers.md#22-flag-value-resolution)

#### Requirement 1.7.7

> The client **MUST** default, run error hooks, and indicate an error if flag resolution is attempted while the provider is in `FATAL`.

The client defaults and returns the `PROVIDER_FATAL` `error code` if evaluation is attempted after the provider has transitioned to an irrecoverable error state.
The SDK avoids calling the provider's resolver functions entirely ("short-circuits") if the provider is in this state.

see: [error codes](../types.md#error-code), [flag value resolution](./02-providers.md#22-flag-value-resolution)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have the client defaults case on evaluation during NOT_READY/FATAL defined somewhere else?
I guess we'll need to drop the "short-circuit" parts here but just want to make sure we keep the definitions on the defaults.

We could consider moving this into 1.4 alongside 1.4.10, without the short circuit of course.

Comment thread specification.json
"id": "Requirement 1.7.9",
"machine_id": "requirement_1_7_9",
"id": "Requirement 1.7.7",
"machine_id": "requirement_1_7_7",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (and very much a question to learn): I'm not very familiar with how these work, but assuming that code/tests are being generated from them. Is this re-numbering a good idea?


#### Condition 2.8.5

> The provider does not define an `initialize` function or an event emission mechanism.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a bit confused.
With 5.1.1 going from MAY to MUST, does that mean providers without an event mechanism are not conformant to this version of the spec?
With this condition it seems we (SDK) intentionally support them. I guess that is just for the time being? Or expected to long term?

#### Requirement 5.3.5

> If the provider emits an event, the value of the client's `provider status` **MUST** be updated to the status associated with that event **before** the SDK invokes any event handlers for that event, so that handlers observe a consistent status.
> When a provider emits an event, the SDK **MUST** update the `provider status` to the status associated with that event **before** invoking any event handlers for that event, so that handlers observe a consistent status.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description states:

Trivial providers (no lifecycle methods, no event mechanism) are unaffected; the SDK treats them as ready immediately.

This requirement contradicts that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Provider lifecycle race condition in multi-threaded SDKs

4 participants