Skip to content

CCOR-13137 - Handle 401 + 403 (w/auth error) Gracefully#121

Open
chrishagglund-ship-it wants to merge 6 commits into
mainfrom
fix/refresh-token-when-expired
Open

CCOR-13137 - Handle 401 + 403 (w/auth error) Gracefully#121
chrishagglund-ship-it wants to merge 6 commits into
mainfrom
fix/refresh-token-when-expired

Conversation

@chrishagglund-ship-it

@chrishagglund-ship-it chrishagglund-ship-it commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

CCOR-13137

Rework automatic token refresh to be more like the python implementation, (react to 403 as well, require error string in json response)

Pull Request type

  • [ x ] Bug fix

Changes in this PR

Improve handling of errors from requests that were not authorized. Prior work added reactive automatic token refresh however the implementation was naive - this should bring it up to the standards of the preferred implementation from the python sdk.

Implementation requirements:

  • automatically refresh token inline when it is known to be expired
  • intercept 401 + 403 response that has error text values EXPIRED_TOKEN or INVALID_TOKEN, and make an inline attempt to get a fresh token
  • max 5 failures with exponential backoff + reset counter on success
  • fatal exception / terminate with error when retries is exhausted

…ion, (react to 403 as well, require error string in json response)
@codecov

codecov Bot commented Jun 5, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 86.48649% with 10 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...conductor/client/http/TokenRefreshInterceptor.java 85.29% 4 Missing and 1 partial ⚠️
...kes/conductor/client/http/OrkesAuthentication.java 88.00% 2 Missing and 1 partial ⚠️
...ctor/client/http/FatalAuthenticationException.java 50.00% 2 Missing ⚠️
Files with missing lines Coverage Δ Complexity Δ
...netflix/conductor/client/automator/TaskRunner.java 55.95% <100.00%> (+1.35%) 31.00 <4.00> (+5.00)
...main/java/io/orkes/conductor/client/ApiClient.java 15.06% <100.00%> (ø) 3.00 <0.00> (ø)
...ctor/client/http/FatalAuthenticationException.java 50.00% <50.00%> (ø) 1.00 <1.00> (?)
...kes/conductor/client/http/OrkesAuthentication.java 79.68% <88.00%> (+10.79%) 14.00 <4.00> (+4.00)
...conductor/client/http/TokenRefreshInterceptor.java 85.29% <85.29%> (ø) 11.00 <11.00> (?)

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@chrishagglund-ship-it chrishagglund-ship-it changed the title https://orkes.atlassian.net/browse/CCOR-13137 CCOR-13137 - Handle 401 + 403 (w/auth error) Gracefully Jun 5, 2026
}

/**
* Mints a fresh token. Shared by the lazy cache-TTL reload ({@link #getToken()})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Referencing the Python SDK is a trap. It's a claim that silently rots (Python changes, this comment becomes a lie no test catches) and a Java SDK reader can't act on it without going to read another language's codebase. Document what the code does on its own terms.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s.b. cleaned up now

private static final String TOKEN_CACHE_KEY = "TOKEN";

// Stop minting after this many consecutive failures (mirrors the Python SDK).
private static final int MAX_TOKEN_REFRESH_FAILURES = 5;

@mp-orkes mp-orkes Jun 10, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 failed refresh attempts (~30s with exponential backoff) and the client is locked out, permanently.

I'm not sure if we are handling this correctly rn. I'll give an example.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume you have a worker. Simplest of workers.

class HelloWorker implements Worker {
  @Override
  public String getTaskDefName() {
      return "hello_task";
  }

  @Override
  public TaskResult execute(Task task) {
     result.getOutputData().put("message", "Hello World!");
     return result;
  }
}

var taskClient = new TaskClient(client);
var runnerConfigurer = new TaskRunnerConfigurer
  .Builder(taskClient, List.of(new HelloWorker()))
  .withThreadCount(1)
  .build();

runnerConfigurer.init();
  • You start a worker while the server is spinning up and the server takes longer than 30sec to be ready.
  • A refresh happens when the server is down or unreachable for 30sec.
Image

The worker (the process) will keep on running and trying to poll but will never be able to authenticate again.

@mp-orkes mp-orkes Jun 10, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As opposed to main which "self-heals"

┌─────────┬─────────────────┬─────────────┐
│         │ hammers /token? │ self-heals? │
├─────────┼─────────────────┼─────────────┤
│ main    │ yes (bad)       │ yes (good)  │
├─────────┼─────────────────┼─────────────┤
│ this PR │ no (good)       │ no (bad)    │
└─────────┴─────────────────┴─────────────┘

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree. please see recent update that addresses these concerns.

@mp-orkes mp-orkes left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall direction might be good (interceptor + backoff), but please address the comments before merge.

The permanent lockout, and the way it's currently handled, is a blocker IMO.

}
synchronized (refreshLock) {
if (tokenRefreshFailures >= MAX_TOKEN_REFRESH_FAILURES) {
throw new ConductorClientException("Token refresh has failed " + tokenRefreshFailures

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we hit this condition the process can never recover. As already mentioned in this comment.

As it is, this shouldn't be a recoverable Exception. It should be fatal, either a dedicated unchecked error that propagates out, or an explicit process exit.

@mp-orkes mp-orkes requested review from c4lm and manan164 June 10, 2026 15:01
…ration) - demonstrate token issue locally in different ways
… in a place that makes sense and exit program
@chrishagglund-ship-it

chrishagglund-ship-it commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

will now handle the permanent lockout by exiting the program. also enhanced the harness worker to have a basic mode that replicates the simple test worker connect and poll so i could run it conveniently to test that

edit: added some tests for the new fatal exception, also did manual testing with the harness worker both full and worker-only modes with bad and valid auth keys, worked as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants