CCOR-13137 - Handle 401 + 403 (w/auth error) Gracefully#121
CCOR-13137 - Handle 401 + 403 (w/auth error) Gracefully#121chrishagglund-ship-it wants to merge 6 commits into
Conversation
…ion, (react to 403 as well, require error string in json response)
| } | ||
|
|
||
| /** | ||
| * Mints a fresh token. Shared by the lazy cache-TTL reload ({@link #getToken()}) |
There was a problem hiding this comment.
NIT: Referencing the Python SDK is a trap. It's a claim that silently rots (Python changes, this comment becomes a lie no test catches) and a Java SDK reader can't act on it without going to read another language's codebase. Document what the code does on its own terms.
There was a problem hiding this comment.
s.b. cleaned up now
| private static final String TOKEN_CACHE_KEY = "TOKEN"; | ||
|
|
||
| // Stop minting after this many consecutive failures (mirrors the Python SDK). | ||
| private static final int MAX_TOKEN_REFRESH_FAILURES = 5; |
There was a problem hiding this comment.
5 failed refresh attempts (~30s with exponential backoff) and the client is locked out, permanently.
I'm not sure if we are handling this correctly rn. I'll give an example.
There was a problem hiding this comment.
Assume you have a worker. Simplest of workers.
class HelloWorker implements Worker {
@Override
public String getTaskDefName() {
return "hello_task";
}
@Override
public TaskResult execute(Task task) {
result.getOutputData().put("message", "Hello World!");
return result;
}
}
var taskClient = new TaskClient(client);
var runnerConfigurer = new TaskRunnerConfigurer
.Builder(taskClient, List.of(new HelloWorker()))
.withThreadCount(1)
.build();
runnerConfigurer.init();
- You start a worker while the server is spinning up and the server takes longer than 30sec to be ready.
- A refresh happens when the server is down or unreachable for 30sec.
The worker (the process) will keep on running and trying to poll but will never be able to authenticate again.
There was a problem hiding this comment.
As opposed to main which "self-heals"
┌─────────┬─────────────────┬─────────────┐
│ │ hammers /token? │ self-heals? │
├─────────┼─────────────────┼─────────────┤
│ main │ yes (bad) │ yes (good) │
├─────────┼─────────────────┼─────────────┤
│ this PR │ no (good) │ no (bad) │
└─────────┴─────────────────┴─────────────┘
There was a problem hiding this comment.
i agree. please see recent update that addresses these concerns.
mp-orkes
left a comment
There was a problem hiding this comment.
Overall direction might be good (interceptor + backoff), but please address the comments before merge.
The permanent lockout, and the way it's currently handled, is a blocker IMO.
| } | ||
| synchronized (refreshLock) { | ||
| if (tokenRefreshFailures >= MAX_TOKEN_REFRESH_FAILURES) { | ||
| throw new ConductorClientException("Token refresh has failed " + tokenRefreshFailures |
There was a problem hiding this comment.
Once we hit this condition the process can never recover. As already mentioned in this comment.
As it is, this shouldn't be a recoverable Exception. It should be fatal, either a dedicated unchecked error that propagates out, or an explicit process exit.
…ration) - demonstrate token issue locally in different ways
… in a place that makes sense and exit program
|
will now handle the permanent lockout by exiting the program. also enhanced the harness worker to have a basic mode that replicates the simple test worker connect and poll so i could run it conveniently to test that edit: added some tests for the new fatal exception, also did manual testing with the harness worker both full and worker-only modes with bad and valid auth keys, worked as expected. |
CCOR-13137
Rework automatic token refresh to be more like the python implementation, (react to 403 as well, require error string in json response)
Pull Request type
Changes in this PR
Improve handling of errors from requests that were not authorized. Prior work added reactive automatic token refresh however the implementation was naive - this should bring it up to the standards of the preferred implementation from the python sdk.
Implementation requirements:
EXPIRED_TOKENorINVALID_TOKEN, and make an inline attempt to get a fresh token