Skip to content

Debugging Async Slow Runs#41

Merged
xzrderek merged 9 commits intomainfrom
derekx/test-long-run
Aug 11, 2025
Merged

Debugging Async Slow Runs#41
xzrderek merged 9 commits intomainfrom
derekx/test-long-run

Conversation

@xzrderek
Copy link
Copy Markdown
Contributor

@xzrderek xzrderek commented Aug 8, 2025

everything should be parallel now :)

tau bench takes ~3 min now instead of 15

@xzrderek xzrderek changed the title WIP Debugging Async Slow Runs Aug 10, 2025
body = {"seed": session.seed}

timeout = httpx.Timeout(3.0)
timeout = httpx.Timeout(15.0)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @mayinghan we should come up with a better solution to this timeout. for complex environments like tau, can definitely take a long time. e.g. it takes ~12 seconds to reset all the environments for airline (loads a large json, and we're doing it on a thread pool, so not truely concurrent)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this sleep for? I thought we fixed it with health check?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it’s not a sleep but a timeout, so if the env reset takes more than 15s, it’ll time out. this reset is called on cleanup when the rollouts end.
can we just remove the timeout amount since it’s possible for env reset to take more than 15s or is that dangerous?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can also consider delete that session completely from the mcp server? but then the server will never be able to persistent any state after one single run

Copy link
Copy Markdown
Contributor Author

@xzrderek xzrderek Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but then the server will never be able to persistent any state after one single run

i don't quite get what this means. i believe you added reset_session recently, and it's triggered at the end of the rollout. so aren't we already not persisting state after a run?

regardless, i'm gonna merge in first and we can talk more later. i'm just calling out that the 15s timeout is likely not a viable long term solution, but it's fine for now.

@xzrderek xzrderek requested review from benjibc and mayinghan August 10, 2025 07:36
@xzrderek xzrderek merged commit 16149d2 into main Aug 11, 2025
7 checks passed
@xzrderek xzrderek deleted the derekx/test-long-run branch August 11, 2025 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants