Surface Galaxy job failures after a single tool run#249
Conversation
A galaxy_run_tool whose job failed immediately wasn't showing up in
Orbit chat. run_tool returns at submission ("Started tool ..." / HTTP
200), the agent reads that as success and moves on, and unlike a workflow
invocation a single tool run leaves no loom-invocation block, so the
background poller never advances it either -- the failure stayed invisible
until the user explicitly asked.
Add a message_end hook (mirrors confusables-hint) that appends a
"submitted != finished, verify terminal state and report failures"
reminder onto successful run_tool / run_user_tool results, so the agent
checks the job in the same turn. Also sharpen the context guidance: the
verification block now splits workflow invocation from single tool run,
and the Galaxy step section spells out the run_tool -> poll
get_job_details -> report-errors path.
This covers the within-turn case. Proactive cross-turn polling of
tool-run jobs (the background-poller equivalent for jobs) is a separate,
larger design call and is intentionally left out.
| leaves no \`loom-invocation\` block — so nothing polls it for you. The | ||
| "Started tool …" / HTTP 200 response means the job was **submitted, not | ||
| that it finished**. Before you tell the user the step worked, confirm the | ||
| job reached a terminal \`ok\` state: call \`galaxy_get_job_details\` with one |
There was a problem hiding this comment.
why aren't we directly checking the job via the job id ? Jobs are not obliged to produce outputs at all.
There was a problem hiding this comment.
or might only produce collections, which would probably fail ?
does this work for map over jobs ?
| // they get their own loom-invocation block + background poller. | ||
| const JOB_RUN_TOOLS = new Set(["galaxy_run_tool", "galaxy_run_user_tool"]); | ||
|
|
||
| export const JOB_SUBMITTED_HINT = |
There was a problem hiding this comment.
Could this live in a yaml document that is easy to review and update, like the help terms document we have in Galaxy ? The instructions also seem partially redundant with the context.ts changes
|
This is what I've been iterating on with Claude -- still thinking it through, so welcome your read before I rework the PR: Both your points landed, and they pushed this somewhere a bit different from what's up now. Short version: the contested how-to (check via a dataset id, mind the jobs with no outputs / only collections / map-over) isn't content that needs a better home -- it's a sign the tool surface is wrong for the question. The only reason the prose has to spell out those caveats is that That said, map-over is the case that doesn't fully fall out of a job-id path: a batch run is N jobs, so "check the job" becomes "check all the jobs, or the output collection's This is the plan Claude came up with after some iteration -- posting it to react to, not as a settled call. The cut is by what's actually stable:
On the redundancy you flagged: the first instinct (move the detail into a skill the hook points at) is actually worse. The whole reason the hook is deterministic is that the ambient guidance already got ignored by a weak model. Shrink it to "verify -- go fetch the skill for how" and you've put a pull-on-demand hop back on the exact path that needed push: the model gets "verify" but has to choose to fetch the method, and if it doesn't it either ignores it (the original bug) or guesses a call (the wrong-dataset-id thing). A deterministic nudge can't depend on a pull for the part that makes it actionable. Concretely:
On the yaml idea: standing up a doc for a single gotcha is overkill, but you're onto something longer-term -- there's a growing family of these deterministic hint injectors, and a loom-local data-driven hint registry is probably the right "data not code" move for that family. Worth doing as its own refactor when it's clearly earning its keep, rather than bolting it onto this issue. Flipped back to draft while we sort the shape. Does the tool-fix-first framing match how you'd want to handle the no-output / collection / map-over cases, or is there a reason to keep a dataset-path stopgap in loom in the meantime? |
Fixes the trust gap in #210: a
galaxy_run_toolwhose job fails immediately wasn't showing up in Orbit chat.run_toolreturns at submission ("Started tool ..." / HTTP 200), the agent reads that as success and moves on, and -- unlike a workflow invocation -- a single tool run leaves noloom-invocationblock, so the background poller never advances it either. The failure stayed invisible until the user explicitly asked.Two parts, both aimed at the within-turn case:
job-status-hint.ts, amessage_endhook mirroringconfusables-hint.ts. On a non-errorgalaxy_run_tool/galaxy_run_user_toolresult it appends a short reminder that a 200 means submitted, not finished -- verify the job reached a terminalokstate (galaxy_get_job_detailson one of the run's output dataset ids, or check the output dataset's state) and report failures now rather than waiting to be asked. It's idempotent and format-agnostic (no id parsing, no Galaxy call, no timers); it fires at the exact moment the model is about to misread the 200.Note:
galaxy_get_job_detailstakes the output dataset id (it returns the job that produced that dataset), not a job id -- the guidance reflects that.Scope. This covers the within-turn case, which is the reported incident ("Galaxy UI immediately showed error"). Proactive cross-turn polling of tool-run jobs -- the background-poller equivalent for single jobs -- is a larger design call and intentionally left out. A cleaner durable fix would also live upstream in galaxy-mcp (have
run_toolreturn clearer "submitted, not complete" language plus the ids), which would make this brain-side nudge belt-and-suspenders.Testing: 529 root tests green (15 new across
tests/job-status-hint.test.ts+tests/job-run-context.test.ts), root + app typecheck clean. Not yet live-eyeballed in Orbit.Mitigates #210.