Skip to content

Surface Galaxy job failures after a single tool run#249

Draft
dannon wants to merge 1 commit into
galaxyproject:mainfrom
dannon:fix/210-job-error-surfacing
Draft

Surface Galaxy job failures after a single tool run#249
dannon wants to merge 1 commit into
galaxyproject:mainfrom
dannon:fix/210-job-error-surfacing

Conversation

@dannon

@dannon dannon commented Jun 9, 2026

Copy link
Copy Markdown
Member

Fixes the trust gap in #210: a galaxy_run_tool whose job fails immediately wasn't showing up in Orbit chat. run_tool returns at submission ("Started tool ..." / HTTP 200), the agent reads that as success and moves on, and -- unlike a workflow invocation -- a single tool run leaves no loom-invocation block, so the background poller never advances it either. The failure stayed invisible until the user explicitly asked.

Two parts, both aimed at the within-turn case:

  • Deterministic nudge. New job-status-hint.ts, a message_end hook mirroring confusables-hint.ts. On a non-error galaxy_run_tool / galaxy_run_user_tool result it appends a short reminder that a 200 means submitted, not finished -- verify the job reached a terminal ok state (galaxy_get_job_details on one of the run's output dataset ids, or check the output dataset's state) and report failures now rather than waiting to be asked. It's idempotent and format-agnostic (no id parsing, no Galaxy call, no timers); it fires at the exact moment the model is about to misread the 200.
  • Sharper prompt guidance. The verification block now splits "workflow invocation" from "single tool run" instead of lumping them, and the Galaxy step section gains a "Running a single tool" subsection spelling out the run_tool -> poll get_job_details -> report-errors path. The old generic guidance already existed and a weak model ignored it, which is why the deterministic hook carries the weight here.

Note: galaxy_get_job_details takes the output dataset id (it returns the job that produced that dataset), not a job id -- the guidance reflects that.

Scope. This covers the within-turn case, which is the reported incident ("Galaxy UI immediately showed error"). Proactive cross-turn polling of tool-run jobs -- the background-poller equivalent for single jobs -- is a larger design call and intentionally left out. A cleaner durable fix would also live upstream in galaxy-mcp (have run_tool return clearer "submitted, not complete" language plus the ids), which would make this brain-side nudge belt-and-suspenders.

Testing: 529 root tests green (15 new across tests/job-status-hint.test.ts + tests/job-run-context.test.ts), root + app typecheck clean. Not yet live-eyeballed in Orbit.

Mitigates #210.

A galaxy_run_tool whose job failed immediately wasn't showing up in
Orbit chat. run_tool returns at submission ("Started tool ..." / HTTP
200), the agent reads that as success and moves on, and unlike a workflow
invocation a single tool run leaves no loom-invocation block, so the
background poller never advances it either -- the failure stayed invisible
until the user explicitly asked.

Add a message_end hook (mirrors confusables-hint) that appends a
"submitted != finished, verify terminal state and report failures"
reminder onto successful run_tool / run_user_tool results, so the agent
checks the job in the same turn. Also sharpen the context guidance: the
verification block now splits workflow invocation from single tool run,
and the Galaxy step section spells out the run_tool -> poll
get_job_details -> report-errors path.

This covers the within-turn case. Proactive cross-turn polling of
tool-run jobs (the background-poller equivalent for jobs) is a separate,
larger design call and is intentionally left out.
leaves no \`loom-invocation\` block — so nothing polls it for you. The
"Started tool …" / HTTP 200 response means the job was **submitted, not
that it finished**. Before you tell the user the step worked, confirm the
job reached a terminal \`ok\` state: call \`galaxy_get_job_details\` with one

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why aren't we directly checking the job via the job id ? Jobs are not obliged to produce outputs at all.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or might only produce collections, which would probably fail ?
does this work for map over jobs ?

// they get their own loom-invocation block + background poller.
const JOB_RUN_TOOLS = new Set(["galaxy_run_tool", "galaxy_run_user_tool"]);

export const JOB_SUBMITTED_HINT =

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this live in a yaml document that is easy to review and update, like the help terms document we have in Galaxy ? The instructions also seem partially redundant with the context.ts changes

@dannon dannon marked this pull request as draft June 11, 2026 01:23
@dannon

dannon commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

This is what I've been iterating on with Claude -- still thinking it through, so welcome your read before I rework the PR:

Both your points landed, and they pushed this somewhere a bit different from what's up now. Short version: the contested how-to (check via a dataset id, mind the jobs with no outputs / only collections / map-over) isn't content that needs a better home -- it's a sign the tool surface is wrong for the question. The only reason the prose has to spell out those caveats is that get_job_details(dataset_id) can't directly answer "did this job reach a terminal ok state?" Give get_job_details a job-id path and the no-output and collection-only caveats go away -- and run_tool already returns the job id(s), so the agent has the handle without guessing. Relocating the prose to a yaml doc or a skill just moves a half-wrong claim into a tidier drawer. The real fix is the tool.

That said, map-over is the case that doesn't fully fall out of a job-id path: a batch run is N jobs, so "check the job" becomes "check all the jobs, or the output collection's populated_state." That's exactly the kind of thing the tool should encapsulate rather than prose -- but worth flagging as the one that needs real thought, not a one-liner.

This is the plan Claude came up with after some iteration -- posting it to react to, not as a settled call. The cut is by what's actually stable:

  • The imperative is stable and belongs in the deterministic hook: "this returns at submission, not completion -- before you report results, confirm the job hit a terminal ok state, and surface a failure now." Zero id-mechanic needed to be correct.
  • The argument shape belongs in the tool schema, not a prose nudge. When prose has to compensate for the tool, that's the smell.

On the redundancy you flagged: the first instinct (move the detail into a skill the hook points at) is actually worse. The whole reason the hook is deterministic is that the ambient guidance already got ignored by a weak model. Shrink it to "verify -- go fetch the skill for how" and you've put a pull-on-demand hop back on the exact path that needed push: the model gets "verify" but has to choose to fetch the method, and if it doesn't it either ignores it (the original bug) or guesses a call (the wrong-dataset-id thing). A deterministic nudge can't depend on a pull for the part that makes it actionable.

Concretely:

  1. Slim the hook to the pure imperative -- no dataset-id mechanic, no skill pointer. That answers the correctness point by deleting the wrong claim, not relocating it.
  2. Dedupe context.ts to the one fact that isn't redundant: a single tool run is a job, not an invocation, so nothing polls it.
  3. File the galaxy-mcp job-id path as a separate issue -- that's the structural fix for the no-output / collection / map-over cases. This loom PR ships on its own; the hook's imperative stays permanently (weak models ignore ambient guidance), and the tool fix just removes any temptation to put the mechanic back into prose.

On the yaml idea: standing up a doc for a single gotcha is overkill, but you're onto something longer-term -- there's a growing family of these deterministic hint injectors, and a loom-local data-driven hint registry is probably the right "data not code" move for that family. Worth doing as its own refactor when it's clearly earning its keep, rather than bolting it onto this issue.

Flipped back to draft while we sort the shape. Does the tool-fix-first framing match how you'd want to handle the no-output / collection / map-over cases, or is there a reason to keep a dataset-path stopgap in loom in the meantime?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants