Skip to content

Tool Error Recovery

skobeltsyn edited this page Mar 28, 2026 · 2 revisions

Tool Error Recovery

LLMs produce malformed tool calls. Agents.KT lets you fix them -- with code, with another agent, or with both.


The Problem

Large language models are probabilistic text generators. When they produce tool calls, things go wrong in predictable ways:

  • Trailing commas in JSON: {"a": 1, "b": 2,}
  • Markdown fencing around arguments: ```json\n{"a": 1}\n```
  • Wrong types: a number sent as a string "42" instead of 42
  • Missing required fields: the model forgets a parameter
  • Runtime failures: the tool itself throws because of bad input or transient errors

Without error recovery, any of these failures kills the agentic loop. The agent stops, the user gets nothing.


The Agents.KT Answer

Most frameworks handle malformed tool calls with special parser classes, retry middleware, or string-cleaning utilities buried in utility packages.

Agents.KT takes a different approach: the fixer is an agent. The same Agent<IN, OUT> interface you use to build your application is the same interface you use to repair broken tool calls. No new abstraction. No special machinery.

This means repair logic gets the full power of the framework: it can be deterministic (a pure function), LLM-driven (an agent with its own model), or a composition of both.


Error Taxonomy

Tool errors form a sealed hierarchy with four variants:

sealed interface ToolError {
    data class InvalidArgs(
        val rawArgs: String,
        val parseError: String,
        val expectedSchema: JsonSchema
    ) : ToolError

    data class DeserializationError(
        val rawValue: String,
        val targetType: KType,
        val cause: Throwable
    ) : ToolError

    data class ExecutionError(
        val args: ToolArgs,
        val cause: Throwable
    ) : ToolError

    data class EscalationError(
        val source: AgentRef,
        val reason: String,
        val severity: Severity,
        val originalError: ToolError,
        val attempts: Int
    ) : ToolError
}
Error Type When It Fires Typical Cause
InvalidArgs JSON parsing fails Trailing commas, markdown fencing, truncated output
DeserializationError JSON parses but cannot map to expected types "42" instead of 42, missing keys
ExecutionError Tool executor throws Bad input values, transient I/O failures, business logic errors
EscalationError Repair itself fails and escalates up Exhausted retries, unrecoverable state

The sealed hierarchy means when expressions are exhaustive -- the compiler tells you if you miss a case.


The onError DSL

Each tool can declare error handlers using the onError {} block. Inside, three verbs match the three non-escalation error types:

tool("write_file", "Write content to a file") { args ->
    val path = args["path"] as String
    val content = args["content"] as String
    fileSystem.write(path, content)
}
onError {
    invalidArgs { args, error ->
        fix { args.trimMarkdownFencing() }
    }
    deserializationError { raw, error ->
        sanitize { raw.normalizePathSeparators() }
    }
    executionError { e ->
        retry(maxAttempts = 3, backoff = exponential())
    }
}
Verb Error Type Purpose
invalidArgs { } InvalidArgs Fix unparseable JSON
deserializationError { } DeserializationError Fix type mismatches
executionError { } ExecutionError Handle runtime failures

Deterministic Repair

The simplest recovery strategy is a pure function. No LLM, no network call -- just string manipulation.

fix { } -- Repair Invalid Arguments

onError {
    invalidArgs { args, error ->
        fix {
            args
                .trimMarkdownFencing()       // strip ```json ... ```
                .replace(Regex(",\\s*}"), "}") // remove trailing commas
                .replace(Regex(",\\s*]"), "]") // remove trailing commas in arrays
        }
    }
}

The lambda receives the raw argument string and returns a cleaned version. The framework re-parses the cleaned string and retries the tool call.

sanitize { } -- Repair Deserialization Errors

onError {
    deserializationError { raw, error ->
        sanitize {
            raw.normalizePathSeparators()   // backslash to forward slash
        }
    }
}

Same idea: transform the raw value so it deserializes correctly.

retry() -- Retry on Execution Errors

onError {
    executionError { e ->
        retry(maxAttempts = 3, backoff = exponential())
    }
}

This re-runs the tool executor with the same arguments. The backoff parameter controls the delay between attempts. Use this for transient failures like network timeouts or rate limits.


LLM-Driven Repair

When deterministic cleanup is not enough -- the JSON is too mangled, the error is too novel -- you can delegate repair to an agent.

Defining a Repair Agent

A repair agent is a regular Agent<String, String>. It takes the broken input as a string and returns a fixed string:

val jsonFixer = agent<String, String>("json-fixer") {
    prompt = """
        You are a JSON repair tool. You receive malformed JSON and return
        valid JSON. Do not add or remove fields. Only fix syntax errors.
        Return ONLY the fixed JSON, no explanation.
    """.trimIndent()

    model {
        ollama("qwen2.5:7b")
        temperature = 0.0   // deterministic output
    }

    budget { maxTurns = 1 }   // single-shot, no tool loop

    skills {
        skill<String, String>("fix-json", "Repairs broken JSON") {
            implementedBy { input -> input }  // LLM does the work via prompt
        }
    }
}

Using a Repair Agent in onError

tool("create_task", "Create a new task") { args ->
    val title = args["title"] as String
    taskService.create(title)
}
onError {
    invalidArgs { args, error ->
        fix(agent = jsonFixer, retries = 3)
    }
}

The framework sends the broken arguments to jsonFixer, takes the output, re-parses it, and retries the tool call. If the fix fails, it retries up to 3 times before giving up.


Hybrid Strategies

The most robust approach: try deterministic repair first, fall back to the LLM only if it returns null:

onError {
    invalidArgs { args, error ->
        fix {
            // Attempt 1: simple cleanup
            tryJsonCleanup(args)   // returns null if cleanup is insufficient
        } ?: fix(agent = jsonFixer, retries = 3)
            // Attempt 2: LLM-driven repair if deterministic fix returned null
    }
}

This gives you the speed of string manipulation for common cases (trailing commas, fencing) and the intelligence of an LLM for edge cases.

Helper Function Example

fun tryJsonCleanup(raw: String): String? {
    val cleaned = raw
        .trim()
        .removePrefix("```json").removePrefix("```")
        .removeSuffix("```")
        .trim()
        .replace(Regex(",\\s*}"), "}")
        .replace(Regex(",\\s*]"), "]")

    return try {
        // Verify it parses
        JsonParser.parse(cleaned)
        cleaned
    } catch (e: Exception) {
        null   // signal: deterministic fix was not enough
    }
}

Deterministic Agent

A repair agent does not have to use an LLM. You can build a fully deterministic agent using implementedBy:

val regexFixer = agent<String, String>("regex-fixer") {
    skills {
        skill<String, String>("fix", "Fix JSON with regex") {
            implementedBy { input ->
                input
                    .replace(Regex("(?s)```json\\s*(.+?)\\s*```"), "$1")
                    .replace(Regex(",\\s*([}\\]])"), "$1")
                    .replace(Regex("'"), "\"")
            }
        }
    }
}

Zero LLM calls, zero latency, zero cost -- but it conforms to the Agent<String, String> interface, so it plugs into fix(agent = ...) seamlessly. The framework does not care how the agent produces its output.


Tool-Level Defaults

When many tools share the same error handling, define defaults at the tools {} level:

skills {
    skill<String, String>("data-ops", "Data operations") {
        tools("read_file", "write_file", "delete_file")

        defaults {
            onError {
                invalidArgs { args, error ->
                    fix { tryJsonCleanup(args) } ?: fix(agent = jsonFixer, retries = 2)
                }
                executionError { e ->
                    retry(maxAttempts = 3, backoff = exponential())
                }
            }
        }

        tool("read_file", "Read a file") { args ->
            fileSystem.read(args["path"] as String)
        }

        tool("write_file", "Write a file") { args ->
            fileSystem.write(args["path"] as String, args["content"] as String)
        }

        tool("delete_file", "Delete a file") { args ->
            fileSystem.delete(args["path"] as String)
        }
        // Per-tool override: delete_file has stricter handling
        onError("delete_file") {
            executionError { e ->
                // No retry for destructive operations
                escalate()
            }
        }
    }
}

The rule: per-tool onError overrides defaults for that specific tool. All other tools inherit the defaults.


Escalation and Throwing

When repair fails, the tool has two fundamentally different options. Understanding the distinction is critical -- it determines whether your system degrades gracefully or crashes hard.

escalate() -- Soft Failure

executionError { e ->
    escalate("File not found, cannot retry", severity = Severity.MEDIUM)
}

escalate() does not throw. It does not break the call stack. Instead, it wraps the error in an EscalationError and walks up the structure {} delegation tree -- from the repair agent to the tool, from the tool to its parent agent, from the parent agent to its parent, until something handles it.

The EscalationError carries everything the handler needs to decide:

data class EscalationError(
    val source: AgentRef,        // who escalated
    val reason: String,          // why they gave up
    val severity: Severity,      // LOW, MEDIUM, HIGH, CRITICAL
    val originalError: ToolError, // the root cause
    val attempts: Int            // how many repair attempts were made
)

What a parent agent can do when it receives an escalation:

// In the parent agent's structure handler:
onEscalation { error ->
    when {
        // Retry with a stronger model
        error.severity == Severity.LOW ->
            retryWithModel("llama3:70b")

        // Route to a different skill that avoids this tool
        error.severity == Severity.MEDIUM ->
            rerouteToSkill("manual-fallback")

        // Give up gracefully with a meaningful result
        error.severity == Severity.HIGH ->
            returnPartialResult("Completed 3/5 tasks, failed on: ${error.reason}")

        // Escalate further up the tree
        error.severity == Severity.CRITICAL ->
            escalate(error.reason, Severity.CRITICAL)
    }
}

The key property: the agent that understands the problem handles it. A JSON repair agent knows nothing about business context -- it should escalate. The parent agent, which knows the domain, decides what "file not found" means for the workflow.

throwException() -- Hard Failure

executionError { e ->
    throwException("Credentials invalid, cannot proceed")
}

throwException() is a hard stop. It throws a ToolExecutionException that propagates through the pipeline like any JVM exception. The agentic loop halts. No recovery. No walking up the tree.

When to use throwException():

  • Security violations -- tool attempted to access something it shouldn't
  • Data corruption -- continuing would make things worse
  • Development bugs -- you want the test to fail loud
  • Invariant violations -- the system is in a state that should be impossible
onError("delete_database") {
    executionError { e ->
        when (e.cause) {
            is SecurityException -> throwException("Unauthorized deletion attempt")
            is ConnectionException -> retry(maxAttempts = 2, backoff = exponential())
            else -> escalate("Unexpected error during deletion")
        }
    }
}

escalate() vs throwException() -- Decision Table

Situation Use Why
Network timeout retry() Transient, likely to resolve
File not found escalate() Parent agent might try a different path
LLM returned garbage JSON 3 times escalate() Parent might switch models or skip the task
Invalid API credentials throwException() No agent can fix bad credentials
Disk full escalate() Parent might clean up or write elsewhere
Data corruption detected throwException() Continuing would make it worse
Tool is deprecated escalate() Parent can route to a replacement tool
Null pointer in tool executor throwException() This is a bug, surface it

Escalation Flow

LLM calls tool with malformed args
  |
  v
onError.invalidArgs handler runs
  |
  +--> fix { cleanup(args) } succeeds
  |       |
  |       v
  |    Tool re-invoked with fixed args --> result returned, loop continues
  |
  +--> fix returns null (can't clean up deterministically)
         |
         v
       fix(agent = jsonFixer, retries = 3)
         |
         +--> jsonFixer fixes it on attempt 2 --> tool re-invoked, loop continues
         |
         +--> jsonFixer fails 3 times
                |
                +--> jsonFixer calls escalate("Schema mismatch, not a formatting issue")
                       |
                       v
                     EscalationError created (source=json-fixer, attempts=3)
                       |
                       v
                     Parent agent "coder" receives EscalationError
                       |
                       +--> coder handles: retries with stronger model
                       |       |
                       |       v
                       |    Stronger model produces valid tool call --> succeeds
                       |
                       +--> coder can't handle either
                               |
                               +--> coder escalates to ITS parent in structure {}
                               |
                               +--> or coder calls throwException() --> pipeline stops

Repair Agents Get These Tools Automatically

When an Agent<String, String> is used as a repair agent inside onError, the framework automatically injects two tools:

// Available inside any repair agent:
tool("escalate") {
    param("reason", STRING)
    param("severity", ENUM<Severity>)   // LOW, MEDIUM, HIGH, CRITICAL
    returns(Nothing)                     // never returns -- exits repair loop
}

tool("throwException") {
    param("reason", STRING)
    returns(Nothing)                     // never returns -- throws exception
}

This means an LLM-driven repair agent can decide for itself whether to escalate or throw. If the fixer model looks at the broken JSON and determines it's not a formatting problem but a schema mismatch, it can call escalate() with a reason -- and the parent agent receives that reason as context.

val smartFixer = agent<String, String>("smart-fixer") {
    prompt("""
        Fix the malformed JSON. If you can fix it, return the fixed JSON.
        If the problem is NOT formatting (e.g., wrong schema, missing required
        fields that you cannot infer), call escalate() with a clear reason.
        If the input is not JSON at all, call throwException().
    """.trimIndent())
    model { ollama("qwen2.5:7b"); temperature = 0.0 }
    budget { maxTurns = 3 }
    tools { escalate(); throwException() }
    skills {
        skill<String, String>("fix", "Fix or escalate JSON problems") {
            tools("escalate", "throwException")
        }
    }
}

Why Not Just Throw? Exceptions vs Escalation

The Eckel Argument

Bruce Eckel, in Thinking in Java, made an argument against Java's checked exceptions that shaped a generation of language design. His core observation:

Checked exceptions force every layer between the error source and the handler to declare or catch exceptions they don't understand. This creates coupling. The intermediate layers have no idea what to do with a SQLException -- they just re-throw it or, worse, swallow it.

The result was the infamous anti-pattern:

try {
    database.query(sql);
} catch (SQLException e) {
    // TODO: handle this
}

Eckel's position: exceptions should propagate up until someone who actually knows what to do catches them. Intermediate layers should not be forced to participate. This led him (and later Kotlin, C#, and other languages) to favor unchecked exceptions -- errors that flow upward silently until a competent handler stops them.

How Agents.KT Applies This

Agents.KT's error recovery was designed with Eckel's insight in mind, but adapted for agent systems where "layers" are not call stacks but delegation trees.

Eckel's Critique of Java Agents.KT's Answer
Checked exceptions force every layer to handle or declare escalate() walks up the tree silently -- intermediate agents don't need to know
Intermediate layers swallow exceptions they can't handle Only agents with onEscalation {} handlers participate -- others pass through
catch (Exception e) {} hides bugs EscalationError carries full context: reason, severity, original error, attempt count
Checked vs unchecked is a binary choice Three-tier model: fix/retry (handle locally), escalate (walk up), throw (hard stop)
Exception handling is a side channel Repair is the main channel -- same Agent<IN, OUT> interface as everything else

Errors as Values, Not Control Flow

The deepest difference: Java exceptions (checked or unchecked) are control flow -- they unwind the stack, skip past code, and land in a catch block. The code between throw and catch is abandoned.

In Agents.KT, errors are values. A ToolError is a data class. An EscalationError is a data class. They don't unwind anything. They flow through the delegation tree as structured data that each agent can inspect, enrich, and act on.

// Java exception: control flow, stack unwinding, context lost
throw new ToolCallException("malformed JSON: trailing comma");

// Agents.KT escalation: value, delegation tree, context preserved
EscalationError(
    source = AgentRef("json-fixer"),
    reason = "Schema mismatch, not a formatting issue",
    severity = Severity.MEDIUM,
    originalError = InvalidArgs(rawArgs = "{...}", parseError = "..."),
    attempts = 3
)

The parent agent receives a complete picture: who tried to fix it, why they failed, how many times, how severe they think it is, and what the original error was. Compare this with a Java catch block that receives a Throwable with a string message and a stack trace.

The Three-Tier Error Model

Eckel saw two tiers: checked (forced handling) and unchecked (optional handling). Agents.KT has three:

Tier 1: Handle locally
   fix { }, sanitize { }, retry()
   → "I know how to fix this. I'll handle it right here."
   → Like a method fixing its own input before proceeding.

Tier 2: Escalate
   escalate(reason, severity)
   → "I can't fix this, but it's not fatal. Someone upstream might know what to do."
   → Like Eckel's ideal unchecked exception: flows up until competence is found.

Tier 3: Throw
   throwException(reason)
   → "This is fundamentally broken. Stop everything."
   → Like a RuntimeException for genuine bugs or security violations.

The key insight from Eckel that Agents.KT preserves: the entity that understands the problem should handle it. A JSON repair agent knows syntax. A parent agent knows business context. The structure {} tree is the org chart -- escalation follows reporting lines, not call stacks.


Complete Example

An agent with multiple tools, each with tailored error recovery:

val jsonFixer = agent<String, String>("json-fixer") {
    prompt = "Fix the malformed JSON. Return only valid JSON."
    model { ollama("qwen2.5:7b"); temperature = 0.0 }
    budget { maxTurns = 1 }
    skills {
        skill<String, String>("fix", "Fix JSON") {
            implementedBy { it }
        }
    }
}

val fileAgent = agent<String, String>("file-manager") {
    prompt = "You manage files. Use tools to read, write, and list files."

    model { ollama("qwen2.5:7b") }
    budget { maxTurns = 10 }

    skills {
        skill<String, String>("manage-files", "File management operations") {
            tools("read_file", "write_file", "list_dir")

            // Shared defaults
            defaults {
                onError {
                    invalidArgs { args, error ->
                        fix { tryJsonCleanup(args) } ?: fix(agent = jsonFixer, retries = 2)
                    }
                }
            }

            tool("read_file", "Read file contents by path") { args ->
                val path = args["path"] as String
                File(path).readText()
            }
            onError("read_file") {
                executionError { e ->
                    when (e.cause) {
                        is FileNotFoundException -> escalate()
                        is IOException -> retry(maxAttempts = 3, backoff = exponential())
                        else -> throwException()
                    }
                }
            }

            tool("write_file", "Write content to a file") { args ->
                val path = args["path"] as String
                val content = args["content"] as String
                File(path).writeText(content)
                "Written ${content.length} bytes to $path"
            }
            onError("write_file") {
                deserializationError { raw, error ->
                    sanitize { raw.normalizePathSeparators() }
                }
                executionError { e ->
                    retry(maxAttempts = 2, backoff = exponential())
                }
            }

            tool("list_dir", "List files in a directory") { args ->
                val path = args["path"] as String
                File(path).listFiles()?.map { it.name } ?: emptyList<String>()
            }
            // list_dir inherits defaults -- no per-tool override needed
        }
    }

    onToolUse { name, args, result ->
        println("[$name] args=$args result=$result")
    }
}

// Usage
val result = fileAgent("Read the contents of /tmp/config.json and summarize it")

In this example:

  • All three tools share the invalidArgs default (deterministic cleanup, then LLM fixer).
  • read_file escalates on missing files, retries on I/O errors, and throws on unexpected failures.
  • write_file sanitizes path separators and retries on execution errors.
  • list_dir relies entirely on the shared defaults.

Next Steps

Clone this wiki locally