-
Notifications
You must be signed in to change notification settings - Fork 0
Tool Error Recovery
LLMs produce malformed tool calls. Agents.KT lets you fix them -- with code, with another agent, or with both.
Large language models are probabilistic text generators. When they produce tool calls, things go wrong in predictable ways:
-
Trailing commas in JSON:
{"a": 1, "b": 2,} -
Markdown fencing around arguments:
```json\n{"a": 1}\n``` -
Wrong types: a number sent as a string
"42"instead of42 - Missing required fields: the model forgets a parameter
- Runtime failures: the tool itself throws because of bad input or transient errors
Without error recovery, any of these failures kills the agentic loop. The agent stops, the user gets nothing.
Most frameworks handle malformed tool calls with special parser classes, retry middleware, or string-cleaning utilities buried in utility packages.
Agents.KT takes a different approach: the fixer is an agent. The same Agent<IN, OUT> interface you use to build your application is the same interface you use to repair broken tool calls. No new abstraction. No special machinery.
This means repair logic gets the full power of the framework: it can be deterministic (a pure function), LLM-driven (an agent with its own model), or a composition of both.
Tool errors form a sealed hierarchy with four variants:
sealed interface ToolError {
data class InvalidArgs(
val rawArgs: String,
val parseError: String,
val expectedSchema: JsonSchema
) : ToolError
data class DeserializationError(
val rawValue: String,
val targetType: KType,
val cause: Throwable
) : ToolError
data class ExecutionError(
val args: ToolArgs,
val cause: Throwable
) : ToolError
data class EscalationError(
val source: AgentRef,
val reason: String,
val severity: Severity,
val originalError: ToolError,
val attempts: Int
) : ToolError
}| Error Type | When It Fires | Typical Cause |
|---|---|---|
InvalidArgs |
JSON parsing fails | Trailing commas, markdown fencing, truncated output |
DeserializationError |
JSON parses but cannot map to expected types |
"42" instead of 42, missing keys |
ExecutionError |
Tool executor throws | Bad input values, transient I/O failures, business logic errors |
EscalationError |
Repair itself fails and escalates up | Exhausted retries, unrecoverable state |
The sealed hierarchy means when expressions are exhaustive -- the compiler tells you if you miss a case.
Each tool can declare error handlers using the onError {} block. Inside, three verbs match the three non-escalation error types:
tool("write_file", "Write content to a file") { args ->
val path = args["path"] as String
val content = args["content"] as String
fileSystem.write(path, content)
}
onError {
invalidArgs { args, error ->
fix { args.trimMarkdownFencing() }
}
deserializationError { raw, error ->
sanitize { raw.normalizePathSeparators() }
}
executionError { e ->
retry(maxAttempts = 3, backoff = exponential())
}
}| Verb | Error Type | Purpose |
|---|---|---|
invalidArgs { } |
InvalidArgs |
Fix unparseable JSON |
deserializationError { } |
DeserializationError |
Fix type mismatches |
executionError { } |
ExecutionError |
Handle runtime failures |
The simplest recovery strategy is a pure function. No LLM, no network call -- just string manipulation.
onError {
invalidArgs { args, error ->
fix {
args
.trimMarkdownFencing() // strip ```json ... ```
.replace(Regex(",\\s*}"), "}") // remove trailing commas
.replace(Regex(",\\s*]"), "]") // remove trailing commas in arrays
}
}
}The lambda receives the raw argument string and returns a cleaned version. The framework re-parses the cleaned string and retries the tool call.
onError {
deserializationError { raw, error ->
sanitize {
raw.normalizePathSeparators() // backslash to forward slash
}
}
}Same idea: transform the raw value so it deserializes correctly.
onError {
executionError { e ->
retry(maxAttempts = 3, backoff = exponential())
}
}This re-runs the tool executor with the same arguments. The backoff parameter controls the delay between attempts. Use this for transient failures like network timeouts or rate limits.
When deterministic cleanup is not enough -- the JSON is too mangled, the error is too novel -- you can delegate repair to an agent.
A repair agent is a regular Agent<String, String>. It takes the broken input as a string and returns a fixed string:
val jsonFixer = agent<String, String>("json-fixer") {
prompt = """
You are a JSON repair tool. You receive malformed JSON and return
valid JSON. Do not add or remove fields. Only fix syntax errors.
Return ONLY the fixed JSON, no explanation.
""".trimIndent()
model {
ollama("qwen2.5:7b")
temperature = 0.0 // deterministic output
}
budget { maxTurns = 1 } // single-shot, no tool loop
skills {
skill<String, String>("fix-json", "Repairs broken JSON") {
implementedBy { input -> input } // LLM does the work via prompt
}
}
}tool("create_task", "Create a new task") { args ->
val title = args["title"] as String
taskService.create(title)
}
onError {
invalidArgs { args, error ->
fix(agent = jsonFixer, retries = 3)
}
}The framework sends the broken arguments to jsonFixer, takes the output, re-parses it, and retries the tool call. If the fix fails, it retries up to 3 times before giving up.
The most robust approach: try deterministic repair first, fall back to the LLM only if it returns null:
onError {
invalidArgs { args, error ->
fix {
// Attempt 1: simple cleanup
tryJsonCleanup(args) // returns null if cleanup is insufficient
} ?: fix(agent = jsonFixer, retries = 3)
// Attempt 2: LLM-driven repair if deterministic fix returned null
}
}This gives you the speed of string manipulation for common cases (trailing commas, fencing) and the intelligence of an LLM for edge cases.
fun tryJsonCleanup(raw: String): String? {
val cleaned = raw
.trim()
.removePrefix("```json").removePrefix("```")
.removeSuffix("```")
.trim()
.replace(Regex(",\\s*}"), "}")
.replace(Regex(",\\s*]"), "]")
return try {
// Verify it parses
JsonParser.parse(cleaned)
cleaned
} catch (e: Exception) {
null // signal: deterministic fix was not enough
}
}A repair agent does not have to use an LLM. You can build a fully deterministic agent using implementedBy:
val regexFixer = agent<String, String>("regex-fixer") {
skills {
skill<String, String>("fix", "Fix JSON with regex") {
implementedBy { input ->
input
.replace(Regex("(?s)```json\\s*(.+?)\\s*```"), "$1")
.replace(Regex(",\\s*([}\\]])"), "$1")
.replace(Regex("'"), "\"")
}
}
}
}Zero LLM calls, zero latency, zero cost -- but it conforms to the Agent<String, String> interface, so it plugs into fix(agent = ...) seamlessly. The framework does not care how the agent produces its output.
When many tools share the same error handling, define defaults at the tools {} level:
skills {
skill<String, String>("data-ops", "Data operations") {
tools("read_file", "write_file", "delete_file")
defaults {
onError {
invalidArgs { args, error ->
fix { tryJsonCleanup(args) } ?: fix(agent = jsonFixer, retries = 2)
}
executionError { e ->
retry(maxAttempts = 3, backoff = exponential())
}
}
}
tool("read_file", "Read a file") { args ->
fileSystem.read(args["path"] as String)
}
tool("write_file", "Write a file") { args ->
fileSystem.write(args["path"] as String, args["content"] as String)
}
tool("delete_file", "Delete a file") { args ->
fileSystem.delete(args["path"] as String)
}
// Per-tool override: delete_file has stricter handling
onError("delete_file") {
executionError { e ->
// No retry for destructive operations
escalate()
}
}
}
}The rule: per-tool onError overrides defaults for that specific tool. All other tools inherit the defaults.
When repair fails, the tool has two fundamentally different options. Understanding the distinction is critical -- it determines whether your system degrades gracefully or crashes hard.
executionError { e ->
escalate("File not found, cannot retry", severity = Severity.MEDIUM)
}escalate() does not throw. It does not break the call stack. Instead, it wraps the error in an EscalationError and walks up the structure {} delegation tree -- from the repair agent to the tool, from the tool to its parent agent, from the parent agent to its parent, until something handles it.
The EscalationError carries everything the handler needs to decide:
data class EscalationError(
val source: AgentRef, // who escalated
val reason: String, // why they gave up
val severity: Severity, // LOW, MEDIUM, HIGH, CRITICAL
val originalError: ToolError, // the root cause
val attempts: Int // how many repair attempts were made
)What a parent agent can do when it receives an escalation:
// In the parent agent's structure handler:
onEscalation { error ->
when {
// Retry with a stronger model
error.severity == Severity.LOW ->
retryWithModel("llama3:70b")
// Route to a different skill that avoids this tool
error.severity == Severity.MEDIUM ->
rerouteToSkill("manual-fallback")
// Give up gracefully with a meaningful result
error.severity == Severity.HIGH ->
returnPartialResult("Completed 3/5 tasks, failed on: ${error.reason}")
// Escalate further up the tree
error.severity == Severity.CRITICAL ->
escalate(error.reason, Severity.CRITICAL)
}
}The key property: the agent that understands the problem handles it. A JSON repair agent knows nothing about business context -- it should escalate. The parent agent, which knows the domain, decides what "file not found" means for the workflow.
executionError { e ->
throwException("Credentials invalid, cannot proceed")
}throwException() is a hard stop. It throws a ToolExecutionException that propagates through the pipeline like any JVM exception. The agentic loop halts. No recovery. No walking up the tree.
When to use throwException():
- Security violations -- tool attempted to access something it shouldn't
- Data corruption -- continuing would make things worse
- Development bugs -- you want the test to fail loud
- Invariant violations -- the system is in a state that should be impossible
onError("delete_database") {
executionError { e ->
when (e.cause) {
is SecurityException -> throwException("Unauthorized deletion attempt")
is ConnectionException -> retry(maxAttempts = 2, backoff = exponential())
else -> escalate("Unexpected error during deletion")
}
}
}| Situation | Use | Why |
|---|---|---|
| Network timeout | retry() |
Transient, likely to resolve |
| File not found | escalate() |
Parent agent might try a different path |
| LLM returned garbage JSON 3 times | escalate() |
Parent might switch models or skip the task |
| Invalid API credentials | throwException() |
No agent can fix bad credentials |
| Disk full | escalate() |
Parent might clean up or write elsewhere |
| Data corruption detected | throwException() |
Continuing would make it worse |
| Tool is deprecated | escalate() |
Parent can route to a replacement tool |
| Null pointer in tool executor | throwException() |
This is a bug, surface it |
LLM calls tool with malformed args
|
v
onError.invalidArgs handler runs
|
+--> fix { cleanup(args) } succeeds
| |
| v
| Tool re-invoked with fixed args --> result returned, loop continues
|
+--> fix returns null (can't clean up deterministically)
|
v
fix(agent = jsonFixer, retries = 3)
|
+--> jsonFixer fixes it on attempt 2 --> tool re-invoked, loop continues
|
+--> jsonFixer fails 3 times
|
+--> jsonFixer calls escalate("Schema mismatch, not a formatting issue")
|
v
EscalationError created (source=json-fixer, attempts=3)
|
v
Parent agent "coder" receives EscalationError
|
+--> coder handles: retries with stronger model
| |
| v
| Stronger model produces valid tool call --> succeeds
|
+--> coder can't handle either
|
+--> coder escalates to ITS parent in structure {}
|
+--> or coder calls throwException() --> pipeline stops
When an Agent<String, String> is used as a repair agent inside onError, the framework automatically injects two tools:
// Available inside any repair agent:
tool("escalate") {
param("reason", STRING)
param("severity", ENUM<Severity>) // LOW, MEDIUM, HIGH, CRITICAL
returns(Nothing) // never returns -- exits repair loop
}
tool("throwException") {
param("reason", STRING)
returns(Nothing) // never returns -- throws exception
}This means an LLM-driven repair agent can decide for itself whether to escalate or throw. If the fixer model looks at the broken JSON and determines it's not a formatting problem but a schema mismatch, it can call escalate() with a reason -- and the parent agent receives that reason as context.
val smartFixer = agent<String, String>("smart-fixer") {
prompt("""
Fix the malformed JSON. If you can fix it, return the fixed JSON.
If the problem is NOT formatting (e.g., wrong schema, missing required
fields that you cannot infer), call escalate() with a clear reason.
If the input is not JSON at all, call throwException().
""".trimIndent())
model { ollama("qwen2.5:7b"); temperature = 0.0 }
budget { maxTurns = 3 }
tools { escalate(); throwException() }
skills {
skill<String, String>("fix", "Fix or escalate JSON problems") {
tools("escalate", "throwException")
}
}
}Bruce Eckel, in Thinking in Java, made an argument against Java's checked exceptions that shaped a generation of language design. His core observation:
Checked exceptions force every layer between the error source and the handler to declare or catch exceptions they don't understand. This creates coupling. The intermediate layers have no idea what to do with a
SQLException-- they just re-throw it or, worse, swallow it.
The result was the infamous anti-pattern:
try {
database.query(sql);
} catch (SQLException e) {
// TODO: handle this
}Eckel's position: exceptions should propagate up until someone who actually knows what to do catches them. Intermediate layers should not be forced to participate. This led him (and later Kotlin, C#, and other languages) to favor unchecked exceptions -- errors that flow upward silently until a competent handler stops them.
Agents.KT's error recovery was designed with Eckel's insight in mind, but adapted for agent systems where "layers" are not call stacks but delegation trees.
| Eckel's Critique of Java | Agents.KT's Answer |
|---|---|
| Checked exceptions force every layer to handle or declare |
escalate() walks up the tree silently -- intermediate agents don't need to know |
| Intermediate layers swallow exceptions they can't handle | Only agents with onEscalation {} handlers participate -- others pass through |
catch (Exception e) {} hides bugs |
EscalationError carries full context: reason, severity, original error, attempt count |
| Checked vs unchecked is a binary choice | Three-tier model: fix/retry (handle locally), escalate (walk up), throw (hard stop) |
| Exception handling is a side channel | Repair is the main channel -- same Agent<IN, OUT> interface as everything else |
The deepest difference: Java exceptions (checked or unchecked) are control flow -- they unwind the stack, skip past code, and land in a catch block. The code between throw and catch is abandoned.
In Agents.KT, errors are values. A ToolError is a data class. An EscalationError is a data class. They don't unwind anything. They flow through the delegation tree as structured data that each agent can inspect, enrich, and act on.
// Java exception: control flow, stack unwinding, context lost
throw new ToolCallException("malformed JSON: trailing comma");
// Agents.KT escalation: value, delegation tree, context preserved
EscalationError(
source = AgentRef("json-fixer"),
reason = "Schema mismatch, not a formatting issue",
severity = Severity.MEDIUM,
originalError = InvalidArgs(rawArgs = "{...}", parseError = "..."),
attempts = 3
)The parent agent receives a complete picture: who tried to fix it, why they failed, how many times, how severe they think it is, and what the original error was. Compare this with a Java catch block that receives a Throwable with a string message and a stack trace.
Eckel saw two tiers: checked (forced handling) and unchecked (optional handling). Agents.KT has three:
Tier 1: Handle locally
fix { }, sanitize { }, retry()
→ "I know how to fix this. I'll handle it right here."
→ Like a method fixing its own input before proceeding.
Tier 2: Escalate
escalate(reason, severity)
→ "I can't fix this, but it's not fatal. Someone upstream might know what to do."
→ Like Eckel's ideal unchecked exception: flows up until competence is found.
Tier 3: Throw
throwException(reason)
→ "This is fundamentally broken. Stop everything."
→ Like a RuntimeException for genuine bugs or security violations.
The key insight from Eckel that Agents.KT preserves: the entity that understands the problem should handle it. A JSON repair agent knows syntax. A parent agent knows business context. The structure {} tree is the org chart -- escalation follows reporting lines, not call stacks.
An agent with multiple tools, each with tailored error recovery:
val jsonFixer = agent<String, String>("json-fixer") {
prompt = "Fix the malformed JSON. Return only valid JSON."
model { ollama("qwen2.5:7b"); temperature = 0.0 }
budget { maxTurns = 1 }
skills {
skill<String, String>("fix", "Fix JSON") {
implementedBy { it }
}
}
}
val fileAgent = agent<String, String>("file-manager") {
prompt = "You manage files. Use tools to read, write, and list files."
model { ollama("qwen2.5:7b") }
budget { maxTurns = 10 }
skills {
skill<String, String>("manage-files", "File management operations") {
tools("read_file", "write_file", "list_dir")
// Shared defaults
defaults {
onError {
invalidArgs { args, error ->
fix { tryJsonCleanup(args) } ?: fix(agent = jsonFixer, retries = 2)
}
}
}
tool("read_file", "Read file contents by path") { args ->
val path = args["path"] as String
File(path).readText()
}
onError("read_file") {
executionError { e ->
when (e.cause) {
is FileNotFoundException -> escalate()
is IOException -> retry(maxAttempts = 3, backoff = exponential())
else -> throwException()
}
}
}
tool("write_file", "Write content to a file") { args ->
val path = args["path"] as String
val content = args["content"] as String
File(path).writeText(content)
"Written ${content.length} bytes to $path"
}
onError("write_file") {
deserializationError { raw, error ->
sanitize { raw.normalizePathSeparators() }
}
executionError { e ->
retry(maxAttempts = 2, backoff = exponential())
}
}
tool("list_dir", "List files in a directory") { args ->
val path = args["path"] as String
File(path).listFiles()?.map { it.name } ?: emptyList<String>()
}
// list_dir inherits defaults -- no per-tool override needed
}
}
onToolUse { name, args, result ->
println("[$name] args=$args result=$result")
}
}
// Usage
val result = fileAgent("Read the contents of /tmp/config.json and summarize it")In this example:
- All three tools share the
invalidArgsdefault (deterministic cleanup, then LLM fixer). -
read_fileescalates on missing files, retries on I/O errors, and throws on unexpected failures. -
write_filesanitizes path separators and retries on execution errors. -
list_dirrelies entirely on the shared defaults.
- Model & Tool Calling -- understand the agentic loop that these errors occur in
- Skill Selection & Routing -- how agents pick which skill to run
- Budget Controls -- prevent runaway loops during error recovery
- Observability Hooks -- monitor recovery attempts
Getting Started
Core Concepts
Composition Operators
LLM Integration
- Model & Tool Calling
- Tool Error Recovery
- Skill Selection & Routing
- Budget Controls
- Observability Hooks
Guided Generation
Agent Memory
Reference