how to

How to debug LLM agents without guessing

the short answer

You debug an LLM agent by reproducing the exact run, reading its execution path end to end, finding the first step that went wrong (not the last symptom), inspecting that step's real inputs and outputs, then making one change and re-running — because agent failures cascade, so the earliest broken step is almost always the true cause.

Debugging an LLM agent is different from debugging ordinary code. There's no stack trace pointing at line 42 — the agent made a sequence of decisions, each one feeding the next, and any of them could have quietly gone wrong. By the time you see a bad final answer, the actual mistake may have happened three tool calls earlier. Print statements scattered through the code rarely surface it, because the interesting state is the prompt, the model's reply, and the tool result, not your local variables.

agentis.ogbuilds.ai/agents/support-triage/debug
agentis
agentssnapshotsdebug
support-triage
snapshot #s8failed
first divergence at step 4
execution path · 6 steps
actionreceived ticket #4825 from webhook
llmclassify intent → billing dispute
toolcrm.getCustomer(id=88213) → ok
toolbilling.getInvoices — ETIMEDOUTcause
llmdraft holding reply
actionescalate to human queue
llm debug

step 4 is the cause — billing api unreachable. steps 5–6 are consequences. fix the cause, not the fallout.

where this happens in the app

the execution path with the first diverging step isolated — step 4 is where the run went wrong, not the last symptom. agentis marks the cause; the dimmed steps below are consequences.

  1. 1snapshot header — run id, error badge, and the first-divergence callout so you know where to look before you read the trace
  2. 2step 4 in red with the 'cause' label — billing.getInvoices timed out here; the three steps before it were fine
  3. 3llm debug — confirms the fix: the cause is the billing api timeout, not the agent's reasoning; lower the per-call timeout so it fails fast

Why agent bugs hide upstream

An agent run is a chain: the model reads context, decides on an action, calls a tool, reads the result, decides again. Each link depends on the one before it. When a link breaks — the model misreads a tool result, picks the wrong tool, or hallucinates an argument — every step after it is built on that mistake. So the symptom you notice (a wrong answer, an infinite loop, a tool error) is usually downstream of the real cause. Chasing the symptom leads you in circles; finding the first divergence leads you to the fix.

This is also why non-determinism makes agents so frustrating to debug. The same input can produce a different path on a second run, so a bug you saw once may not reproduce on demand. The defense is to capture the exact run that failed — its prompts, model replies, tool calls, and results — so you can study the real failure instead of trying to recreate it from memory.

Read the path, don't read the code

When an agent misbehaves, the instinct is to reread the agent's source. That's usually the wrong place to look first, because the code is fine — it's the data flowing through it that's wrong. The model got a prompt you didn't expect, or a tool returned something malformed, or the context window dropped a crucial earlier turn. None of that is visible in the source; it's only visible in the run.

So make the run readable. For each step you want the prompt that was sent, the model's raw reply (including any tool-call it requested), the tool that ran, its arguments, and what it returned. Lay those out in order and the failure usually becomes obvious — you can see the exact moment the agent acted on bad information. agentis condenses a run into exactly this ordered view, and lets you ask an LLM to point at the likely culprit step when the path is long.

how it works

  1. 01

    Reproduce (or capture) the failing run

    Don't debug from memory. Re-run the agent on the same input, or pull the logs of the run that actually failed. Because agents are non-deterministic, the run you study has to be a real one, not an approximation.

  2. 02

    Read the execution path top to bottom

    Walk the ordered sequence of prompts, model replies, tool calls, and results. Don't jump to the end — read from the start so you can see where it first went off course.

  3. 03

    Isolate the first failing step

    Find the earliest step where reality diverged from what you expected. Ignore later errors for now — they're usually consequences of this one. The first divergence is almost always the real bug.

  4. 04

    Inspect that step's inputs and outputs

    Look at exactly what went in (the prompt, the tool arguments, the context) and what came out (the model's reply, the tool result). The mismatch between what you assumed and what actually happened is your answer.

  5. 05

    Make one change and re-run

    Change a single thing — tighten the prompt, fix the tool, add the missing context — then re-run and read the path again. One change at a time keeps cause and effect clear instead of muddying the next run.

frequently asked

Why can't I just debug an agent with print statements?

Print statements show your local variables, but agent bugs live in the prompts, model replies, and tool results — the data flowing between steps. You need the actual conversation the agent had with the model and its tools, in order, not your code's internal state. That's what an execution path gives you and a print log doesn't.

The bug doesn't reproduce every time. How do I debug that?

Non-determinism means you can't rely on recreating the failure on demand. The fix is to capture every run so that when one fails, you already have its exact path saved. Then you debug the real failed run instead of trying to trigger it again. Capturing runs by default turns an intermittent bug into a readable record.

Should I look at the first error or the last one?

The first. Agent failures cascade — the model acts on a bad tool result, then everything after is built on that mistake. The last error you see is usually a symptom of an earlier one. Find the first step where the run diverged from what you expected, and you've almost always found the cause.

Do I need a special tool, or can I do this from raw logs?

You can do it from raw logs if they capture the prompts, model replies, and tool calls in order — most default logging doesn't. The work is reconstructing the ordered path from scattered lines. A tool like agentis ingests the logs and renders the run as a readable snapshot, which removes the reconstruction step.

Last updated June 19, 2026

ready to try agentis?

get started