use case

AI agent guardrails: stop the agent at the dangerous request, not after it

the short answer

AI agent guardrails are the controls that decide what an autonomous agent is allowed to actually do, and the ones that matter most are enforced at the action layer — a proxy that intercepts every request the agent makes, forwards safe traffic instantly, and holds destructive actions (DELETEs, DROP TABLE, rm -rf, kubectl delete) for a human to approve before they ever reach the target, which is exactly what agent·shield does.

Most "guardrails" sold for AI agents live inside the prompt: a system message asking the model to be careful, a list of things it shouldn't do, maybe a tool description that says "only for read operations". They help, right up until the model ignores them — and a model that's been jailbroken, confused by a poisoned web page, or simply wrong will cheerfully run the destructive call you told it never to make.

A real guardrail isn't advice the agent can talk itself out of; it's a control the agent can't route around. This page is about that kind: enforcement at the layer where the action actually happens — the network request — and why holding the dangerous ones for a human is the guardrail that survives a misbehaving model.

every requestagent·shield inspects every call the agent makes — the dangerous ones wait for a human, the rest pass instantly

Prompt guardrails are advice; action guardrails are enforcement

A guardrail written into the prompt is a request the model can decline. It works when the model is cooperative and breaks exactly when you need it most: under prompt injection, after a context window fills with conflicting instructions, or when the model hallucinates that deleting the table is the correct next step. You can't audit a prompt guardrail after the fact, and you can't prove it held — you can only hope.

An action guardrail sits outside the model entirely. agent·shield runs as a transparent HTTP proxy: the agent points its base URL at the proxy instead of calling services directly, and from then on every request — every API call, database query, and tool invocation that goes over HTTP — passes through a checkpoint the model has no control over. The model can decide to send a DELETE; it cannot decide whether that DELETE gets forwarded.

Three outcomes: forward, block, hold

Each intercepted request is matched against policies — regex over the method, path, and body — and gets one of three outcomes. Safe traffic is forwarded instantly, so the agent runs at full speed for the 99% of calls that are reads and routine writes. Calls that match a hard ban are blocked outright and never reach the target. And the dangerous middle — destructive or sensitive actions — is held in a human approval queue, paused, until a person approves or denies it.

That third outcome is the one prompt guardrails can't offer. A held DELETE, a held DROP TABLE, a held kubectl delete doesn't fail and doesn't silently go through; it waits, visible, for a human to make the call. The agent keeps its autonomy on everything safe, and a person stays in the loop only for the handful of actions that could actually do damage.

Guardrails you can prove held

Guardrails are only worth as much as your ability to show they worked. Because agent·shield decides at the request level, every decision — forwarded, blocked, held, approved, denied — is written to an append-only audit log with the full request, the policy it matched, who approved or denied it, and when. When someone asks "what was the agent actually allowed to do, and who let it," there's a record, not a shrug.

Setup stays out of the agent's way: there's no SDK to adopt and no agent code to rewrite — you change the base URL the agent calls and write policies for the actions you care about. The guardrail is infrastructure, not a library buried in the agent, which is what makes it hard to bypass and easy to reason about. Worth stating plainly: agent·shield enforces on traffic that flows through the proxy, so guardrails cover the calls you route through it — point the agent's outbound HTTP at agent·shield and that's everything it does over the network.

Prompt-level guardrails vs action-level guardrails

Prompt / tool-description guardrailsagent·shield (action-level)
Where it livesInside the model's contextA proxy outside the model
Can the model bypass itYes — ignore, jailbreak, hallucinateNo — it never sees the decision
What it controlsWhat you ask the model to avoidWhat requests actually get forwarded
Destructive actionsHopefully skippedHeld for human approval
Proof it heldNoneAppend-only audit log of every decision
SetupEdit prompts and tool specsPoint the agent's base URL at the proxy

frequently asked

Why aren't prompt instructions enough to guardrail an agent?
Because the model can ignore them. Prompt injection, conflicting context, and plain hallucination all lead models to do things they were told not to. A prompt guardrail is advice; agent·shield is enforcement at the request layer, where the model has no vote on whether a destructive call goes through.
Won't a hard guardrail slow my agent down?
Only on the actions you choose to hold. Safe traffic — reads and routine writes — is forwarded instantly; agent·shield only pauses requests that match a destructive or sensitive policy. The agent runs at full speed for the vast majority of what it does.
What counts as a 'destructive action' by default?
The classic high-blast-radius patterns: HTTP DELETEs, DROP TABLE and TRUNCATE in a SQL body, rm -rf, kubectl delete, and mass updates with no WHERE clause. You can tune the policies — they're regex over method, path, and body — to match exactly what's dangerous in your environment.
Does this work with my existing agent framework?
Yes. agent·shield is a transparent HTTP proxy, so there's no SDK and no rewrite — you point the agent's outbound base URL at agent·shield and it inspects the traffic. The guardrail covers whatever HTTP calls you route through it.

Published May 12, 2026 · Last updated June 13, 2026

ready to try agent·shield?

open agent·shield