AI agent guardrails that hold destructive actions for a human

Prompt guardrails are advice; action guardrails are enforcement

A guardrail written into the prompt is a request the model can decline. It works when the model is cooperative and breaks exactly when you need it most: under prompt injection, after a context window fills with conflicting instructions, or when the model hallucinates that deleting the table is the correct next step. You can't audit a prompt guardrail after the fact, and you can't prove it held — you can only hope.

An action guardrail sits outside the model entirely. agent·shield runs as a transparent HTTP proxy: the agent points its base URL at the proxy instead of calling services directly, and from then on every request — every API call, database query, and tool invocation that goes over HTTP — passes through a checkpoint the model has no control over. The model can decide to send a DELETE; it cannot decide whether that DELETE gets forwarded.

Three outcomes: forward, block, hold

Each intercepted request is matched against policies — regex over the method, path, and body — and gets one of three outcomes. Safe traffic is forwarded instantly, so the agent runs at full speed for the 99% of calls that are reads and routine writes. Calls that match a hard ban are blocked outright and never reach the target. And the dangerous middle — destructive or sensitive actions — is held in a human approval queue, paused, until a person approves or denies it.

That third outcome is the one prompt guardrails can't offer. A held DELETE, a held DROP TABLE, a held kubectl delete doesn't fail and doesn't silently go through; it waits, visible, for a human to make the call. The agent keeps its autonomy on everything safe, and a person stays in the loop only for the handful of actions that could actually do damage.

Guardrails you can prove held

Guardrails are only worth as much as your ability to show they worked. Because agent·shield decides at the request level, every decision — forwarded, blocked, held, approved, denied — is written to an append-only audit log with the full request, the policy it matched, who approved or denied it, and when. When someone asks "what was the agent actually allowed to do, and who let it," there's a record, not a shrug.

Setup stays out of the agent's way: there's no SDK to adopt and no agent code to rewrite — you change the base URL the agent calls and write policies for the actions you care about. The guardrail is infrastructure, not a library buried in the agent, which is what makes it hard to bypass and easy to reason about. Worth stating plainly: agent·shield enforces on traffic that flows through the proxy, so guardrails cover the calls you route through it — point the agent's outbound HTTP at agent·shield and that's everything it does over the network.

Prompt-level guardrails vs action-level guardrails

	Prompt / tool-description guardrails	agent·shield (action-level)
Where it lives	Inside the model's context	A proxy outside the model
Can the model bypass it	Yes — ignore, jailbreak, hallucinate	No — it never sees the decision
What it controls	What you ask the model to avoid	What requests actually get forwarded
Destructive actions	Hopefully skipped	Held for human approval
Proof it held	None	Append-only audit log of every decision
Setup	Edit prompts and tool specs	Point the agent's base URL at the proxy

frequently asked

Why aren't prompt instructions enough to guardrail an agent?

Because the model can ignore them. Prompt injection, conflicting context, and plain hallucination all lead models to do things they were told not to. A prompt guardrail is advice; agent·shield is enforcement at the request layer, where the model has no vote on whether a destructive call goes through.

Won't a hard guardrail slow my agent down?

Only on the actions you choose to hold. Safe traffic — reads and routine writes — is forwarded instantly; agent·shield only pauses requests that match a destructive or sensitive policy. The agent runs at full speed for the vast majority of what it does.

What counts as a 'destructive action' by default?

The classic high-blast-radius patterns: HTTP DELETEs, DROP TABLE and TRUNCATE in a SQL body, rm -rf, kubectl delete, and mass updates with no WHERE clause. You can tune the policies — they're regex over method, path, and body — to match exactly what's dangerous in your environment.

Does this work with my existing agent framework?

Yes. agent·shield is a transparent HTTP proxy, so there's no SDK and no rewrite — you point the agent's outbound base URL at agent·shield and it inspects the traffic. The guardrail covers whatever HTTP calls you route through it.

Published May 12, 2026 · Last updated June 13, 2026

AI agent guardrails: stop the agent at the dangerous request, not after it

Prompt guardrails are advice; action guardrails are enforcement

Three outcomes: forward, block, hold

Guardrails you can prove held

Prompt-level guardrails vs action-level guardrails

frequently asked

more on agent·shield

related across the studio

ready to try agent·shield?