// Glossary · technical

Guardrails

Also: AI safety filters · policy layer · agent constraints

Hard-coded rules and filters that constrain AI agent behavior, layered above the model itself to enforce policy the model cannot be trusted to follow on its own.

Guardrails are the policy enforcement layer that sits between a language model and the user. The model produces an output. The guardrail evaluates the output against a set of rules. If the output passes, it reaches the user. If it fails, the system returns a safe fallback, escalates to a human, or retries with stricter context. Rules cover what topics the agent can engage with, what claims it can make, what data it can reveal, and what tools it can invoke. Guardrails exist because prompting alone is not a reliable way to enforce policy. A model told "never give medical advice" in the system prompt will still give medical advice when the user asks the right way. A guardrail running a classifier on the output catches the violation regardless of how it slipped through.

Common guardrail categories include topic restriction (no medical, legal, or financial advice from a marketing chatbot), PII handling (no Social Security numbers or PHI in responses), brand safety (no commitments outside the approved pricing list), and prompt-injection defense (no following instructions hidden in user input). The implementation stack ranges from open-source libraries like NeMo Guardrails and Guardrails AI to commercial offerings from companies like Lakera and Protect AI. Production deployments often combine a fast classifier model for high-volume filtering with regex rules for known patterns and LLM-as-judge for nuanced policy checks.

Every production agent in the AI Support Department and AI Sales Department ships with guardrails as a non-negotiable layer. The rules get written during onboarding alongside the system prompt and brand voice spec, then live in code rather than in the prompt itself. This separation matters because the prompt belongs to the LLM and can be overridden by a clever prompt injection. Guardrails run outside the model context and stay enforceable even when the model gets jailbroken. The right framing is that guardrails are the seatbelts of AI deployment. Most of the time the model behaves well and they do nothing. The few times the model fails, they prevent the failure from becoming a customer incident.

// Examples
  • A healthcare-adjacent SaaS chatbot uses a guardrail that classifies any output mentioning specific drug names or dosages and rewrites the response to "please consult your provider."
  • An e-commerce support agent runs every outbound message through a regex check that blocks the model from quoting prices outside the approved SKU list, catching one misquote per 2,000 conversations.
  • A fintech onboarding assistant uses a PII guardrail that detects Social Security numbers in user input and prevents the model from ever echoing them back into the response.
// Common questions
Are guardrails the same as the system prompt?
No. The system prompt instructs the model from inside the context window and can be overridden by injection or model error. Guardrails run as a separate enforcement layer outside the model context. A system prompt sets intent. Guardrails enforce policy regardless of what the model decides to do.
What is the performance cost of guardrails?
A lightweight classifier guardrail adds 50 to 200 milliseconds of latency per response. A full LLM-as-judge check adds 500 to 1,500 milliseconds. Production systems balance latency against safety by using fast checks for high-volume traffic and slower checks for high-risk paths like tool calls and policy claims.
Can I write guardrails in plain English?
Sort of. NeMo Guardrails uses a domain-specific language called Colang that reads close to plain English. Guardrails AI uses YAML specs. Both let policy owners write rules without deep ML knowledge, but production systems still need an engineer to verify the rules trigger correctly under adversarial input and to maintain them as the agent evolves.
Do guardrails prevent every bad output?
No. Guardrails catch the failure modes you anticipated and wrote rules for. Novel jailbreaks, edge-case phrasings, and emerging attack patterns slip through until you add new rules. The right mental model is defense-in-depth: guardrails plus citation requirements plus human review plus retrieval quality scoring, not any single layer alone.
// Related terms
// Ready to ship?

EOI runs fractional AI departments for funded teams under 50. Sales, Content, Ops, Support. Live in 14 days on a monthly retainer.