Guardrails
Hard-coded rules and filters that constrain AI agent behavior, layered above the model itself to enforce policy the model cannot be trusted to follow on its own.
Guardrails are the policy enforcement layer that sits between a language model and the user. The model produces an output. The guardrail evaluates the output against a set of rules. If the output passes, it reaches the user. If it fails, the system returns a safe fallback, escalates to a human, or retries with stricter context. Rules cover what topics the agent can engage with, what claims it can make, what data it can reveal, and what tools it can invoke. Guardrails exist because prompting alone is not a reliable way to enforce policy. A model told "never give medical advice" in the system prompt will still give medical advice when the user asks the right way. A guardrail running a classifier on the output catches the violation regardless of how it slipped through.
Common guardrail categories include topic restriction (no medical, legal, or financial advice from a marketing chatbot), PII handling (no Social Security numbers or PHI in responses), brand safety (no commitments outside the approved pricing list), and prompt-injection defense (no following instructions hidden in user input). The implementation stack ranges from open-source libraries like NeMo Guardrails and Guardrails AI to commercial offerings from companies like Lakera and Protect AI. Production deployments often combine a fast classifier model for high-volume filtering with regex rules for known patterns and LLM-as-judge for nuanced policy checks.
Every production agent in the AI Support Department and AI Sales Department ships with guardrails as a non-negotiable layer. The rules get written during onboarding alongside the system prompt and brand voice spec, then live in code rather than in the prompt itself. This separation matters because the prompt belongs to the LLM and can be overridden by a clever prompt injection. Guardrails run outside the model context and stay enforceable even when the model gets jailbroken. The right framing is that guardrails are the seatbelts of AI deployment. Most of the time the model behaves well and they do nothing. The few times the model fails, they prevent the failure from becoming a customer incident.
- A healthcare-adjacent SaaS chatbot uses a guardrail that classifies any output mentioning specific drug names or dosages and rewrites the response to "please consult your provider."
- An e-commerce support agent runs every outbound message through a regex check that blocks the model from quoting prices outside the approved SKU list, catching one misquote per 2,000 conversations.
- A fintech onboarding assistant uses a PII guardrail that detects Social Security numbers in user input and prevents the model from ever echoing them back into the response.
Are guardrails the same as the system prompt?
What is the performance cost of guardrails?
Can I write guardrails in plain English?
Do guardrails prevent every bad output?
EOI runs fractional AI departments for funded teams under 50. Sales, Content, Ops, Support. Live in 14 days on a monthly retainer.