// Glossary · technical

On-Device AI Agent

Also: local AI agent · edge AI agent

An AI agent running on hardware controlled by the buyer (on-prem server, local workstation, isolated VM) with no inference traffic leaving the network.

An On-Device AI Agent runs the model, the runtime, and the data layer on hardware the buyer owns and controls. No inference traffic to a public endpoint, no payload sent to a third-party vendor, no telemetry phoning home unless the buyer opts in. The agent reads from internal data sources, runs the model against them locally, and returns the answer inside the network perimeter. EOI ships this pattern through OpenClaw and Hermes on private GPU iron, documented in full on the local agent setup page.

The technical stack typically looks like a single H100 or dual L40S GPU server racked in the buyer data center, running a 70B open-weight model like Llama 3.1 or Qwen 2.5. The runtime integrates with the buyer identity provider (Okta, Entra ID, Active Directory) so users only retrieve documents they were already allowed to see. Connectors read from internal Confluence, SharePoint, on-prem Salesforce, file shares, and SQL databases. Indexes get built and refreshed locally. The buyer IT team operates the box like any other internal service.

Why this pattern exists is compliance, custody, and cost at volume. A regulated team in fintech or healthcare cannot push customer data through a hosted API without rebuilding the entire compliance argument. An on-device deployment collapses that argument to one sentence: the data never left the controlled environment. See Local LLM for the model layer and KYC AI for the financial-services workflow that drives most on-device deployments.

// Examples
  • A regional bank in Asia runs an on-device agent over 10 years of internal credit memos. The model reads underwriting history, recent transactions, and customer correspondence to draft new memos a credit officer reviews. Drafting time drops from 40 minutes to 6 minutes per file.
  • A hospital group runs a 70B local model on a GPU server in the same data center as the EHR to draft clinical summaries. The same project failed legal review two years earlier when it ran on a hosted API. The on-device version cleared review in a week.
  • A defense contractor runs an air-gapped install inside a SCIF with zero internet connectivity. The agent reads, classifies, and summarizes classified material without ever touching a network that touches the outside world.
// Common questions
What hardware do I need for an on-device AI agent?
A single H100 or dual L40S GPU server handles a 70B model serving 50 concurrent users with sub-100ms latency. Lighter workloads run on CPU with quantized smaller models. Heavier workloads (full 405B, high concurrency) need multi-GPU sharding. The right spec gets sized against the actual workload during the audit, not before.
How is this different from running ChatGPT in a private endpoint?
A private endpoint still runs the model on the vendor infrastructure. The buyer data leaves the network on every call, just to a smaller blast radius. On-device runs the model on the buyer hardware. The buyer data never leaves the perimeter. That distinction is what compliance teams actually care about.
Can the on-device agent call cloud APIs when needed?
Yes, if the buyer policy allows it. Most teams run hybrid: sensitive queries stay local, non-sensitive queries can optionally route to a cloud model. The agent decides per request based on rules the buyer writes. Strict air-gap configurations have no external connectivity at all.
What is the latency vs a cloud API?
On the internal network, expect 40 to 90ms for a 70B model on an H100, vs 600 to 1200ms round-trip to a public cloud endpoint. The local agent feels noticeably snappier because there is no internet hop. Air-gapped installs see the same internal latency since the GPU is the bottleneck, not the network.
// Related terms
// Ready to ship?

EOI runs fractional AI departments for funded teams under 50. Sales, Content, Ops, Support. Live in 14 days on a monthly retainer.