// Service · Local Agent Setup

Local AI agents, on your own iron.

OpenClaw + Hermes installed inside your network. Models running on hardware you control. Zero PII, contracts, or source code leaving the perimeter. Built for fintech, healthcare, legal, and government teams where the cloud is not an option.

Book a setup call→Talk to Roy→

// The problem

Every cloud AI tool ships your data to someone else first.

Apollo, Outreach, Gong, ChatGPT, Claude, every modern AI workflow you have seen sells the same thing under the hood. You push your data to their endpoint, their model runs on their hardware, the answer comes back. The contract says they will not train on it. The architecture says it left your network. For most teams that trade-off is fine. For a meaningful slice of the market it is a hard stop.

A private equity firm cannot send LP statements to a third-party API. A regional bank cannot push customer transaction histories to an endpoint in a different jurisdiction. A hospital cannot route patient charts through a vendor that signed a BAA last quarter but has not been audited yet. A defense contractor cannot let proprietary CAD files traverse the public internet at all. A law firm cannot push privileged communications through an LLM that logs requests for thirty days. The list is long and the answer the cloud vendors keep giving (enterprise tier, SOC2, data residency in your region) does not solve the core issue. The data still leaves your machines.

The other answer is local. Install the agent on your hardware, inside your network, behind your firewall. The model never sees a public endpoint. Your data never leaves. The compute happens on iron you own. We have been shipping this configuration for two years across regulated clients in Hong Kong, Singapore, and the EU, and it is now a productized engagement. If you want to understand whether your workload fits the shape, the AI Strategy Audit is the front door.

// Why cloud is off the table

Compliance, custody, and cost, in that order.

Compliance is the loudest reason teams come to us for a local install. GDPR Article 28 processors, HIPAA BAA chains, MAS guidelines on outsourcing in Singapore, HKMA cloud-risk circulars, the EU AI Act provisions for high-risk systems. None of them outright ban cloud AI. All of them make it expensive to defend. Local installs collapse the compliance argument to a single sentence: the data never left the controlled environment. That sentence is worth a lot of legal hours.

Custody is the second reason. When your agent runs on an external API, the vendor controls model behavior, model availability, and pricing. They deprecate a model and you have a migration. They raise prices and you pay. They have an outage and your workflow stops. Owning the model and the compute means the workflow keeps running when OpenAI is down, when Anthropic rate-limits you, when the vendor pushes a behavior change that breaks your prompt chain. The agent answers to you, not to a roadmap somewhere else.

Cost is third and only matters at volume. A team running 10 million inference calls a month on a hosted API is paying real money per call. The same workload on a properly sized local GPU rig amortizes to a fraction of that within twelve months and stays flat after. For teams below that volume, cloud is cheaper and we will say so. For teams above it, local pays for itself. We size the hardware against the workload before we quote anything.

// What we install

Five pieces of a working local AI agent stack.

Not only a model on a server. A full operating agent: runtime, hardware, models, connectors, maintenance. Configured for your workload, not a generic demo.

OpenClaw + Hermes install

OpenClaw as the agent runtime, Hermes as the orchestration layer. Both deployed on your hardware, integrated with your identity provider (Okta, Entra ID, Active Directory), routed through your network policy. No telemetry calling home unless you opt in.

Hardware sizing + provisioning

We spec the box against the workload. Single H100 for a 70B model, dual L40S for cost-optimized inference, CPU-only with quantized models for lighter workloads. We can buy and ship the hardware or work with what your data center already has racked.

Model selection per workload

Llama 3 70B for general reasoning, Mistral Large for European-language workloads, Qwen 2.5 for multilingual including Cantonese and Mandarin, fine-tuned smaller models for domain-specific tasks. Right-sized per use case, not one giant model for everything.

Connector setup to local data sources

The agent connects to your wiki (Confluence, Notion-export, SharePoint), your CRM (on-prem Salesforce, internal CRM), your file shares, your databases. All connections stay inside the network. RAG indexes built and refreshed locally.

Maintenance + model updates

Optional monthly retainer covers security patches, model upgrades, prompt-chain tuning, and new connector requests. When Llama 4 ships or your wiki gets restructured, we update the install without you opening a ticket.

// The hardware

What a typical install looks like in practice.

Honest numbers from real deployments. Your spec will vary with workload size, concurrency, and which models you pick.

1× H100 / 2× L40S

Typical hardware spec

for a 70B-parameter model serving 50 concurrent users

40 to 90ms

Inference latency on internal network

vs 600 to 1200ms round-trip to cloud endpoints

7B to 405B

Model sizes supported

from Phi-3 mini up to Llama 3.1 405B with multi-GPU sharding

4 to 8 weeks

Time from kickoff to live

hardware lead-time is the slowest piece, not the install

// Side by side

Cloud API agent vs local on-device agent.

Both run the same use case. Different trade-offs in compliance, cost, control, and operating posture. Honest read on which fits which team.

Cloud API agent

Data leaves your network on every call
Vendor controls model version, pricing, availability
Compliance argument needs DPAs, audits, region clauses
Per-token pricing scales with usage
Sub-second latency if endpoint is in-region
Zero hardware to manage
Outage at the vendor takes you offline
Right answer for high-velocity teams without sensitive data

Local on-device agent

Data never leaves the perimeter
You control the model, the version, the uptime
Compliance argument is one sentence: data stayed inside
Capex on hardware, flat opex on maintenance
Sub-100ms latency on internal network
You manage a GPU server (or we manage it for you)
Vendor outages do not affect your stack
Right answer for regulated industries and data-sensitive workloads

// The 4-to-8 week sprint

From audit call to live local agent in five steps.

Longer than our 14-day sprints for cloud-native departments. Hardware lead-time is the gating factor. The install itself takes days. Provisioning the iron takes weeks.

Step 01

Week 1 · Audit

We map your workload, your data sensitivity classes, your existing infrastructure, and the compliance constraints you are operating under. Output is a written recommendation: cloud, local, or hybrid. If your workload does not need local, we will say so before you spend a dollar on hardware.

Step 02

Week 1 to 2 · Hardware spec

We spec the GPU server against the model size, concurrency, and growth runway. You pick: buy and rack new hardware, repurpose what you have, or co-locate in a private cloud region (AWS Outposts, Azure Stack Hub). We send a parts list with vendor links.

Step 03

Week 3 to 5 · Install

OpenClaw and Hermes get deployed on the box. Network policy, identity integration, internal DNS, TLS certificates inside your CA. We build against your reference architecture, not against a SaaS install guide.

Step 04

Week 5 to 7 · Model fit

Pick the model, tune the prompts against your data, build the RAG indexes against your local sources, calibrate retrieval quality. We test on real workloads with a small internal user group before opening it to the rest of the company.

Step 05

Week 7 to 8 · Handoff

We document the install, train your IT team on day-to-day operations, and hand over a runbook. If you take the optional monthly retainer, we keep operating it. If not, your team owns it from here. Either way, the stack is yours.

// What it enables

The workflows you could not run on a public API.

A regional bank in Asia uses a local install to run a copilot over ten years of internal credit memos. The model reads underwriting history, recent transactions, and customer correspondence to draft new memos that a credit officer reviews. Every document stays on bank infrastructure. The model never sees a public endpoint. Officers cut memo drafting time from forty minutes to six minutes per file. Compliance signed off on day one because nothing left the network.

A hospital group runs a local agent over patient charts to draft clinical summaries for physicians. The model is fine-tuned on de-identified internal data and runs on a GPU server in the same data center as the EHR. Physicians get a structured summary, citations to the source notes, and a confidence score. The hospital did the same project on a hosted API two years ago and shut it down after legal review. The local install moved past legal review in a week.

A defense contractor uses an air-gapped install (no internet connection at all) to run code review and document classification over classified material. The hardware sits in a SCIF. The agent reads, classifies, and summarizes without ever touching a network that touches the outside world. This is the extreme case but it works because nothing about OpenClaw or Hermes requires phoning home. For teams thinking about a fractional AI Ops Department but needing on-premise execution, this is the hybrid pattern we end up running.

These are not pilots. They are production workloads moving real money and real clinical decisions through agents running on hardware the customer owns. The reason these workloads exist now is that the models are good enough and the runtime makes the install repeatable. Three years ago you could not run a 70B model on a single GPU. Now you can, and the use cases catch up fast.

// Integration patterns

How the local agent connects to the rest of your stack.

Most teams do not run pure air-gap. They run hybrid. The local agent handles anything touching sensitive data and the cloud handles everything else. We set up routing rules so the agent decides per request which path to take. A query about a customer record goes local. A general research question can call out to a cloud model if your policy allows. The policy is yours to write. We implement it as written.

For data sources, the agent reads from internal Confluence, internal SharePoint, on-prem databases, internal Git repos, file shares, anything inside the network that speaks HTTP, SMB, or SQL. RAG indexes get built and refreshed locally on a schedule you control. For identity, we integrate with whatever you run: Okta, Entra ID, Active Directory, custom SAML. The agent inherits your permission model so users only retrieve documents they were already allowed to see.

For observability, logs and metrics ship to your existing stack (Splunk, Datadog on-prem, ELK, whatever you have racked). We do not run a parallel monitoring stack. For backups, model weights and indexes go to your backup system on the schedule your DR plan calls for. The local agent behaves like any other internal service. It plays by the same rules as your other internal services.

Excellent communication and top-notch quality of service. EOI has been a choice to accelerate our company, not only on a technical level, but also business-wise and creatively. If you need anyone to do your AI workflows, these guys are the experts.

Gregory Benjamins

CEO · Green Collective

// Pricing

One-time setup, optional monthly managed retainer.

One-time install fee · Optional monthly managed retainer

Setup pricing depends on hardware scope, model count, and connector complexity. Retainer is optional and covers updates, tuning, and security patches. Hardware is billed at cost.

OpenClaw and Hermes installed on your hardware, inside your network
Hardware sizing and procurement support (we shop or you shop)
Model selection and fine-tuning against your data
Connector setup for wiki, CRM, file shares, databases
Identity integration with your SSO and permission model
Runbook handoff and IT team training included
Optional retainer: security patches, model upgrades, prompt tuning, new connectors

Book a setup call→

// Pair with a department

Local installs often pair with a fractional AI department for the operating layer on top. We install the local stack and run the workflow against it on a monthly retainer. Read the full explanation of how fractional AI departments work and which workloads sit best on local infrastructure.

Read the breakdown→

// FAQ

The questions founders ask before they apply.

01Do I need GPUs to run a local AI agent?

For most useful models, yes. A 7B model can run CPU-only with acceptable latency. Anything 13B and above wants a GPU for usable throughput. The sweet spot for production workloads is a single H100 or dual L40S, which handles a 70B model serving fifty concurrent users with sub-100ms latency.

02Can I use my existing server, or do I need to buy new hardware?

Depends on the box. If you have recent GPU iron (A100, H100, L40S, even an older A6000) we can almost certainly use it. CPU-only servers will run small quantized models but not production workloads. We do a hardware audit in week one and tell you honestly whether your existing kit fits or whether you need to procure.

03What models are supported?

Llama 3 and 3.1 (8B, 70B, 405B), Mistral Large and Mixtral, Qwen 2.5, Phi-3, DeepSeek, and any model on Hugging Face that fits the runtime. We default to open-weight models for license clarity, but we can run a private fine-tune of any of these against your data without that fine-tune leaving your network.

04Can the local agent call cloud APIs when needed?

Yes, if your policy allows. We can configure routing rules so sensitive queries stay local and non-sensitive queries optionally call out to a cloud model. The agent decides per request based on the rules you write. Strict air-gap configurations have no external connectivity at all and use only the local model.

05Who maintains the model after install?

Two options. Take the optional monthly retainer and we handle updates, patches, and tuning. Or take the handoff at week eight and your IT team runs it from there with our runbook. Most regulated clients pick the retainer because keeping current with model releases and security patches is real work.

06What about security patching?

OpenClaw and Hermes get patched on the same cadence as your other internal services. We deliver patches through your existing change management process, not via auto-update. Critical CVEs get same-day notification. The retainer covers patch testing and rollout against your reference architecture.

07Can it integrate with my CRM, wiki, and internal tools?

Yes. The agent reads from Confluence, SharePoint, Notion exports, internal Git, file shares, on-prem Salesforce, SQL databases, and anything else inside the network that speaks HTTP, SMB, or SQL. RAG indexes get built locally and refreshed on a schedule you set. Permissions inherit from your existing identity provider.

08What is the actual latency vs a cloud API?

On the internal network, expect 40 to 90ms for a 70B model on an H100, much faster than a 600 to 1200ms round trip to a public cloud endpoint. The local agent feels noticeably snappier because there is no internet hop. Air-gapped installs have the same internal latency since the bottleneck is the GPU, not the network.

// From the notes

2026-05-25
What is a Fractional AI Department?
A fractional CFO runs your finance function part-time. A fractional AI Department runs a whole function full-time, for the cost of one hire. Here is how the math works.

// Definitions worth knowing

Browse the full glossary →

// Also worth a look

// Ready to ship this?

Start a Local AI Agent Setup sprint. 14 days from kickoff.

Apply in 7 questions. EOI reviews every application within 24 hours.