Question 1

What hardware do I need for an on-device AI agent?

Accepted Answer

A single H100 or dual L40S GPU server handles a 70B model serving 50 concurrent users with sub-100ms latency. Lighter workloads run on CPU with quantized smaller models. Heavier workloads (full 405B, high concurrency) need multi-GPU sharding. The right spec gets sized against the actual workload during the audit, not before.

Question 2

How is this different from running ChatGPT in a private endpoint?

Accepted Answer

A private endpoint still runs the model on the vendor infrastructure. The buyer data leaves the network on every call, just to a smaller blast radius. On-device runs the model on the buyer hardware. The buyer data never leaves the perimeter. That distinction is what compliance teams actually care about.

Question 3

Can the on-device agent call cloud APIs when needed?

Accepted Answer

Yes, if the buyer policy allows it. Most teams run hybrid: sensitive queries stay local, non-sensitive queries can optionally route to a cloud model. The agent decides per request based on rules the buyer writes. Strict air-gap configurations have no external connectivity at all.

Question 4

What is the latency vs a cloud API?

Accepted Answer

On the internal network, expect 40 to 90ms for a 70B model on an H100, vs 600 to 1200ms round-trip to a public cloud endpoint. The local agent feels noticeably snappier because there is no internet hop. Air-gapped installs see the same internal latency since the GPU is the bottleneck, not the network.