Question 1

What models are supported on a local install?

Accepted Answer

Llama 3 and 3.1 (8B, 70B, 405B), Mistral Large and Mixtral, Qwen 2.5, Phi-3, DeepSeek, and any model on Hugging Face that fits the runtime. EOI defaults to open-weight models for license clarity. Private fine-tunes of any of those run locally without the fine-tune leaving the network.

Question 2

How does a Local LLM compare to GPT-4 or Claude on quality?

Accepted Answer

For specific workloads, a tuned 70B local model matches or beats the hosted frontier models. For broad general-purpose reasoning, the frontier hosted models still lead. The right comparison is not "is Llama as smart as GPT-4" but "does the local model handle this workload at the quality bar I need." For most departmental workflows the answer is yes.

Question 3

How much does the hardware cost?

Accepted Answer

A single H100 server lands around $30K to $40K depending on configuration. Dual L40S is roughly $20K. CPU-only deployments running quantized smaller models can run on existing servers with no new hardware spend. The breakeven against hosted API costs typically lands between 5 and 15 million tokens a month.

Question 4

Who maintains the model after install?

Accepted Answer

Two options. EOI runs the optional monthly retainer that handles model upgrades, security patches, prompt tuning, and new connector requests. Or the customer IT team takes the handoff at week eight and runs it with the EOI runbook. Most regulated clients pick the retainer because staying current on model releases is real work.