When does self-hosting an LLM beat per-token vendor pricing?

On three workloads in our experience: steady high-volume (the model is busy most of the day), long-context (you pay vendors for every input token of a 100k-context system prompt — you don't pay yourself), and data-residency (the workload simply isn't allowed to leave the building). Outside those, vendor APIs are usually cheaper and we will tell you so.

Which open-weights models do you deploy?

Llama, Qwen, DeepSeek, GLM and Mistral — chosen against your workload, your language mix, and a licence you can actually live with. We test against your task before recommending. The smallest model that passes your eval is the right model.

Do you offer LoRA finetuning, or only deploy?

Both. LoRA / QLoRA is the default — days, not weeks, cheap to redo when source data shifts. Continued pretraining when a meaningful private corpus earns it. Pretraining from scratch is not the engagement: that's an eight-figure budget and a research team.

— Services / Private LLM deployment

Your weights. Your hardware. Your latency, your audit, your bill.

Open-weights LLM running on hardware you own — procurement, deploy stack, finetune where it earns its keep, and an OpenAI-compatible internal endpoint your apps and engineers can talk to. For teams with PDPA-class data, steady high-volume workloads, or long-context economics that vendor per-token pricing punishes.

01 Hardware & deploy stack

GPU server procurement, the serving runtime, and the open-weights model that fits the workload — not the headline.

hardware

GPU server procurement and on-prem or colo install — H100, L40S, A100 or RTX 6000 Ada, sized against the actual workload, not the spec sheet. NVMe for fast model load; redundant power; sensible cooling. We tell you the smallest box that does the job, not the largest one a reseller wants to quote.

serving runtime

vLLM for high-throughput production. TGI where it fits the model. Ollama or llama.cpp for the small-workload edge cases where a full inference server is overkill. Quantisation only where the eval still passes — never as a way to fit a too-big model on a too-small box.

open-weights model

Llama, Qwen, DeepSeek, GLM, Mistral — chosen against your workload, your language mix and a licence you can actually live with. We read the licences. We test against your task before recommending. The smallest model that passes your eval is the right model.

A 4U rackmount GPU server slid forward on its rails out of an open 19-inch rack, top cover off, four full-length PCIe accelerator cards visible in the chassis, NVMe drive bays at the front, with a leather notebook in the foreground showing hand-written GPU memory-sizing math in pencil and a single teal-green pencil tick beside one of the lines. — fig. 01 · GPU server, sized to the workload owned silicon · owned weights · owned latency

Open-weights models we deploy

Llama

Qwen

DeepSeek

GLM

Mistral

We test against your task before recommending. The smallest model that passes your eval is the right model — open-weights or otherwise.

02 Finetune & domain adaptation

LoRA for almost everything. Continued pretraining when the data and the budget warrant it. Almost never pretraining from scratch.

LoRA / QLoRA, first. The shortest path to a model that knows your domain, your phrasing and your refusal policy. Days, not weeks. Cheap to redo when the source data shifts. Stacks cleanly on top of a base model you can rotate later.
Continued pretraining, when the corpus earns it. A meaningful private corpus — bilingual technical writing, decades of regulator filings, a domain-specific code base — can justify continued pretraining. We tell you honestly whether yours does. Most don't, and that is a feature.
Pretraining from scratch is not the engagement. Eight-figure budgets and a research team. If that's the right shape for you, you don't need us — you need NVIDIA's account team. We'll point you at the right doors.
Eval set first, weights second. No finetune ships without a paired eval — golden cases, adversarial cases, and a regression bar from the base model. If the finetuned weights don't beat the base model on the eval, the finetune doesn't ship. Vibes are not a release criterion.
Engagement shape. A four-to-six-week LoRA build — data curation, training, eval, deploy — with an optional retainer for the next two refresh cycles. Continued-pretraining engagements are scoped against the corpus during the diagnostic.

03 Internal API & end-user access

An OpenAI-compatible endpoint your apps already know how to talk to. Behind your SSO, your audit log, your perimeter.

Drop-in API surface. OpenAI-compatible chat-completions and embeddings endpoints. Your existing app code, agent framework, RAG stack and CLI clients keep working — base URL changes, the rest doesn't.
SSO-fronted end-user access. Engineers and internal tools authenticate through your existing identity provider — Entra ID, Okta, Google Workspace, your own AD. RBAC by group. Rate-limits per user and per app key. No long-lived plaintext tokens lying around in repos.
Audit log on the wire. Every prompt, every response, every model and every caller — captured, indexed and queryable. Retention by class. The same discipline that lets infrastructure & security sign off on a hardened fleet, applied to the model traffic.
Token economics that justify the box. Self-hosting beats per-token vendor pricing on three workloads in our experience: steady high-volume (the model is busy most of the day), long-context (you pay vendors for every input token of a 100k-context system prompt — you don't pay yourself), and data-residency (the workload simply isn't allowed to leave the building). Outside those, vendor APIs are cheaper and we will tell you so.
Pairs with the rest of the toolbox. Hardening for the host fleet — see infra-and-security. Eval-harness, context engineering and memory discipline so the deployed model is actually useful — see consulting-and-training.

A walnut desk shot at a three-quarters angle: an unbranded silver laptop on the left showing a terminal with a curl request to an internal LLM endpoint and the JSON response, an external 27-inch monitor on the right showing a finetune training-loss curve descending in teal-green over twelve thousand steps with an eval-metrics panel below and an audit-log panel scrolling redacted user-*** rows in the corner, and a cream notebook in the foreground with a hand-drawn version of the loss curve. — fig. 02 · internal endpoint, eval-tracked SSO-fronted · audit-logged · regression-tested

04 Common questions

What people ask before they brief us.

When does self-hosting an LLM beat per-token vendor pricing?: On three workloads in our experience: steady high-volume (the model is busy most of the day), long-context (you pay vendors for every input token of a 100k-context system prompt — you don't pay yourself), and data-residency (the workload simply isn't allowed to leave the building). Outside those, vendor APIs are usually cheaper and we will tell you so.
Which open-weights models do you deploy?: Llama, Qwen, DeepSeek, GLM and Mistral — chosen against your workload, your language mix, and a licence you can actually live with. We test against your task before recommending. The smallest model that passes your eval is the right model.
Do you offer LoRA finetuning, or only deploy?: Both. LoRA / QLoRA is the default — days, not weeks, cheap to redo when source data shifts. Continued pretraining when a meaningful private corpus earns it. Pretraining from scratch is not the engagement: that's an eight-figure budget and a research team.
What does the engagement look like?: A paid diagnostic week sizes the workload, picks the model, picks the box, and writes the engagement shape — hardware-only, deploy-only, finetune-only, or end-to-end. You walk away with a written plan and a number, regardless of whether we run it.

— Brief us

The diagnostic week sizes the workload, picks the model, picks the box, and writes the engagement shape — hardware-only, deploy-only, finetune-only, or end-to-end. You walk away with a written plan and a number, regardless of whether we run it.

Brief us →