Your weights. Your hardware. Your latency, your audit, your bill.
Open-weights LLM running on hardware you own — procurement, deploy stack, finetune where it earns its keep, and an OpenAI-compatible internal endpoint your apps and engineers can talk to. For teams with PDPA-class data, steady high-volume workloads, or long-context economics that vendor per-token pricing punishes.
GPU server procurement, the serving runtime, and the open-weights model that fits the workload — not the headline.
GPU server procurement and on-prem or colo install — H100, L40S, A100 or RTX 6000 Ada, sized against the actual workload, not the spec sheet. NVMe for fast model load; redundant power; sensible cooling. We tell you the smallest box that does the job, not the largest one a reseller wants to quote.
vLLM for high-throughput production. TGI where it fits the model. Ollama or llama.cpp for the small-workload edge cases where a full inference server is overkill. Quantisation only where the eval still passes — never as a way to fit a too-big model on a too-small box.
Llama, Qwen, DeepSeek, GLM, Mistral — chosen against your workload, your language mix and a licence you can actually live with. We read the licences. We test against your task before recommending. The smallest model that passes your eval is the right model.
We test against your task before recommending. The smallest model that passes your eval is the right model — open-weights or otherwise.
LoRA for almost everything. Continued pretraining when the data and the budget warrant it. Almost never pretraining from scratch.
- LoRA / QLoRA, first. The shortest path to a model that knows your domain, your phrasing and your refusal policy. Days, not weeks. Cheap to redo when the source data shifts. Stacks cleanly on top of a base model you can rotate later.
- Continued pretraining, when the corpus earns it. A meaningful private corpus — bilingual technical writing, decades of regulator filings, a domain-specific code base — can justify continued pretraining. We tell you honestly whether yours does. Most don't, and that is a feature.
- Pretraining from scratch is not the engagement. Eight-figure budgets and a research team. If that's the right shape for you, you don't need us — you need NVIDIA's account team. We'll point you at the right doors.
- Eval set first, weights second. No finetune ships without a paired eval — golden cases, adversarial cases, and a regression bar from the base model. If the finetuned weights don't beat the base model on the eval, the finetune doesn't ship. Vibes are not a release criterion.
- Engagement shape. A four-to-six-week LoRA build — data curation, training, eval, deploy — with an optional retainer for the next two refresh cycles. Continued-pretraining engagements are scoped against the corpus during the diagnostic.
An OpenAI-compatible endpoint your apps already know how to talk to. Behind your SSO, your audit log, your perimeter.
- Drop-in API surface. OpenAI-compatible chat-completions and embeddings endpoints. Your existing app code, agent framework, RAG stack and CLI clients keep working — base URL changes, the rest doesn't.
- SSO-fronted end-user access. Engineers and internal tools authenticate through your existing identity provider — Entra ID, Okta, Google Workspace, your own AD. RBAC by group. Rate-limits per user and per app key. No long-lived plaintext tokens lying around in repos.
- Audit log on the wire. Every prompt, every response, every model and every caller — captured, indexed and queryable. Retention by class. The same discipline that lets infrastructure & security sign off on a hardened fleet, applied to the model traffic.
- Token economics that justify the box. Self-hosting beats per-token vendor pricing on three workloads in our experience: steady high-volume (the model is busy most of the day), long-context (you pay vendors for every input token of a 100k-context system prompt — you don't pay yourself), and data-residency (the workload simply isn't allowed to leave the building). Outside those, vendor APIs are cheaper and we will tell you so.
- Pairs with the rest of the toolbox. Hardening for the host fleet — see infra-and-security. Eval-harness, context engineering and memory discipline so the deployed model is actually useful — see consulting-and-training.
What people ask before they brief us.
- When does self-hosting an LLM beat per-token vendor pricing?
- On three workloads in our experience: steady high-volume (the model is busy most of the day), long-context (you pay vendors for every input token of a 100k-context system prompt — you don't pay yourself), and data-residency (the workload simply isn't allowed to leave the building). Outside those, vendor APIs are usually cheaper and we will tell you so.
- Which open-weights models do you deploy?
- Llama, Qwen, DeepSeek, GLM and Mistral — chosen against your workload, your language mix, and a licence you can actually live with. We test against your task before recommending. The smallest model that passes your eval is the right model.
- Do you offer LoRA finetuning, or only deploy?
- Both. LoRA / QLoRA is the default — days, not weeks, cheap to redo when source data shifts. Continued pretraining when a meaningful private corpus earns it. Pretraining from scratch is not the engagement: that's an eight-figure budget and a research team.
- What does the engagement look like?
- A paid diagnostic week sizes the workload, picks the model, picks the box, and writes the engagement shape — hardware-only, deploy-only, finetune-only, or end-to-end. You walk away with a written plan and a number, regardless of whether we run it.
The diagnostic week sizes the workload, picks the model, picks the box, and writes the engagement shape — hardware-only, deploy-only, finetune-only, or end-to-end. You walk away with a written plan and a number, regardless of whether we run it.
Brief us →