Infra & Security 18 April 2026 · 8 min

Routing Classifications to Llama 3 to Save Frontier Tokens

By wGrow Project Team · 18 April 2026

The Homogeneous Architecture Trap

Singapore Chinese engineer working at a dual-monitor desk in a modern office.

Here’s how it usually goes. An engineering team discovers the OpenAI API, ships something in a week, and it works. Six months later the token bill is scaling with user growth and nobody can explain exactly why — because nobody stopped to ask whether every request actually needed a frontier model.

If you’re routing boolean classification tasks to GPT-4o, you’re paying senior-engineer rates for work a lightweight classifier could handle.

This is homogeneous LLM architecture: one endpoint, one model, every request hitting the same computational weight regardless of what you’re actually asking. The API equivalent of using a surgical robot to sort mail.

The problem runs deeper than cost. Sending a simple “is this a billing query or a technical query?” decision to a hosted frontier model introduces three penalties at once: API latency from the external network round-trip, per-token billing on a task that barely needs tokens to resolve, and contention on shared infrastructure you have no control over.

We built an intent router to address this. The system evaluates each inbound request and routes roughly 70% of traffic to a locally hosted Llama 3 8B instance running on vLLM. The remaining 30% escalates to GPT-4o. That split isn’t arbitrary — it reflects the actual complexity distribution of our production workloads, measured over two weeks of shadow traffic before we touched a single routing rule.

Unit Economics and the 70/30 Split

Traffic Split

70%

30%

Local Llama 3 8B
GPT-4o Escalation

Example: Target steady-state traffic distribution for a composite routing architecture.

The cost argument for local models starts at the token level but needs to be evaluated at the decision level. Those are different things.

At time of writing, OpenAI’s published GPT-4o rates stand at $2.50 per million input tokens and$ 10.00 per million output tokens (openai.com/api/pricing — verify before modelling). A typical intent classification payload at 200 input tokens and 10 output tokens runs about $0.0006 per call before network overhead. That sounds negligible. Then you run 50,000 classifications per day.

Llama 3 8B on a mid-tier GPU changes the denominator entirely. The model weights are open; vLLM is open-source. Your recurring cost is GPU instance time; A10G on-demand rates vary by provider, region, and commitment tier, so pull the current figure from your cloud provider’s pricing page before modelling the economics. An A10G carries 24 GB of VRAM, enough to fit the 8B model in FP16 with room for KV cache across concurrent short-sequence classification requests. At high utilization, the amortized cost per classification call can be an order of magnitude lower than the GPT-4o path, and lower still if you’re filling the GPU efficiently.

Latency is the second variable, and it’s the one that shows up in user experience. In our testing, warm GPT-4o classification calls measured in the 300–800ms range end-to-end, with provider-side queueing and generation as the dominant factors. The local vLLM path for the same short payloads returned in under 80ms. Cold connections add DNS and TLS setup costs on top, but those don’t recur on every pooled request. When a human is waiting on a classification result before the next screen loads, they feel that difference. Every time.

The 70/30 split came from measuring the confidence distribution of the local model across two weeks of shadow traffic. 70% of inbound requests produced logprob-confident classifications from Llama 3 8B. The other 30% were ambiguous, multi-intent, or edge cases requiring reasoning depth the 8B model simply doesn’t have.

Routing Statutory Board Tickets via vLLM

Technical illustration of a central junction routing geometric shapes into two paths.

The clearest production example is a ticket triaging system we shipped for a statutory board client in Singapore. The client was processing thousands of citizen feedback submissions monthly — across portal forms and email channels — and every single submission was hitting a commercial LLM API for classification across 15 predefined department codes.

Token costs were scaling linearly with portal usage, no complexity filter in sight. A submission reading “I want to update my address” incurred the same computational overhead as a multi-paragraph complaint spanning three departments with a dispute history attached.

We built a two-stage pipeline. Stage one: a Llama 3 8B classifier on a vLLM inference server receives the submission text and returns a department code from the predefined set. We convert the token logprobs into a normalized top-class probability; the router routes locally when that score exceeds 0.82, a threshold calibrated during the shadow period. Submissions scoring above it route locally. Those falling below get packaged and forwarded to GPT-4o for deeper analysis.

The result: an 80% capture rate at the local tier — higher than the 70% baseline across our general traffic. That gap reflects the structured nature of citizen form submissions; the inputs don’t vary as wildly as open-domain requests, so the local model’s confidence stays consistently high. Four out of five submissions never touch the external API. External LLM API spend for this classifier fell by roughly 80%; total serving cost depends separately on the fixed GPU-hours required to keep the local tier warm. Average routing latency for those submissions fell from roughly 600ms to under 100ms — a difference that matters when the confirmation screen needs the classification result to show the citizen their correct next steps.

Escalating WaterDoctor Telemetry to Frontier Models

Text classification is the easy case. The more instructive one is mixed telemetry analysis — which is what we run for WaterDoctor.

WaterDoctor deploys IoT sensors on water infrastructure, streaming continuous telemetry: water quality parameters, pump pressures, flow rates, equipment status flags. We run anomaly detection against that stream in near-real-time.

First-pass analysis runs on Llama 3 8B. The model receives structured context covering normal operational envelopes for each equipment type, reads each telemetry window, and makes a binary pass/flag decision. At normal operating volumes, the first pass handles roughly 90% of events as routine — a higher local capture rate than the ticket routing case, because anomaly detection against known signatures is a narrower, more structured problem than open-domain intent classification. The per-inference cost is low enough to treat this as a continuous background process.

When the local model flags an event, a second check applies. Pattern matches a known failure signature? Routes to a deterministic alert handler with no LLM in the loop.

The escalation path to GPT-4o fires on one specific condition: the local model flags an event and the pattern doesn’t match anything in the reference set. That means the anomaly is novel, multi-variable, or involves cross-sensor correlation the 8B model can’t confidently resolve. The router packages the relevant time-series window, the sensor metadata, and the local model’s uncertainty signal, then forwards the bundle to GPT-4o for root cause synthesis.

The analogy that fits: Llama 3 8B is the monitoring engineer reading logs at 3am. GPT-4o is the senior engineer you call when something doesn’t match anything in the runbook. You don’t pay senior engineers to read healthy server logs.

Infrastructure Footprint for Local Models

Deployment Stack

Intent Router / API Gateway

vLLM Server (PagedAttention)

Llama 3 8B Microservice

Single Mid-Tier GPU Instance

Owning the local inference tier means owning the operational complexity that comes with it. This is the trade-off you’re accepting when you move off a pure API approach — and it’s worth being honest about upfront.

We chose vLLM because naive model loading in Python doesn’t handle concurrent requests without manual batching logic. vLLM is an inference server built for production serving. Its core mechanism is paged attention, which manages the key-value cache in GPU memory using non-contiguous memory blocks — borrowing the virtual memory paging concept from operating systems. Without it, a spike in concurrent requests exhausts contiguous KV cache memory and requests queue or fail. With paged attention, the GPU handles variable-length concurrent sequences with significantly higher utilization.

Our deployment footprint for each 8B model instance is a single mid-tier GPU node. These instances run as microservices with standard health endpoints. The intent router sits upstream of both the vLLM server and the external LLM API — structurally it’s an API gateway with a routing policy keyed on confidence scores rather than URL path patterns.

Cold-start latency is the operational penalty worth naming explicitly. An 8B model takes 15 to 30 seconds to load onto a GPU. We keep a minimum of one warm instance per workload type to avoid cold-starts on the first request of the day. That adds a small fixed cost worth accounting for in your capacity plan.

Composite Systems as the Production Standard

Two IT professionals discussing data on a large monitor in an operations center.

Single-model LLM architectures work at prototype scale. At production scale, treating every request identically — regardless of complexity — is a cost structure that compounds quietly until it shows up on a finance report.

The teams who reduce AI operating costs over the next 12 months are likely those treating their LLM stack the way they already treat their compute stack: tiered by cost and capability, with explicit routing logic between tiers.

The first step isn’t building a router. It’s measuring your own traffic. Pull 30 days of LLM API logs, classify calls by output complexity, and count how many were binary decisions or entity extraction against a fixed schema. That number will likely be higher than you expect. Route those locally.

Keep the frontier API for synthesis, multi-step reasoning, and the cases your local model can’t handle with confidence. Our 70% figure reflects our workload distribution. Yours will differ — classification-heavy workloads skew higher; open-ended generation workloads skew lower. The routing principle holds regardless.

There’s also a resilience argument that gets less attention than cost. When you own the routing layer and the local inference tier, you own your core classification loop. When an API provider revises pricing, enforces new rate limits, or goes down, the locally-served share of your traffic keeps moving without interruption. That’s not a contingency plan. It’s a structural property of the architecture.

Measure your traffic. Build the router. The cost and resilience gains follow from those two steps.

← All field notes Brief a crew →