How-to

What is an LLM Gateway?

An LLM gateway is a proxy that sits between your applications and AI model providers. Here's what it does, how it works, and when you need one.

By Chris Therriault6 min read

An LLM gateway is a reverse proxy that sits between your applications and the large language model APIs they call. Every request your code makes to OpenAI, Anthropic, Google Gemini, or any other model provider goes through the gateway first. Every response passes back through it on the way out.

The gateway speaks the same wire format as the original API, so your application code doesn't change — you point your base URL at the gateway instead of at api.openai.com, and everything else stays the same.

What sits at the gateway

Because every request passes through, the gateway has a unique position: it can observe, modify, or block any call before it reaches the model. That's what makes it the right place to enforce AI governance:

Cost tracking. Every request gets logged with its token counts and estimated cost, allocated to a project and team. No polling the provider's billing API, no waiting for end-of-month invoices, no manual attribution.

PII detection. Before the prompt reaches the model, a detection layer scans it for personally identifiable information — names, email addresses, phone numbers, SSNs, health data. Depending on the configured policy, the gateway can block the request, replace the PII with deterministic tokens (restoring them in the response), or log and pass through.

Budget enforcement. Each project can be assigned a Spend Token — a budget envelope with a hard dollar limit. When the balance is exhausted, the gateway returns an error instead of forwarding the request. This is a circuit breaker, not an alert.

Model routing. The gateway can enforce which models a given project is allowed to use, automatically route cost-sensitive requests to cheaper models, and run A/B tests comparing model performance.

Audit trail. Every request that passes through the gateway is recorded in an immutable log: the project, the model, the cost, the timestamp. This log is the foundation of any compliance review.

How a request flows through the gateway

  1. Your application calls POST /v1/chat/completions — exactly as it would call OpenAI.
  2. The gateway receives the request and authenticates it (typically via an API key or Spend Token).
  3. The gateway runs the configured pre-flight checks: PII scan, budget check, model allowlist check.
  4. If the pre-flight checks pass, the gateway forwards the request to the upstream model provider — in the right wire format for that provider.
  5. The provider returns a response. For streaming requests, the gateway passes the SSE chunks through as they arrive.
  6. The gateway logs the completed request: tokens, cost, model, latency.
  7. The response reaches your application.

For a typical request, the gateway adds less than 5ms of latency. The cost of the governance layer is paid in the first billing surprise it prevents.

Why use a gateway instead of a client-side SDK

The most common alternative to a gateway is to handle governance in application code — wrap every LLM call with a cost-checking function, strip PII in a helper, log the response somewhere.

This works for a single application maintained by a single team. It breaks down when:

Multiple teams are making LLM calls. Application-code governance only works if every team remembers to use the right wrapper. One team that calls the SDK directly bypasses all of it. A gateway doesn't care who wrote the code or which SDK they used — it sees every request.

Multiple providers are in use. Implementing PII detection and cost logging for OpenAI, then for Anthropic, then for Gemini, then for Bedrock — in four different application code paths — is a substantial maintenance burden. A gateway implements it once, for all providers.

Compliance requires infrastructure-level enforcement. An auditor asking whether PII can reach your model providers is not satisfied by "we have a function that developers are supposed to call." They want to see a control that cannot be bypassed by application code. A gateway enforces at the network layer.

The policies change more often than the application code. If you need to tighten a budget limit, add a new model to the allowlist, or change the PII policy for a specific project — doing that in application code means a code change, a PR, a deploy. Doing it in the gateway's configuration means a dashboard change that takes effect immediately.

What a gateway is not

It is not an observability tool. Observability platforms (LangSmith, Braintrust, Helicone) are designed for prompt engineering and chain debugging — they help you understand what your prompts look like and why your chains behave the way they do. A gateway is designed for governance — who is spending what, whether the spend is authorised, whether the data handling is compliant. The two are complementary, not alternatives.

It is not a load balancer. Some LLM infrastructure tools focus on routing requests across multiple API keys to stay within rate limits. That's a valid problem to solve, but it's not governance. A gateway that load-balances without logging, PII detection, or budget enforcement is a traffic manager, not a governance layer.

It is not the same as an LLM proxy for local development. Some developers run local proxies to cache responses and reduce costs during development. These are development tools, not production governance infrastructure.

When do you need one?

You don't need an LLM gateway when you have a single application, a single team, a single provider, and you're still in early development. Direct API calls are fine.

You need a gateway when any of these are true:

  • More than one team is making LLM API calls in production
  • You have compliance obligations that require documented PII controls
  • You've received an unexplained AI invoice that you couldn't attribute to a specific project
  • You're deploying autonomous agents that can make many API calls without human review
  • Finance or Legal has asked you to document which data goes to which AI vendor
  • You've started using more than one model provider and the per-provider overhead is adding up

The threshold is lower than most engineering teams expect. The question is not "are we big enough for this?" The question is "do we have the controls in place before we need to explain ourselves?"

Getting started

Modern LLM gateways are designed to be a drop-in change. You update one environment variable — your base URL — and point it at the gateway. Your application code, your SDK calls, your prompt templates — none of it changes.

The gateway configuration (which projects get which budget, which models are allowed, what the PII policy is) lives in a dashboard that non-technical stakeholders can read and adjust without touching the application.

The deployment is typically 30 minutes for a new organisation. The cost of not having one is measured in billing surprises, compliance findings, and the engineering time spent building equivalent controls in application code — time that could have been spent on the product itself.

Visionality.AI

See how Visionality handles this.

30-minute demo. Live deployment. Your questions answered directly — no slides, no pitch.