What is Respan's AI gateway?

A transparent proxy at https://api.respan.ai/api/ that accepts OpenAI-compatible requests, routes them to 250+ models, applies fallback, response caching, and per-key limits, and logs every request automatically.

How do I migrate from calling OpenAI directly?

Keep your existing OpenAI SDK. Set base_url to https://api.respan.ai/api/ and use your Respan API key. Add your provider keys on the Integrations page so Respan can call models on your behalf.

How much latency does the gateway add?

The gateway adds roughly 50–150ms per call (network hop and processing). It may not suit products with strict latency requirements — test TTFT-sensitive workloads on your path before you commit.

Does the gateway work with Anthropic and Gemini?

Yes. Anthropic passthrough uses https://api.respan.ai/api/anthropic/; Gemini and Vertex have dedicated passthrough base URLs. Pin a different backend with X-Respan-Route-Provider when you need Anthropic-format requests on Vertex or Bedrock.

Can I route to self-hosted models?

Create a custom model alias on the Models page that maps to a supported provider model, or browse available models on the Models page. Your app still calls Respan; Respan forwards and logs like any other model.

How do per-key limits work?

In Settings → Limits, attach policies to API keys: cap cost, request count, or tokens per hour, day, or month. Warn sends Slack or email alerts while traffic continues; Block returns 429 when the threshold is hit. Create separate API keys per environment instead of using an env parameter on one key.

How do failover and retries work?

Set fallback_models on the request or in Settings → Fallback. When the primary model fails, Respan tries backups in the order you specify. Configure retry_params and load_balance_group for retries and key rotation — cap retries in your app to avoid stacking with gateway retries.

How does response caching work?

Enable cache_enabled to reuse exact request/response pairs and cut cost and latency. Set cache_ttl and use cache_by_customer so entries stay scoped per customer_identifier. Filter cache hits on the Logs page; Anthropic prompt caching is separate.

What are customer_identifier and metadata?

Respan params on the same gateway call: customer_identifier tags the end user for the Users page and per-customer analytics. metadata keys become custom properties you filter and group by on Logs and traces — send them via extra_body, respan_params, or the X-Data-Respan-Params header.

How does the gateway connect to tracing and evals?

Every gateway request is logged automatically. Send customer_identifier and metadata on the same call to filter and group logs by end user, feature, or thread — then use tracing and evals on the same platform when you need span waterfalls or production scores.

Browse models

AI Gateway for
Production LLM Routing

Q: Unified router or provider passthrough?

Use the unified router at https://api.respan.ai/api/ when you want one SDK and one base URL across providers, with inline models, fallback_models, and load_balance_group on the same request shape. Use passthrough (/api/anthropic/, /api/google/gemini, /api/google/vertexai/) when you already run provider-native SDKs or CLIs (Claude Code, Gemini CLI) and want to keep the provider's native API shape — both shapes log every request and can run in the same project.

Unified router or provider passthrough for 250+ models, with failover, response caching, warn/block limits, and metadata on every logged request.

Try Respan for free Gateway product

Production gateway on Respan

One endpoint for 250+ models — router or passthrough, failover, response caching, warn/block limits, and metadata on every logged request.

One API for every model

Route OpenAI-style calls through Respan to 500+ models, or keep each provider’s native SDK on a passthrough endpoint—every request is logged.

Stay up when models fail

If a model errors or rate-limits, try the next model in your fallback list, balance load across keys, and retry with backoff from one place.

Control spend and reuse answers

Set soft warnings or hard caps per API key, get Slack or email alerts when a threshold crosses, and cache repeat prompts to cut cost and latency.

Respan gateway capabilities

Routing, reliability, limits, and metadata on one surface — tied to tracing and evals on the same trace.

Routing & models

Chat Completions and Responses on the unified router, or Anthropic, Gemini, and Vertex passthroughs. Swap slugs, pass an inline models list, add custom aliases, pin X-Respan-Route-Provider, or credential_override per model slug.

Reliability & limits

Platform fallback_models or per-request lists; load_balance_group across deployments and customer_credentials across keys; retry_params with backoff. Limits warn or block on cost, requests, or tokens per API key.

Metadata & custom properties

customer_identifier, metadata, thread_identifier, and disable_log on one call, via extra_body, respan_params, or X-Data-Respan-Params when needed. Filter logs and traces by any key; Users page breaks down spend per end user.

What happens on each request

Router or passthrough, limits, cache, failover, and log, one lifecycle from your app through Respan to the model provider.

Route · passthrough

Pick the unified router or passthrough — swap model slugs or native SDKs; Respan logs every request.

Failover · cache

Try fallback_models when the primary fails; cache hits return the stored response, scoped per customer.

Log · trace

Tag customer_identifier and metadata on calls; filter Logs for routing, token counts, and cost.

What breaks production gateway setups

Six gaps teams hit calling providers directly, and how Respan routing, cache, and automatic logging address each.

Direct provider keys in every service.

Per-team key sprawl with no shared caps. Issue Respan API keys per env and team, set warn/block limit policies and tag traffic with customer_identifier.

No failover on the hot path.

Upstream errors become user-facing downtime without a fallback list. Set fallback_models in Settings → Fallback or on the request.

Retries stacked three high.

Gateway retry_params plus app retries compound load. Configure retry_params in the platform or request and cap retries in your application.

Shared cache across customers.

Cache without cache_by_customer can return one user's answer to another. Enable cache_by_customer or validate cache_ttl before launch.

Gateway logs in a silo.

Calls that bypass the gateway miss unified logs. Route through Respan so every router and passthrough request is logged automatically.

No metadata on gateway calls.

Logs lack customer_identifier or metadata; cannot filter by feature, tenant, or thread. Send params on prod paths; thread_identifier groups multi-turn traffic.

How to use the gateway in code

Point your client at https://api.respan.ai/api/, add provider keys, and ship — more examples in the docs.

Get your Respan API key

Add provider credentials

Connect providers on Integrations or add credits on Billing.

Choose router or passthrough

One OpenAI-style base URL, or native Anthropic / Gemini URLs.

Send params on every call

Tag users, set fallback models, and enable cache in extra_body.

from openai import OpenAI

client = OpenAI(
    base_url="https://api.respan.ai/api/",
    api_key="YOUR_RESPAN_API_KEY",
)

response = client.chat.completions.create(
    model="gpt-5.4",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_body={
        "customer_identifier": "user_123",
        "metadata": {"feature": "chatbot", "environment": "production"},
        "fallback_models": ["claude-sonnet-4-20250514", "gemini-2.5-flash"],
        "cache_enabled": True,
        "cache_ttl": 600,
        "cache_options": {"cache_by_customer": True},
    },
)
print(response.choices[0].message.content)

Gateway failure modes at production scale

Retries, cache policy, and streaming edge cases: what teams configure before the gateway becomes a single point of dependency.

Cascading retries

Gateway retry_params can retry upstream while your app retries the gateway. Configure num_retries and retry_after in the platform or request body, and cap application retries so layers do not stack.

Stale response cache

cache_ttl too long or cache_by_customer off can serve stale answers across users. Set cache_options.is_cached_by_model when switching models so the same prompt does not return a cache entry from another model.

disable_log and omit_log

disable_log records metrics only; no request/response payloads. cache_options.omit_log skips a new log on cache hits. Use when you need cost and latency without storing full bodies.

Respan is committed to maintaining compliance with the most rigorous international safety and security standards.

ISO 27001

Respan is fully compliant with ISO 27001, the internationally recognized standard for information security management.

SOC 2

We meet SOC 2 requirements to ensure secure and compliant management of data across all our systems.

GDPR

With operations designed for global compliance, we operate under GDPR - the world's strictest standard for data privacy.

HIPAA

Respan is HIPAA compliant with a Business Associate Agreement available for healthcare organizations.

Works with your entire stack

Use Respan with your favorite frameworks and tools.

Respan

Frequently asked questions

Engineering deep-dives

Practical guides on the architecture decisions that surround the gateway:

LLM gateway. The long version of how a gateway decomposes into routing, fallback, caching, and budget control.
The default AI stack. What most teams converge on once they outgrow direct provider calls.
Local vs frontier LLMs. When self-hosted Llama or Qwen actually beats a frontier API on cost and latency.
Should you fine-tune an LLM?. The decision tree before you spend two weeks training.
RAG and vector databases. What the gateway should and shouldn't handle for retrieval-heavy apps.

Related guides: LLM tracing · LLM evals · LLM observability

Built for AI agents.
Break less.
Ship more.

Start for free Get a demo