Inference catalog
Call open models with one API
Ready-to-use LLMs served on OpenRelay's distributed GPU cloud. Swap in a model name and hit an OpenAI-compatible endpoint — no clusters to provision, billed by the token.
- OpenAI-compatible API
- Pay per token
- No cold starts
- No GPUs to manage
Llama 3.3 70B Instruct
Meta
- Context
- 128K
- Parameters
- 70B
- Input / 1M
- $0.55
- Output / 1M
- $0.80
Meta's flagship general-purpose model. Strong instruction following, tool use, and multilingual support for production assistants.
Qwen3 VL 30B A3B
Alibaba
- Context
- 256K
- Parameters
- 30B (MoE)
- Input / 1M
- $0.15
- Output / 1M
- $0.20
Vision-language MoE model with only 3B active parameters. Reads images and documents while staying cheap to run.
DeepSeek R1
DeepSeek
- Context
- 128K
- Parameters
- 671B (MoE)
- Input / 1M
- $0.55
- Output / 1M
- $2.19
Open reasoning model with chain-of-thought traces. Competitive with frontier closed models on math, code, and logic.
Qwen 2.5 Coder 32B
Alibaba
- Context
- 128K
- Parameters
- 32B
- Input / 1M
- $0.10
- Output / 1M
- $0.15
Code-specialized model that rivals proprietary copilots on completion, refactoring, and fill-in-the-middle tasks.
Llama 3.1 8B Instruct
Meta
- Context
- 128K
- Parameters
- 8B
- Input / 1M
- $0.03
- Output / 1M
- $0.06
Fast, inexpensive workhorse for chat, classification, and RAG. The best value for high-volume, latency-sensitive traffic.
Gemma 4 27B
- Context
- 128K
- Parameters
- 27B
- Input / 1M
- $0.10
- Output / 1M
- $0.18
Google's open model with strong multilingual reasoning, function calling, and JSON-mode reliability for agentic workflows.
New models are added regularly. Need something specific? Request a model.
Drop-in OpenAI-compatible API
Point your existing OpenAI SDK at OpenRelay, set the model, and you're live. The same request shape works across every model in the catalog — streaming, tool calls, and JSON mode included.
| Model ID | Context | Input / 1M | Output / 1M |
|---|---|---|---|
| Llama 3.3 70B Instruct | 128K | $0.55 | $0.80 |
| Qwen3 VL 30B A3B | 256K | $0.15 | $0.20 |
| DeepSeek R1 | 128K | $0.55 | $2.19 |
| Qwen 2.5 Coder 32B | 128K | $0.10 | $0.15 |
| Llama 3.1 8B Instruct | 128K | $0.03 | $0.06 |
| Gemma 4 27B | 128K | $0.10 | $0.18 |
Prices are per 1M tokens. You only pay for tokens you use — there is no per-hour GPU rental and no minimum spend. Need a dedicated, always-on deployment instead? Run your own GPU cluster.
Service-level agreements
Need a guaranteed SLA for a model?
The pay-per-token catalog runs on shared, best-effort capacity. For production workloads that need committed uptime, latency, and throughput, our team can stand up a dedicated deployment of any model with a contractual SLA. Tell us your model and traffic and sales will put together terms.
99.9% uptime
Contractual availability backed by service credits, with dedicated capacity reserved for your traffic.
Latency & throughput
Guaranteed time-to-first-token and tokens/sec targets, sized to your peak request volume.
Priority support
A named contact, private Slack channel, and 24/7 on-call for incident response and capacity changes.
Or email sales@openrelay.inc.
Start calling models in minutes
Grab an API key, pick a model, and send your first request. Free credits to start — no card required.