Inference catalog

Call open models with one API

Ready-to-use LLMs served on OpenRelay's distributed GPU cloud. Swap in a model name and hit an OpenAI-compatible endpoint — no clusters to provision, billed by the token.

  • OpenAI-compatible API
  • Pay per token
  • No cold starts
  • No GPUs to manage

Llama 3.3 70B Instruct

Meta

Most popular
Context
128K
Parameters
70B
Input / 1M
$0.55
Output / 1M
$0.80

Meta's flagship general-purpose model. Strong instruction following, tool use, and multilingual support for production assistants.

TextCode

Qwen3 VL 30B A3B

Alibaba

Vision
Context
256K
Parameters
30B (MoE)
Input / 1M
$0.15
Output / 1M
$0.20

Vision-language MoE model with only 3B active parameters. Reads images and documents while staying cheap to run.

TextVision

DeepSeek R1

DeepSeek

Reasoning
Context
128K
Parameters
671B (MoE)
Input / 1M
$0.55
Output / 1M
$2.19

Open reasoning model with chain-of-thought traces. Competitive with frontier closed models on math, code, and logic.

TextReasoning

Qwen 2.5 Coder 32B

Alibaba

Context
128K
Parameters
32B
Input / 1M
$0.10
Output / 1M
$0.15

Code-specialized model that rivals proprietary copilots on completion, refactoring, and fill-in-the-middle tasks.

TextCode

Llama 3.1 8B Instruct

Meta

Best value
Context
128K
Parameters
8B
Input / 1M
$0.03
Output / 1M
$0.06

Fast, inexpensive workhorse for chat, classification, and RAG. The best value for high-volume, latency-sensitive traffic.

Text

Gemma 4 27B

Google

Context
128K
Parameters
27B
Input / 1M
$0.10
Output / 1M
$0.18

Google's open model with strong multilingual reasoning, function calling, and JSON-mode reliability for agentic workflows.

TextCode

New models are added regularly. Need something specific? Request a model.

Drop-in OpenAI-compatible API

Point your existing OpenAI SDK at OpenRelay, set the model, and you're live. The same request shape works across every model in the catalog — streaming, tool calls, and JSON mode included.

Model IDContextInput / 1MOutput / 1M
Llama 3.3 70B Instruct128K$0.55$0.80
Qwen3 VL 30B A3B256K$0.15$0.20
DeepSeek R1128K$0.55$2.19
Qwen 2.5 Coder 32B128K$0.10$0.15
Llama 3.1 8B Instruct128K$0.03$0.06
Gemma 4 27B128K$0.10$0.18

Prices are per 1M tokens. You only pay for tokens you use — there is no per-hour GPU rental and no minimum spend. Need a dedicated, always-on deployment instead? Run your own GPU cluster.

Service-level agreements

Need a guaranteed SLA for a model?

The pay-per-token catalog runs on shared, best-effort capacity. For production workloads that need committed uptime, latency, and throughput, our team can stand up a dedicated deployment of any model with a contractual SLA. Tell us your model and traffic and sales will put together terms.

99.9% uptime

Contractual availability backed by service credits, with dedicated capacity reserved for your traffic.

Latency & throughput

Guaranteed time-to-first-token and tokens/sec targets, sized to your peak request volume.

Priority support

A named contact, private Slack channel, and 24/7 on-call for incident response and capacity changes.

Start calling models in minutes

Grab an API key, pick a model, and send your first request. Free credits to start — no card required.