Designing an inference API that developers actually enjoy using
Lessons from building our REST API: what we prioritized, what we cut, and why streaming, batching, and type safety matter more than features.
When we started building this platform, we asked ourselves: what makes an inference API feel good to use?
We looked at every API we depend on internally. The ones we enjoy using share three traits: consistent response streaming, meaningful rate limit headers, and SDKs that reflect the actual API surface - not a stripped-down afterthought.
Consistent response shapes
Every API endpoint returns the same envelope:
{
"data": { ... },
"error": null,
"meta": {
"request_id": "req_abc123",
"timestamp": "2026-03-15T10:00:00Z"
}
}
Whether you're running a completion, generating embeddings, or streaming a chat response, the top-level shape never changes. This means your error handling middleware works everywhere without special cases.
Rate limit headers on every response
Every response includes X-RateLimit-Remaining, X-RateLimit-Limit, and X-RateLimit-Reset. Not just on error responses - on every single response. This lets clients intelligently back off before hitting the 429, rather than discovering the limit reactively.
We set the default limit at 50 requests per second per token. Enterprise customers can negotiate higher limits.
Idempotency for safety
POST endpoints that create resources accept an optional Idempotency-Key header. If a request times out, retry with the same key and the operation will only execute once. This is critical for batch inference jobs where a single dropped connection shouldn't mean a double charge.
TypeScript SDK that mirrors the API
Our SDK types are generated from the same OpenAPI spec that documents the API. If the API adds a field, the SDK picks it up immediately. No manual sync, no drift between docs and code.
We ship the SDK as a single import:
import { Client } from '@example/sdk';
const client = new Client({ token: process.env.API_TOKEN });
const result = await client.inference.complete({ model: 'my-model', prompt: 'Hello' });
The response is typed. Autocomplete works. The docs match.
What we left out
We deliberately avoided:
An API should do one thing well. Ours serves models. Every design decision either supports that goal or gets cut.