Back to blog
Product March 15, 2026 Product Team 5 min read

Designing an inference API that developers actually enjoy using

Lessons from building our REST API: what we prioritized, what we cut, and why streaming, batching, and type safety matter more than features.

When we started building this platform, we asked ourselves: what makes an inference API feel good to use?


We looked at every API we depend on internally. The ones we enjoy using share three traits: consistent response streaming, meaningful rate limit headers, and SDKs that reflect the actual API surface - not a stripped-down afterthought.


Consistent response shapes


Every API endpoint returns the same envelope:


{
  "data": { ... },
  "error": null,
  "meta": {
    "request_id": "req_abc123",
    "timestamp": "2026-03-15T10:00:00Z"
  }
}

Whether you're running a completion, generating embeddings, or streaming a chat response, the top-level shape never changes. This means your error handling middleware works everywhere without special cases.


Rate limit headers on every response


Every response includes X-RateLimit-Remaining, X-RateLimit-Limit, and X-RateLimit-Reset. Not just on error responses - on every single response. This lets clients intelligently back off before hitting the 429, rather than discovering the limit reactively.


We set the default limit at 50 requests per second per token. Enterprise customers can negotiate higher limits.


Idempotency for safety


POST endpoints that create resources accept an optional Idempotency-Key header. If a request times out, retry with the same key and the operation will only execute once. This is critical for batch inference jobs where a single dropped connection shouldn't mean a double charge.


TypeScript SDK that mirrors the API


Our SDK types are generated from the same OpenAPI spec that documents the API. If the API adds a field, the SDK picks it up immediately. No manual sync, no drift between docs and code.


We ship the SDK as a single import:


import { Client } from '@example/sdk';

const client = new Client({ token: process.env.API_TOKEN });
const result = await client.inference.complete({ model: 'my-model', prompt: 'Hello' });

The response is typed. Autocomplete works. The docs match.


What we left out


We deliberately avoided:

  • GraphQL (adds complexity most teams don't need)
  • WebSocket subscriptions (ship separately as a focused WebSocket API)
  • Multi-part uploads (use signed URLs instead)

  • An API should do one thing well. Ours serves models. Every design decision either supports that goal or gets cut.