Back to blog
Engineering March 8, 2026 Product Team 6 min read

How we reduced model cold starts by 80% with predictive caching

A technical deep-dive into our GPU pre-warm strategy, the data that drove the decisions, and the trade-offs we accepted.

Cold starts are the most common complaint we hear about model inference. A user in Tokyo sends a request to a model that hasn't been invoked in 30 minutes, and they wait 3 seconds for the GPU to warm up before getting a response.


We set out to fix this.


The problem


Our inference network spans 300+ locations. Each location runs GPU-enabled containers. When a model hasn't been invoked in a while, the runtime offloads it to free VRAM. The next request pays a cold start penalty.


We measured the median cold start across our network at 2.8s. The p95 was 6.2s. For an API that aims for p95 under 200ms, that was unacceptable.


The approach: predictive pre-warming


Instead of keeping every model warm everywhere (wasteful), we built a prediction model that anticipates which regions will need which models.


The model considers:

  • Time of day patterns - APAC regions are pre-warmed during local business hours
  • Recent inference history - A model invoked 5 times in the last hour stays warm; one invoked once in the last day may not
  • Deployment events - Newly deployed models are kept warm for the first 2 hours regardless of traffic

  • Implementation


    We store invocation metadata in a lightweight in-memory store at each region. Every 60 seconds, a regional coordinator runs a scoring function that decides which models to keep warm.


    The scoring function is straightforward:


    def should_warm(invocations_last_hour, invocations_last_day, seconds_since_deploy):
        score = invocations_last_hour * 10
        score += invocations_last_day * 0.5
        if seconds_since_deploy < 7200:
            score += 50
        return score > 20

    No complex orchestration. A simple heuristic that handles 95% of cases.


    Results


    After rolling this out:

  • Median cold start: 2.8s → 340ms
  • p95 cold start: 6.2s → 1.1s
  • VRAM overhead: +15% (models kept warm consume GPU memory even when idle)

  • The 15% VRAM overhead was an acceptable trade-off for an 88% reduction in cold start latency. We monitor this weekly and tune the threshold as traffic patterns evolve.


    What we didn't do


    We considered maintaining a persistent WebSocket connection to each region for instant warm signals. We decided against it - the complexity of managing 300 persistent connections outweighed the benefit for current traffic levels. The polling-based approach is simpler to debug and reason about.


    If your use case demands sub-100ms p99 cold starts, the answer today is to keep your model warm with a keep-alive ping every 3 minutes. Our predictive approach handles the common case; keep-alive handles the edge case.