Why Your AI App Crashes: Circuit Breakers for LLM Pipelines

The infrastructure shift nobody planned for

Traditional applications owned their data layer. Your database sat on your network — or at worst, in your VPC. Queries resolved in milliseconds. Outages were rare, and when they happened, it was your problem to fix with your tools.

AI applications have a fundamentally different dependency graph. Your agentic workflow calls the Claude API 50 times per minute. Your RAG pipeline hits an embedding endpoint on every user query. Your background agents run continuous LLM loops for synthesis, classification, and routing.

Every one of those calls exits your infrastructure and enters someone else’s. And when that someone else goes down — or rate-limits you — the failure mode isn’t a graceful error message. It’s a cascade.

The cascade nobody designs for

A 529 Overloaded error from an LLM provider doesn’t just fail the request that triggered it. Here’s what actually happens:

Background agents stall. Your agentic loops are mid-execution, waiting on API responses that aren’t coming. Each stalled agent holds a thread.
Database connections back up. Those stalled threads are often holding database connections open — waiting to write the LLM response that never arrives.
Thread pool exhaustion. LLM queries take seconds, not milliseconds. A dozen stalled requests can consume your entire thread pool in under a minute.
Total infrastructure failure. A third-party outage just crashed your first-party systems. Your database, your queue, your web server — all starved of threads by requests waiting on an API that isn’t responding.

The irony is architectural: you built a robust application, then wired its critical path through a dependency you don’t control, without a plan for when that dependency disappears.

Why retries make it worse

Your first instinct is exponential backoff. After a failure, wait 1 second, then 2, then 4, then 8, doubling each time until the API recovers.

This works for millisecond-scale database queries. It does not work for LLM calls.

LLM queries take 2–30 seconds to resolve under normal conditions. An agent trapped in a retry loop with exponential backoff will hold its thread for minutes per attempt. Multiply that by every background agent, every user request, every concurrent pipeline — and you’ve built a Thundering Herd that will obliterate your own servers the moment the upstream API stabilizes.

Retries with backoff assume the downstream service will recover quickly and that your requests are cheap to hold. Neither assumption holds for LLM APIs.

The circuit breaker pattern

You don’t need a retry loop. You need a circuit breaker — a classic distributed systems pattern that protects your application by severing the connection entirely.

The circuit breaker operates as a three-state machine:

Closed (normal operation)

Traffic flows to the LLM API. Every call is monitored. Failures are tallied against a configurable threshold (e.g., 5 failures in 60 seconds).

Open (tripped)

The failure tally crosses the threshold. The breaker trips. All subsequent requests are rejected instantly — no timeout, no waiting, no thread held. The circuit stays open for a configurable cooldown period.

This is the critical difference from retries. An open circuit breaker doesn’t queue requests or hold threads. It fails immediately. Your thread pool stays healthy. Your database connections stay available. Your first-party infrastructure survives the third-party outage.

Half-Open (probing)

After the cooldown expires, the breaker allows a single probe request through. If it succeeds, the circuit closes and normal traffic resumes. If it fails, the circuit stays open and the cooldown resets.

The probe is the recovery mechanism. Instead of flooding a recovering API with your full request volume (the Thundering Herd), you send one request. One success proves the API is back. Then — and only then — you restore full traffic.

The AI-specific pattern: fallback routing

In traditional circuit breaker implementations, an open circuit means the feature is unavailable. The user sees an error page. The background job retries later.

AI architecture gives you a better option: fallback routing.

When the primary circuit trips open, instead of rejecting the request outright, the breaker routes traffic to a fallback:

Primary LLM goes down → circuit trips to OPEN
Breaker routes to a cheaper proxy API — a different provider, a smaller model, a cached response
If the proxy also fails → invoke a local model running on an edge node

The user gets degraded capability — maybe slower responses, less nuanced reasoning, or a simpler model. But the application stays online. The architecture absorbs the outage instead of propagating it.

This is the pattern that separates production AI systems from demos. A demo calls one API and crashes if it’s down. A production system has a fallback chain and degrades gracefully.

Implementation

The pattern is straightforward. Both Python and Java have mature circuit breaker libraries:

Python — PyBreaker:

import pybreaker

llm_breaker = pybreaker.CircuitBreaker(
    fail_max=5,           # trip after 5 failures
    reset_timeout=30,     # cooldown: 30 seconds
)

@llm_breaker
def call_primary_llm(prompt: str) -> str:
    return anthropic_client.messages.create(
        model="claude-opus-4-7",
        messages=[{"role": "user", "content": prompt}]
    )

def call_llm_with_fallback(prompt: str) -> str:
    try:
        return call_primary_llm(prompt)
    except pybreaker.CircuitBreakerError:
        # Primary circuit is open — route to fallback
        return call_fallback_llm(prompt)

Java — Resilience4J:

CircuitBreaker breaker = CircuitBreaker.ofDefaults("llm-api");

Supplier<String> decoratedCall = CircuitBreaker
    .decorateSupplier(breaker, () -> callPrimaryLLM(prompt));

Try.ofSupplier(decoratedCall)
    .recover(CallNotPermittedException.class,
             e -> callFallbackLLM(prompt));

Thirty lines of code. That’s the difference between a total infrastructure failure and a graceful degradation.

Configuration for LLM workloads

Standard circuit breaker defaults assume fast, cheap calls. LLM calls are neither. Tune accordingly:

fail_max: 3–5 failures. LLM APIs rarely throw isolated errors — if you’re getting failures, the provider is likely in a degraded state.
reset_timeout: 30–60 seconds. LLM provider outages tend to last minutes, not milliseconds. A 5-second cooldown will just generate more failed probes.
Timeout per request: Set an explicit timeout on your HTTP client (15–30 seconds for generation, 5 seconds for embeddings). Don’t rely on the default, which may be infinite.
One breaker per provider: If you use Claude for generation and OpenAI for embeddings, each gets its own circuit. A Claude outage shouldn’t trip your embedding pipeline.

Stop building glass cannons

The exponential backoff era assumed your dependencies were fast, cheap, and under your control. LLM APIs are slow, expensive, and operated by someone else. The failure mode of a retry loop against a slow dependency is a Thundering Herd that crashes your own infrastructure.

Wrap every third-party AI endpoint in a circuit breaker. Define your fallback chain. Accept degraded capability over total failure. It’s 30 lines of code to prevent a catastrophe.