When Claude Goes Down, Your Tests Shouldn't

This is what status.anthropic.com looked like at 9pm CET today:

Anthropic status page on 2026-04-15: claude.ai 97.84% — Partial Outage. platform.claude.com 98.79% — Partial Outage. Claude API 98.65% — Operational. Claude Code 98.41% — Degraded Performance. Claude Cowork 99.54% — Operational. — status.anthropic.com — April 15, 2026 — 21:00 CET

LinkedIn reactions to the Anthropic outage: Marc Beierschoder, Tobias Häberlein, Nicolas Dengler, and Francisco Bautista on AI infrastructure resilience — LinkedIn reactions, same evening

Two services in partial outage. Claude Code — the developer tool a meaningful fraction of engineers now ship from — degraded. Claude API itself wobbling around 98.6% uptime over the past month.

If you've built anything on top of Anthropic, today was a slow day. If you've built an AI test platform on top of Anthropic, today your customers' CI pipelines stalled, their PR reviews didn't post, their flaky tests didn't get diagnosed, and their new test generations queued forever.

That's the bet most AI testing tools made. We didn't.

"Great example why AI needs to be treated like infrastructure. In practice, we see: orchestration, redundancy, and governance matter more than the model itself."

— Marc Beierschoder, on LinkedIn the morning after the outage

The hidden assumption in "AI-powered" anything

Pick any AI testing platform launched in 2024 or 2025. Read their docs. They'll tell you "powered by GPT-4" or "built on Claude" with the kind of pride normally reserved for a cool framework choice. What they don't tell you is the second sentence: your platform's uptime is now their model's uptime. Inherited, not negotiated.

For SaaS that wraps an LLM as a feature (chat, summarization, autocomplete), this is annoying but survivable — users wait, retry, move on. For an AI test platform sitting in your CI gate, it's worse: the gate either blocks merges (bad) or fails open (worse). Neither is what you signed up for when you put the platform in front of your release process.

The nature of LLM outages makes this acute. They aren't clean — one provider goes degraded for hours while another runs fine, then they swap. There's no failover to a backup data center because the data center's not the bottleneck; the model is. If your platform calls one model, you wait.

What we actually do

QualityMax routes work across Claude, GPT, and Gemini per task. Not "we have a fallback if our primary fails." Not "switch the global model in a config." Per-task routing, baked into the codebase. The decision is in services/ai_crawl/generation/model_router.py and looks roughly like this:

Task	Primary	Fallback	Why
Crawl decisions	Claude Haiku	GPT-4o-mini	Speed and cost — thousands of calls per crawl.
Test generation	Claude Sonnet / Opus	GPT-4 / Gemini 2.5	Deep code reasoning, framework-specific idioms.
Self-healing	Claude Sonnet	GPT-4o	Diff understanding, selector inference.
AI review	Claude Opus	GPT-4 / Gemini Pro	Long-context analysis of diffs and tests.
Adversarial eval	Configurable (BYOLLM)	—	Customer chooses which model judges their AI features.

When Claude Code went degraded today, our test generation kept running on GPT-4. When the OpenAI API blipped two weeks ago, the same workflow flipped to Claude. Customers didn't notice either. That's the point.

Honest about the tradeoffs

Multi-model routing isn't free. It's mostly free, but not entirely. Three real costs we eat:

1. Behavior differences. The same prompt produces different output across providers. Claude tends to over-explain; GPT-4 tends to over-engineer; Gemini tends to be terser but less consistent on edge cases. We normalize via post-processing (strip preambles, validate schema, repair JSON) but it's real engineering work that a single-vendor platform skips.

2. Eval complexity. When a generated test fails in production, "which model wrote it?" matters for diagnosis. We tag every generation with provider + model + version in generation_metadata on the test record. Required for the dogfooding loop — you can't improve a prompt if you don't know which prompt-and-model produced the bad output.

3. Cost ceiling. Routing to "the best model for this task" can route you to the expensive one. We log every AI call to an ai_cost_log table and show per-day, per-user, per-model spend in our admin dashboard. Without that visibility, multi-model routing becomes a budget bonfire.

Worth it? Today's outage made the math obvious. The cost of an AI test platform being down for 4 hours during a release cycle — merges blocked or merges silently un-tested — is much higher than the prompt engineering tax.

The pitch in one line

If your AI test platform's status page mirrors a single LLM provider's, you don't have an AI test platform — you have a thin client.

What this means for you

If you're shopping AI test tools, the question isn't "which model do you use?" It's "which models, and what happens when one is down?" Most vendors will hedge or change the subject. The honest answer is a list of providers with fallback ordering and a recent outage anecdote where the fallback worked.

If you're already on QualityMax, today was a non-event by design. If you're on a single-vendor platform and today wasn't a non-event, you have your answer about the architecture you're paying for.

And if you're an engineer building anything else on top of an LLM — not just testing — copy this pattern. Per-task model routing is the cheapest insurance you'll ever buy.

Try QualityMax

AI test generation, self-healing, and PR-level evidence — routed across Claude, GPT, and Gemini so your CI doesn't go down when one provider does.

Get Started Free

When Claude Goes Down,Your Tests Shouldn't