Building an AI Incident Responder That Actually Works

It was 3:47 AM on a Tuesday when our Slack lit up: "API latency spike — p99 over 8 seconds." The on-call engineer opened their laptop, rubbed their eyes, and started the familiar ritual: check Grafana, grep the logs, trace the request, check the deployment history, try to correlate events across 14 microservices.

By the time they found the root cause — a bad connection pool config deployed 6 hours earlier — it had been 2.5 hours. The incident was resolved, but the damage was done: SLA breach, unhappy customers, and an exhausted engineer.

That incident was the last straw. We decided to build something better.

The Problem: Incident Response is Mostly Data Gathering

Here's a dirty secret about incident response: 80% of the time is spent gathering context, not fixing things. The actual fix is usually simple — rollback a deploy, scale up a service, restart a pod. But finding what to fix requires correlating data across dozens of sources.

We mapped out what our engineers actually do during an incident:

Check alerting dashboards (Grafana/Datadog)
Read the last 15 minutes of logs from affected services
Check recent deployments (kubectl rollout history)
Look at resource utilization (CPU, memory, connections)
Trace a sample failing request across services
Check if similar incidents happened before
Formulate a hypothesis and verify

Steps 1-6 are pure data gathering. An AI agent can do all of that faster and more consistently than a sleep-deprived human.

Architecture: The Three-Phase Agent

We built our incident responder as a Cloudflare Worker that connects to Claude via the Anthropic API. The agent works in three phases:

Phase 1: Context Collection

When an alert fires, a webhook triggers our agent. It immediately starts collecting context in parallel:

async function collectContext(alert) {
  const [logs, deploys, metrics, traces] = await Promise.all([
    fetchRecentLogs(alert.service, '15m'),
    fetchRecentDeploys(alert.namespace, '24h'),
    fetchMetricsSnapshot(alert.service),
    fetchFailingTraces(alert.service, 5),
  ]);

  return { alert, logs, deploys, metrics, traces };
}

We pull from:

Loki for logs (via LogQL API)
ArgoCD for deployment history
Prometheus for metrics (CPU, memory, error rates, latency percentiles)
Jaeger for distributed traces

All data is fetched in parallel. Total collection time: under 3 seconds.

Phase 2: Analysis with Claude

We send the collected context to Claude Haiku with a carefully crafted system prompt:

const analysis = await anthropic.messages.create({
  model: 'claude-haiku-4-5-20251001',
  max_tokens: 2048,
  system: `You are an SRE incident analyst. Given alert data, logs, 
    metrics, deployment history, and traces, identify the most likely 
    root cause and suggest remediation steps.
    
    Rules:
    - Be specific: name the exact service, pod, or config that's failing
    - Correlate timing: if a deploy happened before the alert, flag it
    - Suggest the safest fix first (rollback > restart > scale > config change)
    - Rate your confidence: HIGH / MEDIUM / LOW
    - If unsure, say so — never guess on production systems`,
  messages: [{ role: 'user', content: formatContext(context) }],
});

We chose Claude Haiku for speed — analysis completes in under 2 seconds. For complex incidents, we escalate to Sonnet.

Phase 3: Action Recommendation

The agent posts to Slack with a structured incident brief:

🔴 INCIDENT DETECTED — api-gateway latency spike

📊 Summary:
- p99 latency jumped from 200ms to 8.2s at 03:41 UTC
- Error rate increased from 0.1% to 12.3%
- Affected: api-gateway, user-service (downstream)

🔍 Root Cause (HIGH confidence):
Deploy api-gateway v2.14.3 at 21:30 UTC changed connection 
pool max from 100 to 10 (likely typo in PR #847)

💡 Recommended Actions:
1. ROLLBACK api-gateway to v2.14.2 (safest)
   → kubectl rollout undo deployment/api-gateway -n production
2. OR hotfix: set POOL_MAX=100 in configmap
   → kubectl edit configmap api-gateway-config -n production

📋 Evidence:
- Connection pool exhaustion visible in logs (47 occurrences)
- Latency spike correlates exactly with deploy timestamp
- Pre-deploy metrics were healthy (p99: 180ms)

The on-call engineer can now act immediately instead of spending 2 hours gathering context.

Results: From 2.5 Hours to 8 Minutes

After 3 months in production:

MTTR dropped from 2.5 hours to 18 minutes average (8 minutes for agent-identified incidents)
62% of incidents were auto-diagnosed correctly (HIGH confidence, verified by engineers)
False positive rate: 8% (agent said HIGH confidence but root cause was different)
On-call satisfaction went from 2.1 to 4.3 (out of 5) in team surveys

The biggest surprise? Engineers started trusting the agent within the first week. When they saw it correctly identify a cascading failure across 3 services in 4 seconds — something that would have taken them 30+ minutes — they were sold.

Lessons Learned

Start with read-only. Our agent doesn't execute any actions automatically. It only recommends. This was critical for building trust. We'll add auto-remediation for simple cases (like rollbacks) once we have 6 months of accuracy data.

Haiku is fast enough. We initially planned to use Sonnet for everything, but Haiku's speed (sub-2-second analysis) makes a real difference at 3 AM. We only escalate to Sonnet for multi-service incidents.

Context formatting matters more than prompt engineering. We spent more time formatting the data we send to Claude (clean log excerpts, pre-aggregated metrics, deployment diffs) than on the system prompt. Garbage in, garbage out.

Track everything. We log every analysis alongside the actual root cause (determined post-incident). This gives us accuracy metrics and training data for improving our prompts.

What's Next

We're working on Phase 2: auto-remediation for high-confidence diagnoses. If the agent is 95%+ confident and the fix is a rollback, why wake up a human? We're building a confirmation flow where the agent proposes, a second Claude instance validates, and only then executes.

The goal isn't to replace on-call engineers — it's to let them sleep through the incidents that don't need human creativity.

Want to build something similar? We've helped 3 teams implement AI-powered incident response. Get in touch — we'll share our playbook.