Building an AI Incident Responder That Actually Works

It was 3:47 AM on a Tuesday when our Slack lit up: "API latency spike — p99 over 8 seconds." The on-call engineer opened their laptop, rubbed their eyes, and started the familiar ritual: check Grafana, grep the logs, trace the request, check the deployment history, try to correlate events across 14 microservices.

By the time they found the root cause — a bad connection pool config deployed 6 hours earlier — it had been 2.5 hours. The incident was resolved, but the damage was done: SLA breach, unhappy customers, and an exhausted engineer.

That incident was the last straw. We decided to build something better.

The Problem: Incident Response is Mostly Data Gathering

Here's a dirty secret about incident response: 80% of the time is spent gathering context, not fixing things. The actual fix is usually simple — rollback a deploy, scale up a service, restart a pod. But finding what to fix requires correlating data across dozens of sources.

We mapped out what our engineers actually do during an incident:

  1. Check alerting dashboards (Grafana/Datadog)
  2. Read the last 15 minutes of logs from affected services
  3. Check recent deployments (kubectl rollout history)
  4. Look at resource utilization (CPU, memory, connections)
  5. Trace a sample failing request across services
  6. Check if similar incidents happened before
  7. Formulate a hypothesis and verify

Steps 1-6 are pure data gathering. An AI agent can do all of that faster and more consistently than a sleep-deprived human.

Architecture: The Three-Phase Agent

We built our incident responder as a Cloudflare Worker that connects to Claude via the Anthropic API. The agent works in three phases:

Phase 1: Context Collection

When an alert fires, a webhook triggers our agent. It immediately starts collecting context in parallel:

async function collectContext(alert) {
  const [logs, deploys, metrics, traces] = await Promise.all([
    fetchRecentLogs(alert.service, '15m'),
    fetchRecentDeploys(alert.namespace, '24h'),
    fetchMetricsSnapshot(alert.service),
    fetchFailingTraces(alert.service, 5),
  ]);

  return { alert, logs, deploys, metrics, traces };
}

We pull from:

All data is fetched in parallel. Total collection time: under 3 seconds.

Phase 2: Analysis with Claude

We send the collected context to Claude Haiku with a carefully crafted system prompt:

const analysis = await anthropic.messages.create({
  model: 'claude-haiku-4-5-20251001',
  max_tokens: 2048,
  system: `You are an SRE incident analyst. Given alert data, logs, 
    metrics, deployment history, and traces, identify the most likely 
    root cause and suggest remediation steps.
    
    Rules:
    - Be specific: name the exact service, pod, or config that's failing
    - Correlate timing: if a deploy happened before the alert, flag it
    - Suggest the safest fix first (rollback > restart > scale > config change)
    - Rate your confidence: HIGH / MEDIUM / LOW
    - If unsure, say so — never guess on production systems`,
  messages: [{ role: 'user', content: formatContext(context) }],
});

We chose Claude Haiku for speed — analysis completes in under 2 seconds. For complex incidents, we escalate to Sonnet.

Phase 3: Action Recommendation

The agent posts to Slack with a structured incident brief:

🔴 INCIDENT DETECTED — api-gateway latency spike

📊 Summary:
- p99 latency jumped from 200ms to 8.2s at 03:41 UTC
- Error rate increased from 0.1% to 12.3%
- Affected: api-gateway, user-service (downstream)

🔍 Root Cause (HIGH confidence):
Deploy api-gateway v2.14.3 at 21:30 UTC changed connection 
pool max from 100 to 10 (likely typo in PR #847)

💡 Recommended Actions:
1. ROLLBACK api-gateway to v2.14.2 (safest)
   → kubectl rollout undo deployment/api-gateway -n production
2. OR hotfix: set POOL_MAX=100 in configmap
   → kubectl edit configmap api-gateway-config -n production

📋 Evidence:
- Connection pool exhaustion visible in logs (47 occurrences)
- Latency spike correlates exactly with deploy timestamp
- Pre-deploy metrics were healthy (p99: 180ms)

The on-call engineer can now act immediately instead of spending 2 hours gathering context.

Results: From 2.5 Hours to 8 Minutes

After 3 months in production:

The biggest surprise? Engineers started trusting the agent within the first week. When they saw it correctly identify a cascading failure across 3 services in 4 seconds — something that would have taken them 30+ minutes — they were sold.

Lessons Learned

Start with read-only. Our agent doesn't execute any actions automatically. It only recommends. This was critical for building trust. We'll add auto-remediation for simple cases (like rollbacks) once we have 6 months of accuracy data.

Haiku is fast enough. We initially planned to use Sonnet for everything, but Haiku's speed (sub-2-second analysis) makes a real difference at 3 AM. We only escalate to Sonnet for multi-service incidents.

Context formatting matters more than prompt engineering. We spent more time formatting the data we send to Claude (clean log excerpts, pre-aggregated metrics, deployment diffs) than on the system prompt. Garbage in, garbage out.

Track everything. We log every analysis alongside the actual root cause (determined post-incident). This gives us accuracy metrics and training data for improving our prompts.

What's Next

We're working on Phase 2: auto-remediation for high-confidence diagnoses. If the agent is 95%+ confident and the fix is a rollback, why wake up a human? We're building a confirmation flow where the agent proposes, a second Claude instance validates, and only then executes.

The goal isn't to replace on-call engineers — it's to let them sleep through the incidents that don't need human creativity.


Want to build something similar? We've helped 3 teams implement AI-powered incident response. Get in touch — we'll share our playbook.