The on-call rotation is one of the least glamorous parts of a software engineering career, and it may also be the first part that AI genuinely takes over.

What Autonomous Incident Response Actually Requires

Building an AI that can investigate a broken service sounds straightforward until you start listing what it actually needs: access to logs, the ability to run kubectl commands against a live cluster, awareness of existing runbooks, and some mechanism to verify that whatever fix it produces doesn’t quietly introduce three new failures while resolving one. Most teams that prototype this end up writing substantial scaffolding around a general-purpose LLM - a Claude wrapper here, a log-fetching hook there. HolmesGPT skips that construction project. It is a prebuilt LLM-backed agent designed specifically for on-call work, self-hostable, and capable of reading runbooks already stored in Notion, Confluence, or plain markdown files. It is currently the only open-source AI-SRE with that combination of properties.

The architecture behind the demo described here has three moving parts. HolmesGPT runs as a pod inside the cluster, subscribed to Alertmanager. When an alert fires, it investigates and hands a markdown report to a separate verifier pod. That verifier pod contains a small Claude wrapper - the “bridge” - which converts the markdown report and the service’s source code into an actual code patch. The patch then runs under mirrord exec, which gives the patched process the network identity, environment variables, and volume mounts of the real deploy/checkout pod. When that patched process calls the pricing service, it hits the real pricing pod, not a mock.

That last detail is the one that separates this from a test harness. The patched code runs against real downstream dependencies. If the pricing service has its own quirks that interact badly with the fix, the verifier will catch them.

The Cluster, the Bugs, and the Verdicts

The demo environment is a Python service called checkout that handles HTTP requests. On each request, checkout calls a pricing service to retrieve the item price. A loadgen pod hits checkout continuously. Prometheus scrapes checkout’s metrics, and Alertmanager fires when SLOs are violated. Everything runs in one namespace on a GKE cluster.

Two bugs were planted deliberately, each designed to produce a different verdict. The first scenario is an error-rate alert. A recent code change introduced a handling gap: any request for an item whose item_id ends in -3 raises a ValueError and returns an HTTP 500. The loadgen pod sends a mixed traffic pattern, and roughly 10% of requests hit this failure. A Prometheus rule called CheckoutErrorRateHigh fires once the error rate crosses 5%.

HolmesGPT receives the alert and runs a 30-second investigation. The command it executes against Alertmanager looks like this:

holmes investigate alertmanager \
  --alertmanager-url http://localhost:9093 \
  --alertmanager-alertname CheckoutErrorRateHigh \
  --model 'anthropic/claude-sonnet-4-20250514'

During those 30 seconds, it pulls pod descriptions, fetches logs, and walks the service configuration. Its conclusion identifies the error rate at 20.09% (against a 5% SLO), attributes the failures to ValueError: unsupported catalog shape for item_id=item-3, notes that the pattern is specific to item-3 checkout requests returning HTTP 500, and recommends fixing the application code to handle the item-3 catalog shape or adding proper validation.

That is a clean diagnosis. The exception came from the logs, the attribution is correct, and the recommendation is specific enough to act on.

The bridge step then converts this into code. Claude receives HolmesGPT’s markdown report plus the source of checkout.py and returns a minimal edit - in this case, handling the item-3 case by returning zero rather than raising. The verifier then runs two independent load tests: a baseline against the unpatched code and a patched run, each sending 100 requests under the same load with the same downstream services. The SLO condition under verification is error_rate > 5%. Because the baseline is the verifier’s own load test rather than the live error rate HolmesGPT observed, the baseline number differs from the 20.09% in the investigation report. The patched run clears the SLO without introducing regressions. Verdict: PASS.

The second scenario produces the other verdict. The second bug was planted to test what happens when the fix doesn’t hold. The verifier compares the patched run against both the SLO condition and a regression watchlist covering other signals. If the patched run still violates the SLO, or clears it but degrades something else, the verifier returns REJECT.

What This Means for Engineers Who Still Carry Pagers

There is a version of this story that sounds like threat: the pipeline wakes up, diagnoses the issue, writes the fix, verifies it, and deploys - and the on-call engineer finds out the next morning when they read the incident summary. That version exists, and it is worth taking seriously as a career question.

The more immediate reality is different, and it matters for how engineers should be thinking about their own skill development right now. The pipeline described here does not deploy. It produces a verdict and a patch. A human still decides whether to apply the fix. What the pipeline removes is the most grinding part of the on-call experience: the 2am log trawl, the manual reproduction attempt, the slow triangulation of root cause. HolmesGPT took 30 seconds to do what might take a tired engineer 20 minutes at 2am - and it did so without needing to know the codebase in advance, because it read the logs the same way an engineer would.

Engineers who understand how these pipelines work - how mirrord exec gives a local or verifier process a pod’s full network identity, how the bridge step translates a prose investigation into a diff, how SLO conditions translate into automated pass/fail logic - are in a genuinely different position from engineers who treat the whole stack as a black box.

The skill that remains irreplaceable is knowing what SLO to set and why. The pipeline compares against error_rate > 5% because someone decided that 5% was the threshold worth alerting on, wrote the Prometheus rule, and wired it to Alertmanager. Choosing that number, understanding its relationship to user impact, deciding what belongs on the regression watchlist - none of that is in the pipeline. The pipeline is very good at executing against a clearly specified target. Specifying the target is still a human job.

The CheckoutErrorRateHigh alert fired at 20.09%. The patched run brought that number below 5%.