Having attended March 25th’s SRECon 2025 in Santa Clara, we had a chance to listen and participate with some great speakers on how agentic AI will start to change how SREs will go about troubleshooting alerts and incidents. Most notably from Daria Barteneva of Microsoft, Theo Klein from Google and so many others. It ended with a great talk from Charity Majors and Fred Hebert of Honeycomb.io challenging all of us agentic AI vendors, “AIOps: Prove It! An Open Letter to Vendors Selling AI for SREs”.
From participating in SRECon and from what we are seeing emerge, we wanted to over the course of the coming weeks breakdown the different elements we see changing as a result of our own product development and in general, with agentic AI as applied to troubleshooting. In this post, we’ll talk about the three ways we see agentic AI transforming observability runbooks. Those “playbooks” engineers, devops engineers, SREs and on-call engineers use to perform specific tasks or respond to known issues in a consistent way.
1) From static docs to living knowledge systems.
As above, traditional runbooks are often static and manually updated with a feedback loop limited by everyone’s time to manually keep them up to date. Some newer Agentic AI systems, like our own at Relvy are able to as part of the initial onboarding and on an ongoing basis let engineers directly bring in existing runbooks. The agentic AI system then uses these to begin diagnosing associated alerts and incidents. More importantly, the AI can suggest updates to the runbooks themselves for engineers to review. Perhaps the existing runbook wasn't helpful, and the AI found interesting things by exploring other monitoring data? At Relvy, we use this iterative process in our onboarding. This can be a living system that can be changed through quick feedback cycles on how well the agentic AI determines the root cause. At Relvy, we use this iterative process in our onboarding.
2) Speed up collaboration between the written runbook and action toward root cause analysis
Root cause analysis has traditionally required specific experience in a company’s infrastructure, time to review available runbooks and then apply through review of logs, metrics, events, traces and code. Today, we can use agentic AI to maintain transparency of issues queries, hypothesis for root cause utilizing as a prompt existing runbooks and in quick succession determine the fastest approach to finding the root cause of the alert or issue. By automating the initial root cause analysis in this way yet enabling engineers the ability to clearly see the assumptions made, queries issued and even take immediate corrective action by speaking with the AI agent regarding an alert itself, the AI agent can rapidly rapidly improve the speed with which runbooks are applied, minimizing time to resolution. Minimizing time to resolution in the future.
3) Collaborative Runbook creation between engineers and AI
This will happen as a result of the above. As agentic systems collaborate with engineers in solving for root cause of various alerts and incidents, we see vendors offering the ability to generate runbooks from the data gathered in finding root cause of many incidents. This means one day, we could see Runbooks initially drafted by agentic AI based on incident observations and successful remeditations. Then in concert, engineers can transparently review, refine and annotate the drafts - adding nuance, tribal knowledge or guardrails. Once again, the entire process to observability is improved through each iteration. This is a similar collaborative approach we at Relvy take for initial onboarding of our users to Relvy.
Some of what we’ve mentioned above is just coming online. But it's clear the runbook will be evolving from more than a passive reference to more of an active agent in reliability strategy.
About Relvy
We’ve paired our cost effective custom tuned language models which operate at 1/200th the cost of existing foundational models to make 24/7 agentic AI monitoring and debugging a reality. Get started instantly and see how Relvy can drastically reduce debugging time and costs, transforming your engineering processes today.