At Relvy, we are building an AI debugging assistant for on-call, so the OpenTelemetry demo application is near and dear to us. We have been using this application as a test bed to iterate on best practices for applying AI to explore observability data and help debug production issues. In this blog series, we hope to peel back the layers on the set up we’ve arrived at, and the results we are seeing so far.
The Opentelemetry demo application is a great resource for getting started with OpenTelemetry for observability. It is a microservice-based distributed system for an ecommerce application (Astronomy Shop) with robust documentation for various observability vendor backends. The application is made up of about 15 services, covering ~12 languages / frameworks, all instrumented using OpenTelemetry. The demo comes with pre-configured Grafana dashboards to easily access important metrics and logs.
Most importantly, the demo also has a library of problem scenarios that can be toggled on and off via feature flags. These problems cover a good range of topics (errors/latency issues, unreachable services, traffic spikes etc.)
All of which leads to the question - how effective is AI at debugging these problem scenarios?
Answer: About 74% of the time. Read on to learn more.
The problem statement is quite simple - given access to observability data (metric dashboards, traces and logs), can an AI system debug these problem scenarios? The AI should:
We’ve created benchmark datasets for this purpose using a combination of available repositories such as the OpenTelemetry demo application, and augmented with other synthetically generated software systems. For each system, we have a library of problem scenarios, and a series of manually curated prompts from engineers trying to debug the issue. Here’s what an example sequence looks like
User (Initial debugging prompt): some product pages are not loading
AI: <initial analysis response>
User: other product IDs working fine?
AI: …
User: what's the error for failing product ID?
AI: …
User: show complete trace for failed product request
AI: …
User: product recently added/modified in catalog?
AI: …
User: error rate for GetProduct requests for this product?
AI: …
We rate each AI response on the following 2 point scale
Under the hood, we run an agentic AI system that executes a <plan, query, analyze> loop of the following sort
For the purposes of this blog, we’ve set up Datadog as the source of the underlying observability data.
This is one of the more involved problem scenarios, and the docs have a nice walkthrough of how to diagnose this problem using all the available telemetry data.
Here’s a sneak peak into the results before we take a deeper look:
We trigger Relvy with a simple description - something that can be an alert text, or even shorter, like:
recommendations are slow to load
Our AI is a system of agents that work together to debug these issues. Each agent has a published list of capabilities. For example:
A planner agent determines what next steps to take, and dispatches instructions to individual agents, and finally, compiles the analysis to provide a response to the on-call engineer.
Coming back to our problem with the recommendations endpoints being slow, here are the steps that the planner module came up with. We see that the current generation of planning modules are able to come up with reasonable steps, similar to what an on-call engineer would.
Here are the steps that our model comes up with (We run a custom trained model with 14B parameters):
{
"steps": [
{
"title": "Analyze Recommendation Latency Metrics",
"instruction": "Using the oteldemo - RED metrics dashboard, check the 'Avg Duration by Service' widget filtered for the recommendation service to see if there’s an unexpected increase in latency."
},
{,
"title": "Inspect Recommendation Service Traces",
"instruction": "Fetch exemplar traces for the recommendation service and identify any delays within trace spans, specifically checking calls to flagd and product-catalog."
},
{
"title": "Review Recommendation Service Logs",
"instruction": "Review log entries for the recommendation service, filtering for errors or warnings that might indicate issues with dependency calls or timeouts."
},
{
"title": "Check Recommendation Runtime Metrics",
"instruction": "On the Runtime Metrics Dashboard, check the 'Python CPU' and 'Python Memory' widgets for the recommendation service to see if resource usage is abnormal."
}
]
}
You will notice that the model has good awareness of the system, possibly relevant microservices and available monitoring dashboards. This is possible because Relvy runs exploratory queries periodically to update its understanding of the available observability data and the queries that can be run, in addition to learning from user instructions. We will go into more detail on this in part 2 of this blog.
Our dashboard analysis AI specializes in looking at metric data from monitoring dashboards to compile information that needs attention. Roughly, it is tuned to select appropriate panels/widgets from relevant dashboards, and analyze the underlying data in comparison to historical trends.
For example, one of the steps above instructs the dashboard metrics agent to check the ‘Avg Duration by Service’ widget in the oteldemo - RED metrics dashboard, and report any unexpected increase in latency.
This particular widget has ~120 time series lines, grouped by service and span name.
The AI’s job here is to figure out a subset of data that’s interesting to look at based on (a) anomalies, and (b) any scoping mentioned in the instructions, (such as service:recommendation
in this case)
Here’s the type of analysis we are able to get with our set up.
Similarly, our trace analysis agent is able to process instructions such as Fetch
exemplar traces for the recommendation service
and identify any delays within trace spans
.
The AI starts with a broad query (service:recommendation
), looks at the resulting spans and aggregate statistics, and often chooses to drill down into specific slices (service: recommendation @app.cache_hit: false
). It can also look at a few exemplar traces, to ultimately provide a summary like below.
Taking everything together, the AI is able to come up with an overall summary with pointers to 2-3 relevant data.
The AI has learned that we indeed see high latency, that the latency is variable, and chiefly linked to spans with cache_hit
set to false. It has also looked at logs to confirm that the cache itself is operational. This analysis was completed in 3 minutes.
For comparison, here’s what the docs say about the actual diagnosis
We see that we are able to achieve ~74% efficacy in correctly finding the root cause with our agents as described above. Please see below for the complete set of results for each of the problem scenarios in the OpenTelemetry demo application
At Relvy, we are building a debugging assistant that works well in complex production environments. In Part 2 of this series, we will talk about how Relvy explores observability data and solicits feedback to be able to come up with the steps and queries in the above results.
Relvy is a collaborative debugging assistant. We automate the initial debugging steps that an engineer is likely to take, the result of which you see above, and then answer follow-up queries. We will go deeper into the kind of queries Relvy can handle, and the ones that don’t work (yet) in Part 3.
And finally, we hope to talk more soon about how we enhance this demo application to inject new issues on a continuous basis. If there’s interest in creating a public benchmark for an AI debugging assistant for observability, we would love to talk.
As always, please reach out to us at hello at relvy.ai to learn more and collaborate with us.