At Relvy, we automatically identify the root cause of alerts to reduce the time to predict and prevent outages. Relvy analyzes logs, metrics, traces and events to help get to the bottom of production alerts and incidents. In this blog, we offer a deeper look into Relvy’s metric analysis capabilities.
At a very high level, Relvy runs a query-analyze loop that figures out, iteratively, which metrics to look at, and then how to interpret the metric data in the context of the production issue under investigation.
There are two main inputs that Relvy depends on:
Relvy regularly explores all available monitoring dashboards to build its knowledge base of all available metrics, and the attributes/labels that are defined for each metric. This helps Relvy know what kind of queries can be run in response to a production issue.
Customers configure Relvy by providing a seed set of instructions to guide Relvy’s AI agent. During onboarding, Relvy solves an initial set of alerts and solicits feedback from customer teams on the steps it took, and engineers guide Relvy to more accurately mimic their own workflows. See our previous blog: Part Two: AI Agents + why onboarding is crucial.
At the end of this process, Relvy is equipped with a set of workflows that form the basis of its agentic troubleshooting for future alerts. Of course, Relvy comes up with new workflows when the current set is not sufficient, and engineers are always encouraged to review/update these workflows. This is how Relvy gets better with time.
Relvy uses the above mentioned discovery process and continuously updating instruction set to figure out what metric queries to run given issue symptoms and the root cause hypothesis so far.
The second piece of the puzzle is analyzing the underlying time series data in the context of the production issue under investigation. For any given time series, Relvy’s language model is primed to answer the following questions:
To answer these questions, Relvy typically looks at the time series data points along with baseline statistics over the past hours, 4 hours, same hour last day and last week.
In the case of metrics with a lot of attributes/labels, we use clustering methods for efficient analysis, and use exemplars from each cluster to run the same analysis as above.
Here’s an example where Relvy identifies that trends are consistent across different slices of data.
Here’s an example where Relvy looks at 532 time series lines in one chart, groups them into 11 clusters, and analyzes exemplar data from each cluster. Relvy focuses on clusters that contain data that match attributes/labels corresponding to the alert/issue under investigation.
Relvy’s approach to automated root cause analysis is a dynamic and evolving process that now integrates time series analysis, monitoring dashboard exploration, and customer-driven learning in addition to working with logs, traces, events and code. By continuously refining our understanding of metrics, anomalies, and trends, this helps us efficiently diagnose and prevent outages with increasing accuracy. Our new ability to cluster and analyze large datasets further enhances Relvy’s capability to extract meaningful insights, even from complex monitoring environments.
We’ve paired our cost effective custom tuned language models which operate at 1/200th the cost of existing foundational models to make 24/7 agentic AI monitoring and debugging a reality. Relvy automatically identifies the root cause of alerts to reduce the time to predict and prevent outages.
https://www.relvy.ai/get-started