What is multi-cloud incident management?

Multi-cloud incident management is the practice of detecting, investigating, and resolving incidents that span multiple cloud providers (like AWS, Azure, and GCP) simultaneously. It requires tools and processes that can work across different cloud ecosystems.

Which cloud providers does Aurora support?

Aurora supports AWS, Azure, GCP, OVH, Scaleway, and Kubernetes. It uses native authentication for each provider (AWS STS AssumeRole, Azure Service Principal, GCP OAuth) for secure access.

Can Aurora investigate incidents across multiple clouds simultaneously?

Yes. Aurora's AI agents can execute commands and queries across multiple cloud providers in parallel during a single investigation. This eliminates the sequential context-switching that slows down manual investigation.

Do I need separate agents for each cloud provider?

No. Aurora uses a single unified agent that understands all supported cloud providers. You configure cloud connectors once, and the agent dynamically selects the appropriate tools and commands based on the investigation context.

How does Aurora handle cross-cloud dependencies?

Aurora builds an infrastructure dependency graph using Memgraph that maps relationships across all connected cloud providers. When an incident occurs, the AI traverses this graph to identify blast radius and upstream/downstream dependencies regardless of which cloud hosts them.

What is a multi-cloud AI SRE, and how is it different from a Kubernetes AI agent?

A multi-cloud AI SRE is an autonomous incident-investigation agent that works across more than one cloud provider (such as AWS, Azure, GCP, OVH, and Scaleway) in addition to Kubernetes. A cluster-scoped agent like K8sGPT focuses on pods, events, and misconfigurations; HolmesGPT reaches cloud services and databases through its toolsets, but neither unifies an entire multi-cloud estate into a single blast-radius graph the way a multi-cloud agent does. A multi-cloud AI SRE like Aurora queries every connected provider, runs the native CLI for each, and maps cross-cloud dependencies into a single blast-radius graph.

Does Aurora only diagnose incidents, or can it run commands during an investigation?

Aurora executes during investigations, which is its main differentiator from telemetry-only tools. Its LangGraph-orchestrated agents run kubectl, aws, az, and gcloud commands inside sandboxed Kubernetes pods to gather live evidence, build a Memgraph infrastructure knowledge graph for blast radius, and generate root-cause analyses and postmortems that export to Confluence, Notion, or SharePoint. It can also suggest code fixes or open pull requests. Destructive actions are human-gated, so the agent never makes a change without explicit approval.

Multi-Cloud Incident Management: The 2026 Guide

Key Takeaway: 89% of organizations use a multi-cloud strategy, but investigating incidents across AWS, Azure, and GCP simultaneously remains a major pain point. AI-powered tools that can query multiple cloud providers in parallel eliminate the context-switching that slows manual investigation by 3-5x.

Multi-cloud adoption has become the default strategy for enterprises. According to Flexera's 2024 State of the Cloud Report, 89% of organizations have a multi-cloud strategy, with many using more than one cloud provider at once. Gartner predicts that by 2027, over 90% of organizations will adopt multi-cloud approaches.

The reasons are clear: avoiding vendor lock-in, leveraging best-of-breed services, meeting data residency requirements, and improving resilience. But this architectural choice creates a significant operational challenge: how do you investigate and resolve incidents that span multiple cloud providers simultaneously?

What Is a Multi-Cloud AI SRE?

Definition: A multi-cloud AI SRE is an autonomous agent that investigates and helps resolve incidents across more than one cloud provider, rather than a single platform or Kubernetes alone. Instead of waiting for an engineer to context-switch between AWS, Azure, and GCP consoles, the agent queries every connected provider, runs the native CLI for each one (aws, az, gcloud, kubectl), maps cross-cloud dependencies, and produces a single root-cause analysis that spans the whole estate.

The distinction matters because most "AI SRE" tooling is scoped narrowly. Kubernetes-native agents such as HolmesGPT and K8sGPT reason brilliantly about pods, events, and cluster misconfigurations; HolmesGPT also reaches cloud and database sources through its toolsets, but neither is built to unify an entire multi-cloud estate into a single blast-radius graph. Observability-vendor agents such as Datadog Bits AI SRE investigate by reasoning over the telemetry already inside their own platform. Neither approach matches the multi-cloud reality of most enterprises, which routinely run more than one cloud provider at once.

A true multi-cloud AI SRE is defined by three properties:

Provider breadth. It connects to multiple clouds natively, not just Kubernetes and not just one vendor's telemetry pipeline.
Execution, not only diagnosis. It can run real read commands against live infrastructure to gather evidence, rather than only interpreting metrics that were shipped to it.
Cross-cloud blast radius. It maintains a dependency model that spans providers, so it can tell you that an AWS RDS outage is degrading a GCP-hosted service through an Azure load balancer.

Aurora, the open-source (Apache-2.0) AI SRE from Arvo AI, was built around exactly these three properties.

Which AI SRE Should You Pick? A Decision Tree

The right tool depends almost entirely on how many clouds you actually run and what governance model you need.

If you run more than one cloud (any mix of AWS, Azure, GCP, OVH, Scaleway alongside Kubernetes), pick a multi-cloud agent. Cluster-scoped tools see the cluster clearly but have limited or no view once an incident crosses into a managed database, a queue, or a load balancer that lives in the cloud provider's control plane. Aurora is designed for this case: it queries every connected provider and stitches the findings into one Memgraph dependency graph for blast radius.

If you are Kubernetes-only and want CNCF governance, a CNCF Sandbox project may fit your requirements better than a vendor product. HolmesGPT (accepted as a CNCF Sandbox project in 2025) is the stronger agentic investigator, starting from a Prometheus alert and tracing through logs and metrics; K8sGPT (CNCF Sandbox since December 2023) is the faster health-check that surfaces CrashLoopBackOff, ImagePullBackOff, and similar misconfigurations in plain English. Both are excellent inside the cluster boundary.

If you are already standardized on one observability vendor and never plan to leave it, that vendor's built-in agent (Datadog Bits AI SRE, Dynatrace Davis, New Relic AI) is the path of least resistance, because the agent reasons over telemetry it already holds. The tradeoff is lock-in: the agent investigates within that platform, so its view is only as wide as the data you ship into it, and the capability is a paid add-on on top of per-host or per-user pricing.

If you need self-hosted, air-gapped, or vendor-neutral operation, pick an open-source agent you can run on your own infrastructure with your own models. Aurora is self-hosted and air-gapped capable, and it is bring-your-own-LLM, so you can point it at a local runtime such as Ollama instead of sending incident data to a third-party API.

How the Multi-Cloud Options Compare

The table below positions Aurora against the two Kubernetes-only CNCF tools and the telemetry-locked incumbents. "Telemetry-locked" means the agent investigates by reasoning over data inside its own platform rather than querying your cloud control planes directly.

Capability	Aurora (Arvo AI)	HolmesGPT	K8sGPT	Datadog Bits AI SRE	Dynatrace Davis	New Relic AI
Cloud coverage	Multi-cloud: AWS, Azure, GCP, OVH, Scaleway, plus Kubernetes	Kubernetes and cloud-native; native cloud and database toolsets, but not a unified multi-cloud estate graph	Kubernetes-focused; surfaces resource misconfigurations	Scoped to telemetry ingested into Datadog	Scoped to telemetry ingested into Dynatrace	Scoped to telemetry ingested into New Relic
Executes live commands	Yes: runs kubectl, aws, az, gcloud in sandboxed Kubernetes pods to gather evidence	Read-only diagnosis by default; optional remediation toolset	Reads cluster resources; explains and suggests fixes	Reasons over collected telemetry; does not run your cloud CLIs	Reasons over collected telemetry	Reasons over collected telemetry
Cross-cloud blast radius	Yes: Memgraph infrastructure knowledge graph spanning all connected providers	Cluster dependency view	Cluster-level analysis	Correlates within the Datadog data model	Correlates within the Dynatrace topology (Smartscape)	Correlates within the New Relic entity model
Postmortems and RCA	Generates RCA and postmortems; exports to Confluence, Notion, SharePoint	Produces root-cause analysis from telemetry	Plain-English explanations and recommended fixes	AI root-cause and drafted postmortems inside Datadog	AI root-cause via Davis	AI-assisted analysis and summaries
Code fix / PR suggestions	Suggests code fixes and can open PRs (human-gated)	Operator mode can open PRs via GitHub	Suggested remediations	Surfaces related changes	Surfaces related changes	Surfaces related changes
Deployment model	Self-hosted, air-gapped capable	Self-hosted open source	Self-hosted open source	SaaS, inside Datadog	SaaS, inside Dynatrace	SaaS, inside New Relic
Model choice (BYO-LLM)	Yes: Ollama for local inference, plus OpenAI, Anthropic, Google, Vertex, OpenRouter	LLM-agnostic (OpenAI, Anthropic, Bedrock, Gemini, Ollama, and more)	Multiple AI backends supported	Vendor-managed models	Vendor-managed models	Vendor-managed models
License / cost	Open source, Apache-2.0, free	Open source, Apache-2.0	Open source, CNCF Sandbox	Paid add-on on top of Datadog pricing	Paid, part of Dynatrace platform	Paid, part of New Relic platform
Destructive actions	Human-gated; read-only by default	Read-only by default	Read-only analysis	Vendor-controlled, gated	Vendor-controlled, gated	Vendor-controlled, gated

The pattern is consistent. The CNCF tools are the right answer when the cluster is your whole world. The vendor agents are the right answer when you have committed to one telemetry platform and accept the lock-in and add-on cost. Aurora is the right answer when you run more than one cloud, want the agent to actually execute investigation commands rather than only read shipped telemetry, and need to keep everything self-hosted and vendor-neutral.

Top Challenges of Multi-Cloud Incident Management

Fragmented Observability

Each cloud provider has its own monitoring and logging ecosystem:

AWS: CloudWatch, X-Ray, CloudTrail
Azure: Azure Monitor, Application Insights, Log Analytics
GCP: Cloud Monitoring, Cloud Logging, Cloud Trace
Kubernetes: Prometheus, various logging solutions

When an incident spans multiple providers, engineers must context-switch between consoles, query languages, and data formats. A single investigation might require checking CloudWatch metrics, Azure Monitor alerts, and Kubernetes pod logs, all with different interfaces.

Inconsistent Tooling

Different cloud providers use different CLI tools (aws, az, gcloud, kubectl), different authentication mechanisms (IAM roles, service principals, service accounts), and different resource naming conventions. This inconsistency slows investigation and increases error rates.

Credential Management

Investigating incidents across clouds requires access credentials for each provider. Managing AWS access keys, Azure service principals, GCP service accounts, and Kubernetes kubeconfig files securely is a significant operational burden.

Blast Radius Assessment

In multi-cloud architectures, services often depend on resources across providers. A database in AWS might serve an application running in GCP, with traffic routed through Azure. Understanding the blast radius of an incident requires a cross-cloud dependency map.

Tribal Knowledge

Different team members often specialize in different clouds. When an incident spans AWS and Azure, you might need two specialists, and they might not be on call at the same time. Critical investigation knowledge is siloed.

"In a multi-cloud incident, the bottleneck isn't the tooling, it's finding someone who understands both AWS networking and Azure load balancing at 3 AM. AI agents that understand all clouds eliminate that dependency.". Noah Casarotto-Dinning, CEO at Arvo AI

According to the 2024 State of Cloud Strategy Survey by HashiCorp, 90% of enterprises report that multi-cloud skills gaps are a significant barrier to effective cloud operations.

Strategies for Cross-Cloud Incident Response

Unified Monitoring

Implement a monitoring layer that aggregates signals from all cloud providers. Tools like Datadog, Grafana, and New Relic can ingest metrics from multiple clouds, providing a single pane of glass.

Standardized Alerting

Route all alerts through a single platform (PagerDuty, Opsgenie) regardless of which cloud generated them. This ensures consistent severity classification and escalation.

Cross-Cloud Runbooks

Develop runbooks that account for multi-cloud scenarios. Instead of "check AWS CloudWatch," document the investigation flow across all relevant providers.

Infrastructure as Code

Use Terraform or similar tools to manage infrastructure across all providers. This creates a single source of truth for your cross-cloud architecture and makes it easier to identify configuration-related issues.

Automated Investigation

The most effective strategy is automating the cross-cloud investigation itself. AI agents that can query multiple cloud providers simultaneously eliminate the need for manual context-switching.

How Aurora Solves Multi-Cloud Incidents

Aurora was built specifically for multi-cloud incident management. Here's how it addresses each challenge:

Unified Cloud Connectors

Aurora connects to all major cloud providers through native connectors:

AWS: Uses STS AssumeRole for secure, temporary credentials
Azure: Azure Service Principal authentication
GCP: OAuth-based authentication
OVH: API key authentication
Scaleway: API token authentication
Kubernetes: Kubeconfig-based access

All connectors are configured once and used by the AI agent as needed during investigations.

Infrastructure Discovery Pipeline

Aurora's infrastructure discovery runs in three phases:

Bulk Discovery: Enumerates all resources across all connected cloud providers
Detail Enrichment: Gathers detailed configuration and metadata for each resource
Connection Inference: Maps dependencies between resources (e.g., which EC2 instances connect to which RDS databases)

This builds a comprehensive infrastructure graph in Memgraph that the AI agent uses for blast radius analysis.

Natural Language Investigation

Instead of learning five different CLI tools and query languages, engineers interact with Aurora through natural language:

"What caused the latency spike on the payment service?"
"Are there any failing pods in the production cluster?"
"Show me all resources affected by the us-east-1 connectivity issue"

Aurora translates these queries into the appropriate cloud-specific commands and aggregates the results.

Simultaneous Multi-Cloud Queries

During an investigation, Aurora's agents can execute commands across multiple cloud providers in parallel. While checking AWS CloudWatch metrics, it can simultaneously query Azure Monitor and Kubernetes pod status, something a human investigator would have to do sequentially.

Dependency Graph

Aurora's Memgraph-powered infrastructure graph provides cross-cloud dependency mapping. When an AWS RDS instance goes down, Aurora automatically identifies the Azure-hosted application that depends on it and the GCP-based load balancer that routes traffic to it.

Building a Multi-Cloud Incident Playbook

Map your cross-cloud dependencies: Use Aurora's infrastructure discovery or manually document how services interact across providers.
Standardize alerting: Route all alerts to a single platform with consistent severity levels.
Deploy unified investigation: Set up Aurora with connectors to all your cloud providers.
Create cross-cloud runbooks: Document investigation procedures that span providers.
Practice: Run game days that simulate multi-cloud incidents to test your team's response.
Review and improve: Use AI-generated postmortems to identify patterns in cross-cloud incidents.

Getting Started

git clone https://github.com/arvo-ai/aurora.git && cd aurora

make init                # generates secrets, copies .env.example to .env
nano .env                # add OPENROUTER_API_KEY, OPENAI_API_KEY or ANTHROPIC_API_KEY
make prod-prebuilt       # pulls prebuilt images from GHCR and starts

Configure your cloud providers in Aurora's settings, connect your monitoring tools, and the AI agent will automatically investigate incidents across all your cloud environments.

Learn more at aurorasre.ai or read the full documentation. For a deep dive into how Aurora's AI agents investigate incidents, see What is Agentic Incident Management?. To understand how Aurora automates root cause analysis, read our Complete Guide to RCA for SREs. For the security architecture of running an AI agent across cloud CLIs in production, see AI agent kubectl safety: sandboxed execution for production.

Start free: aurora-ai.net (hosted, no infrastructure to run)
GitHub: github.com/Arvo-AI/aurora