Multi-Cloud Incident Management: Challenges and Solutions
Learn the top challenges of managing incidents across AWS, Azure, GCP, and Kubernetes simultaneously, and how AI-powered tools solve cross-cloud investigation.
Key Takeaway: 89% of organizations use a multi-cloud strategy, but investigating incidents across AWS, Azure, and GCP simultaneously remains a major pain point. AI-powered tools that can query multiple cloud providers in parallel eliminate the context-switching that slows manual investigation by 3-5x.
Multi-cloud adoption has become the default strategy for enterprises. According to Flexera's 2024 State of the Cloud Report, 89% of organizations have a multi-cloud strategy, with many using more than one cloud provider at once. Gartner predicts that by 2027, over 90% of organizations will adopt multi-cloud approaches.
The reasons are clear: avoiding vendor lock-in, leveraging best-of-breed services, meeting data residency requirements, and improving resilience. But this architectural choice creates a significant operational challenge: how do you investigate and resolve incidents that span multiple cloud providers simultaneously?
What Is a Multi-Cloud AI SRE?
Definition: A multi-cloud AI SRE is an autonomous agent that investigates and helps resolve incidents across more than one cloud provider, rather than a single platform or Kubernetes alone. Instead of waiting for an engineer to context-switch between AWS, Azure, and GCP consoles, the agent queries every connected provider, runs the native CLI for each one (aws, az, gcloud, kubectl), maps cross-cloud dependencies, and produces a single root-cause analysis that spans the whole estate.
The distinction matters because most "AI SRE" tooling is scoped narrowly. Kubernetes-native agents such as HolmesGPT and K8sGPT reason brilliantly about pods, events, and cluster misconfigurations; HolmesGPT also reaches cloud and database sources through its toolsets, but neither is built to unify an entire multi-cloud estate into a single blast-radius graph. Observability-vendor agents such as Datadog Bits AI SRE investigate by reasoning over the telemetry already inside their own platform. Neither approach matches the multi-cloud reality of most enterprises, which routinely run more than one cloud provider at once.
A true multi-cloud AI SRE is defined by three properties:
- Provider breadth. It connects to multiple clouds natively, not just Kubernetes and not just one vendor's telemetry pipeline.
- Execution, not only diagnosis. It can run real read commands against live infrastructure to gather evidence, rather than only interpreting metrics that were shipped to it.
- Cross-cloud blast radius. It maintains a dependency model that spans providers, so it can tell you that an AWS RDS outage is degrading a GCP-hosted service through an Azure load balancer.
Aurora, the open-source (Apache-2.0) AI SRE from Arvo AI, was built around exactly these three properties.
Which AI SRE Should You Pick? A Decision Tree
The right tool depends almost entirely on how many clouds you actually run and what governance model you need.
If you run more than one cloud (any mix of AWS, Azure, GCP, OVH, Scaleway alongside Kubernetes), pick a multi-cloud agent. Cluster-scoped tools see the cluster clearly but have limited or no view once an incident crosses into a managed database, a queue, or a load balancer that lives in the cloud provider's control plane. Aurora is designed for this case: it queries every connected provider and stitches the findings into one Memgraph dependency graph for blast radius.
If you are Kubernetes-only and want CNCF governance, a CNCF Sandbox project may fit your requirements better than a vendor product. HolmesGPT (accepted as a CNCF Sandbox project in 2025) is the stronger agentic investigator, starting from a Prometheus alert and tracing through logs and metrics; K8sGPT (CNCF Sandbox since December 2023) is the faster health-check that surfaces CrashLoopBackOff, ImagePullBackOff, and similar misconfigurations in plain English. Both are excellent inside the cluster boundary.
If you are already standardized on one observability vendor and never plan to leave it, that vendor's built-in agent (Datadog Bits AI SRE, Dynatrace Davis, New Relic AI) is the path of least resistance, because the agent reasons over telemetry it already holds. The tradeoff is lock-in: the agent investigates within that platform, so its view is only as wide as the data you ship into it, and the capability is a paid add-on on top of per-host or per-user pricing.
If you need self-hosted, air-gapped, or vendor-neutral operation, pick an open-source agent you can run on your own infrastructure with your own models. Aurora is self-hosted and air-gapped capable, and it is bring-your-own-LLM, so you can point it at a local runtime such as Ollama instead of sending incident data to a third-party API.
How the Multi-Cloud Options Compare
The table below positions Aurora against the two Kubernetes-only CNCF tools and the telemetry-locked incumbents. "Telemetry-locked" means the agent investigates by reasoning over data inside its own platform rather than querying your cloud control planes directly.
| Capability | Aurora (Arvo AI) | HolmesGPT | K8sGPT | Datadog Bits AI SRE | Dynatrace Davis | New Relic AI |
|---|---|---|---|---|---|---|
| Cloud coverage | Multi-cloud: AWS, Azure, GCP, OVH, Scaleway, plus Kubernetes | Kubernetes and cloud-native; native cloud and database toolsets, but not a unified multi-cloud estate graph | Kubernetes-focused; surfaces resource misconfigurations | Scoped to telemetry ingested into Datadog | Scoped to telemetry ingested into Dynatrace | Scoped to telemetry ingested into New Relic |
| Executes live commands | Yes: runs kubectl, aws, az, gcloud in sandboxed Kubernetes pods to gather evidence | Read-only diagnosis by default; optional remediation toolset | Reads cluster resources; explains and suggests fixes | Reasons over collected telemetry; does not run your cloud CLIs | Reasons over collected telemetry | Reasons over collected telemetry |
| Cross-cloud blast radius | Yes: Memgraph infrastructure knowledge graph spanning all connected providers | Cluster dependency view | Cluster-level analysis | Correlates within the Datadog data model | Correlates within the Dynatrace topology (Smartscape) | Correlates within the New Relic entity model |
| Postmortems and RCA | Generates RCA and postmortems; exports to Confluence, Notion, SharePoint | Produces root-cause analysis from telemetry | Plain-English explanations and recommended fixes | AI root-cause and drafted postmortems inside Datadog | AI root-cause via Davis | AI-assisted analysis and summaries |
| Code fix / PR suggestions | Suggests code fixes and can open PRs (human-gated) | Operator mode can open PRs via GitHub | Suggested remediations | Surfaces related changes | Surfaces related changes | Surfaces related changes |
| Deployment model | Self-hosted, air-gapped capable | Self-hosted open source | Self-hosted open source | SaaS, inside Datadog | SaaS, inside Dynatrace | SaaS, inside New Relic |
| Model choice (BYO-LLM) | Yes: Ollama for local inference, plus OpenAI, Anthropic, Google, Vertex, OpenRouter | LLM-agnostic (OpenAI, Anthropic, Bedrock, Gemini, Ollama, and more) | Multiple AI backends supported | Vendor-managed models | Vendor-managed models | Vendor-managed models |
| License / cost | Open source, Apache-2.0, free | Open source, Apache-2.0 | Open source, CNCF Sandbox | Paid add-on on top of Datadog pricing | Paid, part of Dynatrace platform | Paid, part of New Relic platform |
| Destructive actions | Human-gated; read-only by default | Read-only by default | Read-only analysis | Vendor-controlled, gated | Vendor-controlled, gated | Vendor-controlled, gated |
The pattern is consistent. The CNCF tools are the right answer when the cluster is your whole world. The vendor agents are the right answer when you have committed to one telemetry platform and accept the lock-in and add-on cost. Aurora is the right answer when you run more than one cloud, want the agent to actually execute investigation commands rather than only read shipped telemetry, and need to keep everything self-hosted and vendor-neutral.
Top Challenges of Multi-Cloud Incident Management
Fragmented Observability
Each cloud provider has its own monitoring and logging ecosystem:
- AWS: CloudWatch, X-Ray, CloudTrail
- Azure: Azure Monitor, Application Insights, Log Analytics
- GCP: Cloud Monitoring, Cloud Logging, Cloud Trace
- Kubernetes: Prometheus, various logging solutions
When an incident spans multiple providers, engineers must context-switch between consoles, query languages, and data formats. A single investigation might require checking CloudWatch metrics, Azure Monitor alerts, and Kubernetes pod logs — all with different interfaces.
Inconsistent Tooling
Different cloud providers use different CLI tools (aws, az, gcloud, kubectl), different authentication mechanisms (IAM roles, service principals, service accounts), and different resource naming conventions. This inconsistency slows investigation and increases error rates.
Credential Management
Investigating incidents across clouds requires access credentials for each provider. Managing AWS access keys, Azure service principals, GCP service accounts, and Kubernetes kubeconfig files securely is a significant operational burden.
Blast Radius Assessment
In multi-cloud architectures, services often depend on resources across providers. A database in AWS might serve an application running in GCP, with traffic routed through Azure. Understanding the blast radius of an incident requires a cross-cloud dependency map.
Tribal Knowledge
Different team members often specialize in different clouds. When an incident spans AWS and Azure, you might need two specialists — and they might not be on call at the same time. Critical investigation knowledge is siloed.
"In a multi-cloud incident, the bottleneck isn't the tooling — it's finding someone who understands both AWS networking and Azure load balancing at 3 AM. AI agents that understand all clouds eliminate that dependency." — Noah Casarotto-Dinning, CEO at Arvo AI
According to the 2024 State of Cloud Strategy Survey by HashiCorp, 90% of enterprises report that multi-cloud skills gaps are a significant barrier to effective cloud operations.
Strategies for Cross-Cloud Incident Response
Unified Monitoring
Implement a monitoring layer that aggregates signals from all cloud providers. Tools like Datadog, Grafana, and New Relic can ingest metrics from multiple clouds, providing a single pane of glass.
Standardized Alerting
Route all alerts through a single platform (PagerDuty, Opsgenie) regardless of which cloud generated them. This ensures consistent severity classification and escalation.
Cross-Cloud Runbooks
Develop runbooks that account for multi-cloud scenarios. Instead of "check AWS CloudWatch," document the investigation flow across all relevant providers.
Infrastructure as Code
Use Terraform or similar tools to manage infrastructure across all providers. This creates a single source of truth for your cross-cloud architecture and makes it easier to identify configuration-related issues.
Automated Investigation
The most effective strategy is automating the cross-cloud investigation itself. AI agents that can query multiple cloud providers simultaneously eliminate the need for manual context-switching.
How Aurora Solves Multi-Cloud Incidents
Aurora was built specifically for multi-cloud incident management. Here's how it addresses each challenge:
Unified Cloud Connectors
Aurora connects to all major cloud providers through native connectors:
- AWS: Uses STS AssumeRole for secure, temporary credentials
- Azure: Azure Service Principal authentication
- GCP: OAuth-based authentication
- OVH: API key authentication
- Scaleway: API token authentication
- Kubernetes: Kubeconfig-based access
All connectors are configured once and used by the AI agent as needed during investigations.
Infrastructure Discovery Pipeline
Aurora's infrastructure discovery runs in three phases:
- Bulk Discovery: Enumerates all resources across all connected cloud providers
- Detail Enrichment: Gathers detailed configuration and metadata for each resource
- Connection Inference: Maps dependencies between resources (e.g., which EC2 instances connect to which RDS databases)
This builds a comprehensive infrastructure graph in Memgraph that the AI agent uses for blast radius analysis.
Natural Language Investigation
Instead of learning five different CLI tools and query languages, engineers interact with Aurora through natural language:
- "What caused the latency spike on the payment service?"
- "Are there any failing pods in the production cluster?"
- "Show me all resources affected by the us-east-1 connectivity issue"
Aurora translates these queries into the appropriate cloud-specific commands and aggregates the results.
Simultaneous Multi-Cloud Queries
During an investigation, Aurora's agents can execute commands across multiple cloud providers in parallel. While checking AWS CloudWatch metrics, it can simultaneously query Azure Monitor and Kubernetes pod status — something a human investigator would have to do sequentially.
Dependency Graph
Aurora's Memgraph-powered infrastructure graph provides cross-cloud dependency mapping. When an AWS RDS instance goes down, Aurora automatically identifies the Azure-hosted application that depends on it and the GCP-based load balancer that routes traffic to it.
Building a Multi-Cloud Incident Playbook
- Map your cross-cloud dependencies: Use Aurora's infrastructure discovery or manually document how services interact across providers.
- Standardize alerting: Route all alerts to a single platform with consistent severity levels.
- Deploy unified investigation: Set up Aurora with connectors to all your cloud providers.
- Create cross-cloud runbooks: Document investigation procedures that span providers.
- Practice: Run game days that simulate multi-cloud incidents to test your team's response.
- Review and improve: Use AI-generated postmortems to identify patterns in cross-cloud incidents.
Getting Started
git clone https://github.com/Arvo-AI/aurora.git
cd aurora
make init
make prod-prebuilt
Configure your cloud providers in Aurora's settings, connect your monitoring tools, and the AI agent will automatically investigate incidents across all your cloud environments.
Learn more at arvoai.ca or read the full documentation. For a deep dive into how Aurora's AI agents investigate incidents, see What is Agentic Incident Management?. To understand how Aurora automates root cause analysis, read our Complete Guide to RCA for SREs. For the security architecture of running an AI agent across cloud CLIs in production, see AI agent kubectl safety: sandboxed execution for production.