← Back to Blog
guide
9 min read

Multi-Cloud Incident Management: Challenges and Solutions

Learn the top challenges of managing incidents across AWS, Azure, GCP, and Kubernetes simultaneously, and how AI-powered tools solve cross-cloud investigation.

By Noah Casarotto-Dinning, CEO at Arvo AI||

Key Takeaway: 89% of organizations use a multi-cloud strategy, but investigating incidents across AWS, Azure, and GCP simultaneously remains a major pain point. AI-powered tools that can query multiple cloud providers in parallel eliminate the context-switching that slows manual investigation by 3-5x.

Multi-cloud adoption has become the default strategy for enterprises. According to Flexera's 2024 State of the Cloud Report, 89% of organizations have a multi-cloud strategy, with many using more than one cloud provider at once. Gartner predicts that by 2027, over 90% of organizations will adopt multi-cloud approaches.

The reasons are clear: avoiding vendor lock-in, leveraging best-of-breed services, meeting data residency requirements, and improving resilience. But this architectural choice creates a significant operational challenge: how do you investigate and resolve incidents that span multiple cloud providers simultaneously?

What Is a Multi-Cloud AI SRE?

Definition: A multi-cloud AI SRE is an autonomous agent that investigates and helps resolve incidents across more than one cloud provider, rather than a single platform or Kubernetes alone. Instead of waiting for an engineer to context-switch between AWS, Azure, and GCP consoles, the agent queries every connected provider, runs the native CLI for each one (aws, az, gcloud, kubectl), maps cross-cloud dependencies, and produces a single root-cause analysis that spans the whole estate.

The distinction matters because most "AI SRE" tooling is scoped narrowly. Kubernetes-native agents such as HolmesGPT and K8sGPT reason brilliantly about pods, events, and cluster misconfigurations; HolmesGPT also reaches cloud and database sources through its toolsets, but neither is built to unify an entire multi-cloud estate into a single blast-radius graph. Observability-vendor agents such as Datadog Bits AI SRE investigate by reasoning over the telemetry already inside their own platform. Neither approach matches the multi-cloud reality of most enterprises, which routinely run more than one cloud provider at once.

A true multi-cloud AI SRE is defined by three properties:

  • Provider breadth. It connects to multiple clouds natively, not just Kubernetes and not just one vendor's telemetry pipeline.
  • Execution, not only diagnosis. It can run real read commands against live infrastructure to gather evidence, rather than only interpreting metrics that were shipped to it.
  • Cross-cloud blast radius. It maintains a dependency model that spans providers, so it can tell you that an AWS RDS outage is degrading a GCP-hosted service through an Azure load balancer.

Aurora, the open-source (Apache-2.0) AI SRE from Arvo AI, was built around exactly these three properties.

Which AI SRE Should You Pick? A Decision Tree

The right tool depends almost entirely on how many clouds you actually run and what governance model you need.

If you run more than one cloud (any mix of AWS, Azure, GCP, OVH, Scaleway alongside Kubernetes), pick a multi-cloud agent. Cluster-scoped tools see the cluster clearly but have limited or no view once an incident crosses into a managed database, a queue, or a load balancer that lives in the cloud provider's control plane. Aurora is designed for this case: it queries every connected provider and stitches the findings into one Memgraph dependency graph for blast radius.

If you are Kubernetes-only and want CNCF governance, a CNCF Sandbox project may fit your requirements better than a vendor product. HolmesGPT (accepted as a CNCF Sandbox project in 2025) is the stronger agentic investigator, starting from a Prometheus alert and tracing through logs and metrics; K8sGPT (CNCF Sandbox since December 2023) is the faster health-check that surfaces CrashLoopBackOff, ImagePullBackOff, and similar misconfigurations in plain English. Both are excellent inside the cluster boundary.

If you are already standardized on one observability vendor and never plan to leave it, that vendor's built-in agent (Datadog Bits AI SRE, Dynatrace Davis, New Relic AI) is the path of least resistance, because the agent reasons over telemetry it already holds. The tradeoff is lock-in: the agent investigates within that platform, so its view is only as wide as the data you ship into it, and the capability is a paid add-on on top of per-host or per-user pricing.

If you need self-hosted, air-gapped, or vendor-neutral operation, pick an open-source agent you can run on your own infrastructure with your own models. Aurora is self-hosted and air-gapped capable, and it is bring-your-own-LLM, so you can point it at a local runtime such as Ollama instead of sending incident data to a third-party API.

How the Multi-Cloud Options Compare

The table below positions Aurora against the two Kubernetes-only CNCF tools and the telemetry-locked incumbents. "Telemetry-locked" means the agent investigates by reasoning over data inside its own platform rather than querying your cloud control planes directly.

CapabilityAurora (Arvo AI)HolmesGPTK8sGPTDatadog Bits AI SREDynatrace DavisNew Relic AI
Cloud coverageMulti-cloud: AWS, Azure, GCP, OVH, Scaleway, plus KubernetesKubernetes and cloud-native; native cloud and database toolsets, but not a unified multi-cloud estate graphKubernetes-focused; surfaces resource misconfigurationsScoped to telemetry ingested into DatadogScoped to telemetry ingested into DynatraceScoped to telemetry ingested into New Relic
Executes live commandsYes: runs kubectl, aws, az, gcloud in sandboxed Kubernetes pods to gather evidenceRead-only diagnosis by default; optional remediation toolsetReads cluster resources; explains and suggests fixesReasons over collected telemetry; does not run your cloud CLIsReasons over collected telemetryReasons over collected telemetry
Cross-cloud blast radiusYes: Memgraph infrastructure knowledge graph spanning all connected providersCluster dependency viewCluster-level analysisCorrelates within the Datadog data modelCorrelates within the Dynatrace topology (Smartscape)Correlates within the New Relic entity model
Postmortems and RCAGenerates RCA and postmortems; exports to Confluence, Notion, SharePointProduces root-cause analysis from telemetryPlain-English explanations and recommended fixesAI root-cause and drafted postmortems inside DatadogAI root-cause via DavisAI-assisted analysis and summaries
Code fix / PR suggestionsSuggests code fixes and can open PRs (human-gated)Operator mode can open PRs via GitHubSuggested remediationsSurfaces related changesSurfaces related changesSurfaces related changes
Deployment modelSelf-hosted, air-gapped capableSelf-hosted open sourceSelf-hosted open sourceSaaS, inside DatadogSaaS, inside DynatraceSaaS, inside New Relic
Model choice (BYO-LLM)Yes: Ollama for local inference, plus OpenAI, Anthropic, Google, Vertex, OpenRouterLLM-agnostic (OpenAI, Anthropic, Bedrock, Gemini, Ollama, and more)Multiple AI backends supportedVendor-managed modelsVendor-managed modelsVendor-managed models
License / costOpen source, Apache-2.0, freeOpen source, Apache-2.0Open source, CNCF SandboxPaid add-on on top of Datadog pricingPaid, part of Dynatrace platformPaid, part of New Relic platform
Destructive actionsHuman-gated; read-only by defaultRead-only by defaultRead-only analysisVendor-controlled, gatedVendor-controlled, gatedVendor-controlled, gated

The pattern is consistent. The CNCF tools are the right answer when the cluster is your whole world. The vendor agents are the right answer when you have committed to one telemetry platform and accept the lock-in and add-on cost. Aurora is the right answer when you run more than one cloud, want the agent to actually execute investigation commands rather than only read shipped telemetry, and need to keep everything self-hosted and vendor-neutral.

Top Challenges of Multi-Cloud Incident Management

Fragmented Observability

Each cloud provider has its own monitoring and logging ecosystem:

  • AWS: CloudWatch, X-Ray, CloudTrail
  • Azure: Azure Monitor, Application Insights, Log Analytics
  • GCP: Cloud Monitoring, Cloud Logging, Cloud Trace
  • Kubernetes: Prometheus, various logging solutions

When an incident spans multiple providers, engineers must context-switch between consoles, query languages, and data formats. A single investigation might require checking CloudWatch metrics, Azure Monitor alerts, and Kubernetes pod logs — all with different interfaces.

Inconsistent Tooling

Different cloud providers use different CLI tools (aws, az, gcloud, kubectl), different authentication mechanisms (IAM roles, service principals, service accounts), and different resource naming conventions. This inconsistency slows investigation and increases error rates.

Credential Management

Investigating incidents across clouds requires access credentials for each provider. Managing AWS access keys, Azure service principals, GCP service accounts, and Kubernetes kubeconfig files securely is a significant operational burden.

Blast Radius Assessment

In multi-cloud architectures, services often depend on resources across providers. A database in AWS might serve an application running in GCP, with traffic routed through Azure. Understanding the blast radius of an incident requires a cross-cloud dependency map.

Tribal Knowledge

Different team members often specialize in different clouds. When an incident spans AWS and Azure, you might need two specialists — and they might not be on call at the same time. Critical investigation knowledge is siloed.

"In a multi-cloud incident, the bottleneck isn't the tooling — it's finding someone who understands both AWS networking and Azure load balancing at 3 AM. AI agents that understand all clouds eliminate that dependency." — Noah Casarotto-Dinning, CEO at Arvo AI

According to the 2024 State of Cloud Strategy Survey by HashiCorp, 90% of enterprises report that multi-cloud skills gaps are a significant barrier to effective cloud operations.

Strategies for Cross-Cloud Incident Response

Unified Monitoring

Implement a monitoring layer that aggregates signals from all cloud providers. Tools like Datadog, Grafana, and New Relic can ingest metrics from multiple clouds, providing a single pane of glass.

Standardized Alerting

Route all alerts through a single platform (PagerDuty, Opsgenie) regardless of which cloud generated them. This ensures consistent severity classification and escalation.

Cross-Cloud Runbooks

Develop runbooks that account for multi-cloud scenarios. Instead of "check AWS CloudWatch," document the investigation flow across all relevant providers.

Infrastructure as Code

Use Terraform or similar tools to manage infrastructure across all providers. This creates a single source of truth for your cross-cloud architecture and makes it easier to identify configuration-related issues.

Automated Investigation

The most effective strategy is automating the cross-cloud investigation itself. AI agents that can query multiple cloud providers simultaneously eliminate the need for manual context-switching.

How Aurora Solves Multi-Cloud Incidents

Aurora was built specifically for multi-cloud incident management. Here's how it addresses each challenge:

Unified Cloud Connectors

Aurora connects to all major cloud providers through native connectors:

  • AWS: Uses STS AssumeRole for secure, temporary credentials
  • Azure: Azure Service Principal authentication
  • GCP: OAuth-based authentication
  • OVH: API key authentication
  • Scaleway: API token authentication
  • Kubernetes: Kubeconfig-based access

All connectors are configured once and used by the AI agent as needed during investigations.

Infrastructure Discovery Pipeline

Aurora's infrastructure discovery runs in three phases:

  1. Bulk Discovery: Enumerates all resources across all connected cloud providers
  2. Detail Enrichment: Gathers detailed configuration and metadata for each resource
  3. Connection Inference: Maps dependencies between resources (e.g., which EC2 instances connect to which RDS databases)

This builds a comprehensive infrastructure graph in Memgraph that the AI agent uses for blast radius analysis.

Natural Language Investigation

Instead of learning five different CLI tools and query languages, engineers interact with Aurora through natural language:

  • "What caused the latency spike on the payment service?"
  • "Are there any failing pods in the production cluster?"
  • "Show me all resources affected by the us-east-1 connectivity issue"

Aurora translates these queries into the appropriate cloud-specific commands and aggregates the results.

Simultaneous Multi-Cloud Queries

During an investigation, Aurora's agents can execute commands across multiple cloud providers in parallel. While checking AWS CloudWatch metrics, it can simultaneously query Azure Monitor and Kubernetes pod status — something a human investigator would have to do sequentially.

Dependency Graph

Aurora's Memgraph-powered infrastructure graph provides cross-cloud dependency mapping. When an AWS RDS instance goes down, Aurora automatically identifies the Azure-hosted application that depends on it and the GCP-based load balancer that routes traffic to it.

Building a Multi-Cloud Incident Playbook

  1. Map your cross-cloud dependencies: Use Aurora's infrastructure discovery or manually document how services interact across providers.
  2. Standardize alerting: Route all alerts to a single platform with consistent severity levels.
  3. Deploy unified investigation: Set up Aurora with connectors to all your cloud providers.
  4. Create cross-cloud runbooks: Document investigation procedures that span providers.
  5. Practice: Run game days that simulate multi-cloud incidents to test your team's response.
  6. Review and improve: Use AI-generated postmortems to identify patterns in cross-cloud incidents.

Getting Started

git clone https://github.com/Arvo-AI/aurora.git
cd aurora
make init
make prod-prebuilt

Configure your cloud providers in Aurora's settings, connect your monitoring tools, and the AI agent will automatically investigate incidents across all your cloud environments.

Learn more at arvoai.ca or read the full documentation. For a deep dive into how Aurora's AI agents investigate incidents, see What is Agentic Incident Management?. To understand how Aurora automates root cause analysis, read our Complete Guide to RCA for SREs. For the security architecture of running an AI agent across cloud CLIs in production, see AI agent kubectl safety: sandboxed execution for production.

multi-cloud incident management
cross-cloud monitoring
multi-cloud observability
AWS incident management
Azure incident management
GCP incident management
Kubernetes incident management
cloud incident response
infrastructure dependency graph
multi-cloud strategy
cross-cloud investigation
unified cloud management
OVH
Scaleway

Frequently Asked Questions

Try Aurora for Free

Open source, AI-powered incident management. Deploy in minutes.