← Back to Blog
comparison
12 min read

Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT (2026)

Compare Aurora, HolmesGPT, and K8sGPT — the three credible open-source AI SREs in 2026 — across architecture, execution, and integrations.

By Noah Casarotto-Dinning, CEO at Arvo AI|

Key Takeaways

  • Three production-credible open-source AI SREs exist in 2026: Aurora (Arvo AI), HolmesGPT (Robusta + Microsoft, CNCF Sandbox), and K8sGPT (CNCF Sandbox), all Apache 2.0. OpenSRE (Tracer) is an emerging fourth, still in public alpha.
  • Only one is a true multi-step agent. HolmesGPT runs an iterative ReAct loop. K8sGPT is a rule-based scanner that uses an LLM only to explain findings. Aurora is a multi-step LangGraph agent with cross-cloud execution.
  • Only Aurora handles multi-cloud out of the box (AWS, Azure, GCP, OVH, Scaleway, plus Kubernetes). HolmesGPT covers Kubernetes plus 30+ observability integrations. K8sGPT is Kubernetes-only.
  • Only Aurora generates remediation pull requests. HolmesGPT can open PRs with suggested fixes in Operator mode; K8sGPT is strictly read-only with no write actions.
  • All three support BYO LLM, including local inference via Ollama for air-gapped deployments — the differentiator over commercial AI SREs.

Of the 46+ companies offering "AI SRE" products in 2026, only a handful are open source — and only three are credible enough to deploy in production: Aurora, HolmesGPT, and K8sGPT. An open-source AI SRE is an AI agent that performs incident investigation, root cause analysis, and (sometimes) remediation under a permissive license that allows self-hosting, source-code audit, and modification. They get lumped together in marketing, but architecturally these three are different products solving different parts of the incident response problem.

This guide compares them on the things that actually matter: agent architecture, execution model, integration scope, and where you can deploy them. By the end, you should be able to pick the right one for your stack — or know whether you need all three.

What is an open-source AI SRE?

An open-source AI SRE is an AI agent that performs site reliability engineering work — alert triage, incident investigation, root cause analysis, remediation — under a permissive license that allows self-hosting, source-code audit, and modification. Three properties are non-negotiable:

  1. License: Apache 2.0, MIT, or equivalent. Source-available licenses (BSL, SSPL) do not count for most production teams.
  2. Self-hostable: runs entirely inside your environment without phoning home to a vendor.
  3. LLM-driven: uses large language models, not just static rules or regex. (This is what separates "AI SRE" from older AIOps tools.)

The reason this category matters: incident data is some of the most sensitive telemetry an organization produces. Self-hosted, audit-able AI is the only model that works for regulated industries, air-gapped environments, or any team that doesn't want production telemetry leaving their perimeter.

For a deeper background, see our complete guide to AI SRE.

Why open source matters for AI SRE

Three reasons buyers in 2026 are explicitly asking for open-source AI SRE:

  • Data sovereignty. Incident telemetry includes log lines, configuration values, deployment IDs, and sometimes payloads. SaaS AI SREs send all of it to their backend and to a third-party LLM. Self-hosted means it stays in your VPC.
  • Audit transparency. Regulators and security teams want to know exactly what the agent does on production systems. Source code answers that question; vendor marketing does not.
  • Cost predictability. Per-user or per-incident pricing can balloon quickly. Open-source costs scale with infrastructure and LLM tokens — and Ollama-local inference can flatten the LLM bill entirely.

The trade-off is real: you operate the system yourself. For teams already operating Kubernetes and observability stacks, that's marginal effort. For teams without that operational maturity, a commercial AI SRE is often the right call.

How the four compare

This is the only table you need. Verified from each project's GitHub repo, official docs, and source as of May 2026. OpenSRE (Tracer) is included as the emerging fourth entrant, still in public alpha.

DimensionAuroraHolmesGPTK8sGPTOpenSRE
LicenseApache 2.0Apache 2.0Apache 2.0Apache 2.0
GitHub stars2632,5547,8376,352
Latest releasev1.2.15 (Apr 2026)0.26.0 (Apr 2026)v0.4.32 (Apr 2026)v0.1 (May 2026)
MaturityProduction (v1.x)Production (CNCF Sandbox)Production (CNCF Sandbox)Public Alpha
CNCF statusIndependentSandbox (Oct 2025)SandboxIndependent
Built byArvo AIRobusta + Microsoftk8sgpt-ai communityTracer
Agent architectureLangGraph supervisor + sub-agentsReAct loop (ToolCallingLLM)Rule-based scanner + LLM explainerMulti-step investigation agent (originally LangGraph-based, since moved off it)
Multi-step reasoningYesYesNo (single-shot per analyzer)Yes (multi-step reasoning)
Cloud providersAWS, Azure, GCP, OVH, ScalewayKubernetes + AWS via MCPKubernetes onlyAWS, GCP, Azure, Kubernetes
Kubernetes executionkubectl in sandboxed podsRead-only kubectl get/describeRead-only via Kube APIInvestigation; can optionally execute remediation
Other integrations22+ (PagerDuty, Datadog, Grafana, Slack, Confluence, Bitbucket, Jenkins, etc.)30+ toolsets (Prometheus, Grafana, Datadog, Loki, Jira, etc.)None — Kubernetes-only by design60+ tools (Grafana, Datadog, CloudWatch, PagerDuty, Opsgenie, Jira, Slack, GitHub, etc.)
Knowledge base / RAGWeaviate vector search over runbooks + postmortemsYes (via toolsets)NoNot a documented first-class feature
Dependency graphMemgraph (cross-cloud blast radius)NoNoContext assembly across logs, metrics, configs, dependencies
Postmortem generationYes, exports to ConfluenceInvestigation reports onlyNoInvestigation report only (to Slack or PagerDuty)
Pull request remediationGitHub + Bitbucket with human approval gateGitHub PRs in Operator modeNone — strictly read-onlyNo PR-based remediation; can optionally execute remediation actions
MCP serverYes (~22 upfront tools, ~150-tool catalog)Yes (consumes MCP servers)NoYes (supports MCP)
LLM providersOpenAI, Anthropic, Google, Vertex, OpenRouter, OllamaOpenAI, Anthropic, Azure OpenAI, Bedrock, Gemini, Vertex, OllamaOpenAI, Azure, Cohere, Bedrock, SageMaker, Gemini, Vertex, HuggingFace, WatsonX, LocalAI, OllamaAnthropic, OpenAI, Gemini, Bedrock, OpenRouter, NVIDIA NIM, Ollama
Air-gapped supportYes (Ollama + image tarballs)Yes (Ollama)Yes (LocalAI / Ollama)Self-hostable; local LLM via Ollama
DeploymentDocker Compose or HelmBinary, API server, K8s Operator, Python SDKGo binary, K8s operatorPython/FastAPI runtime (Docker, Railway, EC2, ECS)

What is OpenSRE?

OpenSRE is an Apache 2.0 open-source framework for building AI SRE agents, maintained by Tracer. It ingests an alert, assembles context from logs, metrics, traces, and dependencies, reasons across your connected systems to identify the probable root cause, and posts a structured investigation report to Slack or PagerDuty. It is the newest entrant in this comparison: the repository was created in January 2026 and has grown quickly, passing 6,300 GitHub stars.

A note on the framework. OpenSRE was originally built on LangGraph, and that lineage is why it is often described as a LangGraph AI SRE. The current main branch has since moved off it: the README describes the present architecture as the state after removing the old graph and chain framework layers, and the pyproject.toml no longer lists LangGraph as a dependency. We flag this because the framework story is still in motion, which is the broader point about OpenSRE's maturity.

On that maturity: OpenSRE openly labels itself Public Alpha, and its README states that "core workflows are usable for early exploration, though not yet fully stable." Its first tagged release, v0.1, landed in May 2026, and it now ships rolling date-stamped releases (latest v2026.6.3). That is a meaningfully earlier stage than the other three projects here. Aurora is past v1.x with sandboxed cross-cloud execution, postmortem generation, and PR-based remediation, while HolmesGPT and K8sGPT are both CNCF Sandbox projects with multi-year release histories. OpenSRE is promising and fast-moving, but if you need production stability today it is the least battle-tested option in this group. It is best read as a build-your-own toolkit for teams comfortable tracking an alpha codebase.

The OSS AI SRE Maturity Spectrum

A useful way to position these tools is on a four-level spectrum of agent capability. Each level is strictly more capable than the one below — and each requires more architectural work to deploy safely.

LevelWhat the agent doesTools at this level
L1 — Diagnostic ExplainerReads system state, finds anomalies via deterministic rules, uses an LLM only to explain findings in natural language. No multi-step reasoning. Strictly read-only.K8sGPT
L2 — Read-Only InvestigatorRuns an iterative ReAct loop. Picks tools dynamically. Investigates across multiple data sources (metrics, logs, traces, K8s state). Read-only by design.HolmesGPT
L3 — Investigation + SuggestionEverything in L2, plus opens pull requests with suggested fixes. Humans review and merge. No autonomous writes to infrastructure.HolmesGPT (Operator mode), Aurora
L4 — Investigation + Approved RemediationEverything in L3, plus can execute approved remediation actions (rollbacks, restarts, scale changes) inside guardrails — typically a sandboxed runtime with explicit human approval for destructive operations.Aurora (with Bitbucket connector's human approval gate for destructive actions)

No open-source tool today operates as a fully autonomous L5 (closed-loop remediation without human approval) — and that's by design. Most serious teams want explicit gates before agents touch production.

Aurora vs HolmesGPT — which should you choose?

Aurora and HolmesGPT are the two genuinely agentic options. The choice depends on your blast radius.

Pick HolmesGPT when:

  • Your stack is heavily Kubernetes + Prometheus + Grafana and your incidents live there.
  • You want a tool that already integrates with 30+ observability sources, including Loki, AlertManager, NewRelic, Datadog APM, OpsGenie, and Slack.
  • You value CNCF governance and a steep ecosystem velocity.
  • You don't need cross-cloud (AWS APIs, Azure resources, GCP services) reasoning out of the box.

Pick Aurora when:

  • You operate across multiple clouds (AWS + Azure, GCP + AWS, etc.) and need an agent that can correlate incidents across providers.
  • You want auto-generated postmortems exported to Confluence.
  • You want the agent to draft remediation PRs against your codebase.
  • You need a graph-based blast radius model (Memgraph) for dependency analysis.
  • You want an MCP server so your IDE assistants (Cursor, Claude Desktop, Windsurf) can query live incident state.

In practice, some teams run both: HolmesGPT for in-cluster Kubernetes triage, Aurora for cross-cloud investigation and postmortem generation.

Aurora vs K8sGPT — which should you choose?

This is closer to "which tool category do you need?" than a head-to-head.

Pick K8sGPT when:

  • You want the absolute simplest entry point to AI for Kubernetes — a single Go binary you can install with Homebrew and run as k8sgpt analyze --explain.
  • Your needs stop at "explain why this pod is broken" rather than multi-step incident investigation.
  • You want the maturity of a 7.7k-star CNCF Sandbox project with rule-based analyzers that won't hallucinate causes (because they are deterministic before the LLM ever sees them).

Pick Aurora when:

  • You need agentic investigation, not just diagnostic explanation.
  • You operate beyond Kubernetes — cloud APIs, Terraform, monitoring tools, runbooks.
  • You want auto-generated postmortems and remediation PRs.

These two are complements, not competitors. Many teams run K8sGPT as a lightweight first-line scanner and Aurora (or HolmesGPT) for full incident investigation.

HolmesGPT vs K8sGPT — head-to-head

Despite both being CNCF Sandbox projects targeting Kubernetes, these are different categories.

AspectHolmesGPTK8sGPT
What it isMulti-step AI agentRule-based scanner with LLM explanations
When it shinesInvestigating an alert end-to-end across signalsDiagnosing why a specific resource is unhealthy
LatencySeconds to minutes (multi-step)Sub-second per analyzer
LLM costHigher (multiple calls per investigation)Lower (one explanation per finding)
Hallucination riskHigher (agent reasons across signals)Lower (deterministic before LLM)
Best fitOn-call engineers handling alertsPlatform teams running periodic cluster audits

K8sGPT's anonymization feature (which masks resource names and labels before sending to the LLM) is a meaningful privacy advantage that HolmesGPT does not match.

When NOT to use open-source AI SRE

Honest take: open-source AI SRE is the right answer for most engineering-led, security-conscious teams. It's the wrong answer when:

  • You don't have the operational capacity to run another stateful service in production.
  • You want vendor support with SLAs and a phone number to call at 3 AM.
  • Your team is small enough that the LLM-API bill of an investigation-heavy agent will exceed the per-seat price of a SaaS AI SRE like Rootly, incident.io, or Resolve.ai.
  • You need certifications (SOC2, ISO 27001) at the AI-vendor layer rather than at the cloud-provider layer.

For teams in those situations, our comparison guide walks through the trade-offs.

How to pilot an open-source AI SRE in your team

A six-step, low-risk pilot for any of the three tools:

  1. Pick one cluster and one observability source. Don't try to cover everything at once.
  2. Install in read-only mode first. All three tools default to read-only — keep it that way for the first two weeks.
  3. Connect one alert source. PagerDuty, Datadog, or Grafana — pick the one that's already firing real alerts.
  4. Run for two weeks alongside human on-call. Compare the agent's RCA conclusions to what your engineers determined. Track accuracy and time-to-RCA.
  5. Feed it your historical context. Aurora and HolmesGPT both support runbook + postmortem ingestion. Agents become dramatically more useful with organizational memory.
  6. Expand carefully. Add more clusters, then enable remediation suggestions, then (only after trust) approved automated actions for specific low-risk patterns.

Getting started with Aurora

Aurora is the multi-cloud, multi-tool option among open-source AI SREs. To run it:

git clone https://github.com/Arvo-AI/aurora.git
cd aurora
make init
make prod-prebuilt

Aurora supports any LLM provider — OpenAI, Anthropic, Google, OpenRouter, or local models via Ollama for air-gapped deployments. See the full documentation, our AI SRE complete guide, or our explainer on agentic incident management.

For the technical side of running an agent that executes kubectl against production, read our companion piece on AI agent kubectl safety and sandboxed execution. For closing the loop from rollback to remediation across the full delivery pipeline, see our CI/CD auto-remediation complete guide. For the two halves of the AI SRE workflow these open-source projects all converge toward, see our deep guides on AI-powered incident investigation and automated post-mortem generation. For the broader category landscape and the commercial peer set, see Top 15 AI SRE Tools in 2026; for how to actually deploy any of these inside your own perimeter, see Self-Hosted AI SRE.

open source AI SRE
AI SRE comparison
Aurora vs HolmesGPT
K8sGPT alternative
HolmesGPT vs K8sGPT
open source incident management
AI SRE tools 2026
Kubernetes AI agent
agentic incident response
AIOps open source
self-hosted AI SRE
CNCF Sandbox AI
LangGraph SRE
Robusta HolmesGPT
k8sgpt

Frequently Asked Questions

Try Aurora for Free

Open source, AI-powered incident management. Deploy in minutes.