Complex technology follows a predictable arc. In the early stages, teams build in isolation—choosing their own tools, abstractions, and failure models. What appears to be flexibility eventually reveals itself as fragmentation when systems need to scale.
The solution isn't simply adding more features. It requires a shared operational philosophy. Kubernetes demonstrated this principle clearly. The project didn't just solve container orchestration—it established patterns for safely modifying production systems. The community refined these approaches, stress-tested them in production, and elevated them to industry standards.
AI infrastructure remains in its fragmented phase. The operational challenge has shifted from binary states—working or broken—to evaluating output quality. This requires fundamentally different tooling and practices. The path forward mirrors cloud-native's evolution: open source projects establishing common interfaces, with community adoption replacing ad-hoc implementations with documented, repeatable patterns.
Since our last update at KubeCon + CloudNativeCon North America 2025, Microsoft has continued investing across open-source AI infrastructure, multi-cluster operations, networking, observability, storage, and cluster lifecycle management. At KubeCon + CloudNativeCon Europe 2026 in Amsterdam, we're announcing capabilities that extend Kubernetes' operational maturity to modern workload requirements.
Learn more about Azure Kubernetes Service
Building open source foundations for AI on Kubernetes
As AI and Kubernetes infrastructure converge, their gaps increasingly overlap. A substantial portion of our upstream work this cycle has focused on making GPU-backed workloads first-class citizens in the cloud-native ecosystem.
On the scheduling front, Microsoft has collaborated with industry partners to advance open standards for hardware resource management:
- Dynamic Resource Allocation (DRA) has graduated to general availability, including the DRA example driver and DRA Admin Access.
- Workload Aware Scheduling for Kubernetes 1.36 adds DRA support in the Workload API and integrates with KubeRay, simplifying how developers request and manage high-performance infrastructure for training and inference.
- DRANet now includes upstream compatibility for Azure RDMA Network Interface Cards, extending DRA-based network resource management to high-performance hardware where GPU-to-NIC topology alignment directly impacts training performance.
Beyond scheduling, we've invested in tooling for deploying, operating, and securing AI workloads on Kubernetes:
- AI Runway is a new open-source project introducing a common Kubernetes API for inference workloads. It provides platform teams with centralized model deployment management and flexibility to adopt new serving technologies as the ecosystem evolves. The project includes a web interface for users who don't need Kubernetes expertise, built-in HuggingFace model discovery, GPU memory fit indicators, real-time cost estimates, and support for runtimes including NVIDIA Dynamo, KubeRay, llm-d, and KAITO.
- HolmesGPT has joined the Cloud Native Computing Foundation (CNCF) as a Sandbox project, bringing agentic troubleshooting capabilities into the shared cloud-native tooling ecosystem.
- Dalec, a newly onboarded CNCF project, defines declarative specifications for building system packages and producing minimal container images, with support for SBOM generation and provenance attestations at build time. Reducing attack surface and vulnerabilities at the build stage is critical for organizations running AI workloads at scale.
- Cilium received extensive Microsoft contributions this cycle, including native mTLS ztunnel support for sidecarless encrypted workload communication, Hubble metrics cardinality controls for managing observability costs, flow log aggregation to reduce storage volume, and two merged Cluster Mesh Cilium Feature Proposals advancing cross-cluster networking.
What's new in Azure Kubernetes Service
In addition to upstream contributions, we're introducing new Azure Kubernetes Service (AKS) capabilities across networking and security, observability, multi-cluster operations, storage, and cluster lifecycle management.
From IP-based controls to identity-aware networking
As Kubernetes deployments become more distributed, IP-based networking grows harder to manage. Visibility degrades, security policies become difficult to audit, and encrypting workload communication has historically required either a full service mesh or significant custom work. Our networking updates address this by moving security and traffic intelligence to the application layer, where it's both more meaningful and easier to operate.
Azure Kubernetes Application Network provides mutual TLS, application-aware authorization, and detailed traffic telemetry across ingress and in-cluster communication, with built-in multi-region connectivity. Teams get identity-aware security and traffic insight without the overhead of running a full service mesh.
For teams managing the deprecation of ingress-nginx, Application Routing with Meshless Istio provides a standards-based migration path: Kubernetes Gateway API support without sidecars, continued support for existing ingress-nginx configurations, and contributions to ingress2gateway for incremental migration.
At the data plane level, WireGuard encryption with the Cilium data plane secures node-to-node traffic efficiently without application changes. Cilium mTLS in Advanced Container Networking Services extends this to pod-to-pod communication using X.509 certificates and SPIRE for identity management—authenticated, encrypted workload traffic without sidecars.
Pod CIDR expansion removes a long-standing operational constraint by allowing clusters to grow their pod IP ranges in place rather than requiring a rebuild. Administrators can now disable HTTP proxy variables for nodes and pods without modifying control plane configuration.
Visibility that matches cluster complexity
Operating Kubernetes at scale requires clear, consistent visibility into infrastructure, networking, and workloads. Two persistent gaps we've addressed are GPU telemetry and network traffic observability, both increasingly critical as AI workloads move into production.
Teams running GPU workloads have often faced a significant monitoring blind spot: GPU utilization wasn't visible alongside standard Kubernetes metrics without manual exporter configuration. AKS now surfaces GPU performance and utilization directly into managed Prometheus and Grafana, integrating GPU telemetry into the same stack teams use for capacity planning and alerting.
On the network side, per-flow L3/L4 and supported L7 visibility across HTTP, gRPC, and Kafka traffic is now available, including IPs, ports, workloads, flow direction, and policy decisions. A new Azure Monitor experience provides built-in dashboards and one-click onboarding.
For teams managing metric volume, operators can now dynamically control which container-level metrics are collected using Kubernetes custom resources, keeping dashboards focused on actionable signals. Agentic container networking adds a web-based interface that translates natural-language queries into read-only diagnostics using live telemetry, shortening the path from problem identification to resolution.
Simpler operations across clusters and workloads
For organizations running workloads across multiple clusters, cross-cluster networking has historically meant custom plumbing, inconsistent service discovery, and limited visibility across cluster boundaries.
Azure Kubernetes Fleet Manager now addresses this with cross-cluster networking through a managed Cilium cluster mesh, providing unified connectivity across AKS clusters, a global service registry for cross-cluster service discovery, and intelligent routing with centrally managed configuration.
On the storage side, clusters can now consume storage from a shared Elastic SAN pool rather than provisioning and managing individual disks per workload. This simplifies capacity planning for stateful workloads with variable demands and reduces provisioning overhead at scale.
For teams needing a more accessible entry point to Kubernetes, AKS desktop is now generally available. It brings a full AKS experience to your desktop, allowing developers to run, test, and iterate on Kubernetes workloads locally with the same configuration they'll use in production.
Safer upgrades and faster recovery
The cost of a failed upgrade compounds quickly in production, and recovery has historically been time-consuming and stressful. Several updates this cycle focus on making cluster changes safer, more observable, and more reversible.
Blue-green agent pool upgrades create a parallel pool with the new configuration rather than applying changes in place, allowing teams to validate behavior before shifting traffic and maintain a clear rollback path if issues arise.
Agent pool rollback complements this by allowing teams to revert a node pool to its previous Kubernetes version and node image when problems surface after an upgrade, without requiring a full rebuild. Together, these capabilities give operators meaningful control over the upgrade lifecycle.
For faster provisioning during scale-out events, prepared image specification lets teams define custom node images with preloaded containers, operating system settings, and initialization scripts, reducing startup time and improving consistency for environments requiring rapid, repeatable provisioning.
Connect with the Microsoft Azure team in Amsterdam
The Azure team will be at KubeCon + CloudNativeCon Europe 2026. Here's where to connect:
- Rules of the Road for Shared GPUs: AI Inference Scheduling at Wayve—Customer keynote, Tuesday, March 24, 2026, 9:37 AM CET
- Scaling Platform Ops with AI Agents: Troubleshooting to Remediation—Tuesday, March 24, 2026, 10:13 AM CET with Jorge Palma, Principal PDM Manager, Microsoft
- Building cross-cloud AI inference on Kubernetes with OSS—Wednesday, March 25, 2026, 1:15 PM CET with Jorge Palma, Principal PDM Manager, Microsoft and Anson Qian, Principal Software Engineer, Microsoft
- Visit our booth #200 for live demos and conversations with the Azure and AKS team
- Browse the full schedule of sessions by Microsoft speakers