☁️Cloud & DevOps10 min read0 reads

GKE Agent Sandbox GA and AKS Ubuntu 24.04: What Changes

GKE Agent Sandbox hit GA with 300 sandboxes/s throughput; AKS defaulted to Ubuntu 24.04 with kernel 6.8 and containerd 2.0. Here's what changes for AI inference node pools in 2026.

A

Admin

May 30, 2026

Share:
GKE Agent Sandbox GA and AKS Ubuntu 24.04: What Changes

When Google announced that GKE Agent Sandbox hit general availability on May 21, 2026, the headline number — 300 sandboxes per second per cluster at sub-200ms p90 — landed differently depending on where you sit. If you're running inference pipelines that spin up one long-lived pod per user request, the number sounds like marketing. If you're running AI coding assistants, agentic pipelines, or short-lived tool-call workers at scale, it changes your node-pool math entirely. Lovable reportedly runs roughly 200,000 projects per day on the new sandbox. That is not a toy workload.

Meanwhile, AKS quietly flipped a default that matters for every team baking custom node images: Ubuntu 24.04 is now the default OS for --os-sku Ubuntu on Kubernetes 1.35 and later, per the AKS release notes published January 4, 2026. That change pulls in Linux kernel 6.8, containerd 2.0, and a glibc bump that breaks older CUDA driver installer scripts. Neither of these announcements is difficult in isolation. Together, they define the platform engineering agenda for the rest of 2026 for teams running AI workloads on managed Kubernetes.

GKE Agent Sandbox: What Changed and Why Throughput Matters

Before Agent Sandbox, spinning up isolated execution environments for AI agents on GKE meant either accepting VM-level isolation costs (slow, expensive) or accepting shared-kernel container risks (fast, but the isolation story falls apart when your agent can run arbitrary code). gVisor gave you a middle ground but not at the throughput you need for short-lived, bursty agent workloads.

Agent Sandbox went into preview at KubeCon NA in November 2025 and reached GA six months later — with 16× growth in sandbox usage between preview and GA. That rate of adoption from a preview product, on a feature that requires deliberate opt-in, tells you this is not an edge case. Teams building agentic systems had been waiting for exactly this.

The 300/s Number in Context

Three hundred sandboxes per second is a cluster-level ceiling under ideal conditions: same-zone nodes, pre-warmed pools, well-tuned resource requests. Your real throughput depends on:

  • Node pool warm depth (how many sandbox-class nodes are already running and idle)
  • Sandbox spec size (CPU/memory requests per sandbox)
  • Network and scheduler overhead under concurrent burst

If you are building a system where agents run for 2–30 seconds each and you're targeting 10,000 concurrent users, 300/s throughput means you can absorb a burst of 18,000 new agent starts in a minute before queuing becomes visible. That is meaningful for most teams — but don't plan capacity at the theoretical max. Target 40–60% of peak throughput as your sustained planning number until you've profiled your own workload.

Node Pool Sizing for Burst Agentic Workloads

The operational change Agent Sandbox introduces is that you now need to reason about sandbox density per node rather than pod density. Each sandbox carries a fixed overhead from the sandbox runtime layer. A node that previously ran 80 standard containers might run 40–50 sandboxes, depending on sandbox size.

The right approach is a dedicated sandbox node pool — separate from your inference serving pool — with aggressive autoscaling configured for fast scale-up and a generous cooldown:

apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerNodePool
metadata:
  name: agent-sandbox-pool
spec:
  clusterRef:
    name: my-cluster
  location: us-central1
  autoscaling:
    minNodeCount: 2
    maxNodeCount: 50
    locationPolicy: BALANCED
  nodeConfig:
    sandboxConfig:
      sandboxType: gvisor
    machineType: n2-standard-8
    oauthScopes:
      - https://www.googleapis.com/auth/cloud-platform

Keep at least 2 nodes warm at all times. Cold-start latency from zero is where your p90 blows past the 200ms target. The 200ms figure assumes the node is already scheduled — it does not include Kubernetes node provisioning time.

Isolation Guarantees and What They Actually Mean

Agent Sandbox uses a VM-kernel separation model. The agent's syscalls go through a user-space kernel intercept rather than directly to the host kernel. This is not full VM isolation — a determined attacker who finds a bug in the sandbox kernel can still attempt escape — but the attack surface is dramatically smaller than a shared-kernel container. For workloads where the agent executes user-supplied code (coding assistants, REPL environments, tool-call sandboxes), this is the right trade-off. You get isolation that is meaningfully stronger than namespaced containers at a cost that is meaningfully lower than a VM per agent.

AKS Ubuntu 24.04 as Default: The Image Pipeline Implications

For AKS clusters on Kubernetes 1.35+, --os-sku Ubuntu now provisions Ubuntu 24.04 (Noble Numbat) nodes by default. If you are building or maintaining custom node images — packer templates, golden AMI equivalents, daemonset-based driver installers — this change requires action before you upgrade to 1.35.

Kernel 6.8 and CUDA Driver Compatibility

Ubuntu 24.04 ships with Linux kernel 6.8. For GPU-attached node pools running NVIDIA workloads, kernel 6.8 breaks installer scripts that hard-code kernel header paths from the 5.15 era (Ubuntu 22.04 default). The practical fix is moving to the NVIDIA GPU Operator rather than hand-rolled daemonset installs:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set operator.defaultRuntime=containerd

The GPU Operator detects the running kernel at install time and pulls compatible driver modules. This is the approach that survives OS upgrades cleanly. If you're still running a custom daemonset that does apt-get install linux-headers-$(uname -r) and then runs the CUDA .run installer, plan a migration sprint before you hit 1.35.

containerd 2.0 on Ubuntu 24.04 Nodes

AKS Ubuntu 24.04 defaults to containerd 2.0. The headline compatibility concern is the CRI API version: containerd 2.0 drops the v1alpha2 CRI endpoint that some older Kubernetes tooling assumes. On a managed AKS cluster, Kubernetes 1.35 is paired with containerd 2.0 by design — the AKS team has validated this combination. The issue surfaces in your own tooling:

  • Custom node-problem-detector configs that scrape containerd socket paths
  • Falco rules written against containerd 1.x event schemas
  • Image pre-pull scripts that call crictl with flags deprecated in containerd 2.0

Run crictl --version on a test 1.35 node before migrating production pools. If your CI validation scripts call crictl pull or crictl images, test them explicitly — the output format changed.

glibc and Python/ML Library Compatibility

Ubuntu 24.04 ships glibc 2.39, up from 2.35 on Ubuntu 22.04. For most workloads this is invisible. For AI/ML teams, the issue is pre-built Python wheels: some older versions of PyTorch, JAX, and their dependencies were compiled against glibc 2.35 and will throw GLIBCXX_3.4.30 not found at runtime on 24.04 nodes if you're pinning to old wheel versions. The fix is straightforward — pin to wheel versions built for manylinux_2_28 or newer — but you need to discover this in staging, not production.

# Check glibc version on a node
kubectl run glibc-check --image=ubuntu:noble --restart=Never --rm -it \
  -- ldd --version

# Check wheel ABI tag on a built image
python3 -c "import torch; print(torch.__version__); import platform; print(platform.libc_ver())"

The Retirement Clock: Ubuntu 22.04 on AKS

Microsoft has published a clear schedule for Ubuntu 22.04 on AKS:

Milestone Date
No new node images / no security patches June 30, 2027
Node image removal April 30, 2028

This is not an aggressive timeline — you have roughly 13 months until you stop receiving security patches on 22.04 nodes. But the real pressure is earlier: any cluster upgrade to Kubernetes 1.35 will land you on Ubuntu 24.04 by default. If your upgrade cadence targets N-1 from the latest Kubernetes minor (currently 1.35), you are essentially already on the 24.04 migration path.

GKE vs AKS: Sandbox Throughput and Node OS GA Status

Dimension GKE AKS
Sandbox throughput (GA) 300 sandboxes/s per cluster, p90 < 200ms (Agent Sandbox, GA May 21 2026) No equivalent managed sandbox primitive at GA; VM-based node isolation via Confidential VMs
Default node OS (latest K8s minor) Container-Optimized OS (COS) by default; gVisor-backed Agent Sandbox as opt-in pool type Ubuntu 24.04 (Noble) default for --os-sku Ubuntu on K8s 1.35+ (GA Jan 2026)

This is not a vendor comparison — both platforms have real trade-offs. GKE's Agent Sandbox is purpose-built for the agentic workload pattern, which AKS has no direct answer to yet at managed-service level. AKS's Ubuntu 24.04 default is a sensible modernization, but it puts more migration work on platform teams in the near term.

When to Upgrade and When to Wait

Not every team needs to act immediately on either of these changes.

Upgrade GKE to use Agent Sandbox now if:
- You run AI coding assistants, REPL workers, or tool-call executors where each user action spawns a short-lived, potentially code-executing container
- Your current isolation approach is either too slow (VMs per request) or too weak (shared-kernel containers running user code)
- You are building new capacity and can start with sandbox node pools from day one

Wait on GKE Agent Sandbox if:
- Your agents are long-running (minutes to hours) — the sandbox overhead becomes negligible at that timescale and the throughput advantage is irrelevant
- Your workload is pure inference serving with no code execution: standard GPU node pools with gVisor off are more efficient

Upgrade AKS to Ubuntu 24.04 now if:
- You are on Kubernetes 1.34 and planning a 1.35 upgrade in the next quarter — bake the OS migration into that upgrade rather than treating it as a separate workstream
- Your GPU driver install uses the NVIDIA GPU Operator (safe to upgrade)
- You have tested your Python/ML wheels against glibc 2.39

Wait on AKS Ubuntu 24.04 if:
- You have unresolved CUDA driver compatibility with kernel 6.8 (test first in a scratch cluster)
- You have hard dependencies on specific glibc 2.35-linked native extensions with no newer wheels available
- You haven't validated your Falco or security tooling against containerd 2.0 schemas

What to Do This Quarter

  • GKE + Agent Sandbox: Create a dedicated sandbox node pool with pre-warmed minimum nodes (≥2). Write a synthetic load test that hits your target sandbox/second rate at 50% of peak and measure real p90 — do not assume you'll see the 300/s headline number. Size your autoscaler based on measured burst, not the GA blog post.

  • AKS image pipelines: Stand up a 1.35 test cluster today with Ubuntu 24.04 nodes. Run your full node bootstrap sequence — driver installs, daemonsets, custom init containers — and capture any failures. Fix them before your next production upgrade window.

  • containerd 2.0 audit: List every tool that talks directly to the containerd socket or CRI endpoint. Check each against containerd 2.0 release notes. This includes Falco, any custom CNI plugins, image pre-pull controllers, and node-problem-detector.

  • Python wheel pinning: For AKS 24.04 nodes, regenerate your base inference images from scratch against python:3.11-slim-bookworm or equivalent Ubuntu 24.04 base and re-resolve your PyTorch/JAX dependency locks. Catch the glibc mismatch in the image build, not at pod startup.

  • Retirement calendar: Add June 30, 2027 (AKS Ubuntu 22.04 patch cutoff) to your infrastructure review calendar now. Work backward from that date to set a K8s 1.35 upgrade target that leaves a comfortable buffer.

The node OS and sandbox primitives that ship in H1 2026 set the foundation for AI inference capacity through 2027. The migration work is front-loaded and tedious. The alternative — scrambling when your 22.04 nodes stop receiving patches or your burst agentic workload hits scheduler queuing limits — is worse.

Share:

Comments

0/1000

Related Articles