πŸ’»Technology10 min read0 reads

Graviton5 and S3 Vectors: A 2026 Architecture Decision Guide

AWS Graviton5's M9g instances deliver 192 ARM cores, a formally verified hypervisor, and 25-35% throughput gains. S3 Vectors GA brings 2-billion-vector indexes at $0.06/GB. Here's who should move.

A

Admin

May 30, 2026

Share:
Graviton5 and S3 Vectors: A 2026 Architecture Decision Guide

At re:Invent in December 2025, AWS shipped two things that will force real architecture conversations in 2026: the Graviton5 processor powering M9g EC2 instances, and the GA release of Amazon S3 Vectors. Neither is a drop-in swap for what you have today. But together they define a fairly clear migration target for two specific types of workloads β€” compute-heavy ARM-compatible services that care about core density, and RAG pipelines that are currently overpaying for vector storage they barely query. This piece works through the specifics of each, where the numbers hold up, and where they don't.

The backdrop matters: in the same quarter, AWS announced end-of-life timelines for over a dozen services β€” AWS Proton, AWS IQ, AWS Panorama, AWS Copilot CLI, and others β€” signaling an explicit consolidation around AI infrastructure and core compute primitives. Graviton5 and S3 Vectors sit at the center of that bet.

Graviton5: What Actually Changed

Architecture differences that matter

Graviton4 used a dual-chip design β€” two 96-core chiplets connected on a package, which introduced NUMA overhead for memory-intensive operations. Graviton5 puts all 192 Arm cores on a single 3nm socket, eliminating the cross-chiplet penalty. That architectural decision drives several downstream wins.

The most significant change is L3 cache. Graviton5 ships with roughly 180 MB of shared L3 β€” 5Γ— larger than Graviton4 β€” giving each core approximately 2.6Γ— more cache access than the previous generation. AWS moved to a 12-channel DDR5 configuration running at up to 8400 MT/s, delivering around 691 GB/s of aggregate memory bandwidth, a 28% increase over Graviton4's DDR5-6400 setup.

AWS claims 25% better general compute performance over Graviton4-based M8g instances, but the workload-specific numbers tell a cleaner story: up to 30% faster for database queries, 35% faster for web application workloads, and 35% faster for inference-type ML workloads. Atlassian has reported 30% better throughput on Jira with 20% lower latency. SAP is seeing 35–60% improvements on OLTP queries.

Inter-core communication latency drops by up to 33% compared to M8g, and the single-socket topology means that latency is consistent rather than bimodal (as it was with Graviton4's NUMA boundary).

The Nitro Isolation Engine

The other new component worth understanding is the Nitro Isolation Engine, which replaces the conventional hypervisor layer in M9g instances. Instead of a runtime hypervisor that is merely hardened against attack, AWS built a formally verified isolation layer β€” approximately 260,000 lines of machine-checked proofs in Isabelle/HOL β€” that mathematically guarantees workload isolation from other tenants and from AWS operators.

In practice, this means the attack surface that classic speculative-execution attacks (Spectre, Meltdown variants) exploited β€” the shared hypervisor state β€” no longer exists in the conventional sense. The model is: if a property is proved, it holds for all inputs. Not "we tested it," but "it cannot be otherwise."

For regulated workloads in India β€” BFSI, healthcare, defence-adjacent SaaS β€” this is a concrete compliance argument that was not available on any prior Graviton generation.

Graviton5 also enables always-on memory encryption and per-vCPU dedicated caches, both of which reduce the risk from side-channel attacks on shared physical hardware.

Who should migrate to M9g

The honest migration criteria:

Workload type M9g benefit Migration priority
Java application servers (Spring Boot, Quarkus) 25–35% throughput gain High
MySQL / PostgreSQL / Aurora on ARM 30% faster DB queries High
Node.js / Python web services 35% faster, same memory High
Compute-intensive x86 binaries (no source) None β€” recompile required Skip unless ARM compat exists
GPU inference (PyTorch, TensorRT) None β€” GPU-attached instances use different families Skip
C-family ML training Wait for C9g (2026 roadmap) Hold

M9g instances remain in public preview as of May 2026. C9g (compute-optimised) and R9g (memory-optimised) are on the 2026 roadmap but not yet GA.

Migration path for ARM-incompatible code

If your binary dependencies or container images are x86-only, the Graviton migration is gated on recompilation. AWS has had reasonable tooling for this for two generations: AWS Graviton Ready partners include ISVs like Redis, PostgreSQL, and major Java runtimes. The practical blocker is usually either a third-party C extension that hasn't been compiled for aarch64, or an older base Docker image.

A staged approach works:

# Check current container arch
docker inspect --format='{{.Architecture}}' <image>

# Build multi-arch with buildx
docker buildx build --platform linux/amd64,linux/arm64 \
  -t your-registry/your-app:latest --push .

AWS Graviton4 (M8g) is already widely available. Running a parallel test fleet on M8g before committing to M9g preview reduces risk.

Amazon S3 Vectors: What GA Actually Means

The core architecture decision

S3 Vectors went GA on December 2, 2025. The premise is straightforward: instead of standing up a dedicated vector database (Pinecone, Weaviate, Qdrant) or running pgvector inside RDS, you store embeddings natively inside S3, with the similarity search index built into the storage layer.

A vector bucket is a new S3 bucket type. Within it, you create vector indexes β€” each index can hold up to 2 billion vectors. Each bucket supports up to 10,000 vector indexes and up to 20 trillion total vectors. The storage model is object storage, not RAM-resident HNSW graphs, which is exactly what makes it cheap and exactly what makes it slower.

The two supported distance metrics are cosine and euclidean. Vector dimensions must be between 1 and 4096, and only floating-point embeddings are accepted β€” binary quantised vectors are not currently supported.

Pricing that changes the calculus

Storage costs $0.06/GB per month. Writes cost $0.20/GB ingested. Queries cost $2.50 per million API calls, plus a small per-TB charge depending on index size ($0.004/TB for the first 100K vectors processed per query).

To make this concrete: 400 million vectors (768-dimensional, float32) stored across 40 indexes, with 10 million queries per month, comes out to approximately $1,215/month total. A comparable Pinecone pod deployment for the same scale runs several thousand dollars per month.

AWS's stated 90% cost reduction versus dedicated vector databases holds for storage-heavy, query-light scenarios. It compresses significantly as query frequency rises, because the per-query cost of S3 Vectors is not negligible at scale.

Latency: the honest picture

This is the part the marketing glosses over. AWS's own documentation says: infrequent queries return results in under 1 second; frequent queries return results in 100ms or less.

Community benchmarks show a wider picture. Query latency at 10K vectors is around 112ms p50; at 10 million vectors it grows to around 382ms p50. At 100 million 768-dimensional vectors, AWS reports under 40ms at p95 β€” but that number applies to warmed-up, high-frequency access patterns.

For comparison, Pinecone delivers consistent 5–80ms latencies at similar scale, and an in-memory HNSW index (Qdrant with RAM-resident vectors) can hit 1–5ms.

Solution Latency (p50, ~10M vectors) Monthly cost (400M vectors, 10M queries) Cold-start penalty
S3 Vectors 100–400ms ~$1,200 High (disk-backed)
Pinecone Standard 10–80ms ~$4,000–8,000 Low (always warm)
OpenSearch (kNN) 20–100ms ~$2,000–4,000 Low (RAM index)
pgvector (RDS) 5–50ms ~$1,500–3,000 Low–medium

S3 Vectors is not a general-purpose vector database replacement. It is a cost-effective cold or warm tier.

The two use cases it actually fits

1. Bulk RAG knowledge bases with infrequent retrieval. If you're building a Bedrock Knowledge Base over a large document corpus β€” product manuals, support articles, policy documents β€” where a user queries maybe once per session and latency of 100–400ms is acceptable, S3 Vectors is a clear win. AWS natively supports S3 Vectors as a backend for Amazon Bedrock Knowledge Bases. You get production-scale retrieval without managing a vector database cluster.

import boto3

client = boto3.client("s3vectors")

# Create a vector index
client.create_index(
    vectorBucketName="my-rag-vectors",
    indexName="product-docs",
    dataType="float32",
    dimension=1536,
    distanceMetric="cosine",
)

# Query vectors
response = client.query_vectors(
    vectorBucketName="my-rag-vectors",
    indexName="product-docs",
    queryVector={"float32": embedding_list},
    topK=10,
)

2. Tiered vector storage (archive + active). A common pattern emerging in 2026 is tiering: maintain a hot layer (OpenSearch Service, pgvector, or Qdrant) for the most recently-accessed or highest-priority embeddings, and offload everything else to S3 Vectors. When a query misses the hot layer, fall back to S3 Vectors. This pattern keeps memory costs down while preserving sub-100ms latency for frequent queries.

# Conceptual tiered vector architecture
hot_layer:
  store: opensearch-serverless
  index: recent-30-days
  latency_target_p99: 50ms
  retention: 30 days

cold_layer:
  store: s3-vectors
  index: full-corpus
  latency_target_p95: 500ms
  monthly_cost_per_billion_vectors: ~$60
  retention: indefinite

What S3 Vectors does not replace

Sub-10ms vector lookups for real-time recommendation engines, fraud scoring, or semantic ranking inside search results β€” these require an in-memory index. S3 Vectors is disk-backed object storage with a search index layer; it cannot match the latency profile of a RAM-resident HNSW graph.

S3 Vectors also does not support hybrid search (keyword + vector combined in a single query). If your RAG pipeline depends on BM25 + dense retrieval fusion, you still need OpenSearch or a dedicated vector DB that exposes both.

Binary quantised vectors (popular for memory efficiency at extreme scale) are unsupported. If you're using a quantised embedding model, you'd need to store full float32 vectors in S3 Vectors.

The Architectural Read-Through

The combination of Graviton5 and S3 Vectors sketches a fairly specific AWS-native AI stack:

  • Inference service: M9g (Graviton5) β€” 35% faster ML workloads, formally verified isolation, lower cost per vCPU-hour than comparable x86
  • Vector retrieval (bulk/archive): S3 Vectors β€” 2 billion vectors per index, $0.06/GB/month, native Bedrock integration
  • Vector retrieval (hot path): OpenSearch Serverless or pgvector β€” still required for sub-100ms SLAs
  • Orchestration: Bedrock Agents, Lambda on ARM (m9g-backed)

This stack does not eliminate the need for dedicated vector infrastructure in latency-sensitive paths. What it does is make the bulk-storage problem significantly cheaper, which changes the ROI calculation for large knowledge base applications.

AWS's simultaneous retirement of services like Proton, IQ, and Panorama signals a portfolio consolidation around AI-native building blocks β€” Bedrock, Graviton, S3 with built-in intelligence. The message is that AWS intends to commoditise the infrastructure layer of the AI stack, pushing differentiation up toward application logic.

Takeaways

  • M9g (Graviton5) is worth evaluating now for Java, Python, and Node.js application servers β€” 25–35% throughput gains at the same or lower cost per hour. Gated on ARM-compatible binaries; check your C extensions and base images before committing.
  • The Nitro Isolation Engine is a first-of-its-kind formally verified hypervisor. For compliance-heavy workloads in BFSI or healthcare, it's a stronger isolation guarantee than any prior EC2 generation.
  • S3 Vectors is a storage tier, not a database replacement. At $0.06/GB/month and 2 billion vectors per index, it materially changes the cost of large RAG knowledge bases. It does not change sub-100ms latency requirements.
  • The 90% cost reduction claim holds at low query frequencies β€” archival corpora, document knowledge bases, periodic batch similarity jobs. It does not hold for real-time recommendation or fraud-scoring pipelines.
  • The tiered vector pattern is worth implementing now: hot path on OpenSearch/pgvector (last 30 days or high-access), cold path on S3 Vectors (full history). AWS has no managed tiering service yet β€” you build the routing logic.
  • C9g and R9g instances are on the 2026 roadmap but not yet available. If your primary workload is compute-bound (video transcoding, ML training) or memory-bound (large in-memory caches), wait for those families or run M9g preview tests first.
  • M9g remains in public preview as of May 2026. Production workloads should run parallel testing before full migration.
Share:

Comments

0/1000

Related Articles