NVIDIA Vera: The First CPU Designed for a World Where AI Agents Outnumber Humans

NVIDIA Vera: The First CPU Designed for a World Where AI Agents Outnumber Humans

NVIDIA unveils the Vera CPU with 88 custom Olympus cores and 1.2 TB/s memory bandwidth, purpose-built for agentic AI workloads. Why the future of AI infrastructure depends on CPUs, not just GPUs.


For the better part of a decade, the artificial intelligence hardware conversation has been dominated by a single component: the GPU. NVIDIA's graphics processors — from the V100 to the A100 to the H100 to Blackwell — have been the limiting reagent of AI progress. Companies bid billions for GPU allocations. Governments negotiated chip export controls as matters of national security. The question "How many GPUs do you have?" became the primary measure of an AI company's competitive position.

Then, in early April 2026, NVIDIA made an announcement that quietly inverted this narrative. The company unveiled Vera — not a GPU, but a CPU. A custom-designed, Arm-based processor with 88 cores, 1.2 terabytes per second of memory bandwidth, and an architecture optimized not for training neural networks but for orchestrating the AI agents that run on them.

The announcement received modest coverage relative to NVIDIA's GPU launches. Market analysts focused on the Rubin GPU that Vera is designed to accompany. Tech journalists wrote up the specifications and moved on. But within the data center engineering community, Vera represents something more significant than another chip announcement: it is NVIDIA's bet that the bottleneck in AI infrastructure is about to shift from compute to orchestration — and that the era of the GPU as the sole center of gravity in AI hardware is ending.

Why a CPU Company Is Really a GPU Company That Needs Better CPUs

To understand why NVIDIA built Vera, you have to understand the architecture of an AI data center and where the latency actually lives.

In the training phase of AI development — the process of running trillions of mathematical operations over terabytes of data to produce a model — the GPU is unambiguously the bottleneck. Training a frontier model like Claude 4 or GPT-5 requires tens of thousands of GPUs running in parallel for weeks or months. The GPU's massive parallelism, its specialized tensor cores, and its high-bandwidth memory are all essential for this workload.

But training is a one-time cost. The ongoing cost — the operational expenditure that determines whether an AI business is profitable — is inference: running the trained model to generate outputs in response to user queries. And inference, particularly agentic inference, has a fundamentally different computational profile than training.

The Agentic Workload Problem

A traditional AI inference request — "Translate this sentence" or "Summarize this document" — is a single-shot GPU operation. The request arrives, the GPU processes it in milliseconds, and the response is returned. The CPU's role is minimal: receive the request, route it to the GPU, collect the output, send it back.

Agentic AI requests are categorically different. An autonomous agent executing a multi-step workflow — "Research the competitive landscape for our Q3 product launch, draft a strategy memo, create supporting charts, and schedule a review meeting" — involves dozens of individual reasoning steps, each of which requires:

  1. GPU inference to generate the model's next thought or action
  2. CPU processing to execute the action (query a database, call an API, run a computation, parse a response)
  3. Memory access to retrieve context from the agent's working memory
  4. Network I/O to communicate with external services
  5. Orchestration logic to decide what to do next based on the result

In this workflow, the GPU is idle for the majority of the elapsed time. It generates a token sequence in milliseconds, then waits while the CPU executes the tool call, retrieves the result, constructs the next prompt, and sends it back to the GPU. AMD engineers have estimated that in agentic workloads, the CPU-side processing accounts for 60-80% of total latency.

gantt
    title Latency Breakdown: Agentic AI Request
    dateFormat X
    axisFormat %s

    section GPU Work
    Token Generation Step 1      :0, 50
    Token Generation Step 2      :200, 250
    Token Generation Step 3      :450, 500
    Token Generation Step 4      :700, 750

    section CPU Work
    Parse Output + Tool Call     :50, 200
    Execute Tool + Process Result :250, 450
    Context Assembly + Routing   :500, 700
    Final Response Assembly      :750, 850

This is the problem Vera was designed to solve. In a world where AI workloads are shifting from single-shot inference to multi-step agentic orchestration, the CPU becomes as important as the GPU — and a commodity server CPU designed for general-purpose computing is woefully inadequate for the job.

Inside Vera: The Architecture of an Agentic CPU

Vera is not a repurposed data center CPU with AI marketing. It is a ground-up design, built on lessons learned from NVIDIA's experience watching CPU bottlenecks constrain the performance of its GPU-accelerated data centers.

The Olympus Core

Vera's compute engine consists of 88 custom "Olympus" cores based on the Armv9.2 architecture. Each core supports two threads via NVIDIA's Spatial Multithreading technology, yielding 176 hardware threads per socket.

The core design prioritizes two characteristics that are essential for agentic workloads:

High single-threaded performance. Unlike GPU cores, which achieve performance through massive parallelism over simple operations, the Olympus cores are designed for complex, branching, sequential logic — the kind of computation that dominates agent orchestration (parsing JSON responses, executing conditional workflows, managing state machines).

Low and predictable latency. Agent orchestration requires deterministic timing. When an agent decides to call an external API, the CPU must construct the request, manage the network connection, and process the response with minimal jitter. Traditional data center CPUs, designed for throughput over predictability, often show wide variance in per-request latency.

The Memory Subsystem

Vera's most impressive specification is its memory subsystem: LPDDR5X memory providing up to 1.2 TB/s of bandwidth and up to 1.5 TB of capacity per socket, connected via SOCAMM modules.

SpecificationVera CPUIntel Xeon w9-3595XAMD EPYC 9755
Cores8860128
Threads176120256
Memory Bandwidth1.2 TB/s0.31 TB/s0.46 TB/s
Memory Capacity (max)1.5 TB4 TB3 TB
NVLink-C2C1.8 TB/sN/AN/A
ArchitectureArmv9.2x86-64x86-64

The 1.2 TB/s bandwidth figure is roughly 3-4x what competing data center CPUs deliver. This bandwidth is critical for agentic workloads because agent orchestration involves constant context switching — loading and unloading conversation history, skill metadata, tool outputs, and agent state from memory. A memory-starved CPU becomes the dominant bottleneck in multi-agent systems where dozens of concurrent agents share a single processor.

NVLink-C2C: The CPU-GPU Bridge

Perhaps the most architecturally significant feature of Vera is its NVLink-C2C interconnect, which provides 1.8 TB/s of coherent bandwidth between the CPU and GPU. This is not a standard PCIe connection — it is a direct, low-latency link that allows the CPU and GPU to share memory with near-zero overhead.

In an agentic context, NVLink-C2C eliminates the most expensive operation in the inference pipeline: copying data between CPU memory and GPU memory. When a Vera CPU receives a tool call result and needs to feed it back to the GPU for the next inference step, the data does not need to be serialized, transferred over a bus, and deserialized. It is already available in a shared memory space that both processors can access directly.

The practical impact is a reduction in per-step latency that compounds dramatically across multi-step agent workflows. If each CPU-GPU data transfer takes 2ms with PCIe and 0.1ms with NVLink-C2C, a ten-step agent workflow saves nearly 20ms — enough to make the difference between a responsive and a sluggish user experience.

The Vera Rubin NVL72: A Rack-Scale Agent Computer

Vera is not sold as a standalone processor. It is the CPU component of the Vera Rubin platform, which packages CPUs and GPUs into integrated systems optimized for different AI workloads.

The flagship configuration is the Vera Rubin NVL72: a rack-scale supercomputer that couples 36 Vera CPUs with 72 Rubin GPUs connected via NVLink 6. This is not merely a bigger server — it is a fundamentally different architecture, designed so that every CPU can communicate with every GPU at full NVLink bandwidth, creating a single coherent system rather than a collection of independent nodes.

The NVL72's scale enables workloads that are impossible on smaller systems. Training trillion-parameter mixture-of-experts models — the architecture that powers most frontier models in 2026 — requires the kind of memory capacity (108 TB across the rack) and interconnect bandwidth (only achievable with NVLink 6) that the NVL72 provides.

But the more novel configuration is the Vera CPU Rack: a dedicated, liquid-cooled system packed with up to 256 Vera CPUs and zero GPUs. This GPU-free configuration is designed explicitly for agentic AI workloads — the orchestration, tool calling, environment management, and reinforcement learning tasks that are CPU-bound rather than GPU-bound.

NVIDIA's benchmarks claim that a single Vera CPU Rack can sustain over 22,500 concurrent CPU environments for agentic simulation, task planning, and result validation. In reinforcement learning contexts, this means an AI system can explore thousands of possible action sequences in parallel, evaluate their outcomes, and learn from the results — all on CPU compute, with GPU resources freed for the inference steps that actually require them.

The Strategic Context: NVIDIA's Platform Play

Vera is not just a chip. It is the latest move in NVIDIA's systematic transformation from a chip company into a platform company — an entity that controls the full hardware and software stack for AI computing.

Consider what NVIDIA now offers:

  • GPU: Rubin (training and inference)
  • CPU: Vera (orchestration and agentic workloads)
  • Networking: NVLink, NVSwitch, ConnectX (inter-chip and inter-node communication)
  • Software: CUDA, TensorRT, NeMo, Isaacs (development frameworks)
  • Systems: NVL72, DGX (integrated rack-scale computers)

A data center operator who buys the full NVIDIA stack is locked into an integrated ecosystem where every component is optimized to work with every other component. The switching costs are enormous — not because any individual NVIDIA product is irreplaceable, but because the performance advantages of the integrated system are only available if you buy everything from NVIDIA.

This is the strategic logic behind Vera. If NVIDIA leaves the CPU to Intel or AMD, data center operators can mix and match components from different vendors. If NVIDIA provides its own CPU — one that integrates with its GPUs at a level that third-party CPUs cannot match (via NVLink-C2C) — the lock-in deepens, the margins expand, and the competitive moat widens.

AMD and Intel's Response

The competitive response has been swift, if not yet equal.

AMD has announced its next-generation Turin Zen 6 EPYC processors with specific SKUs optimized for "agentic inference orchestration." These processors will feature increased core counts (up to 192 cores), improved memory bandwidth (with DDR5-6400 support), and a new "Agentic Thread Manager" that prioritizes latency-sensitive agent orchestration threads over background workloads.

Intel's response is more architectural. In partnership with Google, Intel has announced a multi-year collaboration to develop custom Infrastructure Processing Units (IPUs) that handle the networking, scheduling, and orchestration tasks that currently burden the main CPU. This approach — offloading agent orchestration to a dedicated accelerator rather than competing head-to-head with NVIDIA's custom CPU — may prove more practical for data center operators who are already invested in Intel infrastructure.

The Broader Infrastructure Shift

Vera arrives in an infrastructure market that is undergoing its most dramatic transformation since the shift from mainframes to distributed computing.

The HBM Shortage

The AI infrastructure buildout is currently constrained by a severe shortage of High-Bandwidth Memory (HBM) — the stacked DRAM that powers GPU memory subsystems. Samsung, SK Hynix, and Micron are all running their HBM fabrication facilities at maximum capacity, and analysts project that shortages will persist into 2027.

The HBM shortage has two consequences for Vera. First, it increases the value of CPU-based computing that does not require HBM — reinforcing the economic case for the Vera CPU Rack configuration. Second, it constrains the availability of Rubin GPUs, meaning that even customers who want full Vera Rubin NVL72 systems may face 6-12 month waiting lists.

The Energy Crisis

AI data center energy consumption has become a genuine constraint on industry growth. A single NVL72 rack consumes approximately 120kW of power under full load — equivalent to roughly 100 average American households. The planned expansion of AI data center capacity by hyperscalers like Microsoft (50GW by 2030), Google (40GW by 2030), and Meta (30GW by 2030) will require new power generation equivalent to the total installed capacity of several mid-sized countries.

NVIDIA's response to the energy constraint is efficiency-focused: the Vera CPU is designed to achieve approximately 4x better performance per watt on agentic workloads compared to general-purpose data center CPUs. This efficiency advantage comes from the custom core design (which eliminates unnecessary computational units), the LPDDR5X memory system (which uses less power than standard DDR5), and the tight NVLink integration (which reduces data movement, the most energy-intensive operation in computing).

The CoreWeave Factor

The Vera announcement coincided with CoreWeave's massive expansion of GPU cloud capacity — including a new $21 billion expansion of its partnership with Meta and a multi-year compute supply agreement with Anthropic. CoreWeave has emerged as the primary GPU-as-a-service provider for AI companies that cannot afford to build their own data centers.

For Vera, CoreWeave's expansion is significant because it represents the channel through which most AI companies will access Vera-class hardware. Few companies outside the hyperscaler tier can justify purchasing rack-scale Vera Rubin systems. But renting CPU and GPU hours from CoreWeave — which will deploy Vera in its data centers — is accessible to any company with an API billing account.

The Terafab Project: Vertical Integration Taken to Its Logical Extreme

While NVIDIA's approach is horizontal platform integration (controlling every component in the data center stack), a different model of AI hardware infrastructure is taking shape in Austin, Texas.

The Terafab project — a multi-billion-dollar joint venture between Tesla, SpaceX, xAI, and Intel — aims to vertically integrate chip production from design through lithography, fabrication, and advanced packaging. The goal is to create a domestic supply chain for AI chips that is not dependent on TSMC's Taiwanese fabrication facilities.

Terafab represents the most ambitious attempt to date to solve the geopolitical risk that underpins the entire AI hardware industry: the concentration of advanced chip manufacturing in a single country (Taiwan) subject to a single geopolitical vector of risk (cross-strait relations with mainland China). Whether Terafab can achieve the manufacturing quality and yield rates that TSMC has spent decades perfecting is an open question, but the strategic logic of geographic diversification is irrefutable.

What Vera Means for AI Engineers

For practitioners building AI systems in 2026, Vera's implications are both practical and philosophical.

Practical: Agent orchestration performance will improve dramatically on Vera-class hardware. Multi-agent systems that currently require careful optimization to maintain acceptable latency will run smoothly on hardware designed for their workload characteristics. This removes a significant engineering constraint and enables more ambitious agent architectures.

Philosophical: The existence of a CPU designed for agentic AI is a signal that the industry has moved beyond the experimental phase. When hardware companies — whose product cycles span years and whose capital investments span billions — commit to agentic AI as a primary use case, it validates the architectural direction of the entire agent ecosystem.

The GPU made deep learning possible. The TPU made large language models economical. And the Vera CPU may make autonomous agents practical at scale — not through raw compute power, but by ensuring that the orchestration layer that connects model reasoning to real-world action is as fast, reliable, and efficient as the reasoning itself.

NVIDIA's one-year release cadence — from Blackwell to Rubin to whatever comes next — suggests that the company views AI hardware as a market where standing still means falling behind. Vera is not the destination. It is the first step toward a hardware ecosystem where AI agents are not bottlenecked by the infrastructure they run on but are instead limited only by the intelligence of the models that power them.

That world — where the hardware is no longer the constraint — is the world NVIDIA is building toward. Whether we are ready for what happens when that constraint is removed is a different question entirely.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn