Tech & Culture · chips compute

The Inference War Nobody Told Nvidia About

Training made Nvidia the most valuable company on earth. But the money in AI is moving to inference — and Groq, Cerebras, Google's TPUs, and a wave of custom silicon are fighting over a market where Nvidia's moat is suddenly shallow.

Flux Desk·2026-06-05·8 min read

For three years the AI chip story was a single sentence: Nvidia sells the shovels, everyone digs, the stock goes up. CUDA was the moat, H100s were the currency, and every competitor's roadmap was a polite way of saying "we'll catch up eventually." That story was true for training. It is increasingly wrong for inference, and inference is where the volume — and eventually the profit pool — actually lives.

The shift is structural. Training a frontier model is a capital event: a few enormous clusters, run by a handful of labs, a few times a year. Inference is the opposite — billions of queries a day, every day, forever, and the economics are dominated by cost-per-token and latency, not raw FLOPs. A chip that wins training can lose inference badly, because the workloads reward completely different things. That gap is the opening, and in 2026 it's crowded.

Groq and Cerebras are selling speed Nvidia can't match

The two most interesting challengers attacked the same weakness from opposite directions. Groq built a deterministic, SRAM-heavy architecture — its LPU — that throws out the memory hierarchy GPUs lean on and instead keeps the whole model close to compute, streaming tokens out at speeds that make GPU inference look sluggish. The numbers that circulated through 2025 and into this year — hundreds of tokens per second on large open models, with response latency low enough to feel instantaneous — weren't marketing fiction; developers building voice agents and real-time tools migrated specifically for the responsiveness.

Cerebras went the other absurd direction: a single wafer-scale chip the size of a dinner plate, the entire model living on one piece of silicon, no inter-chip networking tax. For the right workloads the throughput is staggering. Both companies are making the same bet — that the future of inference is a latency war, and GPUs are carrying memory-bandwidth baggage that wafer-scale and SRAM-streaming designs simply don't.

The catch, and it's a real one, is breadth. Groq and Cerebras shine on a curated set of popular models they've hand-tuned. The moment you want a model they haven't optimized, or fine-grained control, or a workload outside their sweet spot, the GPU's flexibility reasserts itself. They're not replacing the datacenter; they're carving out the high-value, latency-sensitive top of the market — and that slice is growing fast enough to fund real businesses.

Google's TPUs are the quiet giant in the room

Everyone fixated on the startups misses the most credible threat: Google has been designing its own AI silicon for a decade, and the TPU is the only non-Nvidia accelerator deployed at genuine hyperscale, running both Google's own frontier models and a growing slice of external workloads through its cloud. The latest generations are competitive on training and increasingly attractive on inference economics, and crucially Google isn't trying to dethrone Nvidia in the open market — it's removing Nvidia from its own bill of materials, which at Google's scale is a multi-billion-dollar swing.

That's the pattern worth watching. The most serious custom silicon isn't being sold; it's being used internally. Amazon's Trainium and Inferentia, the rumored and confirmed in-house accelerators across the big labs and clouds — these aren't products, they're hedges. Every hyperscaler has concluded that paying Nvidia's gross margin on every inference query, forever, is intolerable, and they all have the volume to justify taping out their own chips. Nvidia's most dangerous competitors aren't startups; they're its biggest customers, quietly building the exit.

AMD finally has a story, and the software finally half-works

AMD spent years as the answer to a question nobody was really asking: a GPU that was competitive on paper and uncompetitive in practice because ROCm — its CUDA alternative — was a graveyard of broken kernels and missing features. The MI300 line changed the hardware conversation; its memory capacity made it genuinely attractive for serving large models where Nvidia parts were memory-starved. And after relentless pressure, the software stack crossed the threshold from "technically supported" to "actually shippable" for mainstream inference.

That's a meaningful crack in the moat. CUDA's dominance was never really about the silicon — it was about the fifteen years of libraries, kernels, and developer muscle memory built on top of it. AMD didn't need ROCm to be better; it needed it to be good enough that a cost-conscious buyer serving open-weight models could switch without a rewrite. In 2026, for a narrowing but real set of workloads, it is. Nvidia still wins the default, but "default" and "only option" are very different competitive positions.

Who actually wins

The honest answer is that nobody wins the whole thing, and that's the point. The AI chip market is fragmenting along workload lines, and the monolithic "Nvidia owns AI" thesis is being replaced by a more textured map. Frontier training stays Nvidia's for now — the software gravity and cluster-scale networking advantages are real and slow to erode. Latency-critical inference is a genuine contest where Groq and Cerebras have defensible technical edges. Hyperscale internal inference is bleeding toward custom silicon, with Google's TPU the furthest along. And the long tail of cost-sensitive open-model serving is finally a two-horse race with AMD.

The number that matters going forward isn't FLOPs or even market cap — it's cost-per-token at a given latency, and on that axis Nvidia's lead is workload-dependent and shrinking at the edges. The company will stay enormous; demand for compute is not the bottleneck. But the era when "AI chips" and "Nvidia" were synonyms is closing. The inference war is being fought in the gaps Nvidia's training dominance left open, and the gaps are getting wider.

The shovel-seller stays rich. It just no longer owns the mine.

#nvidia#groq#cerebras#tpu#amd#inference#silicon#semiconductors