NVIDIA's "mysterious chip" behind the scenes—The era of reasoning ushers in "Four Major New Trends in Computing Power"

robot
Abstract generation in progress

NVIDIA integrates LPU (Language Processing Unit) technology and OpenAI’s multi-threaded inference chips, shifting the main battlefield of AI computing power competition from training to inference. Shenwan Hongyuan Research believes that by 2026, the core keyword in the computing industry will be inference, with both token consumption and technological paradigms deeply restructured around this theme.

On February 28, according to The Wall Street Journal, NVIDIA plans to release a new inference chip next month at the GTC developer conference that integrates Groq’s “Language Processing Unit” (LPU) technology. NVIDIA CEO Jensen Huang described it as a “completely new system never seen before.” OpenAI has agreed to become one of the largest customers for this processor and will purchase large-scale “dedicated inference capacity” from NVIDIA.

Meanwhile, last month, OpenAI also reached a multi-billion dollar computing partnership with startup Cerebras, which claims its inference chips have surpassed NVIDIA’s GPUs in speed. These developments indicate that AI giants are shifting from an arms race for training compute power to a multi-pronged layout focused on inference.

Shenwan Hongyuan reports that in the Token economy era, four major trends are emerging in inference compute power: First, the deployment of pure CPUs (central processing units) is increasing, with low-cost inference needs accelerating the decentralization of compute; second, specialized architectures like LPU are rising, challenging GPUs’ dominance in inference; third, domestic chip breakthroughs are accelerating, with supply chains diversifying; fourth, the demand structure for inference compute is shifting from “single training” to “massive token consumption,” making cost-performance ratio a key competitive factor.

The report states that companies capable of providing sufficient, high-cost-performance inference chips will benefit most, and breakthroughs in CPUs, LPUs, and domestic chips are central to this reshaping of the compute landscape.

Inference demand explodes, token consumption hits record highs

Shenwan Hongyuan believes that behind the sustained demand expansion are two structural drivers: First, the monetization of large models accelerates, with models like Claude beginning to enter application markets and releasing multiple industry plugins; second, the deployment of Agents accelerates, with products like OpenClaw and Qianwen Agent marking the entry of Agents into real work and production scenarios. Each model call and Agent task execution requires substantial inference compute power.

Data cited by Shenwan Hongyuan shows that during the Spring Festival, leading domestic large models saw significant growth in inference volume: on New Year’s Eve, Doubao’s inference throughput reached 63.3 billion tokens; Yuanbao’s monthly active users hit 114 million; Qianwen’s “Spring Festival Free Trial” activity involved over 120 million participants.

Further, data from OpenRouter, a global AI model API aggregation platform, reveals the scale of this trend. During the week of February 9-15, Chinese models surpassed US models in call volume for the first time, with 4.12 trillion tokens compared to 2.94 trillion. The following week, February 16-22, Chinese models surged to 5.16 trillion tokens, a 127% increase over three weeks. Among the top five models globally by call volume, four are Chinese.

LPU becomes a new star, with training and inference chips diverging

NVIDIA invested $20 billion to license Groq’s core technology and recruited top executives, including founder Jonathan Ross, through a “core hiring” deal. Shenwan Hongyuan believes this move signifies top-tier recognition of the importance of pure inference chips.

The fundamental reason for LPU’s efficiency advantage over traditional GPUs in inference is architecture differences. AI inference involves pre-filling and decoding phases, with decoding especially slow for large models. LPUs are specially optimized to address latency and memory bandwidth bottlenecks. Reports suggest NVIDIA’s upcoming products may feature next-generation Feynman architecture, possibly with more extensive SRAM integration or 3D stacking technology to deeply embed LPUs.

Based on this, Shenwan Hongyuan predicts that future AI chips will have a clear division of labor: training will continue to use GPU-HBM combinations, while inference will evolve into ASIC + LPU-SRAM + SSD configurations. As compute demand shifts from training to inference, manufacturers focusing on inference chips will gain opportunities.

Inference system revolutionizes, CPU and network demands rise in tandem

From single chips to system-level innovations, this is another key dimension of the current inference compute upgrade. Shenwan Hongyuan notes that as application scenarios shift from chatbots to Agents, the system’s requirements for latency, throughput, and reasoning depth are simultaneously increasing, driving the architecture toward a three-layer network.

The first layer is the fast response layer, powered by SRAM-equipped pure inference chips providing ultra-low latency feedback; the second layer is the slow thinking layer, using large-scale compute clusters for complex logical reasoning, with increased demand for multi-core, multi-threaded CPUs; the third layer is the memory layer, exemplified by NVIDIA’s ContextMemory System, which manages long-term memory and KV caches for Agents via SSD storage controlled by Bluefield4 DPU.

NVIDIA is also adjusting its hardware strategy. Previously, bundling Vera CPUs with Rubin GPUs was standard, but proved costly for certain AI workloads. This month, NVIDIA announced expanded collaboration with Meta Platforms, including the first large-scale deployment of pure CPUs to support Meta’s advertising targeting AI Agents, marking a move beyond single-GPU sales.

Domestic compute acceleration breakthroughs

Shenwan Hongyuan emphasizes that the technological upgrade of domestic inference chips warrants close attention, with notable market expectations gaps.

Technologically, the new generation of domestic inference chips has achieved several fundamental improvements: support for low-precision data formats like FP8/MXFP8/MXFP4, with compute power reaching 1P and 2P; significant enhancement of vector compute capabilities, adopting a new homogeneous design supporting SIMD/SIMT dual programming models; interconnect bandwidth increased by 2.5 times to 2TB/s.

Particularly noteworthy is the implementation of PD separation at the chip level: using self-developed HBM with two different specifications, forming PR versions for prefill and recommendation scenarios with low-cost HBM, and DT versions for decode and training scenarios. The PR version aims to significantly reduce inference prefill costs and is expected to launch in Q1 2026.

Supply chain progress is also evident. According to a top-tier packaging company’s initial inquiry response, their 2.5D packaging revenue mainly comes from high-performance computing chips, rising rapidly from 0.5 billion yuan in 2022 to 18.2 billion yuan in 2024, indirectly confirming the increasing domestic supply capacity and accelerating localization of the supply chain.

Risk Warning and Disclaimer

Market risks exist; investments should be cautious. This article does not constitute personal investment advice and does not consider individual user’s specific investment goals, financial situations, or needs. Users should evaluate whether any opinions, views, or conclusions herein are suitable for their circumstances. Investment carries risks, and responsibility rests with the individual.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
  • Pin

Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)