AI large models are accelerating their penetration from the cloud to the edge and endpoint devices. However, computing power, memory, and power consumption have emerged as major bottlenecks restricting their large-scale deployment. Neural Processing Units (NPUs), specifically designed for AI computing, have become the key to overcoming these challenges. Arm China’s Zhouyi X3 NPU IP offers technical solutions to tackle the "computing power wall", "memory wall", and "power consumption wall" of edge-side AI through architectural innovations, hardware-software co-optimization, and an open ecosystem.
As edge-side AI shifts from CNNs to Transformers, the demand for high-precision floating-point computing has surged dramatically. Traditional NPUs struggle to meet the dynamic requirements of complex AI scenarios due to poor architectural adaptability and low computing power scheduling efficiency. To address the stringent computing power demands of large models on edge devices, Zhouyi X3 provides a highly efficient solution.
Zhouyi X3’s breakthrough lies in its underlying architectural innovation. It adopts a brand-new DSP+DSA architecture tailored for large models, enabling a shift from fixed-point to floating-point computing. This architecture supports both CNNs and Transformers, resolving the "specialization bias" issue of traditional NPUs. A single Cluster offers flexible computing power configuration ranging from 8 to 80 FP8 TFLOPS, which can accurately match the diverse computing needs of different scenarios. Compared with the previous generation, Zhouyi X3 has achieved significant performance upgrades: AIGC large model capability is improved by 10 times, and CNN model performance is enhanced by 30%-50%, fully unlocking the computing potential of large models.
Large models feature massive parameter sizes, making memory bandwidth and storage pressure another major bottleneck. Without efficient data read-write and storage processing, AI tasks will suffer from lag and response delays. Zhouyi X3 addresses memory challenges through multiple technical upgrades:
Test data shows that the multi-core computing power linearity reaches 70-80%, the utilization rate of large models in the Prefill phase can hit 72%, and the effective bandwidth utilization rate in the Decode phase exceeds 100%[1]. These results fully validate its outstanding memory scheduling capability and system co-optimization performance.
With limited battery capacity and heat dissipation conditions, endpoint devices urgently require the co-optimization of computing power and energy efficiency to balance high-performance AI tasks and long battery life.
Zhouyi X3 integrates the AIFF hardware engine dedicated to AI applications, paired with a specialized hardened scheduler. This combination reduces CPU load to below 0.5% with extremely low scheduling latency. When the NPU processes multiple AI tasks in parallel, it does not rely on frequent CPU intervention for scheduling, significantly reducing communication overhead between the CPU and NPU, thereby lowering system power consumption and effectively extending device battery life.
In addition, Zhouyi X3 adopts a scalable multi-core architecture and a hierarchical memory interconnection architecture, supporting flexible computing power scaling and extension. The system can achieve "on-demand power supply" based on the complexity of AI tasks, effectively reducing invalid computing and data movement to maximize energy utilization efficiency.
Addressing the three major bottlenecks of edge-side AI requires in-depth hardware-software collaboration. The Compass AI software platform, paired with Zhouyi X3, serves as a powerful enabler with its comprehensive usability, open ecosystem, and security assurance.
The Compass AI software platform provides an end-to-end unified toolchain, enabling "one-click deployment and out-of-the-box use". It natively supports Hugging Face, mainstream AI frameworks, and operating systems, with compatibility for over 160 operators and more than 270 models. It also features in-depth optimization for large model inference, including LLM, VLM, VLA, and MoE models, achieving seamless integration from CNN to Transformer models and significantly lowering the threshold and cost of model deployment. Meanwhile, the platform’s support for quantization algorithms and dynamic Shape capability can enhance performance while effectively reducing power consumption and avoiding invalid computing.
In addition, the Compass AI software platform offers a variety of software tools and opens up core components such as IR specifications and open-source quantization tools. Based on the DSL programming language, developers can implement custom operators through rich NN compiler plugins. Combined with visual debugging tools, the platform enables full-link observability and optimization, greatly improving development efficiency across different scenarios and providing underlying software support for computing power scheduling and power consumption control of edge-side AI.
Under the guidance of Arm China’s "AI Arm CHINA" strategy, the company will focus on AI as its core, leverage the Arm® ecosystem as its support, and build on local innovation. It will continue to advance the R&D of four self-developed IP product lines: Zhouyi NPU, Xingchen CPU, Shanhai SPU, and Linglong multimedia processor. By collaborating with industry partners to build a Chinese intelligent computing ecosystem, Arm China aims to drive the large-scale deployment of edge-side AI.