Microsoft has introduced the Maia 200, a custom inference accelerator designed to improve the performance and efficiency of running large AI models across its cloud infrastructure. The company detailed the chip in an official blog post, positioning this development as part of a broader effort to build purpose‑built hardware for the growing demands of AI workloads. According to Microsoft, Maia 200 is optimized specifically for inference and is already being deployed in select Azure datacenters. The announcement outlines the chip’s technical specifications, early deployment results, and the company’s plans to integrate it into future generations of its AI infrastructure.
The specifications support that claim. Maia 200 is built with more than 140 billion transistors and is optimized for low precision compute, which is essential for modern inference workloads. Microsoft highlights performance figures that include more than 10 petaFLOPS of FP4 compute, more than 5 petaFLOPS of FP8 compute, 216 GB of HBM3e memory at 7 TB per second, 272 MB of on chip SRAM, and a 750 watt thermal envelope. The company states that this makes Maia 200 the most performant first party silicon offered by any hyperscaler, and it directly compares the chip to Amazon’s third generation Trainium and Google’s seventh generation TPU.
Microsoft’s focus on inference reflects the reality that the cost of running AI models at scale is dominated by inference rather than training. Every Copilot request, every GPT powered rewrite, and every enterprise chatbot interaction depends on inference hardware. According to Microsoft, Maia 200 delivers thirty percent better performance per dollar than the latest generation of hardware currently deployed in its fleet. The accelerator is already being used to support OpenAI’s GPT 5.2 models and Microsoft’s Superintelligence team, which relies on Maia 200 for synthetic data generation and reinforcement learning.
One of the most notable engineering decisions is the redesign of the memory subsystem. Microsoft emphasizes that raw compute performance is not enough to achieve high token throughput. The chip uses narrow precision data types, a specialized DMA engine, on die SRAM, and a custom network on chip fabric to move data efficiently. These choices are intended to reduce memory bottlenecks that often limit the performance of large language models.
Microsoft is also taking a different approach to networking. Instead of relying on proprietary interconnects, the company has built a two tier scale up network based on standard Ethernet. Each accelerator provides 2.8 TB per second of bidirectional scale up bandwidth and supports predictable collective operations across clusters of up to 6,144 accelerators. This design signals a belief that Ethernet based architectures can scale more cost effectively and flexibly than proprietary fabrics.
The company reports that models were running on Maia 200 within days of receiving the first packaged silicon. Microsoft attributes this rapid turnaround to extensive pre silicon simulation, early validation of backend networking, integration with Azure’s control plane, and the use of second generation liquid cooling systems. These efforts reduced the time from first silicon to datacenter deployment by more than half compared to previous infrastructure programs.
The first Maia 200 racks are already operating in Microsoft’s US Central datacenter region near Des Moines, Iowa, with US West 3 near Phoenix, Arizona scheduled to follow. Additional regions will be added as production scales.
Alongside the hardware, Microsoft is releasing a preview of the Maia SDK. The toolkit includes PyTorch integration, a Triton compiler, optimized kernel libraries, a low level programming language called NPL, and a Maia simulator and cost calculator. The SDK is designed to make it easier for developers to port models across different types of hardware within Azure’s increasingly heterogeneous environment.
Microsoft makes clear that Maia 200 is the start of a multi generational roadmap. Future versions of the accelerator are already in development. As Guthrie notes, “The era of large scale AI is just beginning, and infrastructure will define what is possible.”
