At Google Cloud Next 25, Google unveiled Ironwood, its seventh-generation Tensor Processing Unit (TPU). This latest iteration of Google’s custom AI accelerator is positioned as the company’s most powerful and scalable to date, and notably, the first TPU designed specifically for inference. For over a decade, TPUs have been a cornerstone of Google’s AI infrastructure, powering demanding training and serving workloads both internally and for its Cloud customers. Ironwood aims to build on this legacy, offering enhanced power, capability, and energy efficiency tailored to support the growing need for inferential AI models at scale.
Google emphasizes that Ironwood signifies a pivotal shift in AI development. The focus is moving from responsive AI models, which provide real-time information for human interpretation, to proactive models capable of generating insights and interpretations autonomously. This transition, described as the “age of inference,” envisions AI agents that proactively retrieve and generate data to collaboratively deliver insights and answers.Ironwood is engineered to meet the significant computational and communication demands of this next phase of generative AI. The system scales up to 9,216 liquid-cooled chips interconnected by a new Inter-Chip Interconnect (ICI) network, spanning nearly 10 MW. It is a key component of Google Cloud’s AI Hypercomputer architecture, designed to optimize hardware and software for the most demanding AI workloads. Developers can leverage Google’s Pathways software stack to harness the combined computing power of tens of thousands of Ironwood TPUs.
The design of Ironwood is focused on efficiently managing the complex computation and communication demands of “thinking models,” including Large Language Models (LLMs), Mixture of Experts (MoEs), and advanced reasoning tasks. These models require substantial parallel processing and efficient memory access. Ironwood is designed to minimize data movement and latency while performing large-scale tensor manipulations. To support the demands of these models, Ironwood TPUs utilize a low-latency, high-bandwidth ICI network for coordinated, synchronous communication at full TPU pod scale.Ironwood will be available to Google Cloud customers in two sizes: a 256-chip configuration and a larger 9,216-chip configuration. The larger configuration, scaling to 9,216 chips per pod, delivers substantial computing power. Each individual chip has a peak compute of 4,614 TFLOPs. Ironwood’s memory and network architecture is designed to ensure efficient data availability to support peak performance at this scale.
Additionally, Ironwood incorporates an enhanced SparseCore, an accelerator for processing ultra-large embeddings used in ranking and recommendation workloads. This expanded SparseCore support extends the range of workloads that can be accelerated, including applications in finance and science.Google Cloud highlights Ironwood’s significant performance gains alongside a focus on power efficiency. The TPU is designed to deliver more capacity per watt for customer workloads, featuring advanced liquid cooling solutions and optimized chip design. Ironwood also offers a substantial increase in High Bandwidth Memory (HBM) capacity and dramatically improved HBM bandwidth. Enhanced Inter-Chip Interconnect (ICI) bandwidth further facilitates faster communication between chips, supporting efficient distributed training and inference.
Google Cloud emphasizes its experience in delivering AI compute and integrating it into large-scale services. Ironwood is presented as a solution for the AI demands of the future, offering increased computation power, memory capacity, ICI networking advancements, and improved power efficiency. The company anticipates that Ironwood will enable further AI breakthroughs by its own developers and Google Cloud customers.