Our customers use Azure AI infrastructure to develop innovative AI-based solutions. That’s why today we’re delivering a new cloud-based AI supercomputing cluster built with Azure ND H200 v5 series virtual machines (VMs).
The need for scalable, high-performance infrastructure continues to grow exponentially as the AI landscape evolves. Our customers use Azure AI infrastructure to develop innovative AI-based solutions. That’s why today we’re delivering a new cloud-based AI supercomputing cluster built with Azure ND H200 v5 series virtual machines (VMs). These VMs are now generally available and are tailored to handle increasingly complex advanced AI workloads, from basic model training to generative inference. The scale, efficiency, and improved performance of the ND H200 v5 VM are already driving customer and adoption of Microsoft AI services such as Azure Machine Learning and Azure OpenAI services.
“We’re excited to adopt Azure’s new H200 VMs. We found that H200 delivers improved performance with minimal porting effort. We look forward to using these VMs to accelerate research, improve the ChatGPT experience, and further advance our mission.” —Trevor Cai, OpenAI Infrastructure Director.
The Azure ND H200 v5 VM is designed with Microsoft’s systems approach to improve efficiency and performance and features eight NVIDIA H200 Tensor Core GPUs. In particular, it addresses the gap caused by GPUs, where raw computational power increases at a much faster rate than the associated memory and memory bandwidth. Azure ND H200 v5 series VMs have a 76% increase in high-bandwidth memory (HBM) to 141 GB and a 43% increase in HBM bandwidth to 4.8 TB/s compared to the previous generation Azure ND H100 v5 VMs. Increasing HBM bandwidth allows GPUs to access model parameters faster, helping reduce overall application latency, an important metric for real-time applications such as conversational agents. ND H200 V5 VMs can also accommodate more complex Large Language Models (LLMs) within the memory of a single VM, allowing users to improve performance by avoiding the overhead of running distributed tasks across multiple VMs.
Additionally, the H200 supercomputing cluster design allows for more efficient management of GPU memory for model weights, key-value cache, and batch size, all of which directly impact the throughput, latency, and cost-effectiveness of LLM-based generative AI inference workloads. It affects. With its larger HBM capacity, the ND H200 v5 VM can support larger batch sizes, improving GPU utilization and throughput compared to the ND H100 v5 series for inference workloads on both small language models (SLM) and LLM. Initial testing shows up to 35% increase in throughput for ND H200 v5 VMs compared to ND H100 v5 series for inference workloads running LLAMA 3.1 405B model (world size 8, input length 128, output length 8, and maximum batch). Confirmed. Size – 32 for H100, 96 for H200). For more information about high-performance computing benchmarks on Azure, see here or the AI Benchmarking Guide in the Azure GitHub repository.
ND H200 v5 VMs are pre-integrated with Azure Batch, Azure Kubernetes Service, Azure OpenAI Service, and Azure Machine Learning to help businesses get started right away. For detailed technical documentation on the new Azure ND H200 v5 VMs, please visit here.