Nvidia, Oracle, Google, Dell and 13 other companies have reported how long it takes to train the major neural networks currently in use on computers. Among the results were the first appearances of Nvidia’s next-generation GPU, the B200, and Google’s upcoming accelerator called Trillium. The B200 has double the performance in some tests compared to today’s flagship Nvidia chip, the H100. and Trillium delivered nearly four times better performance than the chip Google tested in 2023.
The benchmark test, called MLPerf v4.1, consists of six tasks: recommendation, pretraining of large language models (LLM) GPT-3 and BERT-large, fine-tuning of Llama 2 70B large language model, and objects. Detection, graph node classification and image generation.
Training GPT-3 is such an enormous task that it would be impractical to do it all to provide a benchmark. Instead, the test is to train yourself to the point where experts have determined that you are likely to reach your goal if you keep going. The goal of Llama 2 70B is not to train LLMs from scratch, but to take already trained models and fine-tune them to be specialized for specific specialties. In this casegovernment documents. Graph node classification is a type of machine learning used in fraud detection and drug discovery.
As the focus in AI has evolved toward primarily using generative AI, test sets have also changed. This latest version of MLPerf has completely changed what we’re testing against since the benchmark effort began. “At this point, all of the original benchmarks have been phased out,” says David Kanter, who leads the benchmark effort at MLCommons. In previous rounds, some benchmarks took only a few seconds to perform.
The performance of the best machine learning systems across a variety of benchmarks was better than would be expected if the gains were from Moore’s Law alone (blue line). The solid line represents the current benchmark. The dotted lines represent benchmarks that have now been retired as they are no longer industry relevant.MLCommons
According to MLPerf’s calculations, AI training on the new benchmark suite is improving at about twice the rate expected from Moore’s Law. As time went on, results plateaued faster than they did at the start of MLPerf’s reign. Kanter attributes this primarily to the fact that companies have figured out how to perform benchmark testing on large systems. Over time, Nvidia, Google, and others have developed software and network technologies that allow for near-linear scaling. This means that doubling the processors roughly halves the training time.
First Nvidia Blackwell training results
This round was the first training test for Nvidia’s next-generation GPU architecture, called Blackwell. For GPT-3 training and LLM fine-tuning, Blackwell (B200) roughly doubles the performance of H100 on a per-GPU basis. The gains were slightly less robust, but still significant, at 64% and 62% for recommender systems and image generation, respectively.
The Blackwell architecture implemented on the Nvidia B200 GPU continues the ongoing trend of using increasingly less precise numbers to speed up AI. For certain parts of the transformer neural network, such as ChatGPT, Llama2, and Stable Diffusion, Nvidia H100 and H200 use 8-bit floating point numbers. The B200 reduces this to just 4 bits.
Google launches 6th generation hardware
Google showed the first result for 6.Day Second results for TPU generations called Trillium and 5 released last monthDay Generation variant Cloud TPU v5p. For the 2023 edition, the search giant has entered another variation of 5.Day The next-generation TPU, v5e, is designed with more emphasis on efficiency than performance. Compared to the latter, Trillium provides up to 3.8x performance improvement on GPT-3 training tasks.
But compared to everyone’s biggest rival, Nvidia, things weren’t quite as rosy. The system with 6,144 TPU v5ps reached the GPT-3 training checkpoint in 11.77 minutes, falling far behind the 11,616-Nvidia H100 system, which completed the task in about 3.44 minutes. That top TPU system was about 25 seconds faster than the H100 computer, which is half its size.
Dell Technologies computers used about 75 cents worth of electricity to fine-tune the Llama 2 70B large language model.
The closest comparison between v5p and Trillium shows that each system consists of 2048 TPUs, and the upcoming Trillium shaves 2 minutes off GPT-3 training time, a nearly 8% improvement over v5p’s 29.6 minutes. Another difference between the Trillium and v5p entries is that the Trillium is paired with an AMD Epyc CPU instead of the v5p’s Intel Xeon.
Google also used Cloud TPU v5p to train its image generator, Stable Diffusion. With 2.6 billion parameters, Stable Diffusion is a light enough lift that MLPerf participants are asked to train to convergence rather than checkpoint like GPT-3. The 1024 TPU system came in second, completing the task in 2 minutes and 26 seconds, about a minute behind an equally sized system configured with an Nvidia H100.
Training capabilities are still unclear
The enormous energy cost of training neural networks has long been a concern. MLPerf set out to measure this. Dell Technologies is the only entrant in the energy sector with eight server systems containing 64 Nvidia H100 GPUs and 16 Intel Xeon Platinum CPUs. The only measurement was the LLM fine-tuning task (Llama2 70B). The system consumed 16.4 megajoules during its five-minute run, with an average power of 5.4 kilowatts. This translates to an electricity bill of about 75 cents at the average cost in the United States.
While it doesn’t say much on its own, the results give a ballpark for the power consumption of potentially similar systems. For example, Oracle reported performance results close to 4 minutes and 45 seconds using the same number and type of CPUs and GPUs.
From your site article
Related articles on the web