Training AI is an expensive feat – and a complex one as well

Martin Průcha, 08. 07. 2024


Nvidia has recently surpassed Microsoft  as the most valuable company with a market cap of 3,34 trillion USD. It is no secret that great amount of this success can be attributed to the design of graphic cards which are used to train the AI models. In fact, multiple large AI labs including but not limited to OpenAI/Microsoft, xAI, and Meta are in a race to build GPU clusters with over 100,000 GPUs. Each of them would cost at least 4 billion dollars alone, not even counting the immense electricity power it consumes; around 150 MW of electricity. 

To put in perspective how much compute a 100,000 GPU cluster can provide, OpenAI’s training BF16 FLOPS for GPT-4 was ~2.15e25 FLOP (21.5 million ExaFLOP), on ~20,000 A100s for 90 to 100 days, informs Semianalysis. On a 100k H100 cluster, this number would soar to 198/99 FP8/FP16 ExaFLOP/second. This is a 31.5x increase in peak theoretical AI training FLOPs compared to the 20k A100 cluster. 

It is staggering that this massive amount of computational power is already being acquired by several major tech companies. According to Observer, Meta has for example been one of the largest buyers, securing around 350,000 H100 GPUs. Microsoft is not falling behind, with an acquisition of approximately 150,000 H100 GPUs to improve their AI infrastructure​ (Observer)​. Google and Amazon have each purchased around 50,000 H100 GPUs to support their respective AI and cloud computing services​ (Tom’s Hardware)​​ (Observer)​. Additionally, Tesla has redirected 12,000 H100 GPUs to Elon Musk’s AI projects and Musk’s company X.AI plans to revitalize an old factory in Memphis Tennessee into a datacenter. 

It is not just about raw computational power, though. Companies need to design datastructures which are efficient, and it is here where the efforts of several companies differ. It is here where OpenAI still holds the upper edge, their ChatGPT-4 uses better architecture, so while Google’s Gemini Ultra, Nvidia Nemotron 340B, and Meta LLAMA 3 405B had similar amount of computational power dedicated to training, inferior architecture resulted in these models falling short of unlocking new capabilities, according to Semianalysis.

While Nvidia’s advancements in AI hardware have propelled the company to new heights and unlocked unprecedented computational power, the challenge of efficiently training large-scale AI models persists. The need for sophisticated parallelism techniques (such as tensor parallelism, which Semianalysis explains in detail here), combined with the inefficiencies introduced by larger clusters and hardware reliability issues, means that realizing the full potential of these advanced systems remains an ongoing struggle for the AI community.

Author: Oldřich Příklenk

Picture: chatgpt

 


More posts