DDN right this moment launched a brand new model of its high-end storage answer for AI and high-performance computing, which Nvidia makes use of to energy its large Eos supercomputer. The AI400X2 Turbo incorporates a 30% efficiency increase in comparison with the system it replaces, which can allow clients to extra effectively practice giant language fashions when paired with GPUs from Nvidia.
DDN has an extended historical past growing storage options for the HPC enterprise. Within the new AI period, it’s leveraged that management to serve the exploding want for high-speed storage options to coach giant language fashions (LLMs) and different AI fashions.
Whereas the coaching knowledge in an LLM is slightly modest by large knowledge requirements, the necessity to regularly again up, or checkpoint, the LLM throughout a coaching session has pushed the demand. As an example, when Nvidia began working with AI400X2 programs two years in the past, Nvidia required a group of storage programs able to delivering 1TB per second for reads and 500GB a second for writes, in line with James Coomer, senior vice chairman of merchandise for DDN.
“That was very vital to them,” Coomer says. “Regardless that this was an LLM and rationally you suppose that’s solely phrases, that’s not large volumes, the mannequin measurement turns into very giant they usually must be checkpointed so much.”
Nvidia, which is holding its GPU Know-how Convention this week in San Jose, California, adopted the AI400X2 for its personal supercomputer, dubbed Eos, which was launched in March 2022. The 18 exaflop cluster sports activities 48 AI400X2 home equipment, which delivers 4.3 TB/sec reads and three.1 TB/sec writes to the SuperPOD loaded with 576f DGX programs and greater than 4,600 H100 GPUs.
“That write efficiency was a very large purpose for them due to the checkpointing operations,” says Kurt Kuckein, vice chairman of selling for DDN. “Their complete purpose was to make sure round 2 TB/sec and we have been capable of obtain above 3 [TB/sec] for the write efficiency.”
That whole throughput would theoretically go up 30% with the brand new AI400X2 Turbo that DDN introduced right this moment. As a 2U equipment, the AI400X2 Turbo can learn knowledge at speeds as much as 120 GB/s and write knowledge at speeds as much as 75 GB/s, with whole IOPS of three million. That compares with 90 GB/s for reads and 65 GB/s for writes with the AI400X, which the AI400X Turbo replaces atop the DDN stack.
Clients will be capable to leverage that 30% profit in a number of methods, both by both cranking by means of extra work in the identical period of time, getting the identical job achieved faster, or getting the identical job achieved with a fewer variety of storage programs, DDN says.
“We are able to cut back the variety of home equipment provisioned, and so doubtlessly you get 30% financial savings in energy versus simply uncooked efficiency, coaching occasions and issues like that,” Kuckein says. “Relying on the variety of GPUs and issues that you’ve, doubtlessly you’re simply lowering the storage footprint.”
When clients string a number of AI400X2 home equipment collectively to Nvidia DGX programs or SuperPODs over 200Gb InfiniBand or Ethernet networks, the whole throughput goes up accordingly. However it’s not simply in regards to the {hardware} funding, Coomer says.
“For us after all the argument isn’t actually that we do 120 GB/sec. The most important argument by far is clients of ours have spent like $100 million in infrastructure and cooling and networks and knowledge scientists and knowledge facilities and stuff. There’s an enormous aggressive play on the market to get your fashions achieved quicker. It’s about spending 5% of that funds max on storage if you happen to select DDN, then you definitely get extra productive output.”
DDN has skilled a big enhance in gross sales because of the GenAI increase. The corporate says its 2023 gross sales for AI storage was double the 2022 degree.
“We didn’t realize it was going to be like this,” Coomer stated. “We posted a press launch final yr saying we shipped as a lot within the first quarter as we did within the earlier yr. This yr, it sort of seems to be prefer it would possibly develop into comparable.”
AI400X2 Turbo can be accessible quickly. The home equipment might be fitted with 2.5-inch NVMe drives with 30TB to 500TB capacities. Along with DDN’s file system, it consists of high quality of service, port zoning detection, and knowledge integrity verify/correction functionalities.
Associated Objects:
AWS Delivers ‘Lightning’ Quick LLM Checkpointing for PyTorch
GenAI Doesn’t Want Larger LLMs. It Wants Higher Information
Why Object Storage Is the Reply to AI’s Greatest Problem