AI has its limits. The insatiable demand for AI compute is stretching energy grids to their breaking point. Five years ago, when the last remaining reactor at the Three Mile Island nuclear plant was decommissioned, no one could have predicted it would come back to life to power just a single data center. But that’s exactly what has occurred with Microsoft’s recent power purchase agreement, and they are not alone in having extreme AI energy challenges.
Today’s data center architects understand that every watt and square foot matters when it comes to deploying new AI applications. Enterprises cannot run AI on the hardware of yesterday, and storage is no exception. Choosing more energy and space efficient solid-state drives (SSDs) can free up power and space needed for more AI model training and inferencing.
No conversation about data center power efficiency can happen without comprehending the extreme growth in compute power and data over the past ten years. Back in 2014, an average processor needed 100W of cooling. In 2024, that average has grown more than five times,1 with current NVIDIA H100 SXM GPUs needing 700W of cooling.2
Average rack power has seen a corresponding increase in requirements. Rack power in 2014 averaged about 4 to 5kW, whereas in 2024 this has grown to 10 to 14kW,3 with GPU-based compute racks calling for much more. At the recent OCP Summit conference, both Microsoft and Google mentioned they had working racks that scaled from 100s of kW to 1MW.
Mark Zuckerberg, Meta4
In addition, GenAI and other AI applications devour ever more data to deliver better models, driving a massive increase in data volume, e.g., 3 to 5 billion new pages added each month to the common crawl.5 We have also seen some AI model data sets more than doubling in size every two years.6
The challenges in delivering sufficient power and cooling for GPU infrastructure grab today’s headlines, but when power is limited, every watt matters. Beyond compute, storage represents a significant portion of data center energy usage.
For instance, published data from Meta shows that legacy hard disk drive (HDD) storage consumes 35% of AI recommendation engine cluster power.7 Data from Microsoft says that storage accounts for 33% of an Azure solution’s overall operational emissions which correlates with energy consumption.8 In power-constrained environments, a watt used for storage is, in effect, one less watt for compute.
Data storage designed with high-capacity SSDs enables you to store more data in fewer devices versus legacy storage. More to the point, all things being equal, fewer drives use less energy, fewer servers, less space, and as a result, can reduce overall cooling requirements. The industry’s highest capacity data center SSD, the Solidigm D5-P5336, available in capacities up to 61.44TB, can store massive data sets in a smaller operational power footprint compared to today’s highest capacity HDDs.9
We find the data capacity used per AI rack (4 DHX servers) varies from somewhere between 0.5 to 2.0PB of data for text-based AI applications and around 16PB of data for vision-based AI applications. Moreover, multiple vendors are showcasing up to 32PB per AI rack. To accurately depict power savings in the table below, we elected to use 16PB of data per compute rack but realize that SSD power savings scale almost linearly depending on how much data is needed.
For our comparison, we host the 16PB of data on a TLC SSD cache/HDD backend storage or an all-Solidigm QLC SSD solution.
16 PB of Data Storage per Compute Rack |
||
Storage Config | TLC Cache With HDD Backend |
All-Solidigm QLC SSD |
---|---|---|
Data Locality | Split
|
All data in QLC NAND
|
Storage Rack Space | ~3 racks (78U), including Cache: 18U (209 TLC SSDs at 7.68TB each) in 12 SSD/1U servers Bulk Storage: 60U (1,800 HDDs at 24TB each, assuming three-way mirroring), in 90 drive/3U JBoDs |
0.5 racks (21U), including Bulk Storage: 21U (521 SSDs at 61.44TB each, assuming two-way mirroring), in 1-12 SSD/1U server plus 2-32 drive 1U JBoFs, or 76 SSDs per 3U of rack space
|
Storage Power | 18.9kW, including Cache: 1.3kW, assuming 209 TLC SSDs with
Bulk Storage: 17.6kW, assuming 1800 HDDs with
|
3.7kW Bulk Storage: assuming 521 QLC SSDs with
|
Support Power & Rack space | 10.5kW (3.5kW each for 3U-PSU + 3U-networking) and at 6U per rack is 18U of rack space | 3.5kW (3U-PSU + 3U-networking) and 6U of rack space |
Total Power & Total Rack space | 29.4kW, 96U over 3 racks | 7.2kW, 27U over 1 rack |
The result of deploying an all-Solidigm D5-P5336 QLC SSD array would save the data center up to 22.2kW of power and over 1.6 racks of space for 16PB of AI data. Your mileage may vary, but in general this is the amount of power and space savings you can realize deploying QLC SSDs over legacy storage for a single rack of AI compute.
Saving 22.2kW of power may not seem like much when an NVIDIA DGX H100 server consumes 10.2kW, but that could mean deploying two more of them for AI applications in the data center. And the power savings only increase if more data per compute rack were needed for AI.
We would be remiss if we didn’t mention that there is a cost differential to consider here. The costs to purchase HDDs have historically been lower than SSDs on a $/TB basis. So, acquisition costs may be higher for all-QLC SSD storage.
Nonetheless, for power constrained retrofits or even greenfield data center deployments with limited power, being able to save watts can be a make-or-break factor in bringing new AI applications online.
When it comes to power and space efficiency, today’s enterprise Solidigm QLC SSDs are transforming the modern data center. Choosing energy-efficient, space-efficient SSD storage can deliver a more fully utilized return on AI infrastructure investments.
Dave Sierra is a Product Marketing Analyst at Solidigm, where he focuses on solving the infrastructure efficiency challenges that face today's data centers.
Ace Stryker is the Director of Market Development at Solidigm, where he focuses on emerging applications for the company’s portfolio of data center storage solutions, with a special expertise in AI workloads and solutions.
1. Average rack power and power segmentation
2. Source: https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet
3. Source: https://www.idc.com/getdoc.jsp?containerId=US50554523
5. Source: https://commoncrawl.org/
6. Source: https://epochai.org/trends#data
7. Source: https://engineering.fb.com/2022/09/19/ml-applications/data-ingestion-machine-learning-training-meta/
8. A Call for Research on Storage Emissions, Carnegie Melon and Microsoft Azure, https://hotcarbon.org/assets/2024/pdf/hotcarbon24-final126.pdf
9. https://www.solidigm.com/products/data-center/d5/p5336.html