Understanding the Benefits of Extreme Storage Density

In the era of generative AI, more data has been created than ever. Solidigm™ offers a solution to many challenges in the modern AI factory.

It's no secret that we love the massive density of Solidigm 61.44TB U.2 NVMe SSDs. We have conducted numerous endurance and performance tests, made scientific discoveries, and pushed world record computations to new, extraordinary heights. So, with the AI craze surging at a blistering pace all around us, the next logical step was to see how the Solidigm NVMe drives stack up in the dynamic world of AI 2024.

Image Caption: Front view of the Lenovo ThinkSystem SR675 V3 showing Solidigm D5-P5336

Solidigm D5-P5336 61.44TB as part of an AI data storage solution.
Solidigm D5-P5336 61.44TB as part of an AI data storage solution.

P5336

The D5-P5336 is a hyper-scalable, cost effective SSD for read-intensive workloads.

 

Solidigm 61.44TB QLC SSDs stand out for their remarkable storage capacity, enabling data centers to pack more storage into fewer drives. This extreme density is especially advantageous in AI servers, where data sets are growing exponentially, and efficient storage solutions are paramount. Using these high-capacity SSDs, data centers can reduce the number of physical drives, decrease footprint, reduce power consumption, and simplify maintenance.

Limited PCIe lanes in GPU servers

One of the primary challenges in modern GPU servers is the limited number of PCIe lanes available after the GPUs get their share. Critical for AI workloads, GPUs require substantial PCIe bandwidth, often leaving limited lanes for other components, including storage devices and networking. This constraint makes it essential to optimize the use of available PCIe lanes. Solidigm 61.44TB QLC SSDs offer a solution by providing massive storage capacity in a single drive, reducing the need for multiple drives, and conserving PCIe lanes for GPUs and other essential components.

Solidigm QLC SSD for extreme storage density

Figure 1. Internal view of Lenovo ThinkSystem SR675 V3 drive enclosure

AI workloads and storage requirements

AI workloads can be broadly categorized into three phases: data preparation, training and fine-tuning, and inferencing. Each phase has unique storage requirements, and Solidigm high-capacity SSDs can significantly enhance performance and efficiency across these phases.

Data preparation

Data preparation is the foundation of any AI project and involves data collection, cleaning, transformation, and augmentation. This phase requires extensive storage as raw data sets can be enormous. Solidigm 61.44TB QLC SSDs can store expansive raw data without compromising performance. Moreover, the high sequential read and write speeds of these SSDs ensure quick access to data, accelerating the preparation process. For data preparation, the Solidigm 61.44TB QLC SSDs meet all the demands outlined above with benefits such as:

  • Massive Storage Capacity: Efficient handling of large data sets.
  • High Sequential Speeds: Fast data access and processing.
  • Reduced Latency: Minimized delays in data retrieval, enhancing workflow efficiency.

Training and fine-tuning

Training AI models is an intensive process that involves feeding extensive data sets into neural networks to adjust weights and biases. This phase is computationally demanding and requires high IOPS (Input/Output Operations Per Second) and low latency storage to keep up with the rapid data exchanges between the storage and the GPUs. Solidigm SSDs excel in this regard, offering high performance and durability. The extreme density of these SSDs allows for more extensive data sets to be used in training, potentially leading to more accurate models. To meet the training and fine-tuning demands, the Solidigm SSDs deliver the following:

  • High IOPS: Supports rapid data exchanges essential for training.
  • Durability: QLC technology optimized for read-heavy workloads, ideal for repeated training cycles.
  • Scalability: Expand storage without adding physical drives, maintaining efficient use of PCIe lanes.

Inferencing

Once trained, AI models are deployed to make predictions or decisions based on new data, known as inferencing. This phase often requires quick access to pre-processed data and efficient handling of increased read requests. Solidigm 61.44TB QLC SSDs provide the necessary read performance and low latency to ensure that inferencing operations are carried out smoothly and swiftly. Solidigm SSDs exceed the performance and low latency by delivering the following benefits:

  • Fast Read Performance: Ensures rapid access to data for real-time inferencing.
  • Low Latency: Critical for applications requiring immediate responses.
  • High Capacity: Store extensive inferencing data and historical results efficiently.

Why is it important to get large storage as close to the GPU as possible?

For AI and machine learning, the proximity of storage to the GPU can significantly impact performance. It is increasingly important to consider several factors when properly designing an AI data center. This is why it is crucial to have extensive storage that is as close to the GPU as possible. As we explored recently, access to a sizable network-attached storage solution is starting to shape into a one-tool-in-the-belt, but relying on it alone might not always be the optimal choice.

Latency and bandwidth

A primary reason for placing ample storage close to the GPU is to minimize latency and maximize bandwidth. AI workloads, particularly during training, involve frequent and massive data transfers between the storage and the GPU. High latency can bottleneck the entire process, slowing training times and reducing efficiency.

In AI workloads, where rapid data availability is critical, low latency ensures that GPUs receive data promptly, reducing idle times and enhancing overall computational efficiency. During the training phase, massive volumes of data need to be continuously fed into the GPU for processing. By minimizing latency, DAS ensures that the high-speed demands of AI applications are met, leading to faster training times and more efficient workflows.

Solidigm D5-P5336 in NVIDIA GPU AI data storage.

Figure 2. Internal view of Lenovo ThinkSystem SR675 V3 view GPUs

NVMe SSDs maximize the potential of the PCIe interface, providing significantly faster data transfer and bypassing slower existing infrastructure. This high bandwidth is crucial for AI workloads requiring large data sets' swift movement. When storage is directly attached, the bandwidth available to the GPUs is maximized, allowing quicker access to the extensive data necessary for training complex models.

In contrast, network-attached storage of legacy installations introduces additional layers of latency and typically lowers bandwidth. Even with high-speed networks, the overhead associated with network protocols and potential network congestion can impede performance.

Having massive capacity directly attached to your GPU allows for data staging so that it does not have to wait to get the job done when the GPU starts crunching.

Data throughput and I/O performance

Local NVMe SSDs excel in handling a large number of Input/Output Operations Per Second (IOPS), which is crucial for the read/write-intensive nature of AI workloads. During the training phase, AI models require rapid access to vast data repositories, necessitating storage solutions that can keep up with the high demand for data transactions.

NVIDIA and Solidigm for AI data storage workloads.

Figure 3. NVIDIA L40S GPUs

The Solidigm D5-P5336, designed for high-capacity, high-performance scenarios, delivers exceptional IOPS, enabling faster data retrieval and writing processes. This capability ensures that the GPUs remain busy with computation rather than waiting for data, thereby maximizing efficiency and reducing training times. The high IOPS performance of local NVMe SSDs makes them ideal for the demanding environments of AI applications, where quick data access and processing are essential for optimal performance.

Data management

While in some scenarios, having ample storage directly attached to the GPU simplifies data management, this does add a necessary layer of data management to stage the data on the GPU server. In a perfect world, your GPU is busy crunching, and your CPU is going out to the network to save checkpoints or bring down new data.

61.44TB Solidigm drives help reduce the number of later types of transactions that need to occur. You could also account for this using a simplified network setup and distributed file systems. This straightforward approach can streamline workflows and reduce the potential for data-related errors or delays.

Data storage bay with Solidigm D5-P5336 61.44TB SSD.

Figure 4. Lenovo ThinkSystem SR675 V3 with Solidigm D5-P5336 61.44TB SSD

Suppose you are working inside a single server, fine-tuning models that fit within a handful of locally attached GPUs. In that case, you have the advantage of local storage, which is more straightforward to set up and manage than network storage solutions. Configuring, administrating, and maintaining network-attached storage can be complex and time-consuming, often requiring specialized knowledge and additional infrastructure. In contrast, local storage solutions like NVMe SSDs are more straightforward to integrate into existing server setups.

This simplicity in configuration and maintenance allows IT teams to focus more on optimizing AI workloads rather than dealing with the intricacies of network storage management. As a result, deploying and managing storage for AI applications becomes more straightforward and efficient with local NVMe SSDs.

Cost and scalability

While NAS solutions can scale horizontally by adding more storage devices, they also come with costs related to the network infrastructure and potential performance bottlenecks. Conversely, investing in high-capacity local storage can provide immediate performance benefits without extensive network upgrades.

Local storage solutions are often more cost-effective than network-attached storage systems (NAS) because they eliminate the need for expensive network hardware and complex configurations. Setting up and maintaining NAS involves significant investment in networking equipment, such as high-speed switches and routers, and ongoing network management and maintenance costs.

Large-capacity local SSDs integrated directly into the server are used as a staging area, reducing the need for additional infrastructure. This direct integration cuts down on hardware costs and simplifies the setup process, making it more budget-friendly for organizations looking to optimize their AI workloads without incurring high expenses.

Testing methodology

To thoroughly evaluate the performance of Solidigm 61.44TB QLC SSDs in an AI server setup, we will benchmark an array of four of the Solidigm D5-P5336 61.44TB SSDs installed in a Lenovo SR675 V3 server. This server configuration also includes a set of four NVIDIA L40S GPUs. The benchmarking tool used for this purpose is GDSIO, a specialized utility designed to measure storage performance in GPU-direct storage (GDS) environments. We looked at two configurations: one GPU to single drive performance and one GPU to four drives configured for RAID0.

AI data storage rig with Solidigm D5-P5336.

Figure 5. Lenovo ThinkSystem SR675 V3

Test parameters

The benchmarking process involves various test parameters that simulate different stages of the AI pipeline. These parameters include io_sizes, threads, and transfer_type, each chosen to represent specific aspects of AI workloads.

1. IO Sizes:

  • 4K, 128K, 256K, 512K, 1M, 4M, 16M, 64M, 128M: These varying I/O sizes help simulate different data transfer patterns. Smaller I/O sizes (128K, 256K, 512K) mimic scenarios where small data chunks are frequently accessed, typical during data preparation stages. Larger I/O sizes (1M, 4M, 16M, 64M, 128M) represent bulk data transfers often seen during training and inferencing stages, where entire data batches are moved.

2. Threads:

  • 1, 4, 16, 32: The number of threads represents the concurrency level of data access. A single thread tests the baseline performance, while higher thread counts (4, 16, 32) simulate more intensive, parallel data processing activities, akin to what occurs during large-scale training sessions where multiple data streams are handled simultaneously.

3. Transfer Types:

  • Storage->GPU (GDS): This transfer type leverages GPU Direct Storage (GDS), enabling direct data transfers between the SSDs and the GPUs, bypassing the CPU. This configuration is ideal for testing the efficiency of direct data paths and minimizing latency, reflecting real-time inferencing scenarios.
  • Storage->CPU->GPU: This traditional data transfer path involves moving data from storage to the CPU before transferring it to the GPU. This method simulates scenarios where intermediate processing or caching might occur at the CPU level, which is expected during the data preparation phase. We could argue that this data path would represent the performance regardless of the GPU Vendor.
  • Storage->PAGE_CACHE->CPU->GPU: This path uses the page cache for data transfers, where data is first cached in memory before being processed by the CPU and then transferred to the GPU. This configuration is useful for testing the impact of caching mechanisms and memory bandwidth on overall performance, which is pertinent during training when data might be pre-processed and cached for efficiency. Again, we could argue that this data path would represent the performance regardless of the GPU Vendor.

Mimicking AI pipeline stages

The benchmark tests are designed to reflect different stages of the AI pipeline, ensuring that the performance metrics obtained are relevant and comprehensive.

Data preparation

  • IO Sizes: Smaller (128K, 256K, 512K)
  • Threads: 1, 4
  • Transfer Types: "Storage->CPU->GPU", "Storage->PAGE_CACHE->CPU->GPU"
  • Purpose: Evaluate how the SSDs handle frequent small data transfers and CPU involvement, which is critical during data ingestion, cleaning, and augmentation phases.

Training and fine-tuning

  • IO Sizes: Medium to large (1M, 4M, 16M)
  • Threads: 4, 16, 32
  • Transfer Types: "Storage->GPU (GDS)", "Storage->CPU->GPU"
  • Purpose: Assess the performance under high data throughput conditions with multiple concurrent data streams, representing the intensive data handling required during model training and fine-tuning.

Inferencing

  • IO Sizes: Large to very large (16M, 64M, 128M) and 4K
  • Threads: 1, 4, 16
  • Transfer Types: Storage->GPU (GDS)
  • Purpose: Measure the efficiency of direct, large-scale data transfers to the GPU, which is crucial for real-time inferencing applications where quick data access and minimal latency are paramount. 4K is designed to look at RAG database searches occurring.

By varying these parameters and testing different configurations, we can obtain a detailed performance profile of the Solidigm 61.44TB QLC SSDs in a high-performance AI server environment, providing insights into their suitability and optimization for various AI workloads. We examined the data by running more than 1200 tests over the course of a few weeks.

Server configuration

Benchmark results

First up let's look at training and inferencing type workloads, the GPU Direct 1024K IO size represents model loading, training data being loaded to the GPU.

4Drive I/O Type     Transfer Type Threads Data Set Size (KiB) IO Size (KiB) Throughput (GiB/sec) Avg Latency (usecs) Operations
WRITE GPUD 8 777,375,744 1024 12.31 634.55 759,156
READ GPUD 8 579,439,616 1024 9.30 840.37 565,859
RANDWRITE GPUD 8 751,927,296 1024 12.04 648.67 734,304
RANDREAD GPUD 8 653,832,192 1024 10.50 743.89 638,508

Here is the inferencing type workload results, for a RAG type workload where fast random 4k data access to something like a vector DB needs to be imports. Efficient random I/O is necessary for scenarios where inferencing workloads need to access data in a non-sequential manner, such as with recommendation systems or search applications. The RAID0 configuration exhibits good performance for sequential and random operations, which is crucial for AI applications that involve a mix of access patterns like RAG. The read latency values are notably low, especially in the GPUD mode.

4Drive I/O Type     Transfer Type Threads Data Set Size (KiB) IO Size (KiB) Throughput (GiB/sec) Avg Latency (usecs) Operations
WRITE GPUD 8 69,929,336 4 1.12 27.32 17,482,334
READ GPUD 8 37,096,856 4 0.59 51.52 9,274,214
RANDWRITE GPUD 8 8,522,752 4 0.14 224.05 2,130,688
RANDREAD GPUD 8 21,161,116 4 0.34 89.99 5,290,279
RANDWRITE GPUD 8 57,083,336 4 0.91 33.42 14,270,834
RANDREAD GPUD 8 27,226,364 4 0.44 70.07 6,806,591

If you don't use GPU Direct, either due to unsupported libraries or GPUs, here is those two types if you utilize the CPU in the data transfer. In this specific server, the Lenovo SR 675 V3, since all PCIe devices go through the CPU root complex, we see comparable bandwidth, but we take a hit on our latency.

4Drive I/O Type     Transfer Type Threads Data Set Size (KiB) IO Size (KiB) Throughput (GiB/sec) Avg Latency (usecs) Operations
WRITE CPU_GPU 8 767,126,528 1024 12.24 638.05 749,147
READ CPU_GPU 8 660,889,600 1024 10.58 738.75 645,400
RANDWRITE CPU_GPU 8 752,763,904 1024 12.02 649.76 735,121
RANDREAD CPU_GPU 8 656,329,728 1024 10.47 746.26 640,947
WRITE CPU_GPU 8 69,498,220 4 1.11 27.47 17,374,555
READ CPU_GPU 8 36,634,680 4 0.58 52.31 9,158,670

The table indicates high throughput rates for read operations, particularly with the GPUD transfer type. For instance, read operations in GPUD mode reaches upward of10.5 GiB/sec. This benefits AI workloads, often requiring rapid data access for training large models.

The write throughput is also considerable, with values reaching up to 3.14 GiB/sec. While not as high as read throughput, it is still adequate for efficiently saving model checkpoints and intermediate results.

The balanced performance between random and sequential operations makes this configuration suitable for inferencing tasks, which often require a mix of these access patterns. While the latency values are not extremely low, they are still within acceptable limits for many inferencing applications.

Additionally, we see impressive throughput rates, with write operations reaching up to 12.31 GiB/sec and read operations up to 9.30 GiB/sec. This high throughput benefits AI workloads that require fast data access for model training and inferencing.

Sequential Reads and Optimization

Moving to 128M IO size and iterating through worker threads, we can see the result of optimizing a workload for a storage solution.

Transfer Type Threads Throughput (GiB/sec) Latency (usec)
Storage->CPU->GPU 16 25.134916 79528.88255
Storage->CPU->GPU 4 25.134903 19887.66948
Storage->CPU->GPU 32 25.12613 159296.2804
Storage->GPU (GDS) 4 25.057484 19946.07198
Storage->GPU (GDS) 16 25.044871 79770.6007
Storage->GPU (GDS) 32 25.031055 159478.8246
Storage->PAGE_CACHE->CPU->GPU 16 24.493948 109958.4447
Storage->PAGE_CACHE->CPU->GPU 32 24.126103 291792.8345
Storage->GPU (GDS) 1 23.305366 5362.611458
Storage->PAGE_CACHE->CPU->GPU 4 21.906704 22815.52797
Storage->CPU->GPU 1 15.27233 8182.667969
Storage->PAGE_CACHE->CPU->GPU 1 6.016992 20760.22778

Properly writing any application to interact with storage is paramount and needs to be considered as enterprises want to maximize their GPU investment.

GPU Direct

By isolating the GPU Direct-only performance across all the tests, we can get a general idea of how the NVIDIA technology shines.

I/O Type     Transfer Type Threads IO Depth Data Set Size (KiB) IO Size (KiB) Throughput (GiB/sec) Avg Latency (usecs) Operations
WRITE GPUD 8 - 777,375,744 1024         12.31 634.55 759,156
READ GPUD 8 - 579,439,616 1024 9.30 840.37 565,859
RANDWRITE GPUD 8 - 751,927,296 1024 12.04 648.67 734,304
RANDREAD GPUD 8 - 653,832,192 1024 10.50 743.89 638,508
WRITE GPUD 8 - 69,929,336 4 1.12 27.32 17,482,334
READ GPUD 8 - 37,096,856 4 0.59 51.52 9,274,214
RANDWRITE GPUD 8 - 8,522,752 4 0.14 224.05 2,130,688
RANDREAD GPUD 8 - 21,161,116 4 0.34 89.99 5,290,279
RANDWRITE GPUD 8 - 57,083,336 4 0.91 33.42 14,270,834
RANDREAD GPUD 8 - 27,226,364 4 0.44 70.07 6,806,591

Scaling over the single drive is apparent, and a properly designed system with some switched, high-speed PCIe interconnects would eliminate the bandwidth issues for larger servers.

Closing thoughts

Since this article focuses on the Solidigm 61.44TB P5336, let's take a step back and address the TLC vs. QLC debate around Performance vs. Capacity. When we look at other products in the Solidigm portfolio, such as the D7 line, which uses TLC 3D NAND, the capacity is limited in exchange for performance. In our testing, specifically with the 61.44TB Solidigm drives, we are seeing throughput performance in aggregate that can adequately keep GPUs fed with data at low latencies. We are hearing feedback from ODMs and OEMs about the demand for more and more storage as close as possible to the GPU, and the Solidigm D5-P5336 drive seems to fit the bill. Since there is usually a limited number of NVMe bays available in GPU servers, the dense Solidigm drives are at the top of the list for local GPU server storage.

Ultimately, the massive storage capacity these drives offer, alongside GPUs, is only part of the solution; they still need to perform well. Once you aggregate the single drive performance over several drives, it's clear that sufficient throughput is available even for the most demanding of tasks. In the case of the 4-drive RAID0 configuration using GDSIO, the total throughput for write operations could reach up to 12.31 GiB/sec, and for read operations, it could reach up to 25.13 GiB/sec.

Close up of RAID0 array for AI applications.

Figure 6. Lenovo ThinkSystem SR675 rear view

This level of throughput is more than sufficient for even the most demanding AI tasks, such as training large deep-learning models on massive datasets or running real-time inferencing on high-resolution video streams. The ability to scale performance by adding more drives to the RAID0 array makes it a compelling choice for AI applications where fast and efficient data access is crucial.

However, it's important to note that RAID0 configurations, while offering high performance, do not provide any data redundancy. Therefore, it's essential to implement appropriate backup and data protection strategies to prevent data loss in the event of a drive failure.

Another unique consideration in data centers today is power. With AI servers pulling down more power than ever and showing no signs of slowing, the total power available is one of the biggest bottlenecks for those looking to bring GPUs into their data centers. This means that there is even more focus on saving every watt possible. If you can get more TB per watt, we approach some interesting thought processes around TCO and infrastructure costs. Even taking these drives off the GPU server and putting them in a storage server at the rack scale, in aggregate, they can deliver massive throughput with extreme capacities.

Integrating Solidigm D5-P5336 61.44TB QLC SSDs with NVMe slot-limited AI servers represents a significant advancement in addressing the storage challenges of modern AI workloads. Their extreme density, performance characteristics, and TB/watt ratio make them ideal for data preparation, training and fine-tuning, and inferencing phases. By optimizing the use of PCIe lanes and providing high-capacity storage solutions, these SSDs enable the modern AI Factory to focus on developing and deploying more sophisticated and accurate models, driving innovation across the AI field.