Enterprise IT has several competing demands to reconcile. There are current and future application considerations, there’s infrastructure cost in both CapEx and OpEx together with power, cooling, and space constraints. And, of course, there’s keeping the business running.
Unless it’s a new build, most enterprise data centers use hybrid (SSD+HDD) storage arrays. But there’s another solution; all-flash storage arrays (AFA), an SSD-only configuration that can offer significant benefits to enterprise companies across several dimensions, especially for AI activities. Many of the more innovative IT shops have already turned to AFA storage, and many of the rest are looking into it to support their AI data needs.
AI offers new ways of doing business and better ways to support critical processes. But the challenges for IT implementing AI are many, namely costly new compute, power, and cooling infrastructure along with the AI applications themselves. Given the cost of AI infrastructure, a primary concern is keeping it highly utilized. Adequate IO performance is one of the key factors in keeping AI infrastructure busy.
However, the real value to the enterprise coming from AI applications is not in the training or inferencing per se, but in all that this enables for the enterprise. For instance:
Any of these AI adjunct or follow-on activities could potentially use hybrid storage but just as IO activity increases, hybrid systems begin to see performance degradation. That is, when IO activity is in high demand, hybrid systems often start to slow down.
Hybrid storage IO performance problems all stem from their basic architecture. Essentially these systems try to optimize placement of data so that hot (highly accessed data) resides on SSDs and cold (less accessed data) resides on HDDs. While this may work well for data that can be readily classified by activity, for data that can’t be so easily classified, or when data access activity increases, it often requires data to be moved from HDD to SSD and back again, otherwise known as thrashing. Thrashing increases hybrid system workloads just when it needs to devote all its resources to application IO.
We should mention that hybrid storage vendors have vastly different ways to optimize data placement to reduce and contain all this extra data movement. But in the end, when cold data needs to be accessed it must either be moved to SSD or accessed directly off HDD. When hot data is no longer accessed, it needs to be moved back to HDD to make room for more hot data.
On the other hand, AFA systems do not have nearly the same levels of data movement or performance problems as hybrid arrays during high IO activity. This is because there’s never a need to go to slower tier storage to offload data or retrieve data that hasn’t been accessed in a while. The variability of IO latency and performance seen in hybrid systems are much narrower in AFA storage, regardless of system sophistication. As a result, AFA systems provide much more consistent, high IO performance regardless of activity.
AFA systems have intrinsic SSD-level data movement when data is written to free up used pages, and AFA systems have internal virtualization managing where data is stored inside the device. However, this movement and virtualization adds almost no overhead for read IO, only minor overhead for writes, and will only cause one-way device level movement of data. Data moves from soon-to-be-freed pages to new pages, but never needs to move back again during garbage collection.
In a prior blog post, The Incredible Power of Power Efficient Storage, we discussed at length how Solidigm QLC SSDs can significantly reduce space and power requirements versus an all-HDD system, in support of data lakes for AI training and inferencing. But to summarize the findings from that post here, the Solidigm 61.44TB QLC SSDs require fewer drives (521 SSDs vs 1800 HDDs), less power (22.2kW less), and less rack space (~60 less RUs) to support 1PB of data, than an all-HDD solution for AI.
And there’s more than performance, space, power, and cooling advantages when comparing hybrid data storage vs all-flash storage. For example, the reliability of SSDs is much better than HDD storage. For consumer grade storage, SSDs are at least one third better or more reliable than HDD storage.
And for enterprise class Solidigm SSDs, which are tested to specs far beyond normal SSD industry standards, it’s even better. In fact, Solidigm SSDs have not detected a single data corruption event in over 3.5B years of simulated operational life.1
How does better reliability benefit the enterprise when, for both AFA and hybrid systems, maintenance costs pay for any repair and servicing? What enterprises pay for maintenance on hybrid vs AFA systems is hard to directly compare as many factors are at play, but in general, costs will be higher for systems with higher failure rates because of the need for more replacement inventory and more service calls. And the performance of a storage system when a HDD fails suffers more versus when an SSD goes out.2
One example of this disparity is with systems using erasure coding data protection. When an SSD or HDD drive out of a RAID stripe fails, all the other drives in the stripe must be read to rebuild and re-construct the data that was lost. This rebuild process can take a long time depending on the speed of the drive and its capacity. The bandwidth of SSDs is ~10x to 25x higher than HDDs. Also, IO latency for SSDs is in microseconds whereas HDD IO latency is in milliseconds, with SSDs performing IO 1000x faster than HDDs. As such, rebuild times for similar capacity drives tend to be much shorter for SSDs than for HDDs.
During any rebuild activity storage systems are busier, which only adds to the time it takes to bring the system to full performance. All this tells us that when a hybrid HDD drive fails, and note that after three years of service, HDDs fail at a higher rate than SSDs,3 the performance of the system suffers more during rebuilds. So, while an SSD failure in an AFA causes a similar set-back, its much faster performance and higher bandwidth means rebuild times are much shorter.
History is no friend to hybrid arrays. They have been in the data center for decades, but enterprise disk shipments peaked over a decade ago.
Some say HDD shipment declines seem to be flattening. But that doesn’t tell the whole story. Enterprise HDDs unit shipments have decreased,4 and the only HDDs that are still being shipped in high volume are nearline disks, used in slower object storage. And right about the time enterprise disk shipments started their steep decline, SSD shipment volumes started to pick up.
In summary, there’s a multitude of benefits to using an AFA, all-SSD storage system over hybrid arrays for enterprise AI activities. These benefits include higher and more consistent IO performance when it matters, higher reliability, lower power, reduced cooling, and a smaller footprint, just to name a few.
Moreover, the decline in enterprise HDD vs. the rise in SSD shipment volumes is yet another piece of evidence telling us that the days of hybrid storage are limited and that all-flash storage systems have become the new primary storage solution for enterprises’ AI workloads, and other, similar, IO intensive workload needs.
Ace Stryker is the Director of Market Development at Solidigm, where he focuses on emerging applications for the company’s portfolio of data center storage solutions, with a special expertise in AI workloads and solutions.