AI Is Accelerating the Shift from Hybrid to All-Flash Arrays

Enterprise IT has several competing demands to reconcile. There are current and future application considerations, there’s infrastructure cost in both CapEx and OpEx together with power, cooling, and space constraints. And, of course, there’s keeping the business running. 

Unless it’s a new build, most enterprise data centers use hybrid (SSD+HDD) storage arrays. But there’s another solution; all-flash storage arrays (AFA), an SSD-only configuration that can offer significant benefits to enterprise companies across several dimensions, especially for AI activities. Many of the more innovative IT shops have already turned to AFA storage, and many of the rest are looking into it to support their AI data needs.

AI offers new ways of doing business and better ways to support critical processes. But the challenges for IT implementing AI are many, namely costly new compute, power, and cooling infrastructure along with the AI applications themselves. Given the cost of AI infrastructure, a primary concern is keeping it highly utilized. Adequate IO performance is one of the key factors in keeping AI infrastructure busy.

However, the real value to the enterprise coming from AI applications is not in the training or inferencing per se, but in all that this enables for the enterprise. For instance:

  • Recommendation engines are compelling solutions, but their real value comes from customers clicking on recommendations. That’s when the rest of the IT infrastructure takes off in support of that click to generate another sale. These are not just one off or single transactions. The retail world holds its collective breath every year when Black Friday and Christmas sales go live, hoping to keep up with all the sales activity arising from customer web activities. 
  • Vision systems are great at identifying and tracking objects or people in a scene. But making use of that functionality to follow customer behavior, diagnose maladies, or identify faulty parts is just half the story. What the enterprise does with that information takes applications running on additional infrastructure to act. This is where the real business value of object tracking and people tracking lies.
  • LLMs are an impressive new technology but to personalize it for enterprise and reduce hallucinations, most organizations use a vector database in conjunction with a Retrieval Augmented Generation (RAG) system. RAGs take in corporate data, encodes, embeds, and/or vectorizes it, loads it into a vector database, indexes it, and queries it to add enterprise specific context to any prompt that comes in. Vector database loading, indexing, and accessing are significant I/O driven workloads. Depending on your prompt activity and new information acquisition rate, these can consume a lot of IO resources. 

Any of these AI adjunct or follow-on activities could potentially use hybrid storage but just as IO activity increases, hybrid systems begin to see performance degradation. That is, when IO activity is in high demand, hybrid systems often start to slow down. 

Hybrid storage IO performance problems all stem from their basic architecture. Essentially these systems try to optimize placement of data so that hot (highly accessed data) resides on SSDs and cold (less accessed data) resides on HDDs. While this may work well for data that can be readily classified by activity, for data that can’t be so easily classified, or when data access activity increases, it often requires data to be moved from HDD to SSD and back again, otherwise known as thrashing. Thrashing increases hybrid system workloads just when it needs to devote all its resources to application IO. 

Data placement is a key differentiator in hybrid vs AFA storage

We should mention that hybrid storage vendors have vastly different ways to optimize data placement to reduce and contain all this extra data movement. But in the end, when cold data needs to be accessed it must either be moved to SSD or accessed directly off HDD. When hot data is no longer accessed, it needs to be moved back to HDD to make room for more hot data. 

On the other hand, AFA systems do not have nearly the same levels of data movement or performance problems as hybrid arrays during high IO activity. This is because there’s never a need to go to slower tier storage to offload data or retrieve data that hasn’t been accessed in a while. The variability of IO latency and performance seen in hybrid systems are much narrower in AFA storage, regardless of system sophistication. As a result, AFA systems provide much more consistent, high IO performance regardless of activity.

AFA systems have intrinsic SSD-level data movement when data is written to free up used pages, and AFA systems have internal virtualization managing where data is stored inside the device. However, this movement and virtualization adds almost no overhead for read IO, only minor overhead for writes, and will only cause one-way device level movement of data. Data moves from soon-to-be-freed pages to new pages, but never needs to move back again during garbage collection.

AFA advantages

In a prior blog post, The Incredible Power of Power Efficient Storage, we discussed at length how Solidigm QLC SSDs can significantly reduce space and power requirements versus an all-HDD system, in support of data lakes for AI training and inferencing. But to summarize the findings from that post here, the Solidigm 61.44TB QLC SSDs require fewer drives (521 SSDs vs 1800 HDDs), less power (22.2kW less), and less rack space (~60 less RUs) to support 1PB of data, than an all-HDD solution for AI. 

And there’s more than performance, space, power, and cooling advantages when comparing hybrid data storage vs all-flash storage. For example, the reliability of SSDs is much better than HDD storage. For consumer grade storage, SSDs are at least one third better or more reliable than HDD storage. 

And for enterprise class Solidigm SSDs, which are tested to specs far beyond normal SSD industry standards, it’s even better. In fact, Solidigm SSDs have not detected a single data corruption event in over 3.5B years of simulated operational life.1

How does better reliability benefit the enterprise when, for both AFA and hybrid systems, maintenance costs pay for any repair and servicing? What enterprises pay for maintenance on hybrid vs AFA systems is hard to directly compare as many factors are at play, but in general, costs will be higher for systems with higher failure rates because of the need for more replacement inventory and more service calls. And the performance of a storage system when a HDD fails suffers more versus when an SSD goes out.2

One example of this disparity is with systems using erasure coding data protection. When an SSD or HDD drive out of a RAID stripe fails, all the other drives in the stripe must be read to rebuild and re-construct the data that was lost. This rebuild process can take a long time depending on the speed of the drive and its capacity. The bandwidth of SSDs is ~10x to 25x higher than HDDs. Also, IO latency for SSDs is in microseconds whereas HDD IO latency is in milliseconds, with SSDs performing IO 1000x faster than HDDs. As such, rebuild times for similar capacity drives tend to be much shorter for SSDs than for HDDs. 

During any rebuild activity storage systems are busier, which only adds to the time it takes to bring the system to full performance. All this tells us that when a hybrid HDD drive fails, and note that after three years of service, HDDs fail at a higher rate than SSDs,3 the performance of the system suffers more during rebuilds. So, while an SSD failure in an AFA causes a similar set-back, its much faster performance and higher bandwidth means rebuild times are much shorter. 

All-flash is the future of storage

History is no friend to hybrid arrays. They have been in the data center for decades, but enterprise disk shipments peaked over a decade ago. 

Some say HDD shipment declines seem to be flattening. But that doesn’t tell the whole story. Enterprise HDDs unit shipments have decreased,4 and the only HDDs that are still being shipped in high volume are nearline disks, used in slower object storage. And right about the time enterprise disk shipments started their steep decline, SSD shipment volumes started to pick up. 

In summary, there’s a multitude of benefits to using an AFA, all-SSD storage system over hybrid arrays for enterprise AI activities. These benefits include higher and more consistent IO performance when it matters, higher reliability, lower power, reduced cooling, and a smaller footprint, just to name a few. 

Moreover, the decline in enterprise HDD vs. the rise in SSD shipment volumes is yet another piece of evidence telling us that the days of hybrid storage are limited and that all-flash storage systems have become the new primary storage solution for enterprises’ AI workloads, and other, similar, IO intensive workload needs. 


About the Author

Ace Stryker is the Director of Market Development at Solidigm, where he focuses on emerging applications for the company’s portfolio of data center storage solutions, with a special expertise in AI workloads and solutions.

Notes

  1. Source: Solidigm. Soft Error Rates conducted at Los Alamos Labs, at 1TB/day
  2. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/a7-narayanan.pdf
  3. https://www.backblaze.com/blog/how-reliable-are-ssds/
  4. https://www.statista.com/statistics/285474/hdds-and-ssds-in-pcs-global-shipments-2012-2017/