Delivering a Sustainable Data Platform for the AI Era With WEKA

TechArena Podcast hosted by Allyson Klein and Jeniece Wnorowski

TechArena host Allyson Klein and Jeniece Wnorowski of Solidigm chat with WEKA’s Joel Kaufman, as he discusses the WEKA data platform and how the company’s innovation provides sustainable data management that scales for the AI era. Learn how WEKA delivers strong software-based data management and distributed storage functionality across the entire data platform that they provide on-prem or in the cloud for just about any workload.

 

Audio Transcript

This transcript has been edited for clarity and conciseness

Allyson Klein: Welcome to the TechArena Data Insight Series. My name is Allyson Klein, and joining me back in studio is my co-host, Jeniece Wnorowski from Solidigm. Welcome to the program, Jeniece.

Jeniece Wnorowski: Thank you, Allyson, it's great to be back.

Allyson: We continue to just have fantastic discussions with leaders in the industry, and pursuing what's happening with the data pipeline, and how is that transforming for organizations as they try to tackle new capabilities with artificial intelligence? Tell me about who you've got lined up for us today to talk to.

Jeniece: Yeah, thank you, Allyson. I am super excited about our guests today. We have joining us Joel Kaufman, who is the Senior Technical Marketing Manager of WEKA. And anyone who's been following anything AI, WEKA is definitely one of those organizations that are a major bright spot in innovation. And we're just super excited to hear from Joel about the technologies that they're working on and the innovation that they're working toward.

Allyson: Welcome to the program, Joel.

Joel Kaufman: Hi, glad to be here.

Allyson: So Joel, why don't you just start with an introduction of yourself and your background in data and how it led to the position that you've got at WEKA.

Joel: Yeah, I started out quite a while back without dating myself too much. But for a while, I worked for Silicon Graphics, which should take you way back in the day, doing a variety of things for them around high-performance computing, managing systems there, things like that. And then after a while, a few people who had moved on to a different company, which became NetApp, said, “Hey, you should come over and start working there.” And so I wound up at NetApp for a very long time, approximately a little over 20 years, doing everything from introductory setup to managing entire teams of technical marketing engineers and then handling a lot of our data protection, data management and data replication programs. And then after a while, again, people you know always pull you around. And so I got this call saying, “You should come over and check out this very cool new technology at a company called WEKA.” At which point I moved out of the engineering side of technical marketing and more into the marketing side of technical marketing, where I help pull together and explain our technology in a way that is meaningful and makes sense to a lot of our customers, our partners, and sometimes even internally for training our own people at WEKA.

Allyson: Now, we recently had WEKA on the program, and I wanted to follow up to go a little bit more under the hood with your solutions. The first interview was great, and people should check that out, but there's so many more questions that I had at the end of that interview, so I'm so glad that we're, Jeniece and I are getting the chance to talk to you. Before that we go there, though, can you just do an introduction of WEKA for those who aren't familiar with your solutions and put a little context around how WEKA works within the industry?

Joel: Yeah, so the way to think of WEKA this day is that we're really this data platform that partially through a lot of foresight and partially through a little bit of luck, we came to this point where there was this convergence of really high-performance compute, really high-performance networking, but storage seemed to be left behind in the dust a little bit. And so our founders took a look at it and said, you know, there's got to be a way of solving this storage problem as part of this, I guess you call it kind of a pyramid, right? A triangle of these three core things that go into infrastructure. And by utilizing under the covers, an incredibly parallelized file system, we're able to deliver really strong software-based data management, distributed storage functionality across the entire data platform that we provide. And we can do it on-prem or in the cloud for pretty much any workload that's out there.  We tend to focus a little bit today on the really performance-intensive workloads, things like AI/ML, high-performance compute [HPC], but all the different variations about what those implementations look like across a large number of industries.

Jeniece: That is amazing, Joel. Can you comment a little bit further, though, on any particular customer challenges that WEKA is kind of uniquely solving, and why is this a focus for you today?

Joel: Yeah, so if we take a look kind of at the history of what's been going on in the industry, we started out with things like HPC, and HPC was sort of this isolated, you know, it was built for large labs, maybe some universities that were doing this in the public sector space, and there was this trend of ultra-high performance. And that moved on for about 10 years. And then you fast forward towards today, and what we're finding is that a lot of the types of applications, the problems that customers really were trying to solve around, you know, even internal things, manufacturing, business and finance types of things, started to require more and more compute power. They started to require a lot more intelligence to what they're doing. And when you couple that with the rise of things like AI/ML, generative AI, it began as convergence. And so now we're seeing HPC and converging into AI in the enterprise space. And so a lot of these challenges that we're seeing is companies that are saying, I'm used to doing traditional enterprise-level IT. Here's a whole new classification of applications that they might be in the cloud. They might be on prem. They're in incredibly high performance. They have scale that they've never seen before, not just from a performance or a capacity standpoint, but even little things like the numbers of files that are being used to pull this data in for processing are at volumes that just literally have never been seen in the history of computing. And so being able to say, take a step back and go, we've simplified this environment.

We're able to give you all the scale and performance you need. And by doing that, it really makes it a much more simplified and easy-to-consume experience for those enterprise customers. Some of the customers that we're dealing with at this point aren't just the traditional top enterprise tier. It's a new generation of data center providers that are doing things like GPU as a service, right? They need to figure out how do they handle tens or hundreds of customers all trying to consume massive farms of GPUs in a somewhat isolated manner and yet maintain consistent performance across all of it to give the best bang for the buck. And then to top it all off, a little bit here, is you have certain customers, actually more than not now, are looking at the fact that traditional types of storage that are being represented as data platforms simply don't have the performance density to make them actually sustainable moving forward. So being able to say we can offer all these capabilities but reduce our entire carbon footprint, make sure that it is sustainable from a community standpoint is really becoming very, very crucial to a lot of these customers.

Allyson: You know, as you were talking, I was thinking, this is such a beautiful story, and you're talking about how you've simplified what I think folks who have managed data for a really long time know is that that simplification is not an easy thing to do. So can you go a little bit deeper into how WEKA has approached data management in a unique way to deliver that simplification to customers across that diverse landscape?

Joel: Yeah, absolutely. So one of the things that we've noticed, and I heard you at the very beginning talking about data pipeline. We've been talking about data pipelines now for probably close to three years or so. What we discovered in a lot of customers is that when you start looking at legacy architectures, and I don't mean that in a really disparaging way, what I really mean is architectures that have done fantastic for traditional enterprise IT for years and years and years and probably will continue to do that for the future, they're not architected to look at different types of IO in a system and manage it in a really appreciable way. So a good example for this is, let's talk about AI in general, or generative AI. If you look at a pipeline for a workflow and a tool chain that's used for all these, you start out by ingesting data. It could be market data. It could be datasets for things like genomics, protein folding libraries, integration with cryo-EM systems. It could be images for doing manufacturing quality assurance, QA on varieties of components, things like that. So you ingest a certain IO profile for that. Then you turn around and say the next thing you need to do is normalize this data out, go ahead and transform it, maybe an ETL [Extract, Transform, and Load] or an ELT [Extract, Transform, and Load] type of function, something that takes that data in. And suddenly you go from this big, maybe slow, but lots of streams of writes coming in, to now you have to do this blended IO of reads and writes and back and forth. And as the scale of these files gets larger and larger, now you have to do tons of metadata lookups. And eventually you get around to processing the data. And then the final step, well, not really the final step, but the next step in the pipeline is maybe you send it off to training in the AI model because you've normalized the data, now you do the ingest, you do the re-tuning and the training and the fine-tuning and so on. And that's a massive type of read function.

And then you take the data and you validate it, send it back as someone says the precision is not enough and you start your automated loop. And this type of blended IO across the board has been a nightmare for most companies to handle. And it was so bad for a long time that even with going from hard drives to flash drives, you simply could not have storage systems that were architected to handle every stage. And so you wound up with: Here's my dedicated ingest system. Then I'll copy the data over to a system for doing the ETL. Then I'll copy the data again to pump it into these GPUs for training. And then I'll take the result back out and then I'll copy it back. And this has caused just massive complexity. So what WEKA has done is we have such a[n] ability to handle so many different IOU profiles at the same time without any real performance deterioration, is we've flattened that entire copying architecture out. We've made it to the point where you can just have a single pool, or file system if you want to, of this data, and have so much performance across reads, writes, big files, small files, and numbers of files that we've removed that entire copy process. And ultimately, what it does is it helps you feed your compute platform, CPUs and GPOs, faster to keep them massively utilized, so they're not sitting there burning power, just idling along waiting for data to come in. And that simplification has really transformed, what a lot of our customers are doing.

Jeniece: That is amazing. And just hearing you explain that simplification, as many others are kind of looking at the overall data pipeline to kind of understand how to navigate through it, I think you guys have a really amazing handle on it. But can you take a step back, and you've mentioned a lot of different workloads as well, but I'd love to understand even further, you guys seem to have a really big affinity with working with data in the cloud. And can you speak a little more specifically around how you are solving data movement challenges across distributed systems.

Joel: Yeah, this is a real interesting one, and I kind of want to clarify a little bit. When we talk about data movement challenges, there's a unique aspect to it that I think is underestimated, and that is that data has an extreme amount of gravity. And so when we talk with customers, one of the biggest things that we do and consult with our customers about is not just, you know, can you move the data, but should you move the data? And so we're finding this really interesting combination of customers who are saying, “No, everything I'm going to do is on-prem.” Some who are entirely 100% cloud native, entirely. We actually have one customer in the media and entertainment space, Preymaker, everything they do, it's 100% cloud-focused. They don't want to deal with infrastructure. But more and more, we're beginning to see this trend of customers who are making these decisions about what data needs to be moved, and it doesn't necessarily have to be all of it. So we get into this very hybrid type of cloud play. So what WEKA does under the covers a little bit is we have a technology that really enables this called Snap-To Object. And one of the things when we began this process of bringing WEKA as a product to market is we took a look at what costs look like and what a better way of maybe doing replication would look like, or data movement, really, in this case. Actually, let's call it data mobility or data liquidity if you want to. And so, being able to take a complete image of what the data is on a WEKA system, move it to an object store where we don't care where it lives. It could be on-prem, could be a cloud object store, it's available in all the various hyperscalers, doesn't really matter. But at that point, any other WEKA system, as long as it has this special token key that can view what the snapshot look like, and we produce this every time we take the snapshot and move the data, if you can pass that key along to another WEKA system, it can then go access that data as long as they have access to the object store.

And so, you get this sort of combination of killer third-party witness of data, because it's now on an entirely separate object store system, so you have that separation of domains, and yet any other WEKA system can grab it. And so, we're seeing use cases. A great example, there's a pharmaceutical company in the Boston area, really big work that they're doing around protein folding and virology and things like that. Which creates solutions from the health of their customers. And what they do is they do a lot of their pre-processing on-prem. And then when they do their final model training and final analysis, they snapshot the data up into the cloud, attach a cloud-based WEKA system, and then they can scale up massive amounts of rental compute power to address that data at really high performance levels, again, even in the cloud. And then once they're done, they get a couple of much smaller outputs, and they just send it right back down to on-prem for final archiving, storage, etc. And so we're seeing this type of data movement and distribution happening across a lot of our customers now.

Jeniece: So Joel, thank you so much for that. Could you also speak a little bit more about the work you guys recently introduced with the WEKA pod as a complement to your WEKA reference architecture, and tell us a bit about why that infrastructure?

Joel: So this has been kind of an interesting journey. If you look at WEKA from the start point, we are completely agnostic. We are a software solution, right? In fact, our original coding and our original builds were all in the cloud. We're one of the, I guess you could say, sort of oddball infrastructure companies, where instead of starting on-prem and saying, we'll figure out how to port it to the cloud, we start in the cloud, and then we had customer demand come in and say, hey, you should be on-prem because we have a real need locally. And one of the benefits of all this, just a little side note, is that because we're just software, we run the exact same binary, whether it's on-prem or in the cloud, we don't change how we operate our data platform. And because of this, it gives us a certain amount of agnosticism that lets our customers, again, make those decisions to deploy anywhere. So cloud was evolution one. Then we move on to evolution two, which was on-prem and then hybrid. And evolution three now is what is essentially the appliance or the ultimate in consumption simplification for on-prem customers. And what this kind of came out to be is that for purposes of integrating with NVIDIA SuperPod systems and BasePod systems, they wanted to have a[n] appliance effectively built out for use in those particular use cases. And so WEKA, we partnered with one of our hardware vendors that we work with, and we've produced effectively a completely wrapped appliance that if a customer wants to purchase it, they can buy it as a complete effectively turnkey bundle as part of a SuperPod or BasePod deployment. And we go out the door now and give them that complete reference architecture with us as the data platform, the compute from NVIDIA, and make it super simple for them to use. Effectively at that point, it's a plug-and-play type of solution. Where we're starting to see customers have real strong interest for this is customers who either don't know what their AI solutions or high-performance compute solutions will look like, or they're unsure what their requirements will be moving forward in the future. And so one of the things that we've done with this WEKAPod that's kind of interesting is that it's all Gen5 hardware under the covers. And what I mean is PCIe Gen5.

So you get the latest processors, you get the latest SSDs, flash drives, and the latest networking in there. And the ultimate result is that we give you this performance density that is you can start with a very small environment. In fact, the smallest one of this is eight servers, or eight nodes, I guess you could say. And yet those eight nodes can go anywhere from half a petabyte to one petabyte, but can deliver performance that is absolutely unheard of, you know, 700- plus gigabytes per second, 18 million IOPS, latency that rivals a raw fiber channel SAN. In fact, in some cases even faster than that. And yet this entire stack is seven and a half kilowatts of power consumption completely. And so that type of performance density, yeah, it gives our customers so much flexibility in terms of saying, look, if I put in this really small system, I don't know what my future scaling will look like, but the performance is there, the ability to expand is there. I don't have to burn a ton of power and cooling for that footprint. You know, to be quite frank, this entire space is moving so fast. I mean, we've seen, you know, let alone the last two years, the last six months or even two months alone of how this industry has moved and the changes in what data structures look like and the hardware. It is so hard to future-proof yourself. And yet, this is probably as close as you're going to be able to get to something that at least for that data platform component really could help you out when there's an unknown future coming with the rate of change.

Allyson: You know, Joel, I loved how you talked about that, because I've been thinking about this quite a bit in terms of trying to forecast how the industry is going to keep delivering balanced platforms when you're looking at the innovation cycles that we're looking at, right? And so as you break down that platform and you did such a beautiful job of it, are there areas where you look at balancing performance, efficiency, and scale where you want the industry to really pay attention across logic, storage media, network and IO, where you're thinking, hey, this is going to become a bottleneck pretty soon? Or is there anything in particular that you would call out from an efficiency standpoint that the industry really needs to focus on?

Joel: Yeah, I think there's going to be a couple of turning points that are going to have to be addressed at some point. From a hardware standpoint, it's only going to go up from here, right? We've gone from 10 years ago, 15 years ago, 10 gig ethernet to, you know, all the way through 100, 200, 400. And by this time next year, 800 gigabit is on the table. So I don't think we're going to have these huge network bottlenecks for the vast majority of workloads that are out there. And the same thing, processors will get faster, GPUs will get faster. Storage devices will still be kind of interesting in terms of how they go. I think there's two things that are going to have to be addressed. One is that there's, in the storage industry and storage device category, there's really this kind of bifurcation that's happening. And this is kind of a, just to be fair, this is a Joel opinion to a large extent. It's not necessarily a WEKA opinion. But what we're seeing is this weird bifurcation. You have traditional flash SSDs, the TLC layer, that are relatively high performance, good endurance, and so on and so forth. But so far, they've been somewhat limited in size. And so that's created this secondary type of device out there where you have bigger SSDs, flash devices, but they don't necessarily have as much endurance, but capacities are significantly higher. And so the question becomes, can you go ahead and figure out a way of making either the fast versions, the TLC devices, bigger, or is there a way of making the bigger QLC devices a little bit faster? And so there's a bit of convergence there that needs to happen. If I had to place a bet, I'd put it on the TLC side because that seems to be where the innovation is happening  a little bit faster. 

But that being said, as Jeniece was saying, Solidigm, they've gone and turned the corner the other direction and said, we're going to make QLC that's getting faster and faster and faster. So I think when those start to collide head to head, that's going to be a real interesting point to see where that goes.Beyond that, though, the number one thing that's going to have to happen in the industry, it really is this addressing of sustainability. I came away from the GTC [NVIDIA GPU Technology Conference] where we had a huge presence there, and they were talking about Blackwell, the new GB200 systems from NVIDIA, and the Blackwell processor combined with the Grace processors, and a stack where it's going to be water-cooled pretty much by default when you buy the full, huge, disaggregated GPU there.  And when Jensen was on stage talking about 120 kilowatts per rack, the question becomes, it may be significantly faster, and you can subdivide that up. I think we're going to reach this interesting point where it's going to be incredibly hard for various companies to actually be able to deploy something like a water-cooled GB200 system. And it will be customers who can only have, do they have the power that they can actually bring in? And more and more data centers, power is becoming an absolute problem because the utilities that are supplying them are going, we're out of power capacity. We literally cannot provide more amperage and current into your data center because we're out. Our major infrastructure, we don't have enough power plants to produce the power to feed these things. And so I think initially, this is going to be a very sparsely-deployed type of system, simply, not because the technology doesn't exist, but because the power doesn't exist. And that's going to be something that's going to become an ongoing reckoning through the industry, where what is the correct amount of compute power or that you're able to even put against a problem without having to build custom data centers in places where they have surplus power? And that's going to be a very interesting challenge moving forward.

Jeniece: Wow, a lot of awesome insight there, Joel. When you started off by saying we're bifurcation of hardware and who's up to the challenge, and I think we're definitely up to the challenge. And as you talk through sustainability and some of those elements, we couldn't agree more, which is why we really believe in not just creating drives that are higher capacity or fast, but how does it kind of help attack this entire solution set? And your insight and work with WEKA has just been admirable from my standpoint personally and also from our company, Solidigm standpoint. But we do have to ask, we've been to a lot of shows, seen WEKA there, but where else can folks go just to learn more and engage with your team and maybe trial some of this work you're working on?

Joel: Yeah, you know, as is true, once it's on the web, you can always find it. So www.WEKA.io is really the gold standard for where you should go to find any and all information. We have links there for all of our solutions, both on industries, both on technology types. And from there, we can absolutely, you know, you can click a few buttons, chat with people live online, and get answers and find out more about how we can help you out.

Allyson: Well, Joel, next time you guys bring out the purple Lamborghinis, I'd like an invitation for a ride. That was really cool at GTC, and I'm sure there were a lot of folks who were a bit envious of the folks that got a chance to take a look at those. Thank you so much for being on the program today. I've been following WEKA since last year when you guys were at Cloud Field Day, and I've just been so impressed with the solutions that you're delivering to the market. We want to keep having you on the program, and today just underscored why. Thanks for being here.

Joel: More than happy to do it, and absolutely, if the Lamborghinis come out, come find me. I'll make sure you get in for a ride.

Allyson: Jeniece, you're going to have to come with me.

Jeniece: We're going to have to find a three-seater.

Allyson: All right, and Jeniece, thank you so much for co-hosting. We will be back with our next episode soon as we explore the data pipeline.

Used with permission.