
No country left behind with sovereign AI
Ryan welcomes Stephen Watt, distinguished engineer and VP of Red Hat’s Office of the CTO, to chat about digital sovereignty and sovereign AI.
No country left behind with sovereign AI - Stack Overflow
Stack Overflow Business Stack Internal: the knowledge intelligence layer that powers enterprise AI.Stack Data Licensing: decades of verified, technical knowledge to boost AI performance and trust.Stack Ads: engage developers where it matters — in their daily workflow.They explore major infrastructure constraints for things like power, cooling, and scarce hardware that cause the regional disparities we see in sovereign AI, plus why we need to extend Kubernetes and integrate PyTorch Stack not just for a sovereign cloud but for sovereign AI.Red Hat’s Office of the CTO is a division of 150 software engineers and researchers working on their Research and Emerging Technologies arms, helping to shape the vision and strategy of Red Hat’s technology.Connect with Stephen on LinkedIn.Congrats to user Ittiel for winning a Populist badge on their answer to Print timestamps in Docker Compose logs.TRANSCRIPT[Intro Music]Ryan Donovan: Hello, and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I'm your host, Ryan Donvan, and today we're talking about AI sovereignty and how engineers are extending Kubernetes, integrating PyTorch Stack, doing all those good things so [that] we can have open-source AI sovereignty. And my guest for that today is Steve Watt, who is a distinguished engineer and VP of the office of the CTO at Red Hat. So, welcome to the show, Steve.Stephen Watt: Thanks, Ryan. Excited to be here.Ryan Donovan: Before we get into our topic today, we like to get to know our guest. How did you get into software and technology?Stephen Watt: [A] long time ago, I actually started in South Africa, which is where I'm from, we were working on the early internet service providers trying to get people connectivity over somewhat problematic African phone lines on 14.4 KBPS modems. But from there, I started building web applications, early Java, and from there I went into startups in the United States and IBM building large systems around web service integration. And then, from there, my career went into emerging technologies and large, big data analytical systems, distributed systems, has been my focus, I'd say in the last 15 to 20 years. And then Spark, and then Kubernetes, and now PyTorch Ecosystem, the LLM.Ryan Donovan: So, I've heard a lot of folks talking around AI and data about AI sovereignty and data sovereignty. Let's take a little step back and define that. What are we talking about [with] sovereignty?Stephen Watt: Yeah, I think this is a great question. There's two ways to articulate this. So, two different lenses. Sovereignty is to get a set of sovereign guarantees for your application. The digital sovereignty is one lens that I think is what's most commonly talked about. And so, that's [when] you're running your application, can you guarantee that it's running in a particular region? It's being operated by people in a particular region, and the data lives in a particular region, and you can, from a compliance standpoint, actually instrument all of that and provide assurances to meet those compliance requirements to articulate that. And then, there's what I would say more on the 'sovereign cloud' piece. Sovereign Cloud's got this subset of ' sovereign cloud' and 'sovereign AI,' but essentially what we're seeing there is there's a region, a nation, or a state, and that region wants to provide infrastructure for its constituents to be able to run these applications to get those sovereign guarantees. But there's additional incentives on this, especially when it comes to sovereign AI. And specifically, that nation or state wants to ensure that their constituents aren't getting left behind. So, that's one. Two, this infrastructure is complicated, and expensive, and requires specialized skills to run, and specialized infrastructure to run, and they want to be able to provide that and be able to provide some sort of discounted access to it to their constituents. So, it often involves deploying an infrastructure and operating an infrastructure for their constituents, and having some sort of a– there's almost always an attached mechanism that researchers, startups, [and] citizens can come through to get discounted access to that infrastructure.Ryan Donovan: Okay, so it does have a sort of state-level control to it, right? Because sovereign, you think of the king, and this is inference under consent of the king, right?Stephen Watt: Yes, exactly. It's a great way to describe it. And that is literally what's happening. There are kingdoms that are doing this. Saudi Arabia, I would say, was a first mover. And UAE is also a monarchy. And so, both of those were early movers in the sovereign AI space.Ryan Donovan: We're talking today about how people are implementing that on a technical level. And I think you wonder what is different than just setting up a data center on that country's soil, cutting the pipes outside of the country? What does it require to create the sovereign AI and data?Stephen Watt: Yeah, I think this is fascinating. This is such an interesting space. That's a loaded question, and I'll explain why: because if you separate sovereign Cloud from sovereign AI, so basically, imagine Sovereign Cloud is all the same things without actually ever bringing AI into the conversation. So, let’s say it's primarily like cloud native, that stuff all runs on CPUs and can be powered and cooled in the data centers that all these regions have today. So, that's by and large, a pretty simple thing as far as it's more focused on who's operating it, and the guarantees that the data doesn't leave the region, but the compute's already there, assuming it's a data center in region and not too complicated. Sovereign AI is way different. And what I mean by that is as soon as you bring in the latest infrastructure, the latest Nvidia and AMD chips, there's a whole lot of additional questions that get asked, which is, can your in-region data centers power and cool these things? Do you have the power? And as far as your data center, most of these are liquid-cooled. So, can your data center provide liquid cooling? Can you retrofit your data center to do liquid cooling? Is that cost-effective? And then, you start to see these regional dynamics play out. And so, if you go through this sort of rubric, then you start to see, okay, does land become a factor? And then, policy becomes a factor. And so, can you pour new concrete? Do you have the land to pour new concrete to build new data centers like the Stargate data centers that are being built in Texas? And do you have the water to be able to provide to liquid cool these? That's an issue by and large, we don't have in the United States, but in other geographies, say Western Europe, there isn't a surplus of available land, and building out this new infrastructure is complicated. And so, they're having to also factor in, at least until 2030, 2035, where maybe new installations are being completed, they're having to do it with what they have.Ryan Donovan: The politics and the concerns around water for data centers is a big sticking point for AI for a lot of people. And you talked about some of the forerunners in sovereign AI, Saudi Arabia, UAE, they're not known for their reserves of water, right?Stephen Watt: Yeah, exactly. I think, I'm not quite sure on how much freshwater versus saline water makes a difference in those, and maybe it does, or it doesn't, but yeah. What they do have challenges are on their thermal footprints in the area. So, they're running hot data centers in a very hot climate, and that can really put an additional load on the grid to actually liquid cool that. It is interesting, that thermal footprint is why the Nordics are really popular for building these data centers. And Finland, especially, is doing some really incredible things with, as they build out this new infrastructure, they're integrating it into their city. So, they're actually moving away the excess heat to power their cities and neighborhoods, which I think is just very clever.Ryan Donovan: Yeah, to make these data centers there's a lot of software, too. And we talked about, [in] the beginning, extending Kubernetes and the PyTorch stack to enable a sort of sovereign AI. What sort of extensions are needed for this?Stephen Watt: You asked about the software stack in the different regions around AI sovereignty. So, if I go into the United States, it's primarily focused around sovereignty, concerns around open weight models that are built in the US. Most of the OpenMates models are currently—the top six of the leaderboard—are all in China. And if you look in Europe, there's a very strong focus of self-determination, and there it's slightly different; there isn't such an open weight focus. Theirs is more focused on a fully open stack from top to bottom that they fully control, even to the point of the silicon, where they're exploring Risk 5 inference processors. And so, the stack is different based on those different needs. In the US, you will see, depending on their provenance of what infrastructure they're running on, if it's specifically an HPC infrastructure, it's some mixture of Slurm, which has been around for 20 plus years as a large-scale cluster orchestrator, and Kubernetes. There's also this journey around familiarity. So, if you're specifically in an HPC world where you've been focused on running jobs—jobs are things that start and then finish—you're really focused on the metric of how long does the job take? But once it's done that, then basically, you don't have a demand on the system until the next run of the job. That's really like model pre-training, post-training, so creating a model or improving one. The model inference, which is serving the model, is a very different dynamic. It's an app, and when you have an app in production, people depend on it, and when the app goes down, problems ensue. And so, inference is way more like that. It's an operational concern, an
📰Originally published at stackoverflow.blog
Staff Writer