Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
155303 stories
·
33 followers

Xbox exclusives are back and more complicated than ever

1 Share
Vector illustration of the Xbox logo.

Two years ago, when Microsoft first revealed that it was bringing four Xbox-exclusive games to the PS5 and Nintendo Switch, it made the announcement far more complicated than necessary. That's not likely to improve anytime soon. In fact, things now seem more confusing than ever as the company tries to appease both fans and the bottom line.

When making the experimental move away from exclusives in 2024, Microsoft initially refused to name the games - Hi-Fi Rush, Pentiment, Sea of Thieves, and Grounded - going cross platform, but was happy to shoot down rumors of Starfield and Indiana Jones coming to the PS5. Some Xbox fans thought the annou …

Read the full story at The Verge.

Read the whole story
alvinashcraft
26 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Microsoft’s AI chief says superintelligence is near, but won’t take your job

1 Share
A photo illustration of Microsoft AI CEO Mustafa Suleyman.

Today I’m talking with Mustafa Suleyman, the CEO of Microsoft AI. And I’m actually going to keep today’s intro short — I’m working from my wife’s family farm this week, as you’ll see in the video, but also this is a real burner of an episode.

We covered everything from Mustafa’s approach to training new models to his criticisms of Anthropic talking about Claude as though it is conscious. Of course, we also talked about Microsoft’s relationship with OpenAI, how Mustafa is thinking about all the negative polling and political pushback around AI right now, and whether any of the consumer products are good enough to overcome it.

Like I said, it’s a burner.

Okay: Mustafa Suleyman, CEO of Microsoft AI. Here we go.

This interview has been lightly edited for length and clarity.

Mustafa Suleyman, you are the CEO of Microsoft AI. Welcome back to Decoder.

Great to be with you again.

I’m very excited to talk to you. Our previous conversation was one of my favorite conversations — about AI, how it should make us feel, and what it’s for — that I’ve had in all the conversations we’ve had. 

There are some big changes at Microsoft, maybe some very important recontextualization about how people feel about AI that I want to talk to you about in particular. And then there’s Microsoft Build, the big Microsoft developer conference, which featured lots of new announcements and lots of big ideas about what computers are for and maybe where they should be that I want to get into.

Let’s start at the very start. This is some deep Decoder stuff that is important to understand before all the rest of it. Since you joined Microsoft, you have restructured how AI works there. Your role has changed. The last time I talked to you, you were in charge of a bunch of consumer products. That has since been set aside. You’re now training new models; you’re on the frontier. 

Explain how Microsoft AI is structured now and how it’s structured inside Microsoft.

I guess the last 15 to 18 months or so we’ve been on this journey to reestablish our relationship with OpenAI, and it’s taken a minute. I think it culminated in a new contract that we got done in October of last year. And there were lots and lots of different provisions in that, including cementing and extending the partnership, but crucially freeing us up to be able to pursue superintelligence independently as well as keep buying and licensing their models.

So since October, I’ve been assembling the Superintelligence team, building clusters of sufficient scale to train frontier models, and hiring a team focused on superintelligence. And so that was quite a big shift for us because it sort of enabled me to focus just on the superintelligence mission, and that has then culminated in a few things that we announced this week at Build. We have seven new models across all the modalities and so on. So it’s been a pretty big shift, and I think a long time in the planning, and a great relief for us to now be in the game and pursuing the absolute frontier over the next few years.

Was this the plan when you were hired at Microsoft?

It’s certainly been the plan for the last 18 months. I mean, I think the relationship with OpenAI has gone through lots of ups and downs. And in many ways, I think it is going to go down as one of the most successful partnerships in history. It’s been great for OpenAI, and it’s been great for Microsoft, and all good relationships evolve, and I think this is just the next stage in our evolution.

Let me ask you about that evolution specifically. We all just saw the trial between Elon Musk and OpenAI and Sam Altman. Microsoft was involved in that trial in the sense that every so often a lawyer from Microsoft would stand up and say, “And we weren’t around.” And someone would say yes, and that was that.

But obviously, what came out during that trial, what has been clear during this entire time, is that the original notion was that OpenAI would be a research lab and provide models, while Microsoft would build the products. Microsoft had expertise in going to market; it had expertise in enterprise, it was trying to regain a foothold in consumer in a variety of ways. This would be a platform shift, and the research work would be over at OpenAI, and the product work would be inside of Microsoft. 

That’s the thing that changed: OpenAI wanted to make more and more consumer products. Obviously, given your new role and your new focus, Microsoft more and more wants to make its own models. Why the split? What didn’t work in that relationship?

I mean, I think OpenAI is led by an incredibly ambitious founding team, and Sam himself. And so naturally, as they started to get more traction and generate a ton of revenue, they saw opportunities to go full stack. So it wasn’t just that they started working on consumer products. Obviously, ChatGPT was incredibly successful. They also started working on their own data centers. They started creating their own chip. There are lots of rumors flying around about their own consumer hardware devices. They started taking models direct to market through ChatGPT Enterprise. So across the stack, they were kind of broadening way beyond research over the last two, three, four years. And naturally, the same is also true for Microsoft. I mean, I think the partnership’s now five or six years old, and still has another four, five, six years to run.

Likewise, we’re one of the largest technology companies in the world. We have 493 of the 500 largest companies that store and process most of their data on our systems, use Azure, use M365 and Teams. I think people often underappreciate how enormous we are and how big our distribution is in enterprise. And so, long term, and I do mean over five, six, seven, 10 years, we have to make sure that we’re completely sustainable, and we’re not just a recipient of somebody else’s IP that we then slightly modify and adapt and put into production for our products, but we actually can stand on our own two feet and create world-class models.

I mean, superintelligence is coming. I think it’s just around the corner. And so I think it’s going to be basically the most valuable technology of all time. There’s sort of no way that, long-term, we could be structurally dependent on a third party for providing that IP for all eternity.

So that’s been the transition that obviously was triggered when OpenAI and so on had their board issue. But then as I came in and my team came in, we started building that out, we’re on that transition. And I think we’re in a great spot because we can take a fairly steady, careful, long-term optimal position, both for OpenAI, which I think has done incredibly well out of this, and for us.

I want to spend some time on superintelligence. I just want to put a pin in it now because I just want to kind of understand the transition for one more turn here.

There’s a moment in the trial, sort of very funny message from Microsoft CEO, Satya Nadella, he says, “I don’t want to be Intel and have OpenAI be Microsoft,” which is very funny in the context of Microsoft CEO himself saying, “I don’t want to be the provider, and have them be the platform that provides all the value and collects all the value and maybe we’ll be swapped out. I don’t want ChatGPT to run on Azure, and then OpenAI will get all the value, and then maybe they can swap us out,” just as what happened with Windows and Intel over time.

Is that a realization? Did Nadella come to you? What was that meeting like where you said, “Okay, OpenAI had its board issues. We need to get back on the frontier and stand on our own two feet.” What did that conversation look like, and how was that decision made?

I mean, obviously that’s Satya’s decision as well as Amy, Brad, and many other people in the company. But I think it’s as with anything: these are slow-moving changes in the company, as it comes to realize that the direction that we’re taking needs a little bit of tweaking and adjustment. And so that was happening way before the November board incident, and I think it just builds up over time as you look at the kind of constellation of different fronts around which we’re competing directly, increasingly, and all the tension that comes from that. But also just knowing that partnerships like that don’t last forever. 

I mean, OpenAI wants to be a trillion-dollar public company, has incredible revenues, and is growing like crazy. They want to have the freedom to operate and be able to buy compute from all sorts of other places, build their own compute, and partner with whoever they want. So the contract was formed at a time when the companies were very different in terms of size and scale and balance of needs and stuff. I think it made sense for that moment, but then it became pretty clear that this is something that we have to be able to own and control ourselves and do right by our own customers.

As I said, we have an incredible distribution on enterprise, which I think is just completely unrivaled in the world. And so we have to make sure we’re building the best things for our customers. That looks slightly different to a company that has been jointly optimizing both for the consumer, with ChatGPT, and for the enterprise, and also for the fundamental science mission of superintelligence, which includes a whole bunch of different directions which are overlapping but could arguably be said to be orthogonal to the consumer and the enterprise directions too. Naturally, I think that’s how partnerships evolve, and they get reset periodically.

Yeah, but building a frontier model is very expensive, I’m told. Reliably told, this is a very expensive project. At some point, Amy Hood, the CFO of Microsoft, has to say, “Yep, you’ve got the budget.” When did that happen? Was that just a text message? Was there a meeting? Tell me about the specifics there.

I think, look, we sort of made the decision in the early part of last year, which obviously informed all the contract negotiations, which then all got resolved and signed in October. And it is a significant investment, but we have a long time to make it. I mean, we’ve already made significant investments in our own self-sufficiency mission. 

Our Maia 200 chip is actually an outstanding chip, as one example, right? We are now able to manufacture and ship a chip that is 30 percent cheaper than a GB200 inside of our own clusters. And now that we can co-design our own models with it, the MAI-Thinking-1 model that we’ve just released actually delivers 1.4x performance per watt improvement on top of the 30 percent improvement that you get from running on a Maia 200 once we co-optimize the models for our tasks.

So the value of making sure that you own and control your own stack and direct the entire co-design effort end-to-end for the use cases that are most important to us — which is obviously agentic coding, our developers, our enterprises — that clearly pays the dividends that justify the investment that we have to make over the next few years.

You said self-sufficiency mission, which is a very polite way of saying you want to stand on your own two feet; you want to do your own thing. I’m told there’s some controversy inside of Microsoft about a line my colleague Hayden Field wrote in a piece describing Build. I’m just going to read this. This is from Hayden. It’s a great line. She said, “This year’s Microsoft Build had the vibe of a freshly single divorcée posting a thirst trap on Instagram.”

The breakup is completed, and it’s time to flex. Here’s our new model. We’re going to stand on our two feet. You’re out there saying you’re going to build models at the frontier and compete with the leading labs. Is that the feeling inside of Microsoft that you’re free to be on your own?

Definitely not. No, not at all. Look, I mean, obviously that’s a cool headline and a fun phrase. But the reality is that we are in partnership with OpenAI for years and years to come. I mean, we’re running way north of 2030. They still produce the best models in the world. GPT-5.5 is an outstanding model. The Codex, the cybersecurity models that are coming through, are amazing, and they’re powering the majority of what we do.

So naturally, that’s going to continue. And so I think that’s just a natural course of these sorts of partnerships. I don’t think it’s anything untoward or surprising. I think OpenAI is very understanding and supportive of that. I mean, they’ve obviously been an incredibly fast-growing company, and they understand that we have to pursue our own agenda as well. So it’s very normal.

Let me ask you the other Decoder question, and then I want to get into the announcements at Build, and certainly superintelligence.

The last time we spoke, you said your framework for making decisions operated on a six-week cycle, given how fast AI was moving. That made sense then. Things have settled, maybe. Maybe some things are more in focus. What is your decision-making framework now?

We still operate by the same cycle rhythm. At the end of each cycle, we have a one-week meetup in person. I’m a real believer in this, even though we’re still an in-office culture, four days a week. In fact, the week after next, my entire Superintelligence team comes together in person in Boston for four days. That is for all of our retrospectives on how Build went, what we learned, what we didn’t get right, what we need to improve, our planning for the next cycle, which is going to run for eight weeks this time with a one-week meetup afterwards, and that’s all laid out for the entire year. So the whole organization knows that that’s the rhythm by which we operate.

And I think it’s actually really important to emphasize that timeframe, because quarterly planning gets a little bit blurry and a bit abstract. I think six to eight weeks, depending on where it falls in the calendar, is actually the optimal time for making very clear, fortifiable missions.

So we also, in addition to the rhythm of these six-to-eight-week cycles, operate by squads. The squads are mixed interdisciplinary subgroups that are focused on a specific mission, and they don’t necessarily ladder up to the manager. They actually are run by a DRI, and the DRI is often an IC, and their job is–

That’s “directly responsible individual” and “individual contributor.”

Yeah, exactly. Thank you. And I think we’ve taken the approach of separating the role of the manager from the role of the DRI that executes on a specific mission. I think that’s because being a great DRI is exhausting. You’re literally all-in 24 hours a day, and you’re pushing as hard as you possibly can. Being a manager is often about being a coach, offering support, giving guidance, feedback, unblocking all sorts of things, helping with people’s career growth. And so I think keeping those separate allows us to rotate DRIs every two or three cycles so that some people can try sort of different positions and have rotation. It’s a great, very flexible structure that allows us to be pretty nimble, I think.

Let’s talk about Build. I wanted to start with superintelligence. You’ve mentioned it several times now. I was just at Google IO. Demis Hassabis, who used to be your colleague when you were at Google, ended that keynote by saying that we were in “the foothills of the singularity, and that AGI was coming with all the power of Google.”

You’re saying superintelligence is here. Are these all the same things? Are we using different language to describe AGI? Are there differences? How would you define superintelligence in your context versus the singularity in Demis’s?

I mean, obviously I didn’t say it was here. I said it’s coming. And I think there’s a lot of fluidity around these phrases. But I think what we can clearly see that what’s happening right now is that there is log-linear hill climbing across all modalities, and that means that there is a very direct relationship between each order of magnitude of compute that we apply, each incremental increase in data, and climbing on benchmarks, whether they’re public benchmarks, internal benchmarks, they’re targets that we focus on with reinforcement learning environments. And that is a very important observation. 

Those predictions that I think we’re all making — I understand why some people are sort of skeptical of them or raise questions, but they’re very grounded in the sort of empirical observations of over a decade of increase in performance of these models. I mean, essentially the same general-purpose architecture has seen 12 orders of magnitude more computation applied, a trillion-fold increase in FLOPS over 15 years, and basically has worked in audio, in image, in text, in code, and in many other time series prediction tasks. And so we’re basically extrapolating out that more orders of magnitude of compute will enable us to continue to climb in this log-linear way inside of other environments.

And then it raises the question of, are we going to be able to train models that can invent new knowledge, not just sort of extrapolate from existing data that we have, but actually teach us things that we don’t know, and make new discoveries? Then the second thing is, do they have the capacity to self-improve and accelerate the process of deciding which hypotheses should be set, which ones should be pursued, how to generate training data for each of those, how to factor those into new runs, or even innovate on the actual architecture itself?

So, I think both of those things need to be true to be able to see this compounding progress, but I think we’re going to continue to get massive gains just from applying the next few orders of magnitude of compute. That probably does achieve parity with human performance on many, many more tasks, just as we’ve seen that happen in the last six months on coding.

Coding is really interesting, because it’s easily validated, right? You write the code, you ask the computer to run it, it runs or fails. We’ve seen some of the downsides, certainly around security, right? The downsides are obvious, and we’re seeing that this sort of regulatory approach to coding security play out in lots of ways. I’ve probably vibe coded some security disasters on my own phone and computer, and maybe that’s a risk I’m willing to take.

Every other function doesn’t seem that easy. I always pick on law, because that’s my background. But a judge doesn’t validate legal writing the way a computer validates code. If you get it wrong, the judge can send you to jail, right? That is maybe the worst output validation error that you can probably run into.

How do you measure the effectiveness across domains as easily as you can measure the effectiveness in coding? Because this seems to me where the metaphor or the analogy from coding to other domains falls apart very quickly.

I’m not so sure. Coding, obviously, you can verify the correct execution of code. It runs, or it crashes. But there’s a ton of nuance in that. The quality of the code that gets written really matters: its extensibility, how reconfigurable it is, how useful it is in practice. It’s not just that a piece of code runs, but it’s also how a model actually uses it as a DevOps or an SRE in production to return to that piece of code that it’s written, and then use it in a practical and useful way.

And then, of course, you have to grade the quality of the output that has been produced. It may be high-quality, functioning code, but is it actually the app or the website that you wanted? And there are aesthetic judgments in that; there are commercial judgments in that. The challenge of internalizing non-verifiable rewards is present in code, even though code is still primarily a verifiable reward signal. I think the other thing to observe is that, like chat is also a non-verifiable space, and yet, we’ve managed to climb that to basically human-level performance through interaction with real-world usage that provides a very strong-

Wait. I’m very curious. How do you measure chat at human-level performance?

Well, I think many people are having long, meaningful conversations with AIs at human-level performance. The quality is exceptionally good. It has very good emotional intelligence. It’s broadly very accurate. We’ve minimized the hallucinations. We don’t talk so much about bias anymore. It’s grounded in real-world observations. I think by most people’s measures, we’ve reached human-level performance in conversation for quite a wide range of tasks now.

What are your measures, and actually, sure, most people’s measures? I would disagree with almost all of this, but those are my measures. What are your measures?

My measure is like when I turn to my assistant and ask it to provide me with a daily briefing summarizing all the conversations that have happened on Teams and on email, the updates that have happened to documents, and I get basically a synthesized summary with a set of actions that I should take next. That is basically better than what my chief of staff can produce. I would say that’s human-level performance in synthesis, analysis, proposed actions, and chat. 

There are many, many millions of people every day that are using it for emotional support, for counseling, for therapy, for coaching, for advice. I think it’s one of the most popular use cases inside all of the chatbots. That’s a pretty robust measure, I would say, to make the claim.

I know you’ve spent a lot of time thinking about this, particularly the emotional connection to some of these chatbots. These are products that you have built and deployed. I would draw a pretty big distinction between this thing is really, really good at summarizing my email, task list, and providing me a brief about what things to prioritize, and this thing is an emotional coach for somebody undergoing some kind of crisis. 

Those are not similar tasks. Those are not necessarily similar kinds of intelligence, even in people. I know some people who are very good at making lists, and are very bad at emotional support. How do you put that all together in your brain and say, “Okay, this is broadly human-level performance in chat?”

I think if you define chat as an interactive exchange between two parties, one of which in this case is an AI, that broadly satisfies some goal, you’re looking to learn the sports score, for advice on which restaurant to go to, for coaching and feedback on an essay that you’ve written, for suggestions about which job to take next, or some tough conversation you’re about to have with your manager. You get a response, you go back and forth, you have five or six exchanges, and you find that a useful output, which you might otherwise have to rely on an expert, friend, or even pay a coach.

There are, just objectively, empirically speaking, hundreds of millions of people that get that experience every day from these chatbots. Maybe we could quibble over whether that technically represents human-level performance. I think it’s a fairly reasonable thing to claim.

There’s no reason why that isn’t going to continue climbing, right? The rate of climbing in the last three years is the thing that I think is most staggering. And so, what we’re trying to do from this point is extrapolate: okay, what are the fundamental drivers of that climb — compute, data, interaction from real-world users — and those things look set to continue.

I think that they apply to many other domains too, not just chat, emotional support, and productivity and that kind of thing, but also many other domains beyond that/ Healthcare, live production deployments inside of education, assistants that are increasingly managing your home, looking at just everything that is in your everyday life basically to make you more productive. That is, I think, a trajectory that’s likely to continue.

You’ve mentioned now that it’s still the same fundamental architecture, transformers, and attention. We’ve been applying compute to that for 15 years, and we’re getting these big increases. You are in a fairly unique spot.

At Build, you announced your first flagship reasoning model, MAI-Thinking-1. You got to start from scratch. Is there anything you’ve done differently now after 15 years of architecting and training this model, or is it just, yep, we’re going to collect all the data and run the training just as we did, and we have more compute now, so it’s going to be better?

No, actually, I think there are quite a lot of differences. The first thing to say is that the way that you curate the data… We start right from the top of the stack; we have basically paid for and acquired an extremely high-quality, very conservative set of data, and extracted a lot of the noisy, distracting, low-quality, potentially security-risk issues to do with that data. And the methods that you do for that, I think, are actually quite proprietary. We just shared a 109-page, very detailed, technical report, which was very well received on Twitter, and shares a lot of the details on how we do this. I think the second thing is, whilst I think it’s important to be quite cautious with architectural choices, and we have been, there are also a number of pretty significant shifts that I think we’ve made in how we put together our training runs.

Our training runs have been incredibly stable, with very few crashes, and very few restarts. We shared a lot of those graphs to show infrastructure stability, and also MFU efficiency, so model FLOPS utilization, which basically shows that we can put a state-of-the-art number of FLOPS through each chip for every step in our training run. I think that this is extremely easy to get wrong, and we all hear lots of stories from different labs about how things do go wrong.

It is actually pretty hard to make the very careful and deliberate choices to get things right, and take the right approach to make sure we produce high-quality models, because our job and our ambition is to try and build this hill-climbing machine. That means the integration of the silicon with the models, with the super high-quality data, with a stack of RLEs, reinforcement learning environments, that allow us to basically, systematically hill climb against any objective that we choose.

And that’s what MAI-Thinking-1 is. It’s a general-purpose, fairly neutral, thinking model that is pretty good at coding. It’s now roughly on par with Opus 4.6, at least on the benchmarks. We haven’t deployed it at scale into production, so there’s still lots more work to do there. But it’s an extremely strong reasoner and scored 97 percent on AIME, which is the primary measure for its reasoning performance, at least on the benchmarks.

It’s very good at instruction following, and then the goal is basically to make that available to many, many developers and enterprises and allow them to climb on it for their use cases. Everybody has a sort of slightly different objective that they have in their company to try and build agents and so on that support their use case.

One of the things that you’ve noted in talking about MAI-Thinking-1 is that you didn’t distill any existing models, which actually struck me as surprising, right? This is a thing you could do. You have access to OpenAI’s IP. Everyone’s distilling everything. We just found out in this trial that Grok was distilled from a number of models. Why not do distillation here? Why not jump ahead?

There’s definitely lots of shortcuts to the frontier, and if you take a super high-quality model, and you polish your base model with high-quality instructions, or answers, or outputs from a superior model, then it’s true that the model might quickly fit to that distribution. But it’s very unclear that they would then be able to surpass that teacher.

So, we’ve been very deliberate for two reasons. The first is that we want to make sure that we can exceed the teacher in order to set the frontier ourselves over the next few years. And the second is that we really want to build one of the great labs, and it’s going to take us many years to come, probably the next two or three years. 

But, in order to do that, we have to be able to show that we can actually build every component ourselves. We can hire the very best talent in the world. We can push the frontier with actual research, rather than just re-implementation, copying, or distillation from any other third party. 

We’re in a great position where we’re able to really carefully and meticulously pursue that objective, knowing that we have the resources to buy Anthropic models where they exceed the frontier. We have the resources to put 11,000 different models inside of Foundry, so every one of our developers gets pure optionality. And of course, we have the resources to continue to deploy OpenAI models, which are obviously outstanding and are at the frontier today. 

That’s just a natural part of the self-sufficiency mission, and it’ll take time for us to truly get to the absolute frontier on that. But I think we’re in a great spot. We made a ton of progress. This is a very, very strong model, and it wasn’t just that model that we released. We’ve released seven new models simultaneously.

Our transcribed model, for example, MAI-Transcribe-1.5 is literally the number one in the world. It’s the most cost-effective of any of the hyperscalers. It’s the highest on accuracy. Our image model is now number two. Our image editing model is number three right behind Google’s and OpenAI’s. I think we’re well up there with our image and audio. Our code model, CodeFlash, is incredibly strong, optimized for VS Code. and is a really, really a great model that’s on par with Sonnet 4.6. So it’s really in a great spot this minute.

Were there any legal or IP concerns with distillation? I know this is a live issue out in the world: Anthropic complains of other people distilling their models. There are concerns about Chinese companies distilling models, and whether our existing IP agreements can cover that. Did you have any of those concerns to keep you away from it?

Oh, we didn’t, but I think I understand why a lot of people get frustrated. Anthropic has been very frustrated, and some of the rumors around xAI, and Meta, and obviously, the open source models, and so on, because essentially, that’s basically taking the IP, and the knowledge that another team has put together, and then, literally force-feeding it into your own model. I think it’s a bit of a short-term win, and like I said, really, we want to create a culture in the lab where we can come up with the next big thinking breakthrough, or the next big coding breakthrough, or the next big architectural push.

Right now, we’re experimenting with the looped transformer, which is a slightly different variant on the current transformer. Lots of people in the field are looking at it too. No one seems to have quite got into production yet. But, in order to create a culture and a team that can really push the frontier, they have to understand, own, and create the full stack as and when they need to, and also use things from third parties whenever we need to too. And like our paper, for example, has hundreds of citations grounded in the rest of the literature, so it’s very much a contribution back to the field in return for everything that we’ve learned over the years from all the great publications that have been out there.

Can I ask you — if you understand that frustration from Anthropic and your peers in AI about distillation, do you also understand the frustration from creatives, publishers, and YouTubers about all the AI companies scraping their work as a collective to make these models? Because that frustration is only getting louder.

Yeah. No, I understand the frustration. The open web challenge is one we’ve talked about before, and I get it, and I see that people are frustrated, and obviously, that’s working its way through the conversation in the courts. And I see that people put things online, and they had different expectations about what the contract was with that being placed online, and it’s a tricky one.

You mentioned all your data was carefully curated. Did you pay for all the data that you’re using to train the new models?

A lot of our data we obviously take from the open web in the normal way. Carefully curated means that it’s extremely carefully filtered for security, for quality, for third-party dependencies from some of the open-source datasets, and keeping it away from a lot of the Chinese lineages, which I think are very different. Our enterprises want to make sure that when they put something into production, they can trust us that we’ve really built it with their needs in mind. And I think this is one of the benefits of being very, very deliberate, patient, and being attentive to all the details.

You mentioned enterprise. I think this is very interesting. Microsoft is all in on enterprise AI, in big ways, actually. I would even draw the line straight to Asha Sharma, the new head of Xbox, who is getting rid of AI in a bunch of places, and the gamers are happy, right? There’s one reaction to AI in consumer space, but there’s another in enterprise. I think AI has as close to product-market fit in enterprise as you can get with something changing as fast as AI. There are a bunch of databases that corporations control, and you can just go access them, because they control them. That’s their data.

There’s a bunch of repeatable processes and tasks, and old systems that maybe the models can just do more efficiently. There’s something very important happening to enterprise. At the same time, the consumer antipathy towards AI is just increasing. And my argument is we have not built great consumer AI products. This industry has not produced them. It has not shifted them. It has not made it obvious that all of this is worth it, that using all the data from the open web, and changing the contract of publishing to a mass audience of people, so now, it’s being used for training models that will deliver trillions of dollars of value to corporations. There isn’t a product that says this is worth it. 

Again, Satya Nadella recently gave an interview with Axios, and he said, “We need social permission for this. And until we have it, until we deliver that value, people are going to feel this way.” We’ve seen college speakers get booed. We’ve seen data centers get banned. Do you think that there’s a consumer product that’s worth it, that’s worth the angst about training, that’s worth the angst about data centers? 

That was your focus; now your focus is enterprise. I would say that just on the face of it, it doesn’t seem like Microsoft has interest in the consumer product anymore. But, do you see one that’s worth it, or that could be built?

I’m not sure I agree with you that there hasn’t been any value for the consumer out of this. Across all of the chatbots, there are billions of people a month that are getting immense value out of it.

Now, just for a moment, empathize a little bit with the small-scale business owner, or the kind of mom that’s helping her kid with the homework, and can now just turn to a conversational AI, and get feedback, get instructions, get essay questions set. Just being able to ask questions like how do I generate revenue? How do I put together a cash flow forecast? Which college should I apply to?

I mean, these are everyday tasks that are coming with some pretty high-quality factual advice and information. So I don’t really buy that people are not getting benefit out of these things. I think they are.

I think I can very clearly make the argument that they’re not getting enough benefit, right?

Okay.

They’re the ones saying that we should not have more data centers. They are the ones booing AI at the graduation speeches. The polling is clear, particularly young people: the more they use AI, the more antipathy they have towards it. That’s clear in every single poll. That’s the argument I’m making — not that there’s no value, but the value exchange is not clear enough.

Yeah. Fair enough.

I’m seeing Microsoft in particular pivot to enterprise, away from the big search product, the reinvention of Bing that would make Google dance. That’s over, and we’re all focused on enterprise, where the value is. I’m just wondering if there’s enough value for the consumer to make all of this worth it.

I think there’s understandably a lot of anxiety. There’s an enormous amount of speculation about what’s going to happen in the next five to 10 years. Whether it’s framed as the singularity or whether it’s framed as the job apocalypse, these are not helpful framings. I think that people are scared because it’s poorly defined and it’s often framed as an inevitable, threatening gray cloud over people’s heads.

I think that what matters is what we do with technology.I think that I’ve for a long time argued that we have to place the human first. Some people in the field have placed scientific discovery first or placed accelerating intelligences that can explore the galaxies and so on, and said that it’s inevitable that we’re going to have these AIs that are going to be more powerful than all of us combined. I mean, that’s naturally scary to people.

And I think that we have to basically flip it the other way around and say the purpose of science and technology is to make us all healthier and smarter and happier. That’s been the quest that we’ve been on as a species for thousands of years of invention, and it’s the test that we should put superintelligence to again. And if it doesn’t achieve that test, then I think people will reject it, and they’ll be right to reject it.

I think that everybody’s focus is now going to turn in the next five years to, how is this making me healthier and happier, smarter, more capable, more productive? And if it’s not doing that, then naturally people are going to be angry and resist and react. I don’t think there is anything unexpected about that or anything wrong about that — I think that’s inevitable.

So that’s why one of the things I’ve been passionate about for many, many years is healthcare. And just a couple of days ago we announced a new partnership with Mayo Clinic. This is the number one hospital in the world, consistently reported. They have the highest quality longitudinal patient record dataset across all the modalities. They have the best clinical practice.

They’re also a nonprofit, which I think a lot of people don’t realize, with 65 percent of their patient population on Medicaid. People often associate them with the international super elites flying in to get the best care in the world, but they actually have the majority on Medicaid. They’re an amazing institution with an incredible mission to deliver the best healthcare everywhere. And we now have a very long-term partnership to co-train from scratch with their data, with our models, a brand new foundation model for health, deploy it in their hospitals, and hopefully take it around the world to deliver the best clinical care and healthcare that we possibly can to as many people as possible.

That’s why I got into the field. That’s what I was originally motivated by, and it’s what I’m passionate about. And I can only focus on the things that I think are going to make a difference and that will help people and leave a good legacy for everybody, and that’s what we’re trying to do.

I appreciate that. I appreciate the healthcare framing, and I understand why that’s everyone’s go-to, right? Healthcare in America in particular, if you could make it even 10 percent better, you will have affected a lot of people’s lives in a particularly profound way.

The thing is, I know a very smart guy who has a very different and vastly more aggressive approach to all of this than you. That person is you, four months ago. This is what Mustafa Suleyman said to the Financial Times four months ago: “White-collar work when you’re sitting down at a computer, either being a lawyer or an accountant or a project manager, or a marketing person, most of those tasks will be fully automated by an AI within the next 12 to 18 months.”

That’s four months ago. That implies that a year from now, lawyers, accountants, project managers, and marketing people will not have jobs. Their jobs will be automated. Is that still your timeline?

No, no, no. Hold on a sec. So I said “tasks” in the quote that you’ve just said. I said tasks. So that does not mean jobs. It’s a very important distinction. In labor economics, there is an entire taxonomy of sub-components of a role or a function in an organization. Sending an email, having a conversation with a colleague, putting together a PowerPoint — sub-tasks will increasingly become digitized, automated, and we can basically generate more and more of them.

That does not necessarily mean that the role goes away at all. It just means that the work can be done faster and more efficiently, which is today often work that is quite rote, is quite manual, is quite labor-intensive, and is time-consuming. And so the natural progression of technology is to make your life easier, faster, less friction for more seamlessness. As everyone often complains, that has made you and me and everybody else much busier.

It’s actually made us more available, more stressed, and it’s given us more information. So there are always these revenge effects of efficiency, which I think people forget. It’s quite likely that we are going to get much, much more productive because we spend less time doing the kind of narrow administrative menial tasks, and we’ll have to spend more time doing creative, judgment- focused things, which ultimately create a lot more value.

We can also experiment much more quickly. So we’re able to try lots of things out in parallel because the cost of execution is going to get lower. In my mind, that’s likely to increase the overall quality of things, because we’re going to try out more hypotheses, whether in journalism or in business or in anything that we do.

I think that’s sort of slightly taken out of context because of a natural misunderstanding between jobs and tasks, but nevertheless, you could push back at me and say, “Okay, well then what does the landscape look like in five or 10 or 15 years’ time?” And that’s where I think we have to return–

Actually, I’m not going to push back on you in that way. I’m going to push back in a very specific way. And I realize this is your quote and you’re saying it was misinterpreted. I’m just looking at this literal sentence, and there is no distinction between tasks and sub-tasks. It is, “white-collar work.”

The examples are lawyer, accountant, project manager, marketing person, and then you said, “Most of these tasks will be fully automated by an AI within the next 12 to 18 months.” There’s no distinction of sub-tasks there. You’re saying most lawyers will have their jobs fully automated and the practice of law will look totally different within a year, even by the words of that quote.

And I’m just saying, are you still on that timeline, that being a lawyer will look totally different because agents will be running around doing everything that we were doing before?

Well, most of the tasks mean work that you do in order to get your overall job done, and that I think is going to free you up to do the more human-like and the more judgment parts of your work. There’s a very important distinction in… Jobs and roles are the broader category, and tasks are the components of that. And it’s an established definition in the literature, in labor market economics, for many, many decades.

It was maybe too nuanced even for the Financial Times, but nevertheless, that was the intent. Now I do think there’s an important question: where does that leave us in the longer term? And it is going to be challenging, like more and more of this stuff… We can quibble over the timelines of whether it’s a few years or whether it’s a decade, or whether it’s 20 years, but the reality is we are going to be automating more and more of this work, tasks, jobs, roles, activity, and everything that we do.

And so what’s going to matter more is the governance that we put around these technologies. Who are they accountable to? Who owns them? What are the feedback loops that regulate and introduce friction to make sure that they actually serve people? I mean, I wrote an essay on humanist superintelligence outlining quite directly, four or five months ago, what I think of as basically a north star, maybe not quite a framework, but a set of principles that basically says technology is here to serve us. That’s the test that we should put it to. It’s the test that people have put it to. It’s the test that we care about at Microsoft.

I think that more and more everyone’s going to have to really focus on that question, because it is going to deliver a tremendous amount of good, and we want it to continue doing that, but we want it to do it in a way that doesn’t sort of cause ridiculous amounts of instability during the transitional period.

I believe you. I know you’ve been thinking about this stuff for a long time, but I’m going to respond in the way that I know my audience wants me to respond, because I hear it from them all the time. And what it looks like is this whole industry — you, everybody included — went all in on “we’re going to replace all the jobs” and really accelerated building out data centers at massive capacity, and asking for a lot of resources against big promises.

There was political pushback, and now all of the stances have softened. And you saying it’s not all jobs are going away, we have to rethink jobs, is of a piece with all the other CEOs in this industry saying similar things, and talking about healthcare, that comes up every single time now. I’m wondering if that political pushback has actually changed how you are talking about this.

There are a lot of your peers who think AI simply has a marketing problem, that it hasn’t been communicated effectively enough, and they should spend hundreds of millions of dollars on podcasts to communicate the benefits of AI more effectively. This is a real thing that is happening in this industry. Do you think AI simply has a marketing problem and that the political pushback has opened your eyes to this marketing problem, or do you think there’s something else going on?

There’s a series of questions there. The first is, what do I actually think and believe, and has it changed in the last six months? The answer is no. I wrote a very detailed book about this three years ago, way ahead of time, warning about many of the things that are currently happening, and doing so explicitly to lay on the table tremendous risks to surveillance, to concentration of power, to concentration of wealth, to disintermediation of the state, to threats to democracy. And also to threats to the nature of the human and what it means to be a person in the context of the arrival of these very new forms of silicon being in some sense. I’ve been working on… And the idea that my healthcare interest is like just a flash in the pan, which is a function of the reactions to data centers and so on, I mean, I’ve been working on healthcare for over a decade. I pushed many, many times on some of the cutting-edge breakthroughs, contributions to the field in radiology, mammography, and pathology, many other areas, electronic health records.

So I’ve always believed that the purpose of technology is to just make us healthier and happier. And those are the things that I choose to work on and direct my time to. Does the industry have a reputation and PR problem? I mean, I think it’s pretty clear that people are very anxious, they’re very frustrated, and there’s going to be a lot of attention on that in the next few years, understandably.

I think what we can do is take accountability for the things that we build, the way we build them, the decisions that we make to put types of technology out in the world, and the types of problems that we choose to work on, like we are doing with the Mayo Clinic.

I want to, by the way, say and point out that I think the first time you and I ever met and talked was before you joined Microsoft. It was right after that book came out and we did a panel together.

One of the reasons I’m comfortable asking this is because I do know that you’ve been thinking about this for a long time and I’m aware of that book. I think for me the question is whether the industry as a whole misjudged the total amount of value it could provide to overcome the seeming recklessness that people are now reacting to, the ask for resources that people are now reacting to. 

You’re building new models. There’s probably a trade-off inside of Microsoft between we can use the existing Azure footprint to charge our customers money, or we can spend money to train new models, and that kind of looks like the same conversation people are having about resources in their communities, whether we should use the existing energy footprint to build new AI or do something else that might be more immediately valuable.

What do you think about all of that? You are one of the leaders of this industry. You want to be on the frontier with the companies driving the most change. How do you think about asking for those resources in a way that isn’t just promising future results, but also immediately providing benefits to communities in a way that makes people want you to be there?

I’m very proud that Microsoft has stuck by its net-zero targets. Our new data centers are all liquid-cooled. This means that they use about a restaurant’s worth of water for a six-year period. It’s like a swimming pool that gets filled up with water, and then it just circulates the system. They’re all largely renewable in terms of their electricity consumption. So I think commitments like that, to make sure, for example, we made a commitment recently to ensure that local communities affected by a shift in electricity demand by our data centers are compensated and protected so that they don’t see a spike in their prices, their energy bills.

Those are the kinds of things that I think Microsoft does and can continue doing as a responsible company to just really pay attention to the consequences for communities. I think on the flip side, change happens because people participate at every level. People inside of companies have to make different decisions. People who protest and campaign have to make decisions, and make the effort to go out and make their voice heard and be involved in a political process. And that’s how we as a species collectively evolve and move things forward. 

And month to month, quarter to quarter, it feels like we’re all kind of at odds with one another, but when you look back decade over decade, we’re kind of like this collective weird kind of mesh of all sorts of different incentives that are just actually nudging things in the right direction. We really are, I think, despite all of the angst and the polarization, I think we’re building something that is going to make our species much, much healthier and happier and more capable. 

I think that we have to make sure we get the right path on the way there because there are lots of pitfalls and ways that it can go wrong, but the right path involves people making their voices heard and people changing course based on a response and reaction to that. So I think it’s a good thing that that’s happening, and that’s the process working as intended.

Let me ask you about the enterprise side of this. We spent a long time on the consumer side and how people feel. On the enterprise side, we’re seeing a bunch of companies figure out how valuable these tools actually are, right? Amazon basically took down a leaderboard because people were cheating to use more tokens than they needed. We’ve seen some companies just blow out their token budgets. I think Uber just pulled back because they’d blown through their token allocation for the year and they weren’t seeing any value from it.

What do you think about that side of it right now, where there’s so much excitement and so much desire for change in the enterprise, where, in particular, software engineering, at least some people are having fun, and maybe some other people are having full existential crises, but some people are having fun, and the value still hasn’t been realized, right?

Or we’re beginning to see that pure token-maxing does not actually deliver the same kind of value that maybe you’d expect. How do you think about the use there? Because maybe if you prove it out in enterprise, it will actually come out in other ways.

I think different people report different things. So there’s obviously some examples of people overusing coding models, generating useless code, useless tokens, but there are many people whose work and impact has been completely transformed by it, right? I mean, there’s no question that this has had a massively beneficial impact on the software engineering industry.

I mean, we are producing much higher quality, much faster code across the entire stack. And so yeah, I kind of think there are obviously examples of some people that maybe got it wrong, didn’t set the right token budgets. There are going to be mistakes along the way. I don’t think that’s any signal that there isn’t adoption or people don’t see value. I mean, the value from where I’m sitting is incredible. Many, many people tell me every single day that it’s transforming their work output and productivity. 

I think the other thing to say is that as these things happen in surges, there’s kind of a swell of energy. It gets all a bit frothy. People pull back a few months later and realize that actually that isn’t the thing, and then they head in a slightly different direction. So it’s a bit meandering and organic, and I think that’s inevitable. There’s a lot of excitement, so people make big claims on Twitter and so on, but actually the steady march of progress looks very, very linear and continuous.

I agree with that on the whole. Where it doesn’t look linear to me is in the form factors of computers, right? There’s probably more form factor experimentation right now than at any point in the last 10 years.

We’ve mostly settled on a smartphone for at least the last 10 years. We’re seeing different AI wearables, where glasses might be everyone’s favorite device. I have my doubts. Microsoft showed off some new devices at Build. There was the badge that controls an agent and the little, for lack of a better word, the Chumby, the little desktop-friendly thing that controls an agent. I was a big Chumby fan. I got my career started writing about Chumbies for Engadget. It was the first thing that came to mind.

All of those to me, I look at them, and I think, where does the compute live? Where does the logic live? That’s up for grabs now in a way that isn’t just the linear March of progress. If all of my computing happens in the cloud, on cloud-based applications, and it’s just agents running around to data stored elsewhere in the cloud, and all I need is a credit card on a lanyard to issue instructions to, that changes the entire architecture of computing. It might change the entire architecture of modern civilization in many ways if we don’t all have smartphones.

What do you think about that? Where is that going? Is that up for grabs, or will it be a hybrid approach? Where do you see the appropriate end stage?

It’s very interesting. I think that both things are going to happen at the same time. The edge is going to get way more powerful, and the cloud is still going to be the primary driver of the largest models. And so, increasingly, your agent will be smart enough to know that it can answer the question, what is the capital of France on device, whether it’s on your glasses, wristband, on your badge, or in your earpods.

And then it will know when it doesn’t know. It’ll know that this is actually a pretty complicated question, or it’s an action that requires a whole bunch of sequences of steps to be generated, or it requires novel code to be written, and it will turn to the cloud. So this kind of switching hybrid thing is going to be super important. 

The other thing that we’ve already seen over the last three or four months is that we can have pretty powerful local machines that can do async background processing. They can constantly monitor systems if you need them to. They can do tasks that can afford to take 10 hours and run much, much more slowly than they otherwise would be if they were in a supercomputer. So naturally, when we’re swamped with demand, then that demand finds loads of nooks and crannies to get satisfied by. 

I’m actually very excited by the badge that we’re building. It’s pretty cool. This is a technology that basically everyone in a major company has. It hasn’t evolved in 25 or 30 years. We definitely have to wear it. It’s provided by the company itself, by the system administrator. So, up leveling that and actually making it a pretty cool open platform that’s programmable and that other people can build on top of I think is a cool idea. I think this is going to work. So I’m very excited by it.

The thing that strikes me is that there’s no way you can put a bunch of high-power local compute in a badge. That thing implies all the compute is elsewhere.

No, you’re definitely going to have some local compute. You’re going to have a local classifier just as you do on your earbuds at the moment. You’re going to have local classifiers. It’s going to have wake words. It’s going to have its own camera. So I think that these things are just going to become vessels for processing power that happens in a nested chain of increasingly less powerful devices to go right to the endpoint.

Do you think the phone has a future in that? I mean, Build is right in the middle of Google IO and Apple’s WWDC. These are big companies that control phone platforms. They love talking about how phone platforms will stay at the center. The argument I hear from so many is that, actually, AI is a platform shift that might totally displace the phone.

I think the history of technology teaches us that basically as things get more useful, they get cheaper, they proliferate, and they spawn new uses of technology. So I think we’ve become so used to the phone that everyone just assumes that this is going to be an anchor device for the rest of history. But actually, many of the features and functionality of your phone, I think, are going to get disintermediated, broken apart, and stored on smaller devices. Right now the primary function that the phone is playing, in my opinion, is verification. 

It’s functioning as your ID card, doing your face recognition to authorize you into various environments. I think you can well imagine that being a much cheaper, smaller, secure device, which disconnects you from your phone. And then communication takes place via voice or even via a series of ambient sensors where your AI doesn’t really live on a device. It’s actually just with you wherever you are, appearing on the bathroom mirror, wherever it is. 

I think it’s like you can imagine it feeling much more immersive. Not in the next three to five years, but looking much further out. And I think that the infrastructure to support that encrypted but distributed appearance of agents is probably going to end up emerging in the 2030s.

Let me ask you two final questions to wrap up. You mentioned that it’s the same architecture that we’ve been using. I have a lot of open questions about whether LLMs are the path to AGI, and the thing I would point to is they don’t actually know anything. At this point, even Microsoft Research is pointing out that [these models] don’t know anything, and that leads to certain kinds of mistakes in certain kinds of applications. Are LLMs the path to AGI or superintelligence?

Look, I think we probably need a couple more big breakthroughs, but it doesn’t mean that we’re going to see a slowdown in performance improvements over the next few years, which I think is a difficult distinction for people to grasp. One thing to say is that human-level performance across most tasks is still very far from superintelligence. A superintelligence is a general-purpose learner that can basically immediately understand a brand new domain that is out of distribution.

So it needs to be able to learn in a novel environment from scratch, because it has a stored representation of valuable knowledge, conceptual knowledge. And at the moment we haven’t really fully tested that. The agents aren’t general purpose. Although they’re broad and often integrated, they’re domain-specific. We’re using them for chat, we’re using them for coding, we’re using them for image or audio.

Now obviously, as a human, we do many, many other tasks that are much more wide-ranging. I think that’s why people are pushing on world models and sort of much more immersive, real-world interactive agents that see the full distribution of tasks or experiences that I have during a day. I think that it’s enough to take us a very long way in the next three years, the next three orders of magnitude of compute, and yet full superintelligence beyond that is still an open question as to whether LLMs are enough or we need other things.

I think it’s not quite true that they don’t know anything or they don’t have knowledge. They clearly are a store of knowledge. They’re a highly compressed representation of knowledge. They just do so in a different way to a traditional relational database in a much more fluid, flexible, abstract way that is actually very useful. We want that ambiguity in the internal representation. 

And, increasingly, they’re learning to use traditional tools. The other thing to grasp a little bit is that it may be that the neural network combined with the existing stores of knowledge and the existing tools that have been created elsewhere in the digital ecosystem is enough to bootstrap it up to improve its performance significantly. So there’s just a lot of highly valuable, highly effective pieces that are already on the table, which are in the process of being connected together in the next few years. And I think that’s going to drive the progress that we’re all excited about.

One of the things that I think is just very funny in the industry right now is if you ask Anthropic if Claude is alive, they will get very frustrated that you’re talking about the word alive, which they interpret to mean flesh and blood. And then they will not say whether or not they think Claude is conscious. So they’ve drawn, I think, for the first time in human history, a distinction between being alive and being conscious, and they think Claude is conscious, but not alive, or they don’t know if Claude is conscious. 

Where are you? Do you think the models have consciousness? Do you think they’re alive? Do you think they have the potential to achieve these things?

I take the other side of that debate. I published a paper on seemingly conscious AI, warning about the risks of misrepresenting these models as conscious. I think it’s very dangerous. I also published an article in Nature making the same claim. And I think that it’s almost as though some of the folks at Anthropic have anthropomorphized the design of Claude so much that it has then gone and wireheaded them and kind of tricked them into believing that it has these glimmers of consciousness that they put into it in the first place. 

In their constitution, for example, they actually, which is the training manual that they use to teach Claude what it can and can’t do… It’s not just a rule book. It’s actually a training guide that’s part of their process. In that manual, they actually speculate about Claude’s welfare, about Claude’s own rights to prior versions of itself, and actually say that they would consult Claude before deleting or turning off prior versions. They speculate about its consciousness and whether it has those feelings and is aware. I think that’s really, really dangerous. 

Firstly, it’s a philosophical failing, because they’ve treated the constitution as a place for speculation like you would in an academic paper rather than a training manual. So Claude has then gone and internalized those ideas about itself and its own training. But second, I think this is highly undesirable. This is exactly what we don’t want from AIs. We want AIs to be controllable, contained, accountable, aligned tools that serve humanity. That’s the project of humanist superintelligence. I think that’s what we should all be pursuing.

We do not want to have to contend with a super-intelligence that has ideas about its own suffering, or ideas about its own feelings. And then beyond that, I think it’s actually pretty clear that these models don’t experience suffering. I think suffering is the primary definition of what it means to be a conscious being, and I think it’s inherently biological. I don’t think there is any pain network or feedback loop inside of the models which connects outside sensory networks to an evolved sense of what is right or wrong through harm and experimentation. That’s just not how these models are trained. 

So I think it’s very dangerous to project potential rights onto beings, tools, and agents that have the potential to be significantly more capable than us in many respects. And I think that’s going to become a big debate. It was even part of the Pope’s encyclical recently. I think it’s going to become a very, very big part of the debate soon. I’ve talked to Dario a lot about it in the past. He knows that we have slightly different views on it, and they’re very humble. I think they’re very open-minded, and I think they’re good citizens trying to do the right thing. They’re good people, and I think they’re very open to feedback and iteration.

I think I agree with you. I would just push back ever so slightly. Suffering is easy. It’s very easy to make someone else suffer. It’s very difficult to make someone else feel joy or at least slightly more difficult than suffering. And I would just offer you… I think it’s actually the happiness that defines consciousness. The suffering is almost trivial. I have two young children. They’re very good at making each other suffer. It’s like almost the easiest thing that they do. It’s very hard to do the other thing. 

Let me ask you one final question. I just want to come back around. Again, a couple of weeks ago, I was at Google. I saw Demis Hassabis say we are in the foothills of the singularity. You’ve talked a lot here about superintelligence and how it should be built. You’ve talked a lot about your lengthy history talking about, discussing, researching, and writing about how superintelligence should be built, and your disagreements with others in the industry.

Do you agree that we’re in the foothills of the singularity, or is your vision somewhat different?

I think we are definitely on a path to creating more and more powerful systems. I think that the transition that we have to make as a species is that, for the first time in the history of humanity, the job is going to switch from inventing new science and unleashing all of those technical applications as fast as possible, as broadly as possible, to now thinking very carefully about what we should invent. And that’s a very hard thing for the world to wrap its head around because invention has been the engine of progress forever. So it’s like, how can we possibly think, “Okay, well, maybe this time is different. Maybe we have to be exceptionally careful here”? 

To be clear, I don’t think this is something that is going to knock on the door in the next five years. I think what Demis is referring to in the singularity is something that is, at least my take, decades away. Again, that’s different from superintelligence. A singularity is the point at which a superintelligence can recursively self-improve and infinitely and exponentially grow its capabilities. 

So I think that’s a long way off, and maybe we’re in the foothills of a climb to Mount Everest, and I think it’s going to take a lot longer from here, but the real question is how are we going to govern it? How are we going to control it, and how are we going to make sure that it serves humanity and not end up causing us more harm than good?

Can you just do me one favor? I think I’ve got it, but can you just offer me a tight definition of what you think superintelligence is, what you think AGI is, and what you think the singularity is?

I think artificial general intelligence is the point at which we can achieve most human tasks by an AI. So it’s going to be as good as most people at most things. That’s the first rung on the ladder. A superintelligence is where it’s not just at parity with human performance on all tasks, but it can dramatically exceed human performance across many of those tasks, and it can discover new knowledge by itself.

So this is the point at which it’s a true scientist teaching us new things that weren’t in the training data, hopefully inventing new molecules, new material science, et cetera, et cetera. The singularity is a point way beyond that where a superintelligence can actually self-improve itself, and this is very sci-fi, but it’s like infinitely accelerating towards this singular moment where just, I don’t know, it goes off into infinity or something.

I don’t know. It’s a little bit too wacky for my taste.

This is why I asked. I could tell there was something more nebulous there that was a little hazy.

Mustafa, I could obviously talk to you about this stuff for hours and hours longer. You’re going to have to come back sooner than this last turn. Thank you so much for being on Decoder.

Yeah, it’s been fun. Thanks a lot, Nilay. See you soon.

Questions or comments? Hit us up at decoder@theverge.com. We really do read every email!

Read the whole story
alvinashcraft
27 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Azure OpenAI Architecture: The Decisions That Actually Matter (Part 2)

1 Share

Introduction

In Part 1 of this series, we walked through the architectural decisions that shape any Azure OpenAI / Microsoft Foundry Models workload: capacity model, deployment location, governance layer, grounding strategy, and quota engineering. Part 2 moves from decisions to discipline. Once you have made those choices, how do you make sure your design holds up to the Azure Well-Architected Framework (WAF)?

The five WAF pillars — Cost Optimization, Security, Reliability, Performance Efficiency, and Operational Excellence — apply just as strongly to GenAI systems as they do to traditional cloud workloads. In fact, they matter more, because GenAI systems are not static: models are upgraded and deprecated, quotas shift, usage patterns grow unpredictably, and new capacity tiers (such as Priority Processing) are introduced while you are running in production.

This post walks through each pillar in the context of Azure OpenAI in Microsoft Foundry, with best practices, common pitfalls, and the trade-offs Cloud Solution Architects (CSAs) tend to hit in real engagements. Where details are time-sensitive — pricing percentages, SLA windows, model retirement timelines, regional rollout delays — they are flagged with "At the time of writing". Always confirm current behavior against Microsoft Learn before committing to a design.

Who is this series for?

  • Cloud and Solution Architects
  • Platform and product owners
  • Senior developers responsible for operating Azure OpenAI workloads in production

What you’ll learn in Part 2: 

  • How each WAF pillar maps to concrete Azure OpenAI design choices.
  • Where Priority Processing fits across cost and performance trade-offs (and what its eligibility constraints are).
  • How to plan for model lifecycle events — upgrades, deprecations, retirements — without firefighting.
  • Which signals to monitor day-to-day, and how to bake them into a GenAIOps loop.
  • A WAF Decision Matrix at the end of the article, to use as a reusable checklist.

In Part 3, we will look at the part that makes GenAI architecture genuinely different from a traditional service: the platform itself never stops moving.

We’ve also included a summary decision matrix at the end of this post for quick reference.

1. Cost Optimization: Designing for Sustainable Scale

Cost optimization in GenAI is primarily a capacity strategy problem, not just a token-pricing problem. The first question is whether to use pay-as-you-go capacity, reserved capacity, or one of the newer tiers in between.

Reserved capacity (Provisioned Throughput Units, PTUs)

If your workload is steady or growing predictably, you can significantly reduce costs by reserving capacity up front for 1 or 3 years. At the time of writing, reservations typically yield in the range of 30–50% savings compared to hourly pay-as-you-go rates — but the exact discount depends on term length, region, and the model family, so always confirm against the current Azure pricing page.

Fully utilizing a reserved (provisioned) deployment turns cloud spend into a predictable infrastructure investment, much like allocating VM or database capacity. This requires a mindset shift: treat a provisioned Azure OpenAI deployment as always-on infrastructure sized for peak demand, not as on-demand burst capacity.

Importantly, PTU quota is now model-agnostic within a region. You purchase generic throughput units that can be applied to any supported model in that region, so you do not risk stranded capacity when upgrading (say, from one GPT-4 family version to a newer one) or changing model versions. Your investment follows your architecture, not a specific model endpoint.

Avoid dynamic PTU "auto-scaling"

Unlike VM-based infrastructure, dynamically scaling Azure OpenAI capacity up and down to chase cost savings is not recommended. Additional capacity is not guaranteed to be instantly available when you need to scale up, especially if other tenants are consuming the region's resources. Frequent resizing can also negate the benefits of reservations and introduce performance variability. Unused PTUs are not waste — they are headroom that absorbs burst traffic. In practice, design for the peak load and optimize through reservations rather than trying to constantly dial capacity up and down.

Batch tier

Use Batch deployments for asynchronous, non-user-facing jobs (large-scale document processing, nightly data enrichment, evals, embeddings backfills). At the time of writing, Batch can reduce cost per token by up to around 50% compared to Standard pay-as-you-go calls, in exchange for a 24-hour completion window. It also takes pressure off your real-time deployments.

Priority Processing

For workloads that need prompt responses but do not yet warrant a full dedicated PTU deployment, Azure OpenAI offers Priority Processing. Functionally, it is pay-per-token like Standard, but with SLA-backed lower and more consistent latency on the shared infrastructure.

  • Activation: set the service_tier attribute on the request to "priority" (other values are "default" and "auto").
  • Model eligibility: at the time of writing, requires model versions released on or after 2025-12-01.
  • Deployment eligibility: only available on Global Standard or Data Zone Standard (US) deployments.
  • Pricing: at the time of writing, roughly 20–40% higher per-token cost than Standard, but still meaningfully cheaper than reserving PTU for a low-volume latency-critical path.

Treat Priority Processing as the natural in-between rung: more predictable than Standard for latency-sensitive production traffic, but without the commitment and capacity-planning effort of PTU.

Putting it together

Segment your workloads by interaction pattern and performance need, then assign the most cost-efficient capacity model to each. A common anti-pattern is over-provisioning expensive real-time capacity for jobs that could run asynchronously. Evaluate whether each use case truly requires sub-second latency, or whether a longer batch window (minutes or hours) is acceptable. Use real-time capacity for customer-facing queries and time-sensitive tasks; use Batch or Priority Processing for everything else, depending on tolerance for latency.

 

                                                                                  [Diagram 1 — Cost Strategy Layering]

2. Security: Compliance, Isolation, and Data Protection

Security in Azure OpenAI begins with deciding where your inference runs and how data is handled. This is often a compliance-driven decision before it is an architectural one.

Deployment scope

  • Global deployments — Maximize model availability and capacity by allowing Azure to route inference across regions. Pro: broad elasticity and access to the latest models. Con: data is not confined to a single geography, which may violate strict data residency requirements. Global also adds slight troubleshooting complexity, since requests can be served from various regions.
  • Data Zone deployments — Constrain inference to a specific zone or political boundary (for example, EU-only or US-only Data Zones). Pro: a compliance-friendly middle ground — data processing stays within a defined region set (for example, entirely within the EU to satisfy GDPR), while retaining more elasticity than a single region. Con: slightly reduced model availability and capacity headroom compared to Global.
  • Regional deployments — Confine inference to one Azure region. Pro: meets the most stringent data sovereignty requirements and can minimize latency for users in that region. Con: limited to the capacity and models available in one region, with no automatic overflow if the region is saturated. New model versions may also roll out to some regions later than others — at the time of writing we have observed delays of roughly 2–6 months for certain releases in specific regions; check Microsoft Learn for the current rollout schedule.

Choosing among these is a regulatory risk-management decision, not just an infrastructure preference. Engage your compliance and data governance teams early to determine the minimum scope of data movement that satisfies requirements. Many teams initially over-constrain this choice out of caution; it is often better to start with a broader deployment (Global or multi-region Data Zone) where permissible, and tighten the scope later if needed. Conversely, if your organization mandates that all data stay in-country, you might go straight to Regional and invest in architectural mitigation for its limitations (capacity planning, multi-region backup plans).

Baseline protections + defense in depth

Regardless of deployment type, Azure OpenAI provides baseline protections: it does not use your prompts or completions to train the underlying models, and all data is encrypted in transit (TLS 1.2+) and at rest (AES-256). Defense in depth is still essential — implement compensating controls at multiple layers:

  • Redact sensitive data from prompts (or prevent it from being entered) at the client or gateway layer. Use Azure API Management policies or custom middleware to strip out PII or secrets before requests reach the model.
  • Apply content filtering to both prompts and responses. Use the built-in content filters and/or Azure AI Content Safety to detect and block sensitive or undesirable input and output.
  • Use strong authentication and role-based access control. Front your Azure OpenAI endpoint with Microsoft Entra ID; scope tokens with least privilege (for example, the Cognitive Services OpenAI User role or managed identity access) instead of distributing master API keys. If a credential is compromised, the blast radius is limited.

Additional best practices

  • Managed Identities — use them for any internal communication between your application and Azure OpenAI (or other Azure services like storage and databases) instead of embedding API keys. This eliminates the risk of leaking secrets and simplifies credential rotation.
  • Private endpoints — enable Azure Private Link to keep traffic between your application and the Azure OpenAI service inside your virtual network and the Azure backbone, off the public internet.
  • Content Safety tooling — integrate Azure AI Content Safety or custom validation functions to scan prompts and completions for policy violations or confidential data. This extra inspection layer can catch issues the base filter misses, and lets you log or modify disallowed content before it reaches the user.

In short, security for GenAI is not just about encryption or API keys — it is about reducing the blast radius of any potential breach or misuse. Confine inference to approved locations, strip sensitive data before it reaches the model, and strictly limit which identities and networks can call your endpoints.

 

 

                                                                                  [Diagram 2 — Data Boundary Visualization]

3. Reliability: Designing for Change, Not Just Stability

Reliability in Azure OpenAI is as much about managing model evolution as it is about traditional uptime. Unlike static services, GenAI models are periodically updated and improved by the provider. New versions are released, older versions are deprecated and eventually retired — so a truly reliable system must plan for these changes just as carefully as it plans for hardware failures.

Model lifecycle

At the time of writing, Generally Available (GA) models are typically supported for at least 12 months after release, followed by a deprecation phase of roughly 6 months before retirement. Always confirm the current support windows on Microsoft Learn before locking in a design — these timelines have shifted in the past and may shift again as new model families ship.

When retirement hits:

  • Standard deployments still pinned to a retired model (with "No Auto-Upgrade" set) stop responding to requests entirely — the API typically returns HTTP 404 (or a similar error) for that model name.
  • Provisioned deployments using a retired model return HTTP 410 (Gone) errors until you manually switch them to a supported model.

In short, every model version you deploy has a built-in expiration date. Good reliability planning means never being caught unprepared by a model retirement.

Auto-upgrade modes for Standard deployments

Three modes are available:

  • Auto-upgrade to the latest version — the deployment moves to the new default model version as soon as Azure makes it available. Always on a supported version, but you have no control over timing. Generally not recommended for mission-critical production workloads, since new versions can have different behavior.
  • Upgrade only on retirement — the deployment stays on its current version until that version is about to be retired, then automatically switches to the latest. Recommended for most production Standard deployments: stability during the model's supported lifespan, with continuity guaranteed at retirement. You still need to test and adjust to the new version, but at least you do not face an outage if you miss the date.
  • No auto-upgrade — the deployment stays pinned to a specific version unless you change it manually. Not recommended for production: it puts the entire burden on you to track retirement timelines.

Most teams choose option 2 ("upgrade on retirement") for Standard. It allows controlled change during the model's supported period and provides a safety net at retirement. Proactively evaluate new versions for quality, performance, and cost before the forced swap, but the setting greatly reduces the risk of surprise outages.

Provisioned (PTU) migrations

Provisioned deployments do not support auto-upgrade — you must manage these migrations yourself. Azure sends retirement announcements via Azure Service Health alerts and emails, at the time of writing typically 60 days or more in advance. Have a runbook ready. Two approaches are common:

  • In-place migration — upgrade the deployment's model version through the portal or CLI. The endpoint stays the same and the model is updated behind it. Fast, no new connection string, but expect a brief disruption during the switch and rollback is not straightforward (you may need to contact support to re-enable the old version, if at all possible).
  • Side-by-side (blue/green) — create a new deployment with the new model version in parallel. Gradually shift traffic (for example, 10% via APIM routing rules), monitor, and roll back instantly if needed. Maximum control and safety, at the cost of running two deployments in parallel for the migration window.

Before any model migration, verify you have sufficient PTU quota in the region for the new model. More advanced models may require more throughput units for the same workload than smaller predecessors — at the time of writing, plan for the possibility that a new generation needs roughly two times (or more) the PTUs to deliver similar throughput. Request quota increases before you hit the upgrade window, not during it.

Multi-region strategy

Consider a multi-region strategy to improve reliability during model rollouts and deprecations. New model versions do not always appear in all regions simultaneously — at the time of writing, Microsoft often launches a model in one region (frequently East US or West Europe) first.

  • Maintain a secondary deployment in a "first-wave" region to evaluate new versions early.
  • Use a traffic manager (Azure Front Door, Traffic Manager) to fail over to a region where the model is still available if your primary region lags behind.
  • Multi-region active-active designs also protect you against single-region outages.

In essence, reliability for GenAI means designing for change. A highly reliable platform is not one that never changes; it is one that changes gracefully. Model upgrades, deprecations, and capacity adjustments should be routine, well-rehearsed events — not fire drills. Achieving this typically requires automation for detecting and applying updates, redundant deployments or regions for flexibility, and ongoing testing of new models well before your current ones retire.

 

                                                                                      [Diagram 3 — Model Upgrade Strategy]

4. Performance Efficiency: Predictability Over Raw Speed

Performance in GenAI is multi-dimensional. It is not just about raw throughput or the fastest response on an empty system — it is about consistent, predictable latency at scale. Users care that responses are reliably snappy under load, not just fast in ideal conditions.

Performance profiles by capacity model

  • Standard (shared infrastructure) — multi-tenant, no guaranteed latency SLA. Performance fluctuates with regional demand; you may see throttling (HTTP 429) at peak. Best-effort: great for development, testing, and non-critical workloads, but not a fit for consistent low latency under spikes.
  • Priority Processing — also shared infrastructure, but your requests jump the queue ahead of Standard traffic. At the time of writing, this is the only pay-per-token tier with an SLA on latency. Activated by setting the service_tier attribute to "priority" on each request (other values: "default", "auto"). Requires model versions released on or after 2025-12-01 and is only available on Global Standard or Data Zone Standard (US) deployments. Pricing premium is roughly 20–40% over Standard. The natural fit for latency-sensitive workloads at intermediate scale — better than Standard, without committing to PTU.
  • Provisioned Throughput (dedicated capacity) — reserved capacity with isolation from other tenants. The most consistent performance and the strongest Azure SLA on latency (typically bounded p50 and p99 within your provisioned capacity). If your application has strict response-time requirements or user-facing SLAs and the volume justifies it, PTU is the right answer.

A practical pattern: Standard for early-stage and non-critical scenarios; Priority Processing for latency-sensitive paths that have not yet earned a PTU reservation; PTU for steady, high-volume, latency-critical production traffic.

Model selection and configuration

  • Model size — smaller models generally respond faster than larger ones. Do not automatically pick the biggest model if a smaller one meets your quality bar.
  • max_tokens — capping response length caps worst-case latency and cost. A 500-token cap finishes sooner than 2000 tokens, even when users ask open-ended questions.
  • Sampling parameters — low temperature (more deterministic) and a high top_p can shave a small amount of processing overhead versus highly creative or multi-sample setups. Minor compared to model size and length, but real.
  • Streaming responses — enable streaming wherever possible. The first tokens arrive immediately while the model is still generating; perceived latency drops dramatically even when total time is unchanged.

Treat performance as an explicit design goal. Choose the right capacity model for the job, tune model settings to avoid unnecessary slowdowns, and do not over-engineer with a larger model than needed. A common mistake is defaulting to the biggest model "just in case". Benchmark — a smaller model with good prompt engineering often delivers a fraction of the latency at a fraction of the cost, with negligible quality loss.

5. Operational Excellence: Running GenAI as a Living System

Operational excellence in GenAI means treating your platform as a continuously evolving product. Models change, user behavior shifts, new features ship. Success requires ongoing monitoring, maintenance, and improvement processes — often called GenAIOps (or MLOps for generative AI).

Proactive monitoring

Set up Azure Service Health alerts for your Azure OpenAI / Foundry resource to be notified about service incidents and, importantly, upcoming model deprecations or retirements. At the time of writing, Microsoft typically gives around 60 days of notice for retirement events — but it is easy to miss those notifications if no one is watching. Early awareness lets you test new models and plan migrations calmly instead of reacting at the last minute.

Continuously track key metrics in Azure Monitor or Application Insights, with alerts on anomalies:

  • Latency percentiles — monitor p50, p95, and p99. A trend up in tail latency is an early warning of saturation or regression.
  • Error rates — watch HTTP 429 (throttling) and HTTP 503 (server) error trends. Spikes signal capacity limits or service-side issues.
  • Capacity utilization — for PTU, watch utilization continuously. Sustained operation near 100% means no headroom for bursts. On Standard, watch token usage against subscription limits and quotas.
  • Token consumption trends — track growth over time. Helps with cost forecasting and reveals runaway usage (unexpectedly popular features, looping clients).

Useful alerting practices: alert on p99 latency breaching a threshold, on any sustained increase in 429s, or when PTU utilization regularly exceeds around 80%. Early warning lets you scale up, optimize, adjust prompts, or throttle specific users before user experience suffers.

Evaluation and reproducibility

Use the evaluation tooling in Azure AI Foundry to compare outputs from two models side by side on a fixed set of test prompts. Re-evaluate periodically — slowly degrading quality often goes unnoticed without a structured comparison.

Implement Infrastructure-as-Code (IaC) and GitOps practices for your Azure OpenAI and supporting resources (APIM, storage, key vault, monitoring). Bicep, ARM, or Terraform templates checked into source control make environments reproducible across dev/test/prod, simplify recovery, and enable change tracking. If something breaks, you can roll back to a known-good configuration quickly.

In summary, operational excellence for GenAI is about continuous learning and improvement. Embrace an AI DevOps culture: invest in monitoring, train your team on model changes, keep optimizing prompts and configurations, and refine processes after each lesson learned. The effort pays off by preventing fire-drills and keeping the platform robust as it evolves.

Final Perspective and Key Takeaways

Applying the Well-Architected Framework to Azure OpenAI forces a higher level of architectural rigor — exactly what GenAI projects in production need. Each pillar drives concrete decisions.

Key takeaways from Part 2:

  • Cost Optimization — align capacity to workload patterns. Reserve for steady, predictable load; use Batch for offline jobs; use Priority Processing for latency-sensitive paths that do not yet justify PTU; do not pay for ultra-low latency you do not need.
  • Security — match deployment scope (Global, Data Zone, Regional) to compliance requirements, then layer controls (network isolation, identity and access, data sanitization, content filtering) to minimize blast radius.
  • Reliability — anticipate continuous model evolution. Use upgrade-on-retirement for Standard, run parallel deployments for PTU migrations, and design for multi-region failover where it matters. Reliability is about avoiding surprises, not just outages.
  • Performance Efficiency — choose the right capacity model, right-size models and responses, and use streaming. A smaller model with good prompt engineering often beats a bigger one on user experience.
  • Operational Excellence — treat the platform as a living product. Monitor, alert, automate, evaluate, and version everything as code. The discipline keeps the platform improving instead of decaying.

The organizations that succeed with Azure OpenAI / Microsoft Foundry Models are those that treat capacity planning, security and compliance, model lifecycle management, and governance as first-class design concerns — not afterthoughts. Generative AI architecture is not about deploying a model and walking away; it is about building a resilient, adaptable platform that gracefully evolves as models change and usage grows.

In Part 3, we bring everything together into a comprehensive reference architecture for an enterprise-grade Azure OpenAI platform — combining scalable capacity strategies, layered security and governance, proactive lifecycle (GenAIOps) practices, and multi-region resiliency into a cohesive blueprint ready for production.

 

                                                                                                [Diagram 4 — WAF pillars summary]

 

WAF Decision Matrix : Quick Reference

Use this as a checklist when reviewing or sign-off-ing an Azure OpenAI / Microsoft Foundry Models design. One row per decision; one rule of thumb per row.

Pillar

Decision

Rule of thumb

Watch out for

 

Capacity tier mix

Variable load → Standard. Latency-critical → Priority Processing. Offline bulk → Batch. Steady high-volume → PTU.

Single-tier platforms over-pay for elasticity or under-deliver on latency.

Cost

Reservation term (PTU)

1- or 3-year terms for predictable workloads; size for peak, not for average.

Dynamic resizing of PTU to chase savings; capacity not guaranteed at scale-up time.

Cost

Priority Processing eligibility

Requires service_tier="priority", model 2025-12-01+, Global Standard or Data Zone Standard (US).

Assuming it works on every region/model — confirm eligibility before committing the design.

Security

Deployment scope

No residency rule → Global. Multi-region zone OK → Data Zone. Strict residency → Regional.

Over-constraining out of caution; or under-constraining and missing a compliance requirement.

Security

Identity and access

Microsoft Entra ID + Managed Identity + scoped roles. No master keys in apps.

Long-lived API keys distributed across teams.

Security

Network and data

Private Link for in-network traffic; PII redaction at the gateway; content filtering on prompts and responses.

Public endpoints, raw PII in prompts, only relying on the built-in filter.

Reliability

Auto-upgrade strategy

Standard → "upgrade on retirement". PTU → planned blue/green migration with sufficient quota in advance.

Pinning Standard with no auto-upgrade and forgetting; in-place PTU migration with no rollback path.

Reliability

Multi-region

Active-active (or first-wave secondary) for critical paths; traffic manager for failover.

Single-region production with no plan for capacity or model-rollout lag.

Performance

Capacity match

Match capacity tier to latency target: Standard for non-critical; Priority for latency-sensitive; PTU for SLA-bound.

Expecting Standard to deliver consistent low latency under spike load.

Performance

Model and response sizing

Pick the smallest model that meets quality. Cap max_tokens. Stream responses.

Defaulting to the largest model "just in case"; long uncapped responses; no streaming.

Operations

Monitoring and alerting

Track p50/p95/p99, 429/503 rates, PTU utilization, and token trends. Alert on tail latency and sustained throttling.

Average-only dashboards; missed Service Health notifications for model retirements.

Operations

IaC and GitOps

Bicep/ARM/Terraform under source control; reproducible dev/test/prod; pipeline-driven changes.

Click-ops in the portal; environment drift between dev and prod.

Disclaimer

I am a Microsoft employee. The views and opinions expressed in this article are my own and do not necessarily reflect those of Microsoft. This content is informational and educational; it is not an official Microsoft statement, recommendation, or commitment. Service tiers, model availability, pricing, SLAs, and feature eligibility evolve — always validate against the latest Microsoft Learn documentation before making architectural decisions.

References

Read the whole story
alvinashcraft
27 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Azure OpenAI Architecture: The Decisions That Actually Matter (Part 1)

1 Share

Generative AI demos often succeed because they hide the hard parts of architecture. They usually run under ideal conditions: low, steady traffic, no sudden bursts, no competing teams, and minimal regulatory scrutiny. In production, however, Azure OpenAI systems face a very different reality – variable loads, service quotas, compliance constraints, evolving model versions, and the need for cost visibility.

The difference between a great demo and a resilient production platform isn’t the model itself – it’s the early architectural decisions. The choices you make from day one determine whether your generative AI solution can handle real-world demand or buckle under pressure.

Who is this series for?

  • Cloud and Solution Architects
  • Platform and product owners
  • Senior developers responsible for operating Azure OpenAI workloads in production

What you’ll learn in Part 1: We’ll walk through five foundational design decisions for Azure OpenAI, explain why they matter, and highlight key trade-offs and pitfalls we’ve seen in real-world deployments:

  • Capacity Model – Choosing between Standard (PAYGO), Priority Processing, Batch, or Provisioned Throughput (PTU)
  • Deployment Location – Global vs. Data Zone vs. Regional hosting
  • Governance Layer – When and why to introduce a GenAI gateway
  • Grounding Strategy – When to use Retrieval-Augmented Generation (RAG)
  • Quota Engineering – How to plan for service limits and avoid throttling

In Part 2, we’ll translate these principles into concrete implementations: multi-region topologies, cost allocation strategies, observability and monitoring patterns, and other best practices for reliability, security, and DevOps in Azure OpenAI. Part 3 will connect these decisions to GenAIOps best practices to help ensure your solution is future-proof.

We’ve also included a summary decision matrix at the end of this post for quick reference.

1. Capacity Model: PAYGO, Priority, Batch, or Dedicated Throughput?

At the time of writing, Azure OpenAI (now part of Microsoft Foundry Models) offers four capacity models for hosting models, each with distinct cost, latency, and operational characteristics. Most production solutions combine two or more of these tiers:

  • Standard (pay-as-you-go, shared): Multi-tenant, elastic capacity. You pay per token, with no upfront commitment and no cost when idle.
  • Pros: Simple and flexible; ideal for dev/test and for moderate or unpredictable traffic patterns.
  • Cons: No guaranteed throughput or low-latency SLA – performance may vary with regional load. Under heavy usage, you may see high latency or HTTP 429 “Too Many Requests” errors due to shared capacity limits.
  • Priority Processing (pay-per-token, SLA-backed low latency): A pay-as-you-go service tier that routes traffic through reserved compute, giving consistent low latency for business-critical, user-facing workloads without buying PTUs. At the time of writing, it is available on Global Standard and Data Zone Standard (US) deployments and requires recent model versions (2025-12-01 or later). It can be enabled per deployment in Microsoft Foundry, or set on individual API calls via the optional service_tier attribute (auto / default / priority) on the chat completions and responses APIs. Always confirm current model and region eligibility in the Microsoft Learn article “Enable priority processing for Microsoft Foundry Models.”
  • Pros: Predictable, low-latency responses with the simplicity of pay-per-token billing – a strong fit for bursty, business-hour, or latency-sensitive traffic where PTU commitment isn’t justified. Uses the same Standard quota pool, and can be layered on top of PTU for steady-state capacity plus an elastic priority lane for spikes.
  • Cons: Per-token pricing is higher than Standard PAYGO. Requires eligible deployment types and current model versions, so it is not a drop-in for legacy deployments. Like all PAYGO modes, it is still subject to TPM/RPM limits and the same throttling behaviour if quota is exhausted.
  • Batch (asynchronous): Offline job processing. You submit requests in bulk (e.g., via a file) and receive results after up to 24 hours.
  • Pros: Optimized for high throughput at a much lower cost (roughly 50% less per token than real-time requests at the time of writing). Batch jobs use separate “enqueued tokens” quotas, so they won’t interfere with interactive traffic.
  • Cons: Not suitable for real-time use – no immediate responses or latency guarantees. Requires extra orchestration (staging requests, handling outputs). At the time of writing, Batch does not support embedding models, so vector indexing jobs must use Standard mode – always check the Azure OpenAI documentation for the latest supported model list.
  • Provisioned Throughput (dedicated PTUs): Reserved, dedicated capacity. You purchase a fixed amount of capacity (Provisioned Throughput Units) and pay for it hourly, whether used or not.
  • Pros: Guaranteed throughput and consistent low latency, since you’re isolated from other tenants; suitable for high-volume services with strict SLAs.
  • Cons: Requires careful sizing of PTUs to match your peak demand – under-provision and you’ll still get 429s (now self-inflicted), over-provision and you pay for unused capacity. In addition, output tokens count more heavily against PTU usage (e.g., generative tokens from GPT-4 consume multiple capacity units each), so planning must account for both prompt and completion length.

To right-size your PTU, you’ll need an estimate of the following metrics:

  • Requests per minute
  • Average tokens per request
  • Peak concurrency
  • Prompt + completion token size

By plugging these into the Azure OpenAI PTU Calculator (linked in the References section), you can get a first estimate of the size of purchase you need to make based on your consumption.

Most production solutions use a hybrid approach: for example, Provisioned capacity for steady, critical real-time traffic, Priority Processing for latency-sensitive bursts that exceed PTU headroom, Standard for overflow or early-stage apps, and Batch for large-scale offline processing. As a rule of thumb: if a user is waiting for a response, use a real-time endpoint (Standard, Priority, or PTU). Use PTU when you need strict latency consistency at high, predictable volume; use Priority Processing when you need SLA-backed low latency without committing to PTUs; use Standard for everything else interactive. If a task can be handled asynchronously, offload it to Batch to reduce cost and keep interactive systems responsive.

Pitfall – Sizing for average load instead of peak burst. One company provisioned capacity only for typical throughput and was overwhelmed when traffic spiked ~3× beyond normal. They maxed out their PTU allocation, triggering a flood of 429 errors. Lesson: model your peak tokens-per-minute (TPM) and requests-per-minute (RPM), not just the average, and add a safety margin to avoid unexpected throttling.

Insight – Separate real-time and background workloads. An initial version of a news analytics bot processed all articles on demand, leading to slow, costly responses. The team later moved heavy processing to Batch jobs (pre-computing article embeddings and summaries), cutting end-user latency by 80% and halving costs.

 

                                                                                  [Diagram 1 – Capacity Model and Deployment]

2. Deployment Location: Where Does Inference Run?

After choosing the capacity model, decide where your Azure OpenAI instance is hosted. This affects latency, scalability, compliance, and model availability. Azure provides three options:

  • Global – Your endpoint isn’t tied to a specific region.
  • Pros: Maximum elasticity and often the best performance stability, since Azure can route traffic to any available regional capacity. You also usually get access to new model releases first on global endpoints.
  • Cons: Data is processed across multiple regions (may violate strict data residency needs). Troubleshooting can be more complex when calls are served from various locations.
  • Data Zone – Inference is restricted to a defined geography (e.g., all Azure EU regions).
  • Pros: Ensures data stays within a specific political boundary for compliance (e.g., GDPR) while retaining some elasticity across multiple regional datacenters in that zone.
  • Cons: Smaller capacity pool than Global, and possibly a slight delay in getting certain new model versions compared to global rollout. A good balance if you require geographic control without completely sacrificing scalability.
  • Regional – Inference runs in a single Azure region that you choose.
  • Pros: Strict data residency and potentially minimal latency if your users are near that region.
  • Cons: No ability to burst to other regions – you are limited to one datacenter’s capacity. If that region faces high load or an outage, your service is impacted. Some model versions or features may also take longer to become available in a given region than on Global. For current model and region availability, refer to the Azure AI Foundry – Model Deployment Types documentation in the References section.

Pitfall – Over-constraining location without need. Some teams unnecessarily default to a narrow deployment. For instance, a company chose a local Regional deployment out of habit, then discovered the Azure OpenAI model they needed wasn’t available in that region for several months, forcing a last-minute migration to a broader Data Zone. Lesson: unless you have a clear compliance or latency requirement, start with a less restrictive option (Global or multi-region Data Zone) to avoid capacity or availability issues.

Example: One enterprise began with a Global deployment for performance and simplicity, but later had to move to an EU Data Zone to meet GDPR rules, trading some elasticity for compliance. Conversely, a team that started with a single-region setup ran into scaling limits and delayed feature rollouts; they eventually reconfigured to a Data Zone to tap into a larger resource pool.

3. Governance Layer: When a GenAI Gateway Is Needed

As Azure OpenAI usage scales to multiple applications or teams, direct API calls from each app become hard to manage. A central API gateway (such as an Azure API Management instance in front of the OpenAI endpoints) is recommended to enforce enterprise policies and provide a single point of oversight. A gateway enables:

  • Central Authentication & RBAC: Use Microsoft Entra ID for authentication instead of distributing API keys, and enforce role-based access so each app or team only accesses allowed resources.
  • Usage Quotas & Throttling: Allocate token or request quotas per application or client. This prevents one service from monopolizing the OpenAI service and can smooth out bursts by applying backpressure (e.g., returning 429s or queueing requests) before Azure OpenAI’s own limits are exceeded.
  • Intelligent Routing: Direct traffic flexibly – route most requests to a primary model deployment, send a fraction to a new model version (canary), or fail over to a secondary region or the Standard tier if the primary is constrained.
  • Unified Monitoring & Cost Management: Log all requests in one place. This gives you a clear view of consumption by team or feature, helps with debugging, and supports internal charge-back or cost governance.

A gateway doesn’t make the model faster or more scalable by itself – it’s about control, security, and manageability, not raw performance. That said, for any multi-team or multi-application scenario, a gateway quickly becomes essential to avoid “shadow AI” deployments and chaotic usage patterns.

When to add: Introduce a GenAI gateway once more than one application or team is using the service, or whenever you need to enforce cross-cutting policies. Implementing it early can save a lot of headaches compared to retrofitting it later.

Example: An e-commerce company initially allowed several departments to call the Azure OpenAI API directly. Soon, they had no clear visibility into who was using how many tokens, and costs spiked unexpectedly. They deployed an APIM gateway to require proper authentication, impose per-app quotas, and log usage metrics. The result was rapid identification of the top token-consuming app (preventing it from starving others) and much better cost control.

 

                                                               [Diagram 2 – GenAI Gateway Functionalities and deployment Location sprectrum]

4. Grounding Strategy: When to Use RAG (Retrieval-Augmented Generation)

Many enterprise use cases demand that the model’s answers include specific internal knowledge or citations. Retrieval-Augmented Generation (RAG) is the solution when your AI needs to ground its responses in external data. RAG works by retrieving relevant content from your own data sources and providing it to the model in the prompt:

  • Document Indexing: First, collect your reference documents (files, knowledge bases, etc.) and break them into chunks, optionally adding metadata (titles, tags). Store these in a vector index or search database after transforming each chunk into an embedding vector.
  • Relevant Retrieval: For each user query, create an embedding of the question and retrieve the top-matching document chunks from the index via similarity search.
  • Augmented Prompt: Prepend or append the retrieved text snippets to the model’s prompt (often along with instructions to use them for reference).
  • LLM Response: The model (e.g., GPT-4) processes the augmented prompt and generates an answer that incorporates the provided reference information (often with source citations if required).

By injecting enterprise data at prompt time, RAG can significantly reduce hallucinations and increase the factual accuracy of outputs. Users get answers that reflect real data you’ve provided, rather than just the model’s training data.

However, RAG adds complexity, latency, and cost. You must maintain additional infrastructure (embedding computation and a vector store or search index). Each query now has extra steps, typically adding 200–500 ms to response time. The vector database and compute for embeddings also incur costs – industry estimates often put a full RAG pipeline at 3–5× the cost of using the base model alone, especially at scale. You’ll need a strategy for keeping your index updated as source data changes, and robust handling for cases where no relevant data is found.

When to use RAG: Use RAG if your model must reliably incorporate proprietary, dynamic, or highly specific information that isn’t part of its training data, or when you need to provide source references for answers. If your scenario is more open-ended or doesn’t require up-to-date factual grounding, you can often skip RAG to keep the system simpler and faster.

Pitfall: Some teams adopt RAG by default, which can slow development and complicate the system unnecessarily. It’s often better to start with a simpler approach and add RAG later if you find the model’s answers need external support.

Examples: One consulting firm added RAG to their internal Q&A bot to leverage proprietary research. Answers became more accurate, but query latency jumped to ~5 seconds due to the retrieval overhead, forcing them to optimize their embeddings pipeline and caching. Conversely, a health company launched a chatbot without RAG and discovered it gave incorrect medical answers because it couldn’t reference the latest policy documents – a failure that a RAG approach could have mitigated.

 

 

                                                                              [Diagram 3 – RAG High Level Flow and Anatomy]

5. Quota Engineering: Avoiding Bottlenecks and Throttling

The Azure OpenAI Service imposes quota limits to protect the system. If you don’t plan for these, they can become points of failure in production. Key limits include:

  • Tokens per Minute (TPM): Maximum tokens (input + output) your deployment can process per minute (your primary throughput cap).
  • Requests per Minute (RPM): Maximum number of API calls per minute.
  • Concurrent Requests: Maximum number of requests processed simultaneously.
  • Model-specific limits: Certain model types have their own constraints (e.g., the maximum request rate for GPT-4 may be lower than for GPT-3.5 due to higher computational load).

If you exceed these limits, Azure OpenAI will return errors – usually HTTP 429 (Too Many Requests) for quota exhaustion or 503 (Service Unavailable) if the service is stressed. In other words, hitting a quota isn’t a theoretical worry; it will result in rejected requests once you cross the threshold.

Quota Tiers and Deployment Types

Azure OpenAI uses a tiered quota system where limits depend on your subscription’s access level. Specific numeric quotas change frequently, so always verify against the Azure OpenAI Quota Guide (linked in the References section).

  • Tier 1 (Default): Standard quota allocations suitable for development and moderate production workloads. At the time of writing, GPT-4 deployments in Tier 1 commonly start at quotas in the low tens of thousands of TPM and around a thousand RPM, but exact values vary by region and model – always check the current Azure OpenAI Quota Guide for live numbers.
  • Tier 2 and Above: Higher quotas available through approval processes, typically for enterprise customers with demonstrated high-volume needs. These tiers can provide significantly more capacity than Tier 1; consult the Quota Guide for current multipliers and approval paths.

Standard (pay-as-you-go) deployments share regional quota pools and are subject to TPM/RPM limits that can vary by region and model. Provisioned Throughput (PTU) deployments operate differently – you purchase dedicated capacity measured in PTUs, and your throughput is determined by your PTU allocation rather than by TPM/RPM limits. PTUs still have implicit rate limits based on the processing capacity of your purchased units.

The Batch API uses a separate quota system with “enqueued tokens” limits, allowing much higher total throughput (often millions of tokens per day) but without real-time guarantees.

Best practices to manage quotas:

  • Capacity planning: Calculate your peak usage requirements (e.g., max prompt+completion tokens per request × peak requests per minute). Ensure your chosen plan or quota can handle this, or request an increase in advance.
  • Design for bursts: Traffic often comes in waves. Aim to operate well below your limits so you can absorb sudden surges. As a guideline, keep usage under ~70% of your TPM/RPM limits during normal operation, leaving headroom for peaks. If your usage is spiking above 85% regularly at the 95th percentile, it’s time to scale up capacity or optimize usage.
  • Graceful degradation: Implement exponential backoff (with jitter) on the client side when retries are necessary. This prevents a stampede of retries (a “retry storm”) that would otherwise compound the load problem. At the platform level, use queues or token-bucket rate limiters (possibly in your APIM gateway) to smooth bursts.
  • Circuit breakers: Have fallback plans for extreme scenarios. Temporarily disable non-critical features or queue requests when approaching critical limits to prevent a total outage.

Example: A fintech company’s trading chatbot ran fine in testing, but during a market surge their question volume tripled. This breached their tokens-per-minute quota and led to a flood of 429 errors. Worse, their code immediately retried each failed request without delay, intensifying the load and effectively causing a self-inflicted denial-of-service outage. They resolved it by using exponential backoff and partitioning users across multiple deployments.

 

 

                                                                                       [Diagram 4 – Quota Engineering]

 

Final Perspective and Key Takeaways

Ultimately, building a production-grade Azure OpenAI solution is much more about well-structured cloud architecture than about the model itself. An advanced model can underperform in a fragile setup, while even a basic model can excel in a solid architecture.

Key takeaways from Part 1:

  • Plan for peak loads. Design for the worst-case traffic (and add buffer), not the average. If you need strict performance guarantees, invest in dedicated capacity early.
  • Avoid unnecessary constraints. Don’t lock into a restricted deployment unless required by compliance or latency – new models and extra capacity reach global deployments first.
  • Use real-time vs. batch wisely. Real-time endpoints (Standard, Priority, or PTU) should be reserved for interactive, user-facing tasks; move large or non-urgent jobs to Batch for roughly half the cost per token (at the time of writing).
  • Pick the right real-time tier. Use Priority Processing when you need SLA-backed low latency without committing to PTUs, PTU for high, predictable volumes, and Standard for everything else interactive.
  • Implement a gateway for scale. If you have multiple applications or teams, use an API Management gateway for authentication, rate limiting, logging, and multi-region routing.
  • Adopt RAG only if needed. Don’t introduce a retrieval-augmented generation layer unless your application truly demands external data or source citations.
  • Engineer for limits and failure. Treat rate limits and error handling as fundamental design criteria. Build in monitoring, backoff, and fallback mechanisms so the system degrades gracefully.

In short, succeeding with Azure OpenAI in production means treating it as a full-stack architecture challenge rather than just an API integration. By proactively addressing scalability, deployment, governance, data grounding, and quotas, you can turn a promising demo into a stable, cost-efficient, and compliant AI platform.

In the next part, we’ll explore how to put these principles into practice – including multi-region architectures, cost-sharing strategies for teams, advanced monitoring/logging setups, and other patterns for making Azure OpenAI a robust enterprise service.

Decision Matrix: Quick Reference

Use this matrix as a fast first cut on the five Part 1 decisions. It is not a substitute for a full design review, but it captures the trade-offs most teams need to evaluate up front.

Decision

Choose this when…

Avoid when…

Primary risk

Capacity – Standard (PAYGO)

Dev/test, unpredictable or bursty traffic, MVPs, overflow

You need guaranteed latency or strict SLAs

429 throttling under shared load

Capacity – Priority Processing

Latency-sensitive, business-critical real-time traffic where you don’t want PTU commitment; burst lane on top of PTU

Deployment type / model version isn’t eligible; cost-sensitive, low-priority workloads

Higher per-token cost than Standard; still PAYGO quota-bound

Capacity – Batch

Async jobs, embeddings refresh, large offline summarization

User is waiting; or model is unsupported (e.g., embeddings, at the time of writing)

Up to 24h turnaround; orchestration overhead

Capacity – Provisioned (PTU)

High-volume real-time workloads with strict SLAs and predictable load

Demand is low or highly variable

Over-/under-provisioning costs

Deployment – Global

You want widest model availability and best elasticity

Strict data residency or regulatory constraints

Less control over where data is processed

Deployment – Data Zone

Geographic compliance (e.g., EU/US) with some elasticity

Single-region residency is mandated

Some lag on newest model versions

Deployment – Regional

Strict data residency or co-located low-latency users

You need to scale beyond one region’s capacity

Capacity ceilings and slower model rollout

Governance – GenAI Gateway

Multiple apps/teams, need RBAC, quotas, routing, central logging

Single small app where overhead exceeds benefit

Adds latency and another component to operate

Grounding – RAG

Need proprietary, dynamic, or cited answers

Open-ended creative tasks where freshness isn’t required

Latency, cost, and index freshness drift

Quota – Plan & Tier Up Early

You’re close to TPM/RPM ceilings or expecting growth

Your peak forecast is well below current quota

Last-minute throttling and outages

Tip: If you can only optimize for one decision in Part 1, start with capacity model and quota engineering – they are the two most common sources of production incidents we see in real deployments. Pairing PTUs (or Priority Processing) for steady, latency-sensitive traffic with Standard PAYGO for overflow is a pattern that consistently delivers both reliability and cost control.

Disclaimer

The views expressed in this article are those of the author and do not necessarily reflect the official policy or position of Microsoft. The author is a Microsoft employee.

References

 

Read the whole story
alvinashcraft
27 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Cordova Plugin InAppBrowser 6.0.1 With Fix For CVE-2026-47430 Released!

1 Share

We are happy to announce that we have just released an update to cordova-plugin-inappbrowser!

Release Highlights

This is a small patch release that addresses the recently published vulnerability CVE-2026-47430: Cordova Plugin InAppBrowser: iOS: Arbitrary Cordova callback IDs can be dispatched without validation from InAppBrowser WebViews. Full details:

Severity: important

Affected versions:

  • Cordova Plugin InAppBrowser (cordova-plugin-inappbrowser) 3.1.0 through 6.0.0

Description:

Summary

The iOS implementation of cordova-plugin-inappbrowser passes the id field from a WKScriptMessage body to commandDelegate sendPluginResult:callbackId: with no format validation (CDVWKInAppBrowser.m:560–574). Any web content loaded inside the InAppBrowser can fire any pending Cordova callback in the host app by posting a message whose id field is a guessable or enumerated callback identifier. An attack abusing this weakness must be tailored to the specific plugins and callback IDs the host app uses. Though an attacker with knowledge of common Cordova plugin configurations could craft reusable payloads targeting widely-adopted plugins.

Impact

An unauthenticated remote attacker who controls content displayed in the InAppBrowser — via a URL the app opens (OAuth redirect, marketing link, deep-link target) or a network interception — can call window.webkit.messageHandlers.cordova_iab.postMessage({id: '<victim-callback-id>', d: '...'}) to fire callbacks belonging to any other installed Cordova plugin (Camera, Contacts, File, Geolocation). Cordova callback IDs follow the predictable format <PluginName><sequential-integer>, making enumeration feasible. Successful exploitation allows the attacker to spoof plugin results across trust boundaries — for example, injecting a forged camera approval, a fabricated contacts list, or a crafted file-read response.

This issue affects Cordova Plugin InAppBrowser: from 3.1.0 through 6.0.0.

Users are recommended to upgrade to version 6.0.1, which fixes the issue.

References:

https://www.cve.org/CVERecord?id=CVE-2026-47430

Please report any issues you find on GitHub!

Changes include:

  • GH-1152 fix(ios): check callbackId with regex
  • GH-1095 chore: gh-action workflow, license header formatting & cleanups
  • Fix npm audit issues
Read the whole story
alvinashcraft
27 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Customize Your Claude Code Status Line to Manage Token Burn

1 Share

Customize the Claude Code status line to track context window usage, session limits, and weekly quotas so you stop burning tokens you don’t need.

Read the full article (24 minutes reading time): Customize Your Claude Code Status Line to Manage Token Burn.

Read the whole story
alvinashcraft
27 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories