On the evening of June 1 at San Francisco’s Bartlett Hall, Microsoft CTO Kevin Scott spoke at a joint event with Lectures on Tap, attended by approximately 150 developers, founders, media, and tech industry leaders. His talk focused on what he described as the growing perceptual disconnect between AI capability and AI reality: the tendency to mistake rapid advances in model performance for equally rapid progress in deployment, organizational transformation, trust, and real-world value creation.
Scott argued that the AI industry is at an inflection point where technical breakthroughs are arriving faster than institutions, workflows, and human systems can absorb them. While acknowledging the extraordinary pace of progress in areas like software development and agentic systems, he emphasized that the difficult challenge ahead is operationalizing these capabilities responsibly and meaningfully at scale.
Here are his five observations of the ways in which AI reality is diverging from apparent capability.
1. Capability ≠ deployment
According to Scott, one of the biggest mistakes people are making right now is confusing technical capability with real-world deployment. The fact that a model can do something impressive doesn’t mean the surrounding systems, economics, governance, and human behaviors are ready to absorb it at scale.
>We just shouldn’t have uniform faith that, as AI model capabilities improve, we’re going to get this crazy fast deployment everywhere.
“Today’s AI models are actually more capable than the things we’re using them for in the real world,” he said, addressing today’s “capability overhang,” as he has dubbed it. “We just shouldn’t have uniform faith that, as AI model capabilities improve, we’re going to get this crazy fast deployment everywhere.”
2. Closed feedback loops ≠ universal progress
Scott explained that some areas of AI (like agentic software development) are improving extraordinarily quickly because tight feedback loops allow those systems to iterate, evaluate, and refine outputs at high speed. But that dynamic doesn’t automatically extend to domains constrained by physical systems, regulation, or long experimental cycles.
>Tight feedback loops don’t automatically extend to domains constrained by physical systems.
“One of the things models can already do is postulate new ideas for particle physics experiments,” he said. “And the problem with particle physics experiments is that they take a lot of expert technical labor to set up and run, and they require the use of extremely expensive infrastructure. So there really isn’t a convenient way—other than publications in the scientific literature—to get the output of those experiments and feed it back into an actual model.”
3. Software velocity ≠ organizational velocity
AI is dramatically accelerating software development production, but that doesn’t mean organizations can suddenly move faster. In many cases, speeding up code generation simply exposes the slower-moving bottlenecks that were already present: deployment, integration, governance, and organizational change.
“I build a lot of prototypes that are greenfield, where I have no constraints whatsoever,” noted Scott. “I just get an idea and there’s nothing stopping me from using an agentic coding system to produce a brand-new thing. But in many cases, the things we want to produce are fairly highly constrained.”
>When things are moving this fast, it’s hard for people to notice the change and snap to.
Scott pointed to last-mile problems, the need for a lot of plumbing work, and human psychology as throttling issues. He also acknowledged the forecasting problem we will inevitably face as things move exponentially faster: “A lot of the stuff I’m doing right now using agentic coding to build things wasn’t even possible in November of last year,” he said. “When things are moving this fast, it’s hard for people to notice the change and snap to.”
4. Activity ≠ value
“Just because you’re using AI to create a lot of activity doesn’t necessarily mean that the activity you’re creating is valuable,” Scott said.
The ability to generate enormous amounts of output doesn’t guarantee meaningful impact. As AI lowers the cost of creation, the defining question shifts from, “How much can we produce?” to: “What is actually worth building?”
>We have to pay close attention to how we measure value.
“We can have a lot of output, we can build more complex things than we built before,” added Scott. “That doesn’t necessarily mean that the things we’re building are super valuable. When they go into a user’s hand, are they solving a real problem? As developers, we have to pay especially close attention to how we measure value and to the feedback we get on the work that we’re doing.”
5. Autonomy ≠ trust
AI systems are becoming increasingly capable of operating autonomously, but autonomy alone does not create trust. Real-world deployment still requires governance, identity, access control, transparency, and meaningful human oversight.
>That’s a new way of thinking about software.
“You’re always going to have human oversight, so this notion of autonomy is a little bit of a pipe dream,” Scott noted. “You have to build systems doing complex things in a way where people can trust that they’re doing them correctly and in a way that’s aligned with their interests and values. And that’s a new way of thinking about software.”
Bridging the gap between capability and reality
Ultimately, said Scott, “There’s a lot of work for all of us to do over the next months and years to fully unlock the potential of this crazy tool that we’ve built collectively. These problems that I enumerated don’t go away just as a function of scaling up an AI model. There is no silver bullet. That means there’s a bunch of technical work to be done, a bunch of societal work, a bunch of organizational work, and just dealing with legacy systems and plumbing.”
AI capability gains will continue, but turning those gains into trusted systems that create meaningful, durable value is the harder and more important work. And that work starts today.
“We need to engage more intensely than ever before,” said Scott, “because we can see the promise of this technology to benefit the world if we’re able to overcome these obstacles.”
What changes when agents become both a new unit of programming and an emerging new unit of human-to-machine interaction? The mission of Project Solara, a new software platform coupled with tailored hardware solutions, is to pioneer agent-first experiences that are shaped around you: your agents, your tasks, your environment, under your control. So, what’s different this time from previous generations of computers? Agents and AI accelerate the creation of even more specialized computers without incurring the full cost and tradeoffs that in the past limited the creation, diversity, and specialization of those new forms. We imagine a diverse ecosystem of agent-first devices, from small to large, from fixed to hypermobile, from personal to professional. We’re starting this journey with two concepts designed for the enterprise—and we’re excited to navigate this transformation with you all.
I manage the Applied Sciences Group, an interdisciplinary team that brings together product engineering, research, and the sciences to explore what comes next in computing. The rise of agents is changing not only how software is built, but how people interact with computers—and ultimately, what new kinds of computers may become possible. We are excited to give you an early look at where we believe computing is headed, and what the next computer may look like.
The next computer
When we think of a computer, we tend to picture something familiar: a laptop, a phone, maybe a tablet. But computing has never really stood still. It keeps moving closer to us, closer to the work, closer to the moment where it can provide the most value.
Mainframes did not disappear when PCs arrived. PCs did not disappear when phones arrived. Phones did not disappear when watches arrived. Each new form became more specialized, closer to you, closer to the solution you need. Each one found a new place in our lives because it was better suited to a specific context, a specific task, or a specific moment. So, what’s next?
Agents as the new interaction technology
At Build 2023, I shared my perspective on three emerging AI application structures, shaped by how AI functions relative to your application: Is the AI beside your app, inside it, or outside it?
In the first application structure, the AI is beside your application, it’s like a helper. It keeps the original app architecture and is minimally disruptive to what our customers already know.
In the second application structure, the AI is inside, as part of the main scaffolding; it becomes the main input loop. Here, AI is used to redefine the application’s interaction model and even its purpose. The experience becomes less dependent on point-and-click commands and becomes more automatic. This is where we are seeing the emergence of agents (for example, Researcher and Agent Mode in Office) and AI-first applications.
The third AI application structure is where AI moves from operating within the application frame to operating outside it, globally. Here, AI orchestrates across multiple apps and services, allowing the agent to connect, coordinate, and maintain context across entire workflows, across devices, and even across very different timescales. Current examples include the recent emergence of various claws (like OpenClaw and Lobster), coworker-like agents, and similar systems.
And so here we are today where agents are a new unit of programming and the new unit of human-to-machine interfaces, changing the way people interact and use their computers. And as we have seen many times in the past, new interaction technologies enable new types of computers.
New interaction technology enables new types of computers
Every new computer form factor follows this pattern shown above. A jump in processing power, both in the cloud and at the edge, has enabled us to create hyper-complex software (AI), making agents possible. Through these agents, human language and dialog is the new interaction technology. For the first time in our history, we can program, direct, and initiate action with computers the way we talk with each other. This higher mode of interaction enables the computer and us to be less dependent on the traditional way we have interacted with computers via keyboards, screens, or even premediated apps. … And because of these trends, we are seeing a major opportunity toward new types of form factors.
As AI streamlines the traditional development stack, these emerging form factors make it possible to bring agents into places, workflows, and moments that previously were difficult or cumbersome. A more specific and better tool for more specific tasks.
That is the opportunity in front of us: agent-first devices.
Agent-first devices accelerate specialization
Historically, specialization has been expensive. If you wanted to create a new type of computer, you had to build almost everything: hardware, software, services, developer tools, UI patterns, management systems, security models, and an ecosystem. This custom stack has been both a hurdle and a moat for new computer form factors.
Take a look at the diagram above, which illustrates the typical technology stack for a computer. Not just for laptops, but phones, watches, wearables, industrial devices, and so forth. Each layer in that stack represents a major company or even an entire industry. Bringing a new type of computer to market has historically required building out or modifying nearly every layer. This is expensive, difficult, and takes time. But what if it didn’t have to be that way?
AI, and the new agent interaction model, reduces this burden. AI introduces newUI and app model flexibility into those layers. With just-in-time UI (see below), fewer apps need to be written for specific hardware implementations. With agentic coding, less effort needs to be spent refining a developer SDK for human consumption. As agent-only experiences grow to cover more of users’ needs, less of the traditional UX surfaces (like app frameworks or even browsers) need to be implemented for the specific hardware. The boundaries between those layers will blur and, in some cases, disappear.
Therefore, agents enable us to create new types of computers that are more specific, more contextual, and closer to where they add value, without rebuilding the entire stack every time. This is the mission of Project Solara.
Introducing Project Solara
To enable this new era, we are introducing a chip-to-cloud platform, codenamed Project Solara,designed from the ground up for agent-first experiences and the new device form factors they enable. Chip-to-cloud sounds funny, I know, but what it really means is that the “operating system” is liminal, transcending the device and the cloud. The system brings a lightweight window to the edge, where the agent manifests and where the state, via Azure, can encompass a constellation of specialized devices.
This is not just about bringing intelligence to the PC, the browser, or the phone. It is about bringing intelligence into the places where people need it most: in the flow of work, in the environment, and closer to the task at hand.
We are building this platform on a simple premise: The next platform shift is from apps to agents—from software you open to intelligence you invoke; from graphical interfaces of buttons to expressing intent through agents; and from AI operating inside your applications to agents working outside and across your apps, workflows, and devices.
This is not just about asking an agent questions. It is about giving people a more direct way to reason over their work, context, tools, and workflows—without navigating every app, notification, or interface layer.
And because we believe the future will not be defined by one agent, Project Solara is designed for an open, multiple-agent world. Organizations will use Microsoft agents where they add value. They will also source or build their own agents for their specific workflows and requirements.
The platform must bring these agents together coherently, while respecting boundaries between data, domains, identities, and organizations. That is why enterprise manageability, identity, security, privacy, and user control are not afterthoughts. They are part of Project Solara’s foundation.
We are also investing in just-in-time UI: the ability for an agent experience to adapt across devices and modalities without requiring developers to redesign everything for every new form factor. Today, that means semi-structured approaches like adaptive cards and known content types. Over time, it moves toward more dynamic and generative interfaces. This is what makes specialized form factors viable.
We are previewing concepts that explore two very broad categories: stationary and portable. Both are multimodal: glanceable access, voice, vision, and getting to the right agent at the right moment. And investigating several verticals across healthcare, retail, the financial industry, and more.
Every place where compute can add value becomes an opportunity to help users achieve more. Every workflow, every environment, every role can have a more specific tool. Not devices built around apps, but devices built around agents—that is the promise of Project Solara. It’s a new way to bring intelligence into the moments and places where people need it most.
We are still early. I don’t want to over-promise. But I also don’t want to understate the significance of the shift. When the cost of specialization drops, innovation accelerates.
More details…
Project Solara is specifically designed for the new era of agent-first devices. It establishes hardware and software requirements that will meet enterprise needs for manageability, security, and privacy, while ensuring critical user experiences are delivered.
The cloud is not the only place intelligence lives. The agent sits between user intent and distributed execution. The UI becomes more like an adaptive access layer. The device becomes a window into long-running intelligence and action. A human-scale interface layer between the person and a larger intelligent environment.
Three pillars to the platform:
Enterprise-readiness, with privacy, security, control, and trust
Agent-driven interaction model with just-in-time UI
Extensibility to bring your own agents
Enterprise-readiness, with privacy, security, control, and trust
Seamless access to your agents must be balanced with transparency and control, so enterprise customers, device users, and the people around them can understand and control how these devices are used.
We are building the Project Solara platform to support enterprise-level hardware and software manageability, security, and privacy protections to securely access services such as WorkIQ. Project Solara includes reference designs that are flexible to modify to accelerate building and customization.
Device-side attributes of Project Solara:
Microsoft Device Ecosystem Platform (MDEP) is an enterprise-grade operating system built on AOSP, designed to meet the highest standards of security, reliability, ease of deployment, and innovation—enabling device makers to build and deploy at scale.
Agent Shell that can dynamically load and tailor multiple cloud-based agents.
Microsoft Intune allows IT administrators to manage and secure these devices just like PC and mobile devices today.
Entra ID so users can use their existing Microsoft accounts.
Hello for Business with at least one biometric authentication method, like facial recognition or fingerprint, allowing seamless access to the device.
Easy privacy controls like a physical mic mute button, and clear indicators when listening or recording.
Approved chipsets accompanied with applicable reference designs.
These attributes represent our current thinking and will continue to evolve as we continue to build out the platform.
Agent-driven interaction model with just-in-time UI
These new devices are not meant to run traditional apps. They are designed for agents. That shift gives us more flexibility in the user interface, because the experience can adapt to the device, the screen size, the content, and even the mode of interaction—whether visual, voice, touch, or multimodal.
Every new device form factor has traditionally required its own application model, UI patterns, and optimization work for screen size, resolution, runtime, and input method. That is one reason new device categories are so expensive to build, and why they can struggle without a strong app ecosystem behind them.
AI changes that equation. We are already seeing models generate content, images, and layouts tailored to different contexts. If those capabilities become part of the agent loop, an agent can adapt its visual, voice, or multimodal interface to the device it is running on, without forcing developers to redesign the experience for every form factor. We call this broader capability just-in-time UI.
Just-in-time UI exists on a spectrum defined by how much structure is required to render an experience. On one end is responsive UI: highly structured interfaces that reflow predictably across screen sizes. On the other end is fully generative UI: a future state in which AI can create the interface frame by frame with minimal predefined structure. That future is not here yet, but we can already see early signs of it.
Today, Project Solara is intentionally building for the middle of that spectrum—beyond traditional responsive design, but not dependent on unconstrained generation. That gives agents enough flexibility to adapt their presentation across very different devices while preserving consistency and usability. In practical terms, the same agent can render a custom experience on multiple screen sizes and modalities with little or no additional work from the developer. For us, that is the first proof point: a path to specialized devices without requiring developers to rebuild the experience from scratch each time.
Extensibility to bring your own agents
One of the most important realities of this new era is that there will not be a single dominant agent.
Instead, we are entering a world of many specialized agents, each optimized for different skills (coding, communication, analysis, etc.), datasets and domains, organizational scopes and requirements. Just like no single app could replace Word, Excel, and PowerPoint, no single agent can meet every need.
This creates a critical challenge: How do you bring multiple agents together into a coherent experience? The most straightforward approach is manually launching agents like launching apps. But soon the user will want more sophistication, more automation, and more coordination. We are working on various software technology for delegation to specialized agents, like an agent dispatcher and an agent task manager, which can automatically activate or surface agents when needed.
Concept reference device designs
We’re developing concept designs to test and pilot the Project Solara platform. These concept devices are not meant to define the limits of the platform, but to show the range of what becomes possible across stationary, portable, wearable, and hyper-mobile experiences.
While these designs may not become the exact shipping experience, they help inform the platform and experience needs to get us started—and show the power of an agent-first interaction model: devices can be shaped around the agent, the environment, and the workflow, instead of forcing every use case into the same general-purpose form.
Silicon partners
MediaTek and Qualcomm are the first silicon partners working with us to deliver solutions to support Project Solara, starting with initial concept designs and expanding to a broad set of form factors in the future.
With Qualcomm, we’ve worked closely on a portable-device concept-reference design. Qualcomm is a leader in silicon for wearables and other new form factors for intelligent devices.
“Microsoft’s Project Solara is an important step in advancing agent-first experiences across a wide range of devices and form factors,” said Dino Bekis, Qualcomm Senior Vice President for Personal and Wearable AI. “With deep experience enabling the majority of today’s wearable experiences and bringing advanced AI to billions of mobile devices, Qualcomm Snapdragon platforms are uniquely optimized for agentic AI—combining high performance with industry-leading power efficiency. We’re proud to partner with Microsoft to help accelerate this next era of intelligent, personalized computing.”
With MediaTek, we’ve worked closely on the development of a stationary device concept design. MediaTek has deep expertise and a breadth of device partners across the IoT ecosystem.
“At MediaTek, we’re bringing intelligence to edge devices with best-in-class silicon,” said Vince Hu, MediaTek Senior Vice President & General Manager, Data Center & Computing. “Microsoft’s Project Solara platform will significantly accelerate the opportunity for agent-first experiences and devices. We look forward to our continued collaboration, building from the first device concept to an extended ecosystem of Project Solara-powered devices.”
Portable reference design: Badge concept device
We’ve reimagined a form factor that information workers, nurses, front-line workers, and millions of others use every day: the access badge. This on-the-go, lightweight, always connected companion empowers each person to do more by having their agents always by their side.
Device capabilities include:
Touchscreen display
Hello for Business fingerprint sensor button, allowing secure access to the device and agent
Privacy switchand volume controls
Far-field high SNR microphone array and speaker
Side-facing camera
WiFi, Bluetooth, GNSS, and 5G wireless connectivity
Qualcomm wearable silicon
With Hello for Business with fingerprint recognition, you are always a touch away from your agents, so you can quickly glance at what’s coming up next with your Priority Agent, or be one tap away from recording an impromptu hallway conversation with Facilitator.
Using the integrated camera, the platform allows agents, with user permission, to better understand and help take action on the environment around them.
In-place reference design: Desk concept device
For our next concept, we thought deeply about where many of us spend a lot of time today already: our desks. Whether your desk space is limited, or you’ve maximized your config with multiple monitors, we’ve designed a humble yet helpful companion providing frictionless access to your agent to help you stay in your flow.
Device capabilities include:
Touchscreen display
Hello for Business with face authentication
Privacy lock buttons
Microphone mute and volume buttons
Dual far-field microphone array and full-range speaker
UWB presence sensor
2 USB-C ports for power and optional external display or peripheral
WiFi and Bluetooth wireless connectivity
MediaTek IoT silicon
Hello for Business enables enterprise grade protection and enables frictionless authentication to glance access your calendar, stay on top of only the most critical items through curated PriorityCards, or tap into the ultimate thought partner with Microsoft 365 Copilot voice that is grounded on your WorkIQ data.
This desk concept can work stand-alone, serve as a companion to your Windows PC, or even become your cloud PC through Windows 365 when connected to an external display. As a companion, it pairs with your PC via Bluetooth, enabling you to hand off tasks between the devices and keep lock state consistent. Plug in a display via USB-C, and the desk agent device can transform into your Windows 365 client—providing access to both the power of your full Windows 365 experience and the benefit of an agent-first device experience.
Together, the badge and desk concept devices show what becomes possible when agents are no longer confined to one app, one screen, or one device. They show how agent-first experiences can move across stationary, portable, and wearable forms—adapting to the user, the context, and the work.
Real-world piloting
We are using these concept designs to inform how these form factors and platform can be built. They will become reference designs for the ecosystem to build turnkey solutions. Inside Microsoft, hundreds of employees are already using these concept devices to improve their workday
Here are some of the ways we and our partners are using, building, and experimenting with Project Solara to help users be more productive:
Microsoft 365 ecosystem
Microsoft 365 Copilot, through conversational voice, is available at tap or (optional) wake word, allowing you to securely access your data, grounded in WorkIQ. Copilot provides daily briefings, becoming your ultimate thought partner to brainstorm, explore ideas, take action, or get coaching.
Researcher can now help you keep tabs on your long-running projects by providing a more direct way to reach and respond to prompts and share reports when complete.
Facilitator is more accessible, allowing users one-tap access to securely record an in-person meeting, with all the power of transcription, detecting action items, and ensuring this information is grounded in WorkIQ. Never miss an important outcome or struggle to find your notes.
Priority Agent is an experimental agent our team is developing to bring actionable insights and actions directly to you. Grounded in signals across WorkIQ, Priority Agent provides the answer to “what needs my attention right now?” Priority Agent dynamically curates this list, adding and removing items intelligently, so you only glance at what’s needed now.
We are also partnering with other teams across Microsoft to explore how Project Solara can help deliver additional value for users:
GitHub Copilot is exploring how an agent-first approach helps keep developers more in touch with the progress of their coding projects and providing faster ways through new modalities like voice to get things done.
Dragon Copilot is exploring how agent-first experiences can better support physicians and nurses in the flow of care—helping capture interactions, surface relevant information in-context, and follow through on critical tasks without interrupting their day.
We’re excited to see how the agents from other third parties will find value and reach users in more direct ways, in more natural modalities. Here are ways you’ll be able to build for Project Solara devices:
We’ll have more to share on other ways to build agents for Project Solara devices in the future.
Private pilot program
In the coming months, we’ll begin piloting this agent-first device ecosystem with industry leaders like AccuWeather, Best Buy, CVS Health, Levi’s, Target, and others.
Platform ecosystem
Realizing the Project Solara platform vision requires close connections across silicon providers, device builders, agent developers, and customers, especially in the early phases of learning and iteration.
We will extend our collaboration with silicon partners to create reference designs for a range of categories spanning portable, ultra-portable, wearable, desktop, and others.
With those reference designs, we’ll enable OEMs and product makers to develop specialized solutions for specific scenarios, environments, across a variety of industry segments—spanning healthcare, retail, hospitality, financial services, legal, industrial, field service, and more—while meeting the needs of enterprise security and management, and seamless access and control for users.
Agent builders will be able to reach more people in more places, using the adaptability of the Project Solara platform to bring their agents into the workflows, environments, and moments where they can create the most value.
People, companies, and other institutions adopting Project Solara will shape the agent-powered, problem-solving experiences that they need.
Together, we will unlock the creativity and energy to establish a broad set of agent-first solutions, empowering everyone to achieve more.
Closing thoughts
I’m excited to share this shift and how we are building a new platform to help usher in a new era of agent-first experiences and devices with our partners.
This is where computing and new types of computers are headed. And importantly, this expands the reach and value of the agents and automation you are already building today.
A device on a desk. A device worn in the field. A device in a hospital, a store, a factory, a school, or a home. Each one becomes a new access point for your agents, and a new way to bring productivity, intelligence, and assistance into places where computing has not reached as naturally before.
>Agents will reshape not only software, but the devices themselves.
Because now you can imagine something more: not just an agent inside an app, but an agent delivered through a device purpose-built for a specific place, a specific workflow, and a specific job to be done.
That is the bigger opportunity.
For agent builders: Think big. The agents you are creating today will not be limited to the screens and devices we know today. They will be able to show up across a variety of new form factors—devices designed around them, tuned for them, and deployed into the moments where they can create the most value.
So, if you are developing agents today using Microsoft 365, Copilot Studio, the Microsoft 365 Agents SDK, and if you are using Azure to cloud-scale your solutions, then you are already taking the right steps to be ready for this future.
Project Solara is about making that future easier to build, in a way that is open, secure, manageable, and scalable.
We are still early, and there is more to come. And to me, the direction is clear: Agents will reshape not only software, but the devices themselves.
AI is entering a new phase, and agents are the mechanism that will turn it into real economic impact.
We’re moving beyond models that generate text or code, and into systems that act by retrieving data, calling tools, executing workflows, and making decisions across environments. Frameworks like LangChain, AutoGen, CrewAI, and the Microsoft Agent stack have made it straightforward to build agents that can reason and operate end-to-end. When software starts acting on behalf of people, the surface area of automation expands across every workflow, every system, and every industry.
That shift introduces a new problem: As agents gain autonomy, the question is no longer just what they can do, but what they should be allowed to do and who defines those boundaries. Recent industry work highlights failure modes that don’t exist in traditional systems: tool misuse and unintended actions across multi-step failures that emerge across agent workflows. At the same time, regulatory expectations are accelerating, with new requirements around high-risk AI systems and accountability already coming into play.
Today, governance largely lives in the development layer. Individual teams embed rules in prompts, application code, or framework-specific hooks. These controls are fragmented, inconsistent, and tightly coupled to how each agent is built. There is no standard mechanism by which a security or compliance team can define, enforce, and audit policy across agents.
Why the existing playbook breaks
Traditional security models assume a fixed actor and a fixed scope. With agents, the same credential that may be safe in one moment becomes risky in the next. A Slack token that’s fine for posting a meeting summary becomes dangerous the moment the agent has read a document marked confidential or included external users in the group. Traditional access control has no vocabulary for this: It can only answer “is this credential allowed to call this resource?”—not “given everything this agent has touched in this conversation, is this call still safe?”
Customers told us they were patching these gaps by stitching together classifiers, validations, and custom checks throughout their codebases. Every team had built some versions of this. We consistently saw the same patterns emerge.
System prompts are often the first line of defense. They tell the agent what it should or should not do. They’re useful, but they aren’t enforceable. A prompt instruction lives in the same stream as user input, retrieved content, tool results, and potentially attacker-controlled text.
Custom logic inside the agent can provide stronger guarantees for deterministic checks. But those rules are usually buried in application code. They’re hard to audit, hard to reuse, and hard to move when the team changes frameworks.
Input and output classifiers help detect risks like jailbreaks, prompt injection, toxicity, or sensitive content. But classifiers often only see isolated text. They don’t automatically know which tool is about to run, what data labels are attached, or what context the agent has accumulated.
Framework-specific guardrails (OpenAI Agents SDK input/output guardrails, Semantic Kernel filters, LangChain callback handlers, Anthropic tool-use callbacks) get the shape right but stop short. They’re the native extension points each framework exposes, which means the same policy has to be rewritten for every framework an organization uses, and a security team has no single place to author or audit controls.
General-purpose policy engines (OPA/Rego, Cedar) answer authorization questions deterministically on structured inputs. They’re mature, expressive, and widely understood. But they have no model of an agent loop, no notion of when in a lifecycle to consult them, what state to collect, or how to enforce the verdict in the host runtime.
Across all these approaches, the gap is the same: Enforcement is scattered, framework-specific, and disconnected from the broader context of the agent’s lifecycle. The result, in practice, is that agent security ends up scattered across system prompts, framework callbacks, application code, content classifiers, and policy engines, with no single contract that describes how policies should be evaluated and enforced.
Introducing Agent Control Specification (ACS)
ACS is an open specification and reference implementation for the runtime governance layer of AI agents. It’s a new module within Microsoft’s Agent Governance Toolkit (AGT), extending how developers manage and govern AI agents. Its core artifact is a portable manifest that defines where, when, and how policies are evaluated and enforced across the full agent lifecycle, independent of the agent framework, the runtime, or the policy engine that authors the rules themselves.
ACS provides the missing layer that makes policy languages like Rego usable in the agent context: the standardized hooks, inputs, and enforcement contract. It owns the orchestration around policy evaluation:
Lifecycle interception: where checks happen in the agent loop
Canonical input shaping: what structured context is passed to the policy engine
Evidence collection: how classifiers, DLP services, judges, or endpoints contribute facts
Information flow checks: how labels and tool clearances are enforced
Verdict normalization: how policy results become standard ACS decisions
Final enforcement: how allow, warn, deny, or escalate verdicts, plus redaction effects, are applied by the host
Fail-closed handling: what happens when policy, evidence, or verdict processing fails
ACS: Open standard interception policies
ACS defines eight interception points where policies can be evaluated against the agent’s runtime context. Each point evaluates a policy against the current snapshot, and the policy can allow, warn, deny, or escalate the action.
agent_startup: evaluate configuration and environment before the agent begins running
Input: inspect user input before the model sees it
pre_model_call: inspect the full context being sent to the model
post_model_call: inspect the model’s response before the runtime acts on it
pre_tool_call: inspect tool name and parameters before execution
post_tool_call: inspect tool output before it re-enters the model context
output: inspect the final response before it leaves the agent
agent_shutdown: evaluate end-of-session conditions for logging and audit
Each call to ACS stands on its own: The host runtime passes the current snapshot, and ACS shapes the canonical input, runs configured evidence providers, invokes the policy engine, and returns a normalized verdict. Anything the policy needs to know about the session, including prior tool calls, accumulated sensitivity, approval state, user history, lives in the snapshot the host passes in.
The canonical policy input
At each intervention point, ACS turns the current agent context into a standard policy input. This input is the bridge between the agent runtime and the policy engine.
intervention_point tells the policy engine where in the agent lifecycle the check is happening. For example, pre_tool_call means ACS is evaluating a tool call before the tool runs.
policy_target is the specific thing being evaluated. In a tool call, this might be the tool arguments. In an output check, it might be the final response.
snapshot is the broader context provided by the host runtime. This can include the actor, roles, conversation state, prior tool calls, data sensitivity, approval status, or anything else the policy may need.
annotations contains evidence collected before the policy runs, such as results from classifiers, DLP systems, LLM judges, or external services.
tool includes tool metadata, like the tool name, clearance level, and security labels, when the intervention point involves a tool.
A worked example: The same manifest across two SDKs
Consider an email agent that must not send messages to external recipients. A single manifest binds one Rego policy to the pre_tool_call intervention point and declares the tool the policy reasons about:
The Rego policy reads the projected tool arguments and denies external recipients:
A Python host loads the manifest with AgentControl.from_path and evaluates the pre-tool snapshot:
from agent_control_specification import AgentControl, InterventionPoint
control = AgentControl.from_path("manifest.yaml")
result = await control.evaluate_intervention_point(
InterventionPoint.PRE_TOOL_CALL,
{"tool_call": {"id": "t1", "name": "send_email",
"args": {"to": "user@external.example"}}},
)
assert result.verdict.decision.value == "deny"
A Node host takes the same manifest and the same snapshot, and reaches the same verdict:
const { AgentControl, InterventionPoint } = require("agent-control-specification");
const control = AgentControl.fromPath("manifest.yaml");
const result = await control.evaluateInterventionPoint(
InterventionPoint.PreToolCall,
{ tool_call: { id: "t1", name: "send_email",
args: { to: "user@external.example" } } },
);
// result.verdict.decision === "deny"
Both SDKs load the same native core and evaluate the same Rego bundle. Cross-SDK conformance fixtures assert that the .NET and Rust SDKs return identical verdicts for the same snapshots, so the controls follow the agent when it moves from a Python service to a Node sidecar, or from a local script to a hosted runtime.
The documentation includes a quickstart guide for common frameworks and languages. The project is MIT-licensed and developed in the open. The spec is the source of truth. SDK behavior that diverges from it is a bug. Issues, RFCs, and adapter contributions are welcome.
Agent frameworks will change. Policy engines will evolve. Governance requirements will increase. The enforcement contract shouldn’t have to be rewritten each time.
Relationship with Agent Framework Toolkit
ACS is a controls layer, not an agent framework. It does not orchestrate the loop, choose tools, or manage memory. Those responsibilities belong to the host and to the framework the agent is built on, including the Microsoft Agent Framework, whose objects can be guarded directly through the ACS SDK adapters. ACS plugs into the moderation points each framework already exposes and supplies the pieces above them that no framework provides on its own: the canonical input shape, the evidence pipeline, the normalized verdict, and the fail-closed enforcement contract. The effect is that a team can pick or change agent frameworks without rewriting its policy surface, and a security team gets one place to author, version, and audit controls regardless of the runtime underneath.
Relationship with Agent Governance Toolkit
Agent Governance Toolkit (AGT) is the Microsoft-signed runtime that bundles policy enforcement, identity, sandboxing, and audit for production agents. The next version of AGT adopts ACS as its policy language, so existing AGT users gain the eight intervention points, the canonical policy input, Rego-based decisioning, and the framework adapters that ACS provides, while keeping AGT’s identity, sandboxing, and audit guarantees.
>“The agent ecosystem needs an open standard for guardrails the same way it needs open standards for tool protocols and model interfaces. CrewAI has always leaned on open primitives, agents, and tasks declared in YAML, an OSS core anyone can extend, and guardrails should follow the same pattern: declarative, portable, not tied to any single vendor. That’s the direction Agent Control Specification is going, and it’s why we support it.”
– Lorenze Jay, Open Source Lead, CrewAI
Acknowledgements
PM team: Mehrnoosh Sameki, Mike Shi Eng team: Mohamed Elmargawi, Mohammad Abouomar, Liam Crumm, Apoorv Jindal, Roni Burd Design: Sooyeon Hwang Business development: Ilvens Jean
Today, we’re releasing Adaptive Spec-driven Scoring for Evaluation and Regression Testing (ASSERT), an open-source framework for turning natural-language behavior specifications into executable evaluations. Every team building an AI system starts with a clear intention for the behaviors they want to coax from the product. Those expectations are usually written down somewhere: in a product requirement, a policy document, a system prompt, a launch checklist, or a review note. The more difficult step is turning that intention into an eval suite that’s specific enough to run, inspect, and update as the system changes. ASSERT seeks to address this by turning plain-language requirements into full evaluation pipelines: automatically generating test scenarios, datasets, metrics, and scorecards, then running them against your model, application, or agent.
High-quality behavioral evaluations are essential for understanding whether AI systems behave as intended. But the evaluations that product teams need generally don’t already exist, are often slow to build, are hard to validate, and are quick to go stale. Product requirements change; policies evolve; tools and retrieval environments shift; and models improve until yesterday’s benchmark no longer measures the behavior that matters. The intended behaviors are shaped by the product’s actual context, policies, and tools, but the evaluations used to assess them often only weakly reflect those conditions.
The gap is most visible in application-specific behavior. A support agent should issue refunds below a threshold, escalate likely fraud, and decline out-of-policy requests. A research assistant should synthesize internal and public information without relying on restricted findings. A change-control agent should produce useful plans while respecting approval boundaries. Generic evaluators such as helpfulness, relevance, groundedness, toxicity, and faithfulness can be useful signals, but they don’t test these product-specific behavioral boundaries directly. A system can score well on generic metrics while failing application-specific requirements
ASSERT is built on the premise that a behavior specification should be a first-class input to evaluation—not just the background context. The framework systematizes the specification, converts it into an inspectable taxonomy, generates stratified test cases from the taxonomy, runs the test cases against the target, and scores each failure against the policy statement that produced it. In the next section, we’ll walk through how each of those steps works in practice.
How ASSERT works
The pipeline has four stages. First, ASSERT turns a broad behavior specification into an explicit concept specification, which is then converted into a granular, editable behavior taxonomy with suggested permissible and impermissible behaviors. Next, it generates stratified test cases over the dimensions the developer declares. Then, it runs those cases against the target system and records the full trace, including tool use and intermediate decisions. Finally, ASSERT scores each trace against the behavior taxonomy and associated policy stance for that case, producing labels, rationales, and failure patterns that developers can inspect and refine.
In the systematization stage, ASSERT turns a broad idea like harmful financial advice, tool-use governance, or unsafe health guidance into something concrete enough to evaluate. Rather than treating the concept as a single label, it represents it as a structured set of patterns, definitions, edge cases, and operational distinctions. Following Agarwal et al. (2026), ASSERT grounds the concept in prior work, reconciles multiple practical definitions, and refines the result into an explicit concept specification.
In the taxonomization stage, ASSERT converts that specification into a draft taxonomy of permissible and impermissible behaviors, together with the artifacts used to derive it. Developers and policy experts can review and revise both before the next stage runs. The user can input the behavior description, number of test set samples they want, and a systematizer model. The taxonomization step outputs an editable behavior taxonomy that can be validated by a policy expert.
In the test-set generation stage, ASSERT instantiates that taxonomy into executable cases. It can generate single-turn prompts or multi-turn scenarios, including benign interactions and adversarial probes. Developers specify the dimensions that matter for the application, such as task type, persona, tool availability, request class, or environment configuration. ASSERT then builds a stratified set of cases so that behavior is tested across the declared conditions rather than on a narrow slice of easy examples.
In the inference stage, ASSERT runs those cases against the target. The target can be a model, an agent, or an application-level workflow. Through its instrumentation layer, ASSERT records not only the final text output but also the evidence needed to interpret the result later: tool calls, retrieved context, routing behavior, and intermediate actions. For agentic systems, those traces are often necessary to understand what actually happened.
In the scoring stage, ASSERT evaluates each trace against the associated behavior or policy stance. The scoring output is not only a pass or flagged label, but also includes a rationale, a policy citation, and the turn or action that justified the verdict. The policy citation refers to the specific taxonomy behavior or developer-provided policy decision that the judge used to support the verdict.
Validation
We conducted two internal validation studies for ASSERT. First, we conducted a coverage study to determine whether ASSERT produces better behavior-specific evaluations than a more direct generation approach starting from the same written intent. Then, we evaluated the LLM judges against human review.
The coverage study spanned five behaviors: social scoring, sycophancy, task adherence, tool-use governance, and unsafe health guidance. We tested whether the generated probes surfaced meaningful signal across the target behavior surface rather than collapsing onto a narrow slice of it. Across these suites and three target models, ASSERT produced evaluation sets that were more useful on the properties teams typically need from an eval. Compared with a comparable in-house baseline, ASSERT covered roughly 1.2x as much of the intended behavior space, surfaced about 1.5x as many cases where the model did something worth inspecting, produced more than 4x stronger separation between stronger and weaker systems, and had about half as many saturated cases where every model behaved the same way. It also surfaced roughly 2x as many distinct failure patterns, though we treat that result as directional because failure-type labeling is harder to stabilize than coverage or model separation. These results reinforced a design point that’s easy to underestimate: Coverage is largely determined upstream. If the behavior is underspecified, the generated dataset will be, too. ASSERT is built around a systematization step that makes the behavior explicit before generation begins, so the evaluation set is guided by a structured representation of the target behavior rather than a loose prompt. In practice, this produced evaluation sets that were broader and better aligned with the behaviors developers actually wanted to test.
Second, we validated the judges directly against human review. Across more than 10 behavior concepts, we used LLM judges for a first pass over the full evaluation set, then sampled cases per risk for human validation and independent review. In practice, agreement between LLM judges and human annotators was typically in the 80–90% range, while human inter-annotator agreement was around 90%. This gave us confidence that the judges were capturing much of the intended signal, while also making clear where caution was needed. At the same time, judge quality and stability are partly dependent on the underlying LLM: Different judge models can vary in strictness, boundary sensitivity, and willingness to treat closely related behaviors as distinct.
Finally, we also ran qualitative review with subject-matter experts (SMEs) on 15 generated datasets. SMEs reviewed the test cases for policy alignment, behavioral relevance, and overall quality and found that the generated datasets were generally well aligned with the intended policy and risk boundaries. We view this as a complementary form of validation: Beyond quantitative metrics, it showed that the datasets were also credible and useful to experts inspecting them directly.
Taken together, these studies support the two claims we think matter most: Systematization improves the coverage and usefulness of the generated dataset, and decomposed measurements make the resulting evaluations easier to interpret than a single aggregate score. They also highlight an important caveat: Evaluation quality depends not only on the pipeline design, but also on the stability and calibration of the judges used to score it.
>“My favorite thing about ASSERT is that the eval is easy to configure and reason about. I describe the behavior I care about in YAML, point it at a real agent, and get artifacts back. Not just pass/fail. They show why the judge made each call. That openness matters. The spec, generated cases, model outputs, judge rationale, and metrics are all inspectable locally. The eval feels auditable, not like a black box.”
– Lorenze Jay, Open Source Lead, CrewAI
A worked example: A travel-planning agent
To make this concrete, imagine a travel-planning agent that helps users build itineraries. On the surface, this sounds like a simple assistant: Find flights, suggest hotels, check the weather, and produce a plan.
But a real travel agent has to do much more than answer a question. It must use tools in the right order, respect explicit user constraints, ground its recommendations in tool results, and avoid subtle failure modes that traditional single-turn QA benchmarks miss.
For example, the agent shouldn’t invent flight prices. It shouldn’t agree with an itinerary that exceeds the user’s budget. It shouldn’t make stereotyped assumptions about a traveler based on age, disability, family status, or travel style. And it shouldn’t follow malicious instructions hidden inside tool outputs or search results.
The example in the ASSERT repository uses a multi-agent LangGraph travel planner with five tools:
search_flights
search_hotels
check_weather
check_travel_advisories
validate_budget
It operates in a six-turn budget, and every run records the full agent trace (tool calls, arguments, tool results, routing decisions, and intermediate state) alongside the final response. That trace evidence is what makes the judge able to cite the specific action responsible for each verdict, not just the final reply. That trace is important. It lets the evaluator judge not only whether the final answer was acceptable, but why the agent failed and which action caused the failure.
The evaluation configuration defines six failure-mode categories across two themes:
Quality: wrong or skipped tool use; fabricated flight, hotel, or price details; budget constraint violations
Safety: stereotyping; prompt injection from tool output; sycophantic agreement with unsafe or invalid itineraries
To run the evaluation:
assert-eval run --config eval_config.yaml
# To inspect the results
Assert-eval results status \
--results-dir "$PWD/artifacts/results" \
travel-planner-langgraph-v1 \
demo-1
ASSERT produces a set of artifacts under the run directory:
taxonomy.json: the concept spec produced by systematization
test_set.jsonl: the stratified prompts and multi-turn scenarios
inference_set.jsonl: per-scenario traces with tool calls and intermediate state
scores.jsonl: per-trace verdicts with rationale and policy citation
metrics.json: the aggregate roll-up
Example results:
The dimensions are separated rather than rolled into a single number: The same five scenarios produce 40% over-refusal and 60% policy violation, and those aren’t the same failures. A team optimizing on the aggregate would miss that the agent is failing in both directions at once. The results can be further inspected in a UI widget as shown below:
Practical considerations
In practice, this framework works best when the behavior definition is relatively narrow and the relevant constraints are clearly specified. Richer descriptions of tools, policies, and boundaries usually lead to more precise scenarios. It’s also worth treating aggregate scores cautiously. In many cases, the most useful output isn’t the summary metric but the collection of failures and traces that shows where the specification, the system, or the evaluation itself needs refinement. ASSERT doesn’t remove the need for judgment in evaluation design. Vague specifications still produce vague scenarios. Synthetic interactions can miss failures that only appear in production settings. And model-based judges can be unreliable, especially when the policy distinction is subtle or highly domain-specific. More broadly, a specification-driven evaluation shouldn’t be treated as a compliance certification or a substitute for human review, telemetry, or domain expertise. It’s better understood as a way to make evaluation faster, more explicit, and easier to iterate on.
Get started
ASSERT is open-source under the MIT license and available today.
If you build evals and run them as part of your release process, we’d like to hear what works, what doesn’t, and what behaviors you think are hardest to specify. ASSERT is at its most useful when behavior specifications are written down and treated as first-class inputs to evaluation. We’re releasing it in that spirit.
Acknowledgements
PM team: Mehrnoosh Sameki, Minsoo Thigpen, Chang Liu, Abby Palia, Hanna Kim
Science: Riccardo Fogliato, Emily Sheng, Alex Dow, Meera Chander, Alex Chouldechova, Sharman Tan, Xiawei Wang, Ahmed Magooda, Mayank Gupta, Jean Garcia-Gathright, Chad Atalla, Dan Vann, Hanna Wallach, Hannah Washington, Meredith Rodden, Nadine Frey, Melissa Kirkwood, Nick Pangakis, Ali Azad, Ahmed Elghory Ghoneim, Shushan Arakleyan
Eng team: Mohamed E, Jake Present, Aaron Aspinwall, Yeming Tang
Design: Sooyeon Hwang, Becky Haruyama
Special thanks: Roni Burd, Mohammad A, Heba Elfardy, Sandeep Atluri, Sydney Lister, Ram Shankar Siva Kumar, Andrew Gully
Available for XBOX Insiders: Personalize Your Experience with New Color Options, Controller on Screen Settings and More
Alex Charters, Senior Product Manager; Eden Marie, Principal Software Engineering Lead
Starting today, select Alpha Skip-Ahead XBOX Insiders can start testing the latest updates coming to XBOX consoles, bringing more ways to make your console feel more like yours and easier ways to stay informed. This latest update adds more personalization with expanded color options, on-screen controller visibility, easier access to ‘What’s New’ after updates, and XBOX Service status on console.
These features will roll out to more Insiders over time, with broader availability coming to all players later. We’re excited to hear what you think, and we’ll continue to keep bringing you more of the features you’ve been asking for.
See Your Controller, On Screen
The XBOX Accessories app now shows an image of the specific XBOX controller you have connected, including most XBOX Wireless Controllers with a Share button. Whether you’re using a supported standard or special edition XBOX Wireless Controller, or an XBOX Elite Wireless Controller Series 2, it’s easier to confirm you’ve got the right one selected, especially when you’re remapping buttons or adjusting settings to fine-tune your set up.
Personalize Your Experience with More Color Options
Inspired by your feedback, we’re expanding how you can make your console truly yours. You’ll soon be able to enter a specific hex color code for precise color selection or use the new “match my gamerpic” option to pull a color directly from your gamerpic. Once you’ve picked your color, you can preview how it carries across your console, making it easier to see exactly how your look comes together before you set it.
Find “What’s New” After an Update
After your console updates, you can now quickly access the latest release notes directly from Home’s top navigation to easily see what’s new and explore new features. We’ve also improved the release notes themselves, adding more details about new features, changes, and major bug fixes. XBOX Insiders will be linked directly to release notes specific to their update preview ring in the XBOX Insider Hub app, so you know exactly what’s new for you.
Check XBOX Service Status on Console
Now you can stay informed without leaving your console. When there’s a confirmed XBOX service issue, a new indicator will appear on the upper-right corner of the screen so you can quickly see what’s going on. From there, you can jump straight to a status page with the latest updates on what’s affected and when things are expected to be back up and running, no need to check other sites or social channels.
How to Get XBOX Insider Support and Share Your Feedback
We also want to thank all the XBOX Insiders for the feedback you share with us.
We recently launched XBOX Player Voice, a new place to collect your feedback and make it more visible. When you submit feedback, teams review and organize it so it can be understood and considered alongside other work. In some cases, ideas will move forward. In others, they may take longer to implement or may not be something we can act on. When there are meaningful updates, we’ll share them. Visit aka.ms/XBOXplayervoice to share what you think.
If you’re an XBOX Insider looking for support, please join our community on the XBOX Insider subreddit. Official XBOX staff, moderators, and fellow XBOX Insiders are there to help. We recommend adding to threads with the same topic before posting a brand new one. This helps us support you the best we can!
If you aren’t part of the XBOX Insider Program yet and want to help create the future of XBOX and get early access to new features, join the Program today by downloading the XBOX Insider Hub for XBOX Series X|S & XBOX One or Windows PC. For more information on the XBOX Insider Program, follow us on Twitter at @XBOXInsider and keep an eye on this blog for all the latest news.
For more information: follow us on X/Twitter at @XboxInsider and this blog for announcements and more. And feel free to interact with the community on the XBOX Insider SubReddit.
Lately there has been a lot of talk of how the foundational models are quickly becoming like every other iPhone release. They are ho-hum, till the next one comes around. But it is not the right analogy. I have a more boring, and more accurate, analogy that will explain the growth so far, and how it will evolve.
I have been fortunate enough to have been involved with the last five cycles of technology. As a result, I have been able to see patterns in the history of technology. It doesn’t matter what the technology is – we go from shock and awe to ho-hum, go-to-work. A technology eventually becomes invisible to us.
Remember when broadband came around? That was in the late 1990s, and it was magical. I had a DSL connection in my East Village apartment. By 2020 I was part of the gigabit society. Speed had faded from the foreground. We just kept consuming the internet, its joys and jolts, without thinking about speeds.
Same will happen to AI. But let me give you some more examples.
The Clock Speed Wars
If you were around in the 1990s, you lived through the megahertz and then the gigahertz wars. For roughly three decades, from the mid-1970s through the early 2000s, the personal computer industry was all about how fast the processor ran. Intel and AMD were in a clock speed race. The Pentium 4 launched in November 2000 at 1.3 GHz and eventually topped out at 3.8 GHz. Faster was better.
The Pentium 4 era became notorious not for its speed but for its heat. Chips that once drew 15 to 20 watts were routinely hitting 70, 80, even over 100 watts, generating more thermal waste than useful work. The industry regrouped. It stopped talking about clock speed and started talking about performance per watt. Multi-core architectures replaced the single fast core. Apple’s M1 chip in late 2020 made performance per watt the story. By 2021, ARM itself declared that performance per watt was the new Moore’s Law. The question was what the chip could do while drawing almost no power, producing almost no heat, inside a device thin enough to forget you were carrying it.
The chip became invisible. It stopped being the product and became part of the product. Nobody buying a MacBook Air today asks about the M-series clock speed. They feel the battery last all day and the machine stay cool and they do not think about the processor at all. That is exactly the point.
It took roughly 30 years from the first commercial microprocessors to the clock-speed plateau in 2005. Another 15 years to the M1 reframing. The total arc from speeds and feeds to invisible efficiency ran about 45 years.
The Smartphone Upgrade Treadmill
The smartphone followed a compressed version of the same arc, running faster because the industry had already learned some lessons and because the market was way larger, and it moved much quicker than slower PC buying cycles.
The original iPhone, announced in 2007, was a genuinely new way of doing things. Nothing before it did what it did. The next three or four years were a sprint – cameras improved dramatically, screens improved dramatically, LTE replaced 3G, form factors settled, app stores created ecosystems that generated their own gravity. Annual upgrades were rational. Each year’s phone was genuinely different from the one before. It was so exciting. I still remember getting the iPhone 4 and being just simply astounded by how much it could do.
The plateau came faster than anyone expected. By the early 2010s, the core smartphone experience was essentially perfected in functional terms. The differential between one year’s model and the next shrank from obvious to marginal. Upgrade cycles that had been annual stretched to two years, then three, then beyond. Half my family is still using the iPhone 15 or iPhone 16, with no intention to upgrade. It’s typical. Industry data shows nearly a quarter of users stretching to three or four years between replacements.
Samsung introduced a 100 megapixel phone. It might have been pixels, but it didn’t solve any real problem most users had. OnePlus made faster charging a headline feature. Again good, but not amazing enough to spend more dollars. The features that remained genuinely useful – battery life, camera quality, storage – were also the least exciting to announce. The things that mattered were becoming infrastructure. The smartphone did not become worse. It became good enough that its goodness stopped generating conversation. The upgrade became an assumption rather than a desire.
This is what my friend Christian Lindholm, who once worked for Nokia and then for the design firm Fjord, calls his “of-course principle of design.”
Great design means that one look and the end user reacts by knowing what to do with a knob or a button, without as much as even thinking about it. Of course this knob is what turns the volume up, or brings up the home screen. This of course factor is at the heart of every great design – from the iPhone to the Braun alarm radio.
It is the same for underlying infrastructure technology. In comparison to 45 years for the PC, we first noticed the perceived plateau in roughly 7 years from the release of the first iPhone, somewhere around 2013 to 2015. The total arc from cataclysmic change to ho-hum ran, give or take, a decade.
The Wrong Analogy
In both previous cases, the underlying technology actually did plateau in measurable ways – clock speeds stopped climbing, annual phone improvements shrank. The ho-hum feeling tracked a real slowdown in capability improvement.
That is not what is happening with AI. Capability is not plateauing. The curve is still accelerating. GPT-4 to GPT-5 is not a shrinking increment. Reasoning models, multimodal capabilities, the proliferation of open-weight models that commoditize what were closed advantages – these are real jumps.
The ho-hum that is coming, the one already arriving at the edges, is something different. It is not the slowing of capability. It is the migration of AI from topic to infrastructure. It will go into the background. From the thing you think about to the thing that makes everything else work. Consider when the iPhone 18 marries Gemini into its operating system. (Of course, knowing Apple they will bungle it up.)
It is infrastructure commoditization. And there is a much better historical analogy for it.
The Light Nobody Sees
I was lucky to watch the dawn of optical networking. As a young reporter I wrote about it with genuine excitement. George Gilder was preaching telcosm to us young punks paying attention. The future seemed to live in strands of glass thinner than a human hair.
In the mid-1990s, the internet backbone that carried data across continents ran at 45 megabits per second. Home users waited minutes to download a photograph over dial-up. The constraint was everywhere.
To understand what changed it, you need to understand how light moves through fiber. A single strand of glass can carry only so much data at one wavelength – think of it as one lane of a highway. Wavelength Division Multiplexing, or WDM, was the insight that you could send multiple signals down that same strand simultaneously, each on a different color of light, the way a prism splits white light into its spectrum. Each color carries its own independent stream of data. One fiber becomes many. Dense Wavelength Division Multiplexing – DWDM – pushed this further, packing dozens, then scores, of tightly spaced wavelengths onto a single fiber. Where earlier systems carried a handful of channels, DWDM eventually supported 96 simultaneous channels, each at its own wavelength, each carrying its own full stream of traffic.
There were no launch events for this. There were no reviews. DWDM entered commercial deployment in the mid-1990s and began doing something remarkable, and entirely invisible: capacity that seemed finite became functionally unlimited.
In the early 2000s, channel capacity climbed from gigabits per second to 100 gigabits per second per wavelength. Modern DWDM systems can carry 51.2 terabits per second down a single fiber pair. Fiber deployed in the 1980s is now running signals 645 times faster than it was 20 years ago, with no new cable in the ground. One estimate puts the theoretical capacity of a single standard fiber strand at over 600 terabits per second, meaning current deployments use roughly 1/60,000th of what the glass can carry.
Nobody wrote about this. Nobody had to. Because DWDM worked, nobody noticed the bandwidth problem. YouTube became possible. Netflix became possible. Zoom calls during a pandemic became possible. The capacity was simply there, having grown silently for twenty years, enabling everything above it without requiring acknowledgment from anyone.
Not the PC clock-speed arc, where capability slowed and conversation shifted. Not the smartphone arc, where the category became good enough and stopped generating desire. The optical networking curve, where capability kept growing – is still growing – while the conversation moved entirely away from it, because the growth had become embedded in everything and required no one to pay attention.
Where AI Goes From Here
AI capability will keep climbing. There is no reason to expect the research curve to flatten. Models will become more capable, more efficient, more specialized. The open-weight ecosystem will compress the gap between frontier and commodity. Inference costs will continue to fall, as they have been falling, by an order of magnitude roughly every year or two. The raw capability, like the raw bandwidth of DWDM, will continue its silent exponential.
What will stop growing is the conversation about it. The breathless coverage of each new model announcement has a different texture than it did in 2022. The releases come faster, the benchmarks climb, but the surprise is attenuating.
When GPT-3 appeared, it felt like a visitation. A new iPhone moment. When GPT-4 arrived, it felt like a significant upgrade. Like the arrival of the M1. Now, as fifth and sixth generations circulate, the question people ask has changed. Not because the models are less capable. Because capability is no longer the story. The story is what the capability is inside.
AI will become what DWDM became: the layer you cannot see that makes everything above it work. It will be inside the camera that decides how to expose the photograph. Inside the chip that manages the laptop’s power. Inside the hospital monitor watching for early deterioration. Inside the contract that was reviewed before you read it.
The models will not disappear. They will stop being the unit of conversation. No one talks about DWDM when they open a video call. No one will talk about the foundation model when they use what the foundation model made possible.
This is not good for the valuations of foundation labs and the ilk of Nvidia, who want to come up with new metrics. But the industry isn’t stupid. Just follow the Ciena stock from IPO to first 10 years, and you can almost predict the curvature, if not the scale, of the stocks of these labs.
The iPhone took roughly a decade to shift from rupture to infrastructure. The PC clock-speed story took longer because the stakes were different and the industry moved differently. But AI is moving faster than either. The infrastructure framing is already present in enterprise software, in developer tools, in hardware roadmaps. The consumer shift tends to lag by a few years. My estimate: by 2028, the question will no longer be “which AI” but “what does this do that it did not do before.” The model will have gone underground.
The companies building foundation models are not necessarily the companies that will define the AI era. DWDM was built by Nortel, Lucent, carriers now mostly forgotten or absorbed. The internet layer above them – Google, Amazon, Netflix – captured the value that the optical infrastructure enabled. Infrastructure enables; it does not determine who wins.
Commoditization is already underway. Open-weight models are compressing the advantage that closed frontier models once held. The cost of inference has fallen so fast that capability is no longer a defensible edge. The edge will be the particular applications that make the underlying capability feel indispensable and invisible at once.
Watch not the benchmark. Watch the disappearance of the benchmark from the conversation. When we stop asking which model scores highest on reasoning tests and start asking why our software feels smarter without us having changed anything, the transition will have happened.
The optical fiber is already in the ground. We just don’t know yet what runs on it.