Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
155951 stories
·
33 followers

Mastodon looks to newsletters to help revive the open social web

1 Share
Mastodon’s newly launched newsletter feature lets anyone subscribe to creators by email, even without a Mastodon account.
Read the whole story
alvinashcraft
43 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Getting more from each token: How Copilot improves context handling and model routing

1 Share

As Copilot takes on more agentic work, from planning and editing to debugging, reviewing, and calling tools across longer sessions, efficiency means more than using fewer tokens. It means being smarter about how you use them.

Increasing efficiency starts with reducing what Copilot has to repeat from turn to turn, including context, tool definitions, and cached state. It continues with choosing the right model for the job. A quick explanation, a focused edit, and a complex multi-file change should not all be treated the same way.

We are working on both: improving the Copilot harness so more of each session goes toward the task itself, and expanding Auto so Copilot can pick the model that fits the work without asking developers to make that choice every time. This post focuses on harness improvements in GitHub Copilot for VS Code and on ongoing work to expand Auto across Copilot surfaces.

Increased prompt caching and deferred tools

In longer GitHub Copilot sessions in VS Code, the harness prepares a lot of recurring information for the model: instructions, repository context, conversation history, available tools, and the current state of the task. Some of that context is needed. Some of it can be cached, deferred, or loaded only when it becomes relevant.

Two improvements in GitHub Copilot for VS Code are doing most of the work here. Prompt caching helps Copilot reuse model state for repeated prompt prefixes instead of recomputing the same prefix on every request. Tool search lets the model load tool definitions on demand, instead of sending every full tool schema into context on every turn.

That matters more as agents use more tools. A session may need access to MCP tools, terminal commands, file operations, workspace search, and product-specific actions. Loading every full tool definition up front adds fixed cost to each turn, even when only a small number of tools are relevant to the task. With tool search, Copilot can keep the available toolset broad while sending less unnecessary tool schema into the model.

For a deeper technical look at the implementation, including prompt caching, cache-control breakpoints, provider-specific tool search, and how these changes work across long-running agentic sessions, read the VS Code technical deep dive.

Where GitHub Copilot auto model selection fits in

Auto answers a practical question: which model is the best fit for this task right now?

After your first prompt, Copilot uses task intent and current model health to choose a model that best fits the task. Different kinds of work, like quick explanations, focused edits, or multi-file changes, do not all benefit from the same level of reasoning, so Auto makes that call without requiring you to tune model settings.

In our evaluations, no single model consistently performed best across tasks. In many cases, a more efficient model reached the same outcome, while stronger models mattered most when the task required deeper reasoning. Auto learns where stronger reasoning improves the result. It routes up when the task demands it and stays more efficient when it does not. The goal is not to trade quality for cost, but to use the model that best fits the work.

How Auto selects the right model

Auto combines two signals: what model is healthy and available right now, and what kind of work Copilot is being asked to do.

  • Real-time model health: a dynamic engine tracks model availability, utilization, speed, error rates, and cost. A model may be capable of handling a task, but that does not mean it is the best choice at that moment. Auto takes current system conditions into account so Copilot can route to a model that is both capable and ready to respond.
  • Task-aware routing with HyDRA: a routing model that considers factors like reasoning depth, code complexity, debugging difficulty, and tool orchestration needs. HyDRA identifies models that can meet the quality bar for the task, then chooses the best fit among them.
Chart shows HyDRA quality vs cost savings across a 5 model production pool. Three HyDRA operating points illustrate tunability: (peak) exceeds Sonnet at 12.9% savings; (agg.) balances quality for 72.5% savings. 
Figure 1: Three HyDRA operating points illustrate tunability: (Peak) exceeds Sonnet at 12.9% savings; (Agg.) balances quality for 72.5% savings.
Chart showing quality and cost tradeoffs of HyDRA and other published research and commercial routers using SWEBench benchmarks. HyDRA (Cons.) ties OpenRouter Auto on resolution rate (70.8%) at 3.3x the savings. HyDRA (Aggr.) outperforms both Azure Foundry operating modes. 
Figure 2: HyDRA (Cons.) ties OpenRouter Auto on resolution rate (70.8%) at 3.3x the savings. HyDRA (Agg.) outperforms both Azure Foundry operating modes.

Taken together, these signals let Auto avoid a one-size-fits-all approach. The point is not to send every task to the biggest model, or every task to the cheapest one. It is to choose the model that fits the work.

Making Auto work in practice

Getting routing right in evaluations is only part of the problem. To make Auto useful in real workflows, we also had to account for how developers actually use Copilot: conversations get longer, context builds up, tasks shift, and developers work in many languages.

Cache-aware routing. Switching models on every turn may sound flexible, but it can work against efficiency. When a conversation stays on the same model, the prompt prefix can be cached and reused across turns. Switching models mid-conversation breaks that cache, which can cost more than the routing change saves. Auto avoids that by routing at natural cache boundaries: on the first turn, when there is no cache to lose, and after compaction, when Copilot summarizes older turns and the prompt prefix resets. Between those points, the selected model stays in place so the cache can keep building.

Routing across languages. Copilot serves developers around the world, so routing has to work in languages other than English. We trained the routing model on conversations across 16 language families, including CJK, European, and others. In evaluations, routing accuracy stayed within four points of the English baseline across language groups, with no statistically significant quality gap.

Chart showing the efficacy of high reasoning, low reasoning, and Auto across English, European, CJK, and other script families. Evaluation is based on an evaluation set sampled from production VS Code chat telemetry across 19 languages.
Figure 3: Intelligent routing stays within 4 points of English baseline. Model evaluations across English, European, CJK, and other script families, based on a held out evaluation set sampled from production VS Code chat telemetry across 19 languages.

Learning when escalation matters. Instead of labeling tasks as simply “easy” or “hard,” we trained the router to learn where models actually diverge. For each training query, responses from a less capable model and a more capable model are scored across quality dimensions. The router learns when the stronger model adds value, and when a more efficient model can produce an equally good result. For context-dependent messages in longer agentic sessions, the router is trained on complete multi-turn conversations, including the original user intent, recent assistant responses, and conversation metadata.

Auto with task intent is expanding

Auto with task intent is already live in Visual Studio Code, github.com, and mobile. It gives Copilot more signal about the kind of work you are doing, whether that is coding, debugging, planning, or using tools, so it can make a better model choice for the task.

We are continuing to expand that experience across Copilot. Next, we are bringing Auto with task intent to more surfaces and adding more ways for teams to make Auto the default.

  • Auto with task intent is coming to Copilot CLI, GitHub App, and additional IDEs.
  • Copilot Free and Student plans will be simplified to leverage Auto as the only model selection option.
  • Admin controls will let organizations set Auto as the default or enforce Auto as the only option.

Getting more value from your AI credits

Copilot is getting more efficient by default, but a few habits can help your credits go further.

  • Start with Auto. Auto is the strong default for many tasks because it chooses a model based on what you are trying to do, without making you pick one manually every time.
  • Keep context focused. Start a new session when you switch tasks, compact long-running sessions when needed, and mention the files you want Copilot to use when you already know where the relevant code lives. Less unnecessary context means more of the session goes toward the actual work.
  • Avoid changing models or settings mid-session. Switching models, reasoning levels, context size, or tool configuration can break cache reuse and make Copilot rebuild context. Set up the session the way you want it, then keep related work together.
  • Plan before parallelizing. For larger tasks, ask Copilot to plan first. Parallel agents can be useful when work can truly be split up, but they also consume credits in parallel, so use them deliberately.
  • Use only the tools you need. Tools and MCP servers are powerful, but broad toolsets can add extra context. Enable what is relevant to the task and turn off what you do not need. Check out agent finder in GitHub Copilot to help streamline your tool usage.
  • Check your usage. Your AI usage page shows where credits are going across features and models. In Copilot CLI, session-level usage can also help you spot expensive patterns while you work.

For the full guide, see How to get more out of your AI credits.

Get started

Auto model selection is available today across supported Copilot experiences. To learn more, see the Auto model selection docs. You can also share feedback in Copilot discussions.

We are continuing to make Copilot more efficient across the system so more of your credits go toward useful work, without requiring you to tune every model choice yourself.

The post Getting more from each token: How Copilot improves context handling and model routing appeared first on The GitHub Blog.

Read the whole story
alvinashcraft
44 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

AI is accelerating cyberattacks—here’s how to stay ahead

1 Share

In March, we wrote that identity security has become the new pressure point for modern cyberattacks. Since then, AI has only increased that pressure.

AI helps cyberattackers move faster across the attack chain: personalizing social engineering at scale, automating reconnaissance, analyzing leaked credentials, identifying privileged users, probing exposed systems, and adapting tactics in real time. Attacks that once depended on manual effort can now unfold with greater speed, scale, and autonomy.

Yet even as methods evolve, identity remains one of the most common entry points. Every account, admin, workload, application, non-human identity, and AI agent can become a path to sensitive data and critical systems if not properly secured. Attackers do not need to break every defense; they only need to compromise or misuse the right identity with the right access at the right moment.

When attacks are accelerated by AI, speed and accuracy in detection and response are critical. Identity security can no longer operate in silos. Even a minor delay between when a threat is detected and action is taken can be the difference between suspicious activity becoming a contained incident or a business-impacting breach. This shift is reshaping how organizations think about security. The imperative is becoming clear: identity and security teams need comprehensive visibility and integrated solutions that streamline how they prevent, detect, and respond to identity threats.

Securing the future of identity at the speed of AI

One of the biggest security challenges organizations face today is fragmentation, and identity security is no exception. IAM and SOC teams often work across separate tools, separate workflows, and separate operational models. But identity attacks don’t respect those organizational boundaries.

Modern identity attacks span infrastructure, access control, and detection. At Microsoft, we understand this, and we are continuing to expand how Microsoft Entra and Microsoft Defender work together to provide more unified identity security experiences.

Actionable intelligence, everywhere

At RSA earlier this year, we unveiled our unified identity risk score, a new way to turn broader attack-chain insight into real-time access decisions. This score analyzes and correlates relevant signals across related accounts, sessions, workloads, and applications to surface a single, comprehensive evaluation of an identity’s true risk level and enable more dynamic response directly within authentication flows as part of risk-based Conditional Access policies.

View of a risky user within Entra ID Protection with new identity risk score and attack timeline.

Identity admins also gain a stronger operational experience through the new Microsoft Entra ID Protection experience. Rather than forcing identity teams to piece together risk signals across disconnected views, the updated experience brings deeper visibility into risky users, sign-ins, workloads, and associated detections in one place. The new identity risk score adds another layer of context by surfacing insights across related accounts and activity, including signals from Microsoft environments and connected identity activity beyond them. This helps admins understand whether a risky user, agent, workload, or sign-in is an isolated event or part of a broader pattern spanning sessions, applications, and associated accounts.

New user dashboard in Entra ID Protection which provides deeper visibility for identity admins into risky users, sign-ins, and associated detections.

New risky user details view provides more information about a user’s risk and the attack timeline within Entra ID Protection.

That richer context gives identity teams a more complete view of how risk is developing across the identity estate. Admins can better understand how risk is calculated, which related accounts or workloads contributed to the score, what detections are driving concern, and why a given identity requires attention. By connecting Microsoft and cross-environment signals into a single evaluation, the risk score helps identity admins prioritize the identities that matter most, make more informed access decisions, and explain the rationale behind remediation actions with greater confidence.

For security operations teams, this new score helps prioritize and triage investigations faster by focusing analysts on the identities that pose the greatest risk. But knowing what to fix is only half the challenge. In many organizations, security operations teams lack the needed permissions to take action; instead, they can only wait for separate IAM workflows to resolve the issue. That delay creates friction during moments when response speed matters most. Some solutions address this by giving SOC teams, or the security application itself, broad standing permissions across the identity environment. That may solve the permissions issue, but it also expands the blast radius if the application or identity is misused or compromised.

Microsoft takes a different approach because our solution natively spans identity infrastructure, the identity control plane, and ITDR. Customers get streamlined workflows across the full identity security lifecycle, and with a new identity-focused RBAC role, coming soon in public preview, security operations teams can access the core identity response actions they need without broad administrative permissions. This allows organizations to preserve least privilege access while reducing operational friction between IAM and SOC teams. Combined with the native privileged identity management in Microsoft Entra, organizations can also create just-in-time access policies for these response roles, further reducing standing privilege while still enabling responders to elevate quickly during incidents and investigations.

Together, unified risk, the new Microsoft Entra ID Protection experience, and least-privilege response roles give identity and security teams the shared context and governed action paths they need to move from insight to response faster.

Shifting left with proactive prevention

Shifting identity protection left means addressing risk earlier, before it becomes an active threat or incident. By continuously strengthening posture and adapting access controls as conditions change, organizations can reduce exposure, improve resilience, and stay ahead of emerging risks.

The Conditional Access Optimization Agent continues to evolve to help organizations keep pace with a rapidly changing threat landscape. Instead of manually auditing policies or reacting after gaps are exposed, the agent continuously analyzes identity signals, usage patterns, and emerging threats to recommend the right policy changes at the right time. New recommendations, like the “Block risky user agent” policy, are designed to address emerging attack vectors such as agent-based abuse and automated access attempts. These optimizations give organizations a more adaptive way to enforce Zero Trust, where access decisions continuously adjust based on risk and context rather than relying on one-time configuration.

And as part of our continued effort to help customers close the loop and move beyond reactive responses, we are soon bringing more threat detections and insights from Defender that are automatically fed directly into the Conditional Access Optimization recommendations in Microsoft Entra. Administrators receive clear, explainable, and reviewable recommendations that outline why the change is important, who is impacted, and what action to take, empowering a more proactive and preventative approach to mitigating future attacks.

Accelerating response

In AI-accelerated attacks, response speed matters just as much as visibility. Manual investigation and response will always be necessary, but in today’s AI-accelerated threat landscape, defenders need automation that helps level the playing field. That’s why we were so excited to extend the Security Alert Triage Agent to identity scenarios and pair it with automatic attack disruption and new predictive shielding capabilities. Together, these capabilities create an end-to-end automation loop that helps defenders triage identity threats, disrupt active attacks, drive response, and continuously harden posture before the next incident.

At Microsoft Security, we are building toward that future by embedding this kind of adaptive, AI-driven enforcement directly into identity security. That means accelerating detection across the attack chain, speeding up investigation and response through AI, and ensuring every authentication and access decision reflects real-time risk. It also means bringing IAM and security operations closer together, so identity signals, policy enforcement, and incident response work as one continuous system rather than separate workflows.

The future of identity security

In the AI era, identity is not just a control point. It is the system that connects prevention, detection, and response into a single, adaptive defense system. And Microsoft is building and operating that system as both the identity provider and policy enforcement layer, with real-time risk signals that can immediately influence access decisions. The organizations that defend identity fastest will be the organizations that defend everything else better.

-Sandeep Deo and Yaron Paryanty

Additional resources

Learn more about Microsoft Entra 

Prevent identity attacks, ensure least privilege access, unify access controls, and improve the experience for users with comprehensive identity and network access solutions across on-premises and clouds.

The post AI is accelerating cyberattacks—here’s how to stay ahead appeared first on Microsoft Security Blog.

Read the whole story
alvinashcraft
44 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Engage customers with Teams Phone Agent and custom voice agents built in Copilot Studio

1 Share

It’s often difficult for businesses to serve every customer right away during surges in call volumes. Callers sit on hold while staff race to work through the backlog. Meanwhile, agentic voice AI is opening entirely new ways to serve customers, such as getting answers to questions or even enabling them to pay a bill over the phone, including after hours and on weekends.

We are excited to announce Teams Phone Agent, along with the ability to bring custom voice agents your organization builds in Microsoft Copilot Studio to Microsoft Teams Phone.         

For customer-facing organizations using Teams Phone, like healthcare clinics or bank branches, these agents take repetitive calls off the plate of employees so they can focus on the conversations that truly need a human touch. All this enables faster issue resolution for customers.

How Teams Phone Agent works

Teams Phone Agent greets callers and resolves common requests with these skills.

  • Questions and Answers: Using configured knowledge bases that support file uploads and URLs, Teams Phone Agent answers callers' questions in natural conversation, so customers get answers quickly instead of being placed on hold or hunting through a website.
  • Appointment scheduling: Teams Phone Agent enables callers to book new appointments, reschedule or cancel existing ones, and find upcoming appointment details so customers can lock in a time without playing phone tag.
  • Conversational Routing with intelligent transfers: Teams Phone Agent supports the same tried and tested routing from traditional auto attendants, such as user or extension lookup and transfer to a user, call queue, and more through conversation. When additional assistance is needed, Teams Phone Agent passes the caller along with the full context of the conversation to the right individual or department. Customers can skip cumbersome phone menus and don’t have to repeat themselves.
  • Multilingual: Teams Phone Agent supports multilingual conversations across 60+ supported languages, allowing callers to interact naturally in their preferred language and helping organizations deliver global voice experiences at scale.

Automate what’s unique to your business with Copilot Studio voice agents

When you need to automate processes unique to your business, like letting patients fill a prescription over the phone, custom voice agents that your organization builds in Copilot Studio step in. Teams Phone Agent can seamlessly hand off a call to custom voice agents whenever those specialized skills are needed. Alternatively, you can set things up so customers can dial a Copilot Studio custom voice agent directly through a Teams Phone line.

Voice agents with Teams Phone in action

Here are a few illustrative use cases for how Teams Phone Agent and Copilot Studio voice agents can help organizations engage their customers:

  • Answer routine questions without making callers wait. A healthcare clinic can use Teams Phone Agent to answer common questions about hours, locations, accepted insurance, and appointment preparation, helping patients get answers quickly while staff focus on care coordination.
  • Book and reschedule appointments over the phone. A home services company can use Teams Phone Agent to help customers schedule, confirm, or change appointments for plumbing or electrical repairs in natural conversation, helping reduce back-and-forth calls and freeing employees from repetitive scheduling work.
  • Route customers to the right expert with context. A bank branch can use Teams Phone Agent to understand what type of support a caller needs, such as assistance with completing a mortgage loan application. Teams Phone Agent can then transfer the call to the right team with the conversation context included.
  • Complete business-specific tasks with a Copilot Studio voice agent. A pharmacy can use a Copilot Studio custom voice agent to help customers request a prescription refill by phone and check order status, giving customers a simpler way to manage routine needs without waiting for staff assistance.
  • Support customers after hours. A utility provider can use a Copilot Studio custom voice agent connected to Teams Phone to let customers report an outage or get billing help outside normal business hours.

 

An expanding ecosystem of voice agents for Teams Phone

We want customers to have choice across first-party and third-party voice agents. That is why we are working with select solution developers to integrate their voice agents with Teams Phone. AudioCodes is announcing general availability of its voice agent for Teams Phone today, with additional solutions expected in the future.

Get started

Teams Phone Agent and the ability to integrate custom voice agents built in Copilot Studio are now accessible through the Frontier program. Organizations can join Frontier to get early access to Microsoft’s latest AI innovations. Learn more.

Licensing during the Frontier preview

  • Teams Phone Agent is available via the Frontier program. Service limitations may apply.
  • Custom Copilot Studio voice agent experiences—whether reached through a Teams Phone Agent hand-off or by direct dial—are accessible via the Frontier program and are billed consumptively at a rate based on the orchestration type your organization selects when building the agent in Copilot Studio. Learn more.
  • Billing for Copilot Studio voice agent experiences in Teams Phone will roll out by early July, and usage will not be charged prior to this rollout. Service limitations may apply. When the billing experience is rolled out, all Frontier preview users will be required to set up billing to continue using Copilot Studio voice agents for Teams Phone.
  • Tenants must also meet standard Teams Phone prerequisites, including a properly configured Teams Phone resource account. Learn more.
  • All licensing, pricing, and service limits are subject to change. Additional information will be communicated at general availability.
Read the whole story
alvinashcraft
44 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Azure OpenAI Architecture: The Decisions That Actually Matter (Part 3)

1 Share

Introduction

Part 1 of this series tackled the architectural decisions that shape any Azure OpenAI / Microsoft Foundry Models workload — capacity model, deployment scope, governance layer, grounding strategy, and quota engineering. Part 2 turned those decisions into a Well-Architected Framework discipline. Part 3 looks at the part that makes GenAI architecture genuinely different from a traditional service: the platform itself never stops moving.

Models are released, promoted to GA, moved to Legacy, deprecated, and eventually retired. New regions come online; certain features (such as Priority Processing) light up only on specific model versions and deployment scopes. Fine-tuned models inherit the lifecycle of their base. Performance characteristics shift between releases. Reliability in this world is not just uptime — it is the ability to absorb continuous change without disrupting production.

That discipline is GenAIOps: the people, processes, and tooling that turn model upgrades from emergency events into routine operations. Part 2 already covers the core lifecycle mechanics and upgrade policy trade-offs through a Well-Architected lens. Part 3 stays focused on the operational and architectural practices that make change safe: evaluation of pipelines, observability, routing patterns, prompt governance, and abstraction. Where details are time-sensitive — stage thresholds, SLA windows, regional rollout delays, capacity tier eligibility — they are flagged with "At the time of writing". Always confirm current behavior against Microsoft Learn before committing to a design.

 

Who is this series for?

  • Cloud and Solution Architects
  • Platform and product owners
  • Senior developers responsible for operating Azure OpenAI workloads in production

What you’ll learn in Part 3: 

  • How to build an evaluation pipeline that promotes model upgrades the way CI/CD promotes code.
  • How to instrument full-stack observability so regressions surface early (latency, errors, token trends, quality drift).
  • How the Model Router pattern, canary releases, and tier-aware fallbacks turn model change into a configuration concern.
  • How to govern prompts as production artifacts with versioning, feature-flagged rollouts, and regression testing.
  • How to manage lifecycle-dependent assets (fine-tuned models) and regional rollout realities without firefighting — plus a GenAIOps Decision Matrix you can reuse as a checklist.

We’ve also included a summary decision matrix at the end of this post for quick reference.

1. Model lifecycle (recap)

Azure OpenAI/Microsoft Foundry models are living dependencies: new versions are released, promoted from Preview to GA, then eventually move through deprecation toward retirement. To avoid surprises, treat every deployed model version as having an expiration date and design so you can swap versions without rewriting application code. In general, use the Standard deployment auto-upgrade mode that preserves stability but guarantees continuity at retirement, and plan to deliberate blue/green migrations for dedicated (provisioned) capacity where auto-upgrade is not available. For the deeper mechanics (upgrade modes, retirement behavior, and migration playbooks), refer to Part 2’s Reliability section; the rest of this article focuses on the GenAIOps practices that make those upgrades routine.

 

                                                                                             Figure 1 — Models lifecycle

2. GenAIOps: Evaluating Before Promoting

Upgrading a model should not be a manual, subjective exercise. Azure AI Foundry provides evaluation capabilities that, combined with a regression prompt suite, turn model upgrades into measurable, repeatable decisions:

  • Side-by-side prompt comparisons across model versions.
  • Automated quality scoring (relevance, coherence, groundedness, safety, and fluency).
  • Structured-output validation (JSON conformance, schema validation).
  • Batch testing across comprehensive prompt libraries representative of real production traffic.
  • Custom evaluation metrics tailored to your domain.

Architectural best practice:

  • Maintain a curated regression prompt suite that mirrors real production traffic — including the long tail.
  • Run evaluation pipelines against candidate models before any production cut-over.
  • Integrate evaluation into CI/CD using Azure DevOps, GitHub Actions, or similar automation.
  • Define quality gates that must pass before promotion (e.g., groundedness ≥ a target threshold, p95 latency under a target budget). Pick numbers that fit your workload, not the article.

Model promotion should require passing the evaluation gates the same way application code requires passing unit tests. Without automated evaluation, model upgrades become high-risk, low-visibility events that teams avoid until forced by retirement deadlines — the exact pattern that keeps lifecycle work in the "emergency" column instead of the "scheduled" column.

Example evaluation workflow:

  • Trigger — a new model version reaches GA, or your migration playbook hits the R-90 step.
  • Deploy — the candidate model goes to a staging deployment.
  • Regress — the prompt suite (typically several hundred to several thousand prompts) is run against the candidate.
  • Compare — the candidate's outputs are scored against the current production model.
  • Inspect — humans review flagged differences; metrics, latency distributions, and cost-per-request go on the dashboard.
  • Gate — an approval step (manual or automated) decides whether the candidate proceeds to blue/green production deployment.

 

 

                                                                                       Figure 2 — Evaluation Pipeline

3. Observability: Full-Stack or It Didn't Happen

GenAIOps is more than one-time evaluation. Once a candidate's model has been promoted, you need continuous, end-to-end observability across the request path — not just at the model boundary. Without it, you are operating blind during model transitions.

At a minimum, instrument:

  • Prompt processing time (gateway through model invocation).
  • Model inference latency, expressed as p50, p95, and p99 — averages hide the experience of the slowest 5% of users.
  • Token consumption (prompt tokens, completion tokens, total) trended over time.
  • Error rates by class (429 throttling, 503 service unavailable, 400 validation errors, content-filter rejections).
  • Model version distribution — which versions are actually serving traffic right now.
  • User-satisfaction signals (thumbs-up/down, explicit feedback, session abandonment).

Many performance regressions only surface at scale. A model version that performs well in evaluation against a few hundred prompts may behave differently under production traffic patterns. Plan for that.

A practical metrics architecture on Azure tends to combine:

  • Application Insights for end-to-end request tracing across the application and gateway.
  • Azure Monitor for infrastructure, quota, and PTU utilization of metrics.
  • Custom telemetry for prompt-level success/failure tracking and quality scoring.
  • Log Analytics for forensic analysis when a regression is suspected.

Drift in model behavior rarely shows up as a single broken request — it surfaces as a slow shift in tail latency, fallback rate, or user-satisfaction signal. Monitoring that only looks at average will miss it.

4. The Model Router Pattern

As GenAI systems mature, a static single-model architecture becomes both limiting and expensive. A Model Router introduces dynamic, intelligent model selection in front of one or more model deployments.

Typical responsibilities of a router:

  • Send simple queries to a smaller, faster model and complex reasoning to a larger one.
  • Run canary releases of new model versions with percentage-based rollouts.
  • A/B test model variants to measure quality, latency, and cost differences.
  • Route to the right capacity tier — including falling back from Provisioned to Standard during migrations or capacity constraints.
  • Where the workload also needs lower-variance latency on the Standard side, route latency-critical traffic through Priority Processing on a Global Standard or Data Zone Standard (US) deployment, on a model version that supports it. (At the time of writing, Priority Processing is enabled by setting the service_tier attribute on the request and requires a model version released on or after 2025-12-01 — verify both eligibility constraints on Microsoft Learn before depending on it.

Decision logic can be driven by any combination of:

  • Query complexity (simple heuristics or a lightweight classifier).
  • User tier (e.g., free vs premium).
  • Response-time requirements (interactive vs background).
  • Cost constraints — pick the cheapest model that meets the quality bar.
  • Regional model availability and capacity.

Implementation options:

  • Azure API Management — built-in routing policies, weighted backends, retry policies.
  • Azure Front Door — global routing with health probes.
  • Custom routing service — maximum flexibility, more operational overhead.
  • Semantic Kernel or LangChain — framework-level routing logic embedded in the application.

Beyond cost and performance, the Model Router pattern decouples the application layer from any single model version. That decoupling is what makes lifecycle management tractable: when a model moves to Legacy, you change a router rule, not application code.

 

                                                                    Figure 3 — Model Router Architecture vs Blue/Green Deployment

 

5. Prompt Lifecycle Governance

Prompts are not strings embedded in code. They are production artifacts that influence quality, cost, and safety, and they evolve almost as often as the models behind them. Treat them as first-class assets.

Prompt templates

Separate stable system instructions from dynamic content (user input and retrieved context). This lets you version, test, and audit each layer independently.

Version control

Store prompts in Git — full history, code review, branching, and tagging. Treat prompt changes the way you treat code changes: pull request, review, and test before merging.

Feature-flagged rollouts

Roll out prompt changes gradually using feature flags. Monitor the impact on a subset of users before exposing the change broadly. The same observability stack that watches model upgrades should watch prompt rollouts.

Regression testing

Maintain a regression suite of expected prompt behaviors and run it whenever prompts or models change. The suite reuses the same evaluation pipeline you built in Section 2.

Prompt-level metrics

  • Success rate — did the prompt achieve its intended outcome?
  • Fallback rate — how often did users rephrase or abandon?
  • Satisfaction score — explicit user feedback.
  • Token efficiency — average tokens per successful completion (a leading indicator of cost regression).

PII and privacy safeguards

Customer prompts and completions are not used to train base models. That means logging is safe for debugging — but defense in depth still applies:

  • Redact PII (names, emails, phone numbers, addresses) before logs are written.
  • Apply RBAC on Log Analytics workspaces so only the right roles can access raw prompt data.
  • Govern data retention with automated purging after a defined window.
  • Keep audit trails of who accessed which logs and when.

Prompt quality is not a one-time effort. It is an ongoing operational discipline that needs tooling, processing, and measurement, in the same way application code does.

 

                                                                                        Figure 4 — Prompt Lifecycle Governance

6. Fine-Tuned Models: The Hidden Retirement Risk

Fine-tuned models inherit the lifecycle of their base model. That creates a cascading retirement risk that many teams overlook.

During base-model deprecation:

  • New fine-tuning jobs against that base are blocked — you can no longer create new fine-tuned versions.
  • Existing fine-tuned deployments continue serving inference, with no immediate impact.

When the base model is retired:

  • Fine-tuned deployments stop responding (HTTP 404), exactly like any other deployment pinned to a retired version.

The migration imperative is straightforward: retrain fine-tuned models on the successor base model well before the retirement date, ideally during the predecessor's Legacy or Deprecated stage.

Architectural considerations:

  • Track base-model dependencies explicitly in your asset inventory — the same place you track library and runtime versions.
  • Schedule retraining workflows aligned with base-model lifecycle dates, not with team availability.
  • Validate fine-tuned model quality on the new base; behavior can shift between base versions.
  • Keep training datasets in version-controlled storage, so retraining is reproducible.
  • Re-evaluate whether fine-tuning is still necessary; newer base models, combined with better prompting (few-shot, chain-of-thought, structured outputs), sometimes remove the need entirely.

Common mistake: investing heavily in fine-tuning without budgeting for the recurring retraining cost and lifecycle overhead. Improved prompting on a newer base model is often the cheaper path.

7. Regional Rollouts and Multi-Region Strategy

Successor models are not always available in every Azure region simultaneously. Microsoft typically releases a new version in a subset of regions first, with broader rollout following over weeks or months. At the time of writing, the regional rollout schedule is published per model on Microsoft Learn — confirm before assuming a particular region will receive a release on a particular day.

Maintain staging deployments in early-release regions

Even if production runs elsewhere, maintain a staging deployment in regions that tend to receive new models earliest. That gives you visibility into the successor's behavior before it auto-upgrades into your primary region.

Pre-test successor models before primary auto-upgrades

If your production deployment uses "Once the current version expires", the upgrade will happen automatically. Pre-testing in an early-release region lets you catch behavioral changes before they hit live traffic.

Multi-region routing for lifecycle flexibility

Azure Front Door or Azure API Management with multi-region back-ends lets you route based on model availability, capacity headroom (one region may have quota while another is exhausted), and latency. Combined with the Model Router pattern from Section 6, this turns regional staggering from a constraint into an option.

Account for capacity-tier eligibility in your routing

Some capacity tiers are scoped to specific deployment scopes — Priority Processing, for example, is offered on Global Standard and Data Zone Standard (US) deployments at the time of writing. Bake those eligibility constraints into routing rules, so a fallback path does not silently land in an ineligible deployment.

Multi-region strategy is no longer just a disaster-recovery concern. It is also lifecycle resilience — the ability to test, stage, and absorb model changes without coupling your platform to a single region release schedule.

8. Future-Proofing Through Abstraction

Future-proofing is architectural, not procedural. The goal is to design systems that adapt to change without requiring code rewrites every time a model is promoted, deprecated, or retired.

Abstract model calls behind a service layer

Avoid calling Azure OpenAI APIs directly from the application code. Introduce an internal Model Service that owns model selection, retry and fallback, prompt-template lookup, and response validation. The application asks for an outcome ("summarize this", "classify that"); the Model Service decides which model and which prompt to use.

Externalize model names and configuration

Store model identifiers, versions, and parameters in configuration or feature flags — never as hard-coded strings. Changing models then becomes a configuration change, not a deployment.

Centralize prompt logic

Maintain prompts in a registry or template repository, not scattered across codebases. This enables centralized versioning, A/B testing without code changes, and prompt optimization that is decoupled from application releases.

Avoid scattering model identifiers across the codebase

Use constants, enums, or configuration references rather than literal model strings repeated across many files. The number of files that have to change at upgrade time is a leading indicator of how painful the upgrade will be.

Benefits of abstraction:

  • Seamless model replacement — swap models without touching application logic.
  • Multi-model strategies — the Model Router pattern becomes trivial to add.
  • Provider flexibility — integrating additional or alternative providers becomes a service-layer change, not an application to rewrite.
  • Faster adoption of new capabilities — reasoning controls, function calling, structured outputs land in one place.

Common mistake: Prototyping with direct API calls for speed and never refactoring. The technical debt accumulates until a model upgrade requires an emergency engineering sprint.

 

                                                                        Figure 5 — Abstraction Layer for Future-Proofing.

Final Perspective

The most important shift this article asks for is a change in operational mindset:

  • Model upgrades are not emergencies. They are scheduled events.
  • Retirement deadlines are not surprising. They are published timelines, often with months of notice.
  • Architecture fails when teams treat models as static dependencies. They succeed when they treat models as evolving infrastructure.

In practice, GenAIOps means:

  • Automated evaluation that runs continuously, not just during migrations.
  • Controlled rollouts using blue/green or canary patterns.
  • Observability-driven decisions based on metrics, not intuition.
  • Lifecycle-aware planning, with retirement dates tracked alongside library and runtime upgrades.
  • Modular design that decouples applications from specific model versions.

Across the three parts of this series we have covered the architectural decisions that frame an Azure OpenAI / Microsoft Foundry Models workload (Part 1), the Well-Architected Framework discipline that keeps it sustainable (Part 2), and the GenAIOps practices that let it evolve without firefighting (Part 3). The organizations that succeed long-term are the ones that plan for model evolution from day one, invest in evaluation and observability tooling, decouple application logic from model specifics, and treat prompts and configurations as versioned artifacts.

Generative AI architecture is not about deploying a model endpoint. It is about building a platform that absorbs change gracefully as the AI landscape shifts. The retirement of a model should be a routine operational event, not a crisis. If your architecture makes model upgrades feel risky or expensive, refactor before the next retirement deadline forces your hand.

Lifecycle & GenAIOps Decision Matrix

Use this as a checklist when reviewing or signing off on the GenAIOps posture of an Azure OpenAI / Microsoft Foundry Models platform. One row per decision; one rule of thumb per row.

Area

Decision

Rule of thumb

Watch out for

Lifecycle

Version expiry tracking

Treat model versions as expiring dependencies: inventory every deployed model/version, track deprecation/retirement dates, and design so swapping versions is a configuration change (details on upgrade modes in Part 2).

Pinning versions without an owner; discovering retirement dates after an outage or emergency migration window.

Evaluation

Promotion gates

Pass the regression suite + meet domain-specific quality and latency thresholds before promoting any model.

Subjective "feels better" sign-off; gates that exist on paper but never block a release.

Evaluation

Pipeline integration

Evaluation runs in CI/CD on every candidate; the same suite watches prompt changes.

Manual evaluation runs that only happen under retirement pressure.

Observability

Latency and error metrics

Track p50/p95/p99 latency, 429/503/4xx rates, token trend, and model-version distribution. Alert on tail latency and sustained throttling.

Average-only dashboards; missed Service Health notifications for model retirements.

Observability

Quality drift

Trend per-prompt success rate, fallback rate, and user-satisfaction signals; surface drift before users complain.

Treating quality as a one-time evaluation event.

Architecture

Model Router

Centralize model selection, canary, and fallback (including Priority Processing on eligible deployments) behind a router service.

Application code that calls a specific model deployment by name; routing logic scattered across services.

Architecture

Abstraction layer

Application code asks for an outcome; the Model Service decides which model and prompt; configuration drives model selection.

Hard-coded model identifiers across many files; bypass paths that skip the service layer.

Prompts

Prompt governance

Prompts in Git, behind feature flags, with regression tests, prompt-level metrics, and PII redaction in logs.

Prompts copy-pasted across services; PII in logs; no rollback path for a regressed prompt.

Fine-tune

Fine-tuned model lifecycle

Track fine-tuned models against base-model dates; schedule retraining during the predecessor's Legacy/Deprecated window.

Treating fine-tuned models as permanent infrastructure; lost or unversioned training datasets.

Regional

Multi-region for lifecycle resilience

Maintain staging in early-release regions; route across regions to absorb staggered rollouts and capacity gaps.

Single-region production with no early-release staging; routing rules that ignore tier-eligibility constraints.

Disclaimer

I am a Microsoft employee. The views and opinions expressed in this article are my own and do not necessarily reflect those of Microsoft. This content is informational and educational; it is not an official Microsoft statement, recommendation, or commitment. Service tiers, model availability, lifecycle stages, deprecation timelines, regional rollouts, pricing, and SLAs evolve — always validate against the latest Microsoft Learn documentation before making architectural or migration decisions.

References

Read the whole story
alvinashcraft
44 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Microsoft 365 Copilot app learning series: From research to polished outputs

1 Share

Work today often spans chats, files, meetings, and tools. Our latest Microsoft 365 Copilot app learning series explores how connected experiences like Researcher, Copilot Pages, Copilot Notebooks, and Create can support workflows that move work from research and planning to collaboration and content creation. 

Across the series, you’ll see practical scenarios for organizing information, building shared context, creating content, and turning ideas into shareable outputs. 

Here’s what we will cover:

Researcher agent: Generate structured, source cited reports

Researcher in the Microsoft 365 Copilot app is designed to help with multi-step research tasks using web information and the work content you already have access to, such as files, emails, meetings, and chats. 

In this session, you’ll learn how to: 

  • Start with a research question and refine the scope 
  • Build structured reports with cited sources 
  • Adjust research direction to stay focused on your scenario 
  • Review summaries, visuals, and references before sharing results 
  • Export or adapt content into formats such as Word, PowerPoint, or PDF, when available 

This session is especially useful for market research, competitive analysis, planning, and stakeholder preparation where source visibility and review are important. 

PresentersCarl Mekala and Leo Daniel Gnanamanickam 

Watch video.

Copilot Pages: Turn chat responses into editable, shareable content

Copilot Pages helps you take a Copilot response and turn it into content you can edit, refine, and collaborate on with others. 

In this session, you’ll learn how to: 

  • Generate a structured draft from Copilot Chat 
  • Continue working in a Copilot Page 
  • Revise, expand, or reformat content using Copilot 
  • Tailor content for different audiences and use cases 
  • Share Pages for real-time collaboration across Teams, Outlook, or the Microsoft 365 app, depending on availability

This session is useful for brainstorming and content drafting, where teams need to turn ideas into shared, polished outputs. 

PresentersCora Chen and Oby Omu 

Watch video.

Copilot Notebooks: Organize project materials and get insights

Copilot Notebooks is designed to provide a focused workspace where you can bring together project materials such as files, notes, Pages, and other references to keep work grounded in the context you choose.

In this session, you’ll learn how to: 

  • Bring project references into one organized workspace 
  • Use Notebook overviews to identify themes and insights 
  • Guide responses with custom instructions 
  • Ask questions grounded in your selected references 
  • Generate supporting materials such as summaries or audio overviews, when available 
  • Collaborate while managing sharing and permissions settings

This session is especially helpful for ongoing projects that involve multiple content sources, project planning, and cross-team collaboration. 

PresentersMedha Madangarli and Derek Liddell 

Watch video.

Create: Turn ideas into branded visuals

Create can be used to generate visual assets and creative content from prompts, project context, and available brand resources. 

In this session, you’ll learn how to: 

  • Generate visual concepts from prompts 
  • Apply brand kits and brand elements, when available 
  • Create social-ready assets such as carousel posts 
  • Turn presentations into short videos or other shareable formats, when available 
  • Refine outputs so they better align to your audience, message, and brand guidance 

This session is useful for teams exploring practical ways to create shareable visual content while keeping brand alignment part of the process. 

PresentersLudo Ulrich and Ray Curiel 

Watch video.

Keep learning 

Have questions after watching the series? Join us on Tuesday, June 30, for a live Ask Microsoft Anything (AMA) with product experts focused on real-world scenarios, product tips, and questions from the community.

Whether you’re getting started or exploring advanced workflows, the AMA is a chance to hear practical guidance and learn how others are using Microsoft 365 Copilot app experiences in their everyday work. 

Register here.

 

Learn about the Microsoft 365 Insider program and sign up for the Microsoft 365 Insider newsletter to get the latest information about Insider features in your inbox once a month!

Read the whole story
alvinashcraft
45 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories