As enterprise customers roll out and govern AI agents through Agent 365, they have been asking for pre canned evals they can run out of the box. They want transparent, reproducible evaluations that reflect their own work in realistic environments, including interoperability, how agents connect across stacks and into Microsoft Agent 365 systems and tools. In response, we are investing in a comprehensive evaluation suite across Agent 365 Tools with realistic scenarios, configurable rubrics, and results that stand up to governance and audit as customers deploy agents into production. Introducing Evals for Agent Interop, the way to evaluate those cross-stack connections end to end in realistic scenarios.
Introducing Evals for Agent Interop
As a first step, we’re launching ‘Evals for Agent Interop’, a starter evaluation kit debuting. ‘Evals for Agent Interop’ provides curated scenarios and representative data that emulate real digital work, along with an evaluation harness that organizations can use to self-run their agents across Microsoft 365 surfaces (Email, Documents, Teams, Calendar, and more). It’s designed to be simple to start, yet capable enough to reveal quality, efficiency, robustness, and user experience tradeoffs between agent implementations, so organizations can make informed choices quickly.
Get started: Download the starter evals and harness from our repo. We currently support Email and Calendar scenarios, and we’re rapidly expanding the kit with new scenarios, richer rubrics, and additional judge options. (https://aka.ms/EvalsForAgentInterop).
Leaderboard: Strawman agents, frameworks, and LLMs
To help organizations benchmark and compare, we’re introducing a leaderboard that reports strawman agents written using different stacks, a combination of agent embodiment frameworks (ex., Semantic Kernel, Lang Graph) and LLMs (ex., GPT 5.2). This gives organizations a clear view of how various approaches perform on the same scenarios and rubrics. The leaderboard will evolve as we add more agent types and frameworks, helping organizations determine the right set of agents for their Agent 365 Tools.
Why it matters
Customers want to more easily optimize their AI agents to their unique business. Enterprise AI is shifting from isolated model metrics to customer-informed evaluation. Businesses want to define rubrics, calibrate AI judges, and correlate offline results with production signals, tightening iteration cycles from months to days to hours. As Microsoft, we realize that customers expect to bring their own grading criteria and scrutinize datasets for domain fit before they trust an agent in their environment. ‘Evals for Agent Interop’ is purpose-built for this new reality, unifying evaluation needs in one path: start with pre-canned evals, then tailor to your context.
How Evals for Agent Interop works
‘Evals for Agent Interop’ ships with templated, realistic and declarative evaluation specs. The harness measures programmatically verifiable signals (schema adherence, tool call correctness, policy checks) alongside calibrated AI judge assessments for qualities like helpfulness, coherence, and tone. This yields consistent, transparent, and reproducible results that teams can track over time, compare across agent variants, and share across organizations.
How it will evolve into a full evaluation suite
We’re building toward a full suite that helps organizations choose the right set of agents to run on their Agent 365 Tools:
Product teams within Microsoft define rubrics, train and calibrate judges, ship scenarios and data, and correlate offline scores with production metrics.
Customers bring their own data and grading logic via a that becomes the single source of truth for both offline grading and online guardrails at runtime. We’ll support custom tenant rubrics, with LLM or human grading for ambiguous cases.
Packaged governance includes audit trails, documented rubrics, and privacy posture aligned to usage. Over time, we intend to co-publish capability manifests, tool schemas, and calibration methods to foster transparency and community validation.
What can organizations do with the Evals for Agent Interop kit?
With ‘Evals for Agent Interop’, organizations can compare multiple agent candidates head-to-head on the same scenarios and rubrics, quantify quality and risk controls, and verify improvements (for example, a fine-tuned model or a different LLM) before broad rollout. As we expand the suite, these offline signals will align with online evaluation, so organizations can move from confidence to controlled deployment. Faster, safer, and clearer accountability.
Where to start (and what’s next)?
Clone the GitHub repo (https://aka.ms/EvalsForAgentInterop) with the starter evals and Harness. Run the included scenarios to baseline your agents and understand gaps.
Tailor rubrics to your domain, then re-run to see how agent behavior shifts under your constraints.
We’ll expand ‘Evals for Agent Interop’ with new scenario families (document collaboration, communications, scheduling and tasking), richer scoring, and broader judge options, while integrating more tightly with Agent 365 Tools so evaluations and runtime guardrails share one source of truth.
I was building a Modal component that uses the <dialog> element’s showModal method. While testing the component, I discovered I could tab out of the <dialog> (in modal mode) and onto the address bar.
And I was surprised — accessibility advice around modals have commonly taught us to trap focus within the modal. So this seems wrong to me.
Upon further research, it seems like we no longer need to trap focus within the <dialog> (even in modal mode). So, the focus-trapping is deprecated advice if you use <dialog>.
Some notes for you
Instead of asking you to read through the entire GitHub Issue detailing the discussion, I summarized a couple of key points from notable people below.
Here are some comments from Scott O’Hara that tells us about the history and context of the focus-trapping advice:
WCAG is not normatively stating focus must be trapped within a dialog. Rather, the normative WCAG spec makes zero mention of requirements for focus behavior in a dialog.
The informative 2.4.3 focus order understanding docdoes talk about limiting focus behavior within a dialog – but again, this is in the context of a scripted custom dialog and was written long before inert or <dialog> were widely available.
The purpose of the APG is to demonstrate how to use ARIA. And, without using native HTML features like <dialog> or inert, it is far easier to trap focus within the custom dialog than it is to achieve the behavior that the <dialog> element has.
Both the APG modal dialog and the WCAG understanding doc were written long before the inert attribute or the <dialog> element were widely supported. And, the alternative to instructing developers to trap focus in the dialog would have been to tell them that they needed to ensure that all focusable elements in the web page, outside of the modal dialog, received a tabindex=-1.
Léonie Watson weighs in and explains why it’s okay for a screen-reader user to move focus to the address bar:
In the page context you can choose to Tab out of the bottom and around the browser chrome, you can use a keyboard command to move straight to the address bar or open a particular menu, you can close the tab, and so on. This gives people a choice about how, why, and what they do to escape out of the context.
It seems logical (to me at least) for the same options to be available to people when in a dialog context instead of a page context.
Finally, Matatk shared the conclusion from the W3C’s Accessible Platform Architectures (APA) Working Group that okay-ed the notion that <dialog>‘s showModal method doesn’t need to trap focus.
We addressed this question in the course of several APA meetings and came to the conclusion that the current behavior of the native dialog element should be kept as it is. So, that you can tab from the dialog to the browser functionalities.
We see especially the benefit that keyboard users can, for example, open a new tab to look something important up or to change a browser setting this way. At the same time, the dialog element thus provides an additional natural escape mechanism (i.e. moving to the address bar) in, for example, kiosk situations where the user cannot use other standard keyboard shortcuts.
From what I’m reading, it sounds like we don’t have to worry about focus trapping if we’re properly using the Dialog API’s showModal method!
Hope this news make it easier for you to build components. 😉
Who will build the next version of the web? Mozilla wants to make it more likely that it’s you. We are committing time and resources to bring experienced builders into Mozilla for a short, programmed period, to work with our New Products leaders to build tools and products for the next version of the web.
A different program from a different kind of company
Our mission at Mozilla is to ensure the internet is a global public resource, open and accessible to all. We know that there are a lot of gifted, experienced and thoughtful technologists, designers, and builders who care as deeply about the internet as we do – but seek a different environment to explore what’s possible than what they might find across the rest of the tech industry.
Pioneers is intentionally structured to make it possible for those who don’t typically get the opportunity to create new products to participate. The program is paid, flexible (i.e. you can do it part-time if needed), and bounded. We’re not asking you to gamble your livelihood in order to explore how we can improve the internet.
This matters to me
My own career advanced the most dramatically in moments when change was piling on top of change and most people couldn’t grasp the compounding effects of these shifts. That’s why I stepped up to start an independent blogging company back in 2002 (Gizmodo) and again in 2004 (Engadget).
It’s also why, a lifetime later, I joined Mozilla to lead New Products, where I’ve had the good fortune of supporting the development of meaningful new Mozilla tools like Solo, Tabstack, 0DIN, and an enterprise version of Firefox.
Changing the game
We’ve designed Pioneers to make space for technologists — professionals comfortable working across code, product, and systems — to collaborate with Mozilla on foundational ideas for AI and the web in a way that reflects these shared values.
We’re looking for people to work with; this is not a contest for ideas, and you don’t apply with a pitch deck. Our vision:
Pioneers are paid. Participants receive compensation for their time and work.
It’s flexible, designed so participants can be in the program and continue to work on existing commitments. You don’t have to put your life on hold.
It’s hands-on. Builders work closely with Mozilla leaders to prototype and pressure-test concepts.
It’s bounded. The program is time-limited and focused, with clear expectations on both sides.
It’s real. Some ideas will move forward inside Mozilla. Some will not – and they’ll still be valuable. If it makes sense, there will be an opportunity for you to join Mozilla full-time to bring your concept to market.
Applications are open Monday, Jan. 26 and close Monday, Feb. 16, 2026.
Pioneers isn’t an accelerator, and it isn’t a traditional residency. It’s a way to explore foundational ideas for AI and the web in a high-trust environment, with the possibility of continuing that work at Mozilla.
If this sounds like the kind of work you want to do, we want to hear from you. Hopefully, by reading to the end of this post, you’re either thinking of applying yourself — or know someone who should. I encourage you to check out (and share) Mozilla Pioneers, thanks!
Retrospect is a perfect WordPress theme for photographers. It displays images at full resolution alongside your post content.
Over 5.93% of WordPress.com users choose this theme for photo blogs and visual-first sites.
My experience
I liked the minimalist approach of the Retrospect theme. It’s a strong choice for art, travel, or photography blogs, where the focus should stay on visuals.
The layout is distraction-free, so your images can shine without clutter.
It also comes with patterns for newsletter sign-up, booking forms, and contact sections.
Setup is fast and intuitive. On mobile, images still look sharp without taking over the whole screen.
Choose this theme to:
Build a visual-first site, whether that’s a photo journal or personal blog.
Twenty Twenty-Three is a minimalist theme that gives you a clean starting point without heavy styling.
It offers plenty of style variations, but the base design stays flexible — more blank canvas than finished product.
About 5.53% of WordPress.com users choose this theme.
My experience
I liked the style variation options in this theme. The color palettes and typography options are vastly different from one another, making it easy to match the look to different kinds of sites.
If you’re building a simple one-page website, the template library and patterns make it easy to get started.
The minimal base also gives you room to experiment, which is great when you want more creative control over the design.
Choose this theme to:
Create a minimal design without heavy styling.
Maintain full creative control over your website’s design.
Twenty Twenty-Five sits at the cusp of a blank canvas and a fully designed premium theme.
It hits the right balance if you want something that looks and feels polished but is still easily customizable to your needs.
This theme is chosen by 3.83% of WordPress.com users.
My experience
My favorite part about this theme is the new and improved patterns.
There’s a wide variety of choices, from online store layouts to poster-style sections and event RSVP blocks.
No matter what type of site you’re building, you’re guaranteed to find something valuable here, which makes it a strong choice for beginners and more advanced users.
The style variations are also ready to use. You can switch between different color palettes and typography options without extra tweaking.
The bottom line: Twenty Twenty-Five sits between Twenty Twenty-Four, which is more specialized toward blogging with a clearly defined design, and Twenty Twenty-Three, which is broader and more open in its design.
Choose this theme to:
Gain a versatile base for many kinds of sites.
Build pages visually using patterns and templates.
Zoologist is an ideal theme for all sorts of blogging websites.
The single-column layout displays your posts in a clean, linear format, with no sidebars or distractions.
My experience
Zoologist has strong blogging roots.
To me, it felt like a great choice for anyone publishing long-form, whether that’s a business blog, a personal website, or a journal.
You can choose from several color variations to customize the visual design of your site.
The theme also offers templates and patterns similar to Twenty Twenty-Four, which help you add essential elements such as newsletter sign-up forms and waitlists.
My favorite part of this WordPress theme: It has little noise, with no unnecessary bells and whistles — just set it up and start publishing.
Choose this theme to:
Publish content using a simple, clean, single-column layout.
Build a straightforward blog or content-centric site.
Price: Available on the Business plan ($25/month on the annual plan)
If you’re building an online store, Tsubaki is a WordPress theme worth considering.
It’s designed for e-commerce and integrates seamlessly with WooCommerce, so your store, blog, and site all live in one place.
My experience
Tsubaki is built around e-commerce from the ground up.
The layout, navigation, and structure all support product displays and shopping flows.
The patterns are e-commerce-focused, with options for product categories, new arrivals, checkout sections, and more.
The additions don’t detract from the core blogging features, though. You can use this theme to host your blog while selling your physical or digital products.
Choose this theme to:
Build an online store or e-commerce site.
Combine shop and blog content on one site.
Use WooCommerce with an e-commerce-friendly layout.
Fewer’s clean content presentation and project-driven focus, which combine text and visuals neatly, make it an excellent choice for building portfolio sites.
Its design is clear and readable without being noisy, which helps keep the spotlight on your work.
My experience
I was immediately impressed by Fewer’s style variations.
The designs are versatile but not so loud that they shift focus away from the projects you want to highlight.
I found the typography especially clean and balanced.
Fewer is a solid choice if you want an elegant, content-first site that displays your work with minimal clutter.
While it’s great for portfolio sites, it’s also flexible enough to work for business or blog sites.
Choose this theme to:
Take advantage of style and typography variations.
Build a blog, portfolio, or content-driven site.
Keep the focus on your content or visuals through good typography and design.
Poema is a simple black-and-white text site built in honor of writer and poet Fernando Pessoa.
It’s designed to focus 100% on the writing material, with no visuals or design elements overpowering the text.
Poema is perfect for poetry sites, personal journals, or anywhere writing needs to take center stage.
My experience
Entering the Poema theme feels like opening a poetry book.
The design is clean and clutter-free — just your words on the page. The layout feels classic and literary, with serif fonts, neutral colors, and lots of whitespace.
Despite the name, it works just as well for long-form essays, journal entries, or personal reflections.
Choose this theme to:
Create an elegant site focused on typography and reading.
Nook uses a classic two-column layout with a sidebar structure, giving it a familiar blog feel.
It’s a strong choice for someone creating a personal site, food blog, journal, or craft-focused blog.
My experience
Nook has a warm, nostalgic blog feel.
If I were building a site for fun or to explore a hobby, this is the theme I’d pick. It’s great for getting creative and connecting with people who share your interests.
The templates and patterns are especially helpful if you’re a beginner or want to get started quickly.
I also liked the overall familiarity of the theme. Everything feels intuitive — easy to set up for you and easy to navigate for your visitors.
Choose this theme to:
Create a classic two-column blog layout with a sidebar.
Price: Available on the Premium plan ($8/month on the annual plan)
Aether is a great WordPress theme for small-scale stores that want to weave storytelling into their business site.
It’s particularly suitable for handcrafted goods, boutique products, or small merch brands, where you want clean presentation and built-in store-style flows.
My experience
As soon as I entered the Aether theme, its focus was clear: it’s built to help you sell your products while combining shop functionality with a brand story, an About page, a testimonials section, and a visual gallery.
The homepage includes sections for best-sellers, brand story, testimonials, and contact info, so you can launch a shop with minimal custom work.
The patterns are small business-friendly, with options for Instagram grids, sitewide notices, product displays, and more.
Choose this theme to:
Run a small jewelry, accessories, or artisan brand store.
Showcase products with a focus on style and storytelling.
Launch quickly with a ready-to-use homepage and store-oriented sections.
Vivre is heavily inspired by fashion and lifestyle magazines, making it a good fit for publication sites.
The design has a stylish, editorial feel that enhances the reading experience. The font pairing (heavy sans with elegant serif) and generous whitespace give it a traditional magazine vibe.
My experience
Vivre feels like a magazine from the moment you open it.
It features bold visuals, strong headers, and stylized typography that feels like ink on paper, making it well-suited for editorial or publication sites.
The patterns are also especially helpful when finishing your site. You can quickly add a hero post, a recent content section, and a posts grid.
It’s a great theme if your site relies on strong visuals or a distinct brand style.
Choose this theme to:
Create a bold, magazine-style look.
Balance style with readability.
Build a fast-loading site, even with heavy stylistic elements.
The best WordPress theme is the one that matches your site’s purpose and saves you time down the road.
Use this quick checklist:
Does it match your use case? If you’re building a blog, pick a blogging theme. If you’re building a portfolio, choose one designed for showcasing work. You’ll get the right features without extra customization.
Does it fit the visual style you want? Most themes offer style variations, so check them all before committing. Think about the impression you want to make, whether it’s calm, minimalist, bold, or editorial.
Does it have the features you need? For example, if you’re selling products, you’ll want a theme with WooCommerce support. If you’re a beginner, look for a solid pattern library to help you build pages fast.
Can it scale with you? If you’re planning to add pages, products, or content over time, make sure the theme can handle it.
Is it well-maintained? Themes from trusted developers get regular updates, which means fewer bugs and better performance.
You can always switch themes later — it’s not irreversible. But investing time upfront helps you avoid dealing with broken layouts and user experience headaches down the road.
Get started with WordPress.com
WordPress.com gives you plenty of themes to build any kind of site.
But themes are just the start.
WordPress.com also takes care of the essentials that keep your site running smoothly:
With the MCP protocol, Anthropic created the de facto standard for AI models and agents to talk to third-party applications. After donating the MCP protocol to the Agentic AI Foundation last December, Anthropic today released a major new open extension to MCP that will allow MCP servers to serve up an interactive app-like experience right within the chat interface.
Anthropic, of course, is building this feature right into the web and desktop experiences for Claude. It’s worth stressing, however, that this is an open protocol, so any other chatbot provider can adopt this protocol, and any third-party service will be able to build these apps.
Already, support for MCP Apps is available in Goose, Visual Studio Code (for Insiders), and, later this week, ChatGPT from Anthropic competitor OpenAI.
Some of Anthropic’s early partners include the likes of Amplitude, Asana, Box, Canva, Clay, Figma, Hex, monday.com, and Slack. With the Box MCP App, for example, users will be able to search for files and preview documents inline in the chat experience — and then ask questions about those documents, too.
With the Slack app, meanwhile, users can use the AI model to write and edit message drafts and then post them to Slack. Among other things, it’s the MCP Apps framework that allows for the direct editing of these messages right in Claude.
The Slack MCP app (image credit: Slack).
“Enterprises need more from AI than powerful models. They also need a reliable way for those models to operate inside real business environments. By partnering with Anthropic, we are bringing Salesforce directly into our customers’ flow of work and providing the execution layer with context, data, governance, and trust,” says Nick Johnston, SVP of Strategic Tech Partnerships at Salesforce in today’s announcement. “That’s what powers the Agentic Enterprise.”
Soon, Slack owner Salesforce will also bring its Agentforce, Data 360 and Customer 360 apps to Claude.
The Asana MCP app (credit: Asana).
Some of the typical scenarios for using MCP Apps, which Anthropic first proposed in November, include interactive data exploration using dashboards, configuration wizards, document reviews and real-time monitoring.
At its core MCP Apps rely on tools that supply user interface metadata and the user interface resources (HTML and JavaScript) to render them.
Building MCP Apps
The core primitives for defining MCP apps (credit: Anthropic).
Bringing this interactive UI experience to Claude and other chat-centric AI tools feels like a logical next step. Chat is, for better or worse, still the default way to interact with AI models, but for a while now, it has felt quite limited.
Anthropic isn’t the first one to think of this, of course. With its Apps SDK, OpenAI offers a somewhat similar framework, which also uses MCP at its core. Anthropic notes that both the OpenAI Apps SDK and the open-source MCP-UI project (created by Ido Salomon and Liad Yosef) pioneered many of these patterns.
“The projects proved that UI resources can and do fit naturally within the MCP ecosystem, with enterprises of all sizes adopting both the OpenAI and MCP-UI SDKs for production applications,” the Anthropic team writes.
And for the foreseeable future, developers who wrote MCP-UI apps will be able to continue to do so.
“MCP Apps builds upon the foundations of MCP-UI and the ChatGPT Apps SDK to give people a rich, visually interactive experience,” says Nick Cooper, Member of Technical Staff, OpenAI. “We’re proud to support this new open standard and look forward to seeing what developers build with it as we grow the selection of apps available in ChatGPT.”
On the security front, Anthropic notes that it implemented a number of guardrails to ensure the third-party code you are running on your MCP host can break out of its sandbox. These include sandboxed iframes with restricted permissions, the ability of hosts to review the HTML content before rendering, auditable UI-to-host messages, and the fact that users have to give explicit approval for UI-initiated tool calls.
Hey! It’s my first post for 2026, and I’m writing to you while watching our driveway getting dug out. I hope wherever you are you are safe and warm and your data is still flowing!
This week brings exciting news for customers running GPU-intensive workloads, with the launch of our newest graphics and AI inference instances powered by NVIDIA’s latest Blackwell architecture. Along with several service enhancements and regional expansions, this week’s updates continue to expand the capabilities available to AWS customers.
Last week’s launches
Amazon EC2 G7e instances are now generally available — The new G7e instances accelerated by NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs deliver up to 2.3 times better inference performance compared to G6e instances. With two times the GPU memory and support for up to 8 GPUs providing 768 GB of total GPU memory, these instances enable running medium-sized models of up to 70B parameters with FP8 precision on a single GPU. G7e instances are ideal for generative AI inference, spatial computing, and scientific computing workloads. Available now in US East (N. Virginia) and US East (Ohio).
Additional updates
I thought these projects, blog posts, and news items were also interesting:
Amazon Corretto January 2026 Quarterly Updates — AWS released quarterly security and critical updates for Amazon Corretto Long-Term Supported (LTS) versions of OpenJDK. Corretto 25.0.2, 21.0.10, 17.0.18, 11.0.30, and 8u482 are now available, ensuring Java developers have access to the latest security patches and performance improvements.
Amazon ECR now supports cross-repository layer sharing — Amazon Elastic Container Registry now enables you to share common image layers across repositories through blob mounting. This feature helps you achieve faster image pushes by reusing existing layers and reduce storage costs by storing common layers once and referencing them across repositories.
Amazon CloudWatch Database Insights expands to four additional regions — CloudWatch Database Insights on-demand analysis is now available in Asia Pacific (New Zealand), Asia Pacific (Taipei), Asia Pacific (Thailand), and Mexico (Central). This feature uses machine learning to help identify performance bottlenecks and provides specific remediation advice.
Amazon Connect adds conditional logic and real-time updates to Step-by-Step Guides — Amazon Connect Step-by-Step Guides now enables managers to build dynamic guided experiences that adapt based on user interactions. Managers can configure conditional user interfaces with dropdown menus that show or hide fields, change default values, or adjust required fields based on prior inputs. The feature also supports automatic data refresh from Connect resources, ensuring agents always work with current information.
Upcoming AWS events
Keep a look out and be sure to sign up for these upcoming events:
Best of AWS re:Invent (January 28-29, Virtual) — Join us for this free virtual event bringing you the most impactful announcements and top sessions from AWS re:Invent. AWS VP and Chief Evangelist Jeff Barr will share highlights during the opening session. Sessions run January 28 at 9:00 AM PT for AMER, and January 29 at 9:00 AM SGT for APJ and 9:00 AM CET for EMEA. Register to access curated technical learning, strategic insights from AWS leaders, and live Q&A with AWS experts.
AWS Community Day Ahmedabad (February 28, 2026, Ahmedabad, India) — The 11th edition of this community-driven AWS conference brings together cloud professionals, developers, architects, and students for expert-led technical sessions, real-world use cases, tech expo booths with live demos, and networking opportunities. This free event includes breakfast, lunch, and exclusive swag.
Join the AWS Builder Center to learn, build, and connect with builders in the AWS community. Browse for upcoming in-person and virtual developer-focused events in your area.
That’s all for this week. Check back next Monday for another Weekly Roundup!