Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
153119 stories
·
33 followers

Advancing AI evaluation with the Center for AI Standards (US) and Innovation and the AI Security Institute (UK)

1 Share

Today, Microsoft is announcing new agreements with the Center for AI Standards and Innovation (CAISI) in the US and the AI Security Institute (AISI) in the UK to advance the science of AI testing and evaluation, including through collaborative work to test Microsoft’s frontier models, assess safeguards, and help mitigate national security and large-scale public safety risks. These agreements matter because ongoing, rigorous testing is essential to building trust and confidence in advanced AI systems. Well-constructed tests help us understand whether our systems are working as intended and delivering the benefits they are designed to provide. Testing also helps us stay ahead of risks, such as AI-driven cyberattacks and other criminal misuses of AI systems, that can emerge once advanced AI systems are deployed in the world. 

While Microsoft regularly undertakes many types of AI testing on its own, testing for national security and large-scale public safety risks necessarily must be a collaborative endeavor with governments. This type of testing depends on deep technical, scientific, and national security expertise that is uniquely held by institutions like CAISI in the US and AISI in the UK and the government agencies they work with. By combining that government expertise with Microsoft’s experience building and deploying AI systems at global scale, together we are better positioned to anticipate and manage national security and public safety risks in ways that build public trust and confidence in advanced AI systems.  

Improving AI evaluation science through cooperative research and operational experience 

Advancing the science of AI evaluation requires more than isolated research or one-off testing. It depends on sustained collaboration between industry, government, and research institutions. Through our new and expanded partnerships with the US and UK governments—alongside national security–focused evaluations of model capabilities—Microsoft is bringing technical expertise and operational experience to strengthen AI evaluation methods and practical testing foundations.  

  • In the US, with CAISI, Microsoft and NIST will collaborate on improving methodologies for adversarial assessments—testing AI systems in ways that probe unexpected behaviors, misuse pathways, and failure modes, much like stress-testing whether airbags, seatbelts, and braking systems work effectively and reliably in safety-critical driving scenarios. This work involves co-developing more systematic and reproducible approaches to evaluation, including shared frameworks, datasets, and workflows for assessing safety, security, and robustness risks in advanced AI systems. It also builds on our AI Red Team’s novel research and tools to detect compromised models at scale. 
  • In the UK, with AISI, Microsoft will collaborate on research related to frontier safety and security, including methods for evaluating high-risk capabilities and the effectiveness of the safeguards used to address them. The partnership will also include societal resilience research examining how conversational AI systems interact with users in sensitive contexts.  

These collaborations are designed to improve measurement science, evaluation methodologies, practical testing workflows, and real-world mitigation impact. They reflect a shared commitment to rigorous, practical approaches that can make safeguards stronger and evaluations more reliable. 

Looking ahead 

No organization can address these challenges alone. Our partnerships with CAISI and AISI are a key part of a wider effort to build the institutions, research base, and shared methodologies needed for effective AI testing. This effort also includes: 

  • Pursuing research and evaluation in collaboration with other AI institutes globally while helping advance shared priorities and methodologies for testing through the International Network for AI Measurement, Evaluation and Science. 
  • Helping deliver industry best practices through the Frontier Model Forum (FMF), an initiative dedicated to advancing the science and practice of frontier AI safety and security. Through the FMF, we are working with other leading AI developers to support independent research, develop shared evaluation methodologies, and promote transparency around risk mitigation strategies.  
  • Contributing to MLCommons, a multistakeholder non-profit that develops and operationalizes testing tools such as AILuminate, a family of safety and security benchmarks. In February, we announced efforts underway with institutions in India, Japan, Korea, and Singapore to expand AILuminate to support multilingual, multicultural, and multimodal evaluation, helping to make sure that AI systems work well in the languages and cultural contexts in which people around the world use them. 

As AI capabilities advance, so too must the rigor of the testing and safeguards that underpin them. We will apply what we learn from these partnerships directly into how we design, test, and deploy AI systems, ensuring that progress in evaluation science translates into safer, more secure products for our customers. As these partnerships progress, we will share what we learn and look for opportunities to apply insights and best practices to AI testing more broadly.   

The post Advancing AI evaluation with the Center for AI Standards (US) and Innovation and the AI Security Institute (UK) appeared first on Microsoft On the Issues.

Read the whole story
alvinashcraft
38 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Welcome to Maintainer Month: Celebrating the people behind the code

1 Share

At a Maintainer Unconference in Brussels this year, a breakout session had maintainers jotting down thoughts on the future of open source. One sticky note stood out:

As AI gets better at writing code, human work around code becomes more important and more invisible.

Mentoring new contributors, building trust across a community, making the judgement calls that shape a project’s direction: that’s the work that turns a repository into a living collaboration. And with the speed of AI, the people doing the work are carrying more than ever. Pull requests merged on GitHub have nearly doubled year over year, and agentic workflows are accelerating the pace even further. As one maintainer put it:

How much time should I spend on something that you didn’t spend any time on?

I’ve been part of Maintainer Month for five years now. The conversations I’m having with maintainers this year feel different—there’s a weariness, but there’s also innovation. Maintainers are converging on standards like agents.md, building trust systems, and designing workflows that put them back in control. In February, Ashley Wolf named the influx of low-quality contributions open source’s Eternal September. Maintainers told us exactly what they needed. We took notes.

Six years ago we started Maintainer Month because the people behind open source deserve better tools, real resources, and community. This year, we’re going bigger on all three.

Tools: Big releases for maintainers this month

Maintainers need better ways to manage who contributes, how, and at what volume. In the Eternal September post, we shared some of the directions we were exploring. Here’s where things stand.

Granular contribution limits: This one’s for every maintainer who’s watched their pull request queue turn into a firehose. This gives maintainers the ability to introduce limits on how many pull requests a new or unknown user can make in your project. No more choosing between closing the doors and opening the floodgates. You control how much you let in.

Pull request archiving pairs with it. Sweep spam pull requests out of public view. No more emailing support to clean up your repo.

And there’s a brand new accessibility best practices guide on opensource.guide. Practical steps to make your project usable by everyone.

And we haven’t been waiting around. Since February, we’ve also shipped:

We’re building these because maintainers asked for them. Specifically, repeatedly…and often loudly! We hear you, and we’re going to keep shipping. Please keep flagging.

Resources: Who else is showing up

We asked companies and foundations across the ecosystem to show up for Maintainer Month. And they did! Sentry, OpenJS Foundation, Daytona, and more partners are putting real resources behind maintainers: free tools, compute credits, threat intelligence, conference tickets, and more.

Open source runs on maintainers, and we’re proud to partner with GitHub to celebrate and support them. As the ecosystem scales, maintainers are doing more than ever to keep projects secure and reliable. Maintainer Month is a chance to connect, share knowledge, and remind them they’re not doing this alone.

Robin Ginn, OpenJS Foundation

Partners across the ecosystem are offering real resources for maintainers. Here’s what’s available:

Last year, Sentry celebrated companies that fund open source on a Times Square billboard for Maintainer Month. That’s the energy we’re looking for.

Want to join them? Whether you’re a company that depends on open source, a startup, or an educator—reach out about the Partner Pack or explore the GitHub Partner Program for more ways to get involved.

Maintainers, here’s where you can claim your Partner Pack benefits.

And if you maintain open source tools for science: the new Open Source for Science Fund just launched with $20 million in funding. Grants up to $1 million for projects supporting data-intensive research. Letters of intent open May 11.

Community: You shouldn’t have to do this alone

There are 20+ events and streams (and counting!) scheduled throughout Maintainer Month. Here are a few we’re excited about:

We’d love to see you there, whether you maintain a project with millions of users or you’re just getting started.

Check out the full schedule >

Part of something bigger

One thing we heard over and over from maintainers this year: they want to be ”part of something bigger and not just being a solo maintainer.” If you maintain an open source project and want to connect with others who get it, request to join the Maintainer Community, a vetted space to share experiences, get support, and have honest conversations. It’s where the “how are you handling this?” sharing of best practices is happening.

Community members also get access to an exclusive tier of the Partner Pack, with deeper discounts, higher credit limits, and offers you won’t find in the public pack.

Request to join >

Get involved

  • Sponsor a maintainer. Financial support is one of the most direct ways to say “your work matters.”
  • Host or attend an event. Browse the schedule or submit your own event.
  • Share your story. Tag #MaintainerMonth on social media. Tell people about the project you maintain and what it means to you. The best way to celebrate maintainers is to make their work visible.
  • Say thank you. Find a project you depend on and tell the maintainers you appreciate them. It matters more than you think.

Open source is changing fast. What hasn’t changed is that real people wake up every day and choose to maintain the software the world runs on. They do it because they believe in it, and millions of us depend on that choice.

This month is for them. Show up, pitch in, say thank you. Let’s make it count.

See everything happening this Maintainer Month >

The post Welcome to Maintainer Month: Celebrating the people behind the code appeared first on The GitHub Blog.

Read the whole story
alvinashcraft
48 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Radar Trends to Watch: May 2026

1 Share

The most significant tension in this issue is between two companies making different decisions about how to handle AI with frontier security capabilities. Anthropic restricted Claude Mythos to a small corporate cohort through Project Glasswing. OpenAI released GPT-5.5 to general availability, and some are calling it “Mythos-like hacking, open to all.” The AI Security Institute’s evaluation confirms the capability is real and consequential. How will you manage risk when the time between discovery of a vulnerability and exploitation collapses to zero?

Another important theme is that, in the words of The Sequence, “AI is becoming operational.” It’s no longer about LLMs that can play games with words. It’s about tools that can automate processes across an enterprise: agents, of course, but more specifically agents that can be shared by teams to produce a consistent set of tools that can be used by groups.

AI Models

The open-weight model market is reshaping the economics of AI. This cycle brought at least 10 significant model releases or updates across open and closed providers, with pricing pressure coming from multiple directions. DeepSeek now performs within a fraction of Claude Opus 4.7 on coding benchmarks at a radically lower price; Alibaba, Google, Z.ai, and Moonshot all released capable open models this cycle. The Stanford AI Index documents this at scale. For organizations building on AI, the question is no longer whether open-weight alternatives are viable but which trade-offs they are willing to make on cost, portability, and support.

  • Google has published a list of 1,302 real-world use cases for generative AI. It’s very long and probably not worth reading on your own. However, you might want to point your agent at it.
  • OpenAI has announced GPT Images 2, its flagship model for generating images. The initial reaction is that it’s slightly better than Google’s Nano Banana. What distinguishes Images 2 is that it “thinks” before generating the image.
  • Anthropic used Claude to work on some problems in alignment research. Claude outperformed the humans at lower cost. The problems were, admittedly, cherry-picked to be easily scoreable. But the experiment also demonstrated that a less capable model can supervise a stronger model.
  • Moonshot Labs has released Kimi K2.6, the latest in its series of open models. It also open sourced the Kimi Vendor Verifier, a tool that tests the accuracy of vendors selling inference using Kimi.
  • Alibaba has released Qwen3.6-35B-A3B, the latest model in its Qwen series. It’s a mixture-of-experts model with 3B active parameters. Simon Willison reports that it draws great flamingos, if you consider that relevant.
  • Anthropic has released Claude Opus 4.7. The model is positioned as an intermediate step between Opus 4.6 and Claude Mythos Preview. Anthropic claims that 4.7 is better at multimodal work, including vision, instruction following, and memory use. Its new tokenizer increases the number of tokens that Claude uses. Because billing is based on tokens, that’s effectively a price increase. Simon Willison has built a tool to compare the token usage of different models.
  • Google has announced Gemini 3.1 Flash TTS, a text-to-speech model that gives extraordinary control over the speakers: accents, style, expression, and more.
  • Stanford’s 2026 AI Index Report is out, with over 400 pages of data and analysis about the state of AI.
  • Meta’s refactored AI lab has released its first model, Muse Spark. It’s a multimodal model that has been designed for integration with Meta’s products. There will eventually be a Contemplating Mode for orchestrating agents.
  • DeepSeek has released a preview version of DeepSeek-V4, its latest open-weight model. It’s a large model (over 1T parameters) with performance very close to the frontier models, but (as Simon Willison points out) running it is very inexpensive.
  • OpenAI released GPT-5.5, which some are calling “Mythos-like hacking, open to all.” In addition to being its “smartest and most intuitive” model yet, OpenAI claims that it reduces token counts, thereby reducing cost. Other sources report that, while it scores highly on benchmarks, GPT-5.5 is markedly more likely to hallucinate and provide incorrect answers.
  • Z.ai’s GLM-5.1 is a new version of the open source GLM-5 model that has been optimized to perform well on long-running tasks.
  • Google has released Gemma 4, a new version of its family of open source models. The family includes a 31B version and a mixture-of-experts version with 26B parameters, 4B active. These are all reasoning models that are designed for agentic workflows. One model, Gemma 4 E4B, can run on the iPhone and Android.

Software Development

Anthropic has clearly been winning the announcement race. Whether it’s also winning on performance is a different question. Claude Code was a favorite among developers until its performance slipped. Many switched to newly released Cursor 3, which puts an agentic interface front and center while relegating the IDE to the background. Anthropic’s public postmortem on Claude Code’s behavior regression is worth reading both for its specific findings and as a model for how AI providers should communicate quality issues to developers. And Cursor’s transformation from an IDE into an agent is a pattern we expect to see repeated across the industry.

  • OpenAI has announced “workspace agents.” Workspace agents can be shared across a team, while the agents we have so far are tied to individual productivity. They enable a team to collaborate on building shared tools to automate workflows.
  • Microsoft has announced two new tools, Critique and Council, that use Claude and GPT together to solve research problems. Their benchmark results show that the combination works better than any model used on its own.
  • Stash is an open source memory layer that agent builders can use to connect their agents to models. We’re beginning to see an agentic stack that is composed of interchangeable modules.
  • Developers have been complaining about a drop in Claude Code’s behavior over the last few months. Anthropic has issued a response explaining what happened and how they’re fixing it.
  • Glif is an agent that tries to unify all the LLMs and tools at your disposal. You don’t have to decide which model or tool is best for each task; it makes the decision for you and gets the task done.
  • OpenAI has decoupled its agent harness from computing and storage, enabling durable long-running agents. The harness is now open source and can be customized through the Agents SDK.
  • Anthropic has announced Claude Code routines. A routine is a package that includes a prompt, a repository, and connectors that will run automatically on Anthropic’s infrastructure, either on a schedule or when triggered.
  • Anthropic also announced Claude Managed Agents, a prebuilt harness for developing agents that run on Anthropic’s infrastructure. The harness provides most of the infrastructure that an agent needs (memory management, etc.) but can be configured for the user’s tasks. Anthropic’s goal appears to be becoming the AWS of agentic AI: a service provider for tool builders.
  • Interoperability between tools, models, and plug-ins is allowing a new programming stack to develop: an orchestration layer, an execution layer, and a review layer.
  • Amazon has launched an agent registry service as part of AWS Bedrock AgentCore. Bedrock AgentCore is a collection of services that make it easy to build and deploy agents on AWS. The registry gives developers a way to discover third-party agents that might be useful to their work.
  • Bryan Cantrill’s essay on laziness is a must-read. AI isn’t lazy, and that’s a problem. When work costs nothing, there’s no need to think about future workers. Laziness is a virtue that we need to preserve.
  • Anthropic has announced Claude Design, a new tool designed to help designers. It competes directly with Figma and Canva. It’s currently in “research preview.”
  • Perplexity has launched Personal Computer, a local AI agent that runs on a dedicated Mac mini (Windows to come) and has persistent access to your files, native apps, inbox, and the web.
  • Anthropic has released a Claude plug-in for Microsoft Word, targeting the legal market. Automated edits appear as tracked changes.
  • LiteParse is a command-line tool that extracts text from PDF files. If you’ve never needed to do that, you’ve lived a blessed life. Simon Willison has built a web-based version that runs LiteParse in the browser.
  • Luke Wroblewski has said that designers should code; they need to understand their medium. But around 2014, heavyweight frameworks like React and Angular got in the way. Coding agents are now making “collapsing the gap between designing and building.”
  • Cursor 3, the letest release of Cursor, relegates its IDE to the background. The main screen is designed for orchestrating agents. You can fall back to the IDE for editing code if you need to.
  • In the first quarter of 2026, Apple’s app store has seen a huge (84%) increase in the number of new apps, compared to the first quarter of 2025. The cause is probably the ease of using AI to create new apps. Apple also appears to be limiting the use of “vibe coding” to create new apps, and has removed several vibe coding apps from the app store.
  • Anthropic accidentally leaked the source code for Claude Code, prompting waves of commentary. Two of the most interesting are Shlok Khemani’s tour of what he found interesting in the source and Gergely Orosz’s discussion of the legal implications.
  • The Hidden Technical Debt of Agentic Engineering” argues that, as with machine learning, agents are relatively small parts of larger software systems, and that technical debt accumulates in all the supporting modules.
  • Chat is rarely the best interface for working with AI. Ethan Mollick writes that the current generation of AI models and agents are capable of creating task-specific interfaces on the fly.

Security

Security has spent a lot of time in the news. Two core tools for secure private networking, Tor and Signal, have been attacked. In both cases, the attack didn’t involve the software or protocols themselves. These attacks teach us that secure systems are often jeopardized by the software that surrounds them. We’ve also seen that ransomware gangs are using postquantum encryption, and that quantum computers are likely to break traditional encryption sooner than expected. If you’re not investing in security, it’s time to start.

  • The Tor network is the gold standard for secure private networking. Researchers recently discovered a vulnerability in Firefox browsers that lets attackers de-anonymize identities. The vulnerability has been fixed in Firefox 150, but it’s a reminder that anything can be attacked.
  • We all know that ransomware gangs use encryption. The Kyber group is making the transition to postquantum encryption.
  • A supply chain attack against npm allows bad actors to steal developers’ credentials. Once it has infected a victim, it inserts itself into other packages that the victim publishes.
  • Law enforcement agencies were briefly able to exploit a vulnerability in iOS notifications that allowed them to access unencrypted messages sent with the Signal secure messaging system. The vulnerability has been patched. It’s important to understand that the vulnerability wasn’t in Signal itself but in the environment in which it operated.
  • With AI, time from discovery of a vulnerability to exploitation has dropped to zero. To help defense catch up, Google has added three agents to its Google Security Operations platform: Threat Hunting, Detection Engineering, and Third Party Context.
  • Microsoft reports that criminals are increasingly using Teams to impersonate help desk personnel, who ask users for their credentials and then steal data.
  • NIST has stopped assigning severity scores to lower-priority vulnerabilities. All vulnerabilities will still be added to the National Vulnerability Database (NVD).
  • The NSA is using Claude Mythos Preview, despite Anthropic being blacklisted by the Pentagon. Anyone want to guess what they’re using it for?
  • Anthropic will ask for identity verification in some cases.
  • Small open-weight models can do as well as Anthropic’s Mythos at finding vulnerabilities. The key isn’t the model; it’s the system within which the model works.
  • A new malware campaign embeds credit-card stealing software into a single pixel SVG image. ecommerce sites using Magento Open Source or Adobe Commerce are vulnerable.
  • Anthropic has pulled its newest model, Claude Mythos, from broader release because it’s too good at finding vulnerabilities in other software. They’ve made it available to a few corporations via Project Glasswing, an attempt to secure critical software before it can be exploited. The AI Security Institute’s analysis of Claude Mythos Preview says that it “represents a step up over previous frontier models in a landscape where cyber performance was already rapidly improving.”
  • Many open source security maintainers agree with Greg Kroah-Hartmann‘s report that the quality of AI-generated security bug reports has gone up tremendously.
  • Versions of Claude Code that include the Vidar malware have been published on GitHub. They are based on the code that Anthropic inadvertently leaked. These versions entice victims to download them by claiming to have unlocked enterprise features.
  • Claude has been used to discover zero-day remote code execution vulnerabilities in both Vim and Emacs. The vulnerabilities are triggered when a user opens a file. An update is available for Vim; Emacs developers argue that it’s really a bug in Git, which may be correct but misses the point.
  • Breakthroughs in quantum computing mean that computers capable of cracking current encryption algorithms may be on the horizon.

Infrastructure and Operations

Multiple providers released overlapping pieces of an agent stack this cycle, covering orchestration, persistence, memory, and registry services. A three-layer model (orchestration, execution, review) is becoming the standard architecture, but each vendor’s implementation makes different bets about portability and durability. It’s important to evaluate each vendor’s products carefully before settling on an agent stack.

  • Microsoft now allows admins to uninstall Copilot, though there are conditions.
  • Google has announced two new eighth-generation TPUs. One is designed for training (8t), the other specializes in inference (8i). This is the first time Google has produced specialized TPUs for training and inference.
  • Google has open-sourced Scion, its testbed for agent orchestration.
  • Anthropic has agreed to buy 3.5 gigawatts of computing power from Google and Broadcom, maker of Google’s GPUs. The deal specifies power consumption rather than the number of chips, implying that the limiting factor isn’t computation but the availability of power. Chips come and go; watts are a constant.
  • Ollama now uses Apple’s MLX framework to improve performance on Apple silicon. Support is currently limited to the Qwen3.5-35B-A3B; support will be added for other models. As part of this update, it also uses NVIDIA’s NVFP4 floating point format for model quantization.

Web

Don’t overlook the web layer when planning for AI-driven disruption. The web’s infrastructure is older than most of the people who maintain it, and several items this cycle are reminders of the gap between what that infrastructure was designed for and how it is used today. Two deal with protocols that have outlasted their original assumptions; another reimagines the dominant CMS from scratch using current tooling.

  • Is PHP the new COBOL? What about open source itself? “Who Will Maintain the Web When PHP’s Veterans Retire?” points to a reality that we don’t like to think about. Not only are companies reluctant to hire junior developers; the ones they do hire aren’t learning older technologies.
  • Laravel is apparently injecting ads for its commercial cloud service into agents. What happens when an open source framework receives venture funding and starts injecting ads into agents? We’re about to find out.
  • Doesn’t every musician need tools to typeset Gregorian chant?
  • Is IPv8 the future of the Internet? IPv6 has been “two years away” since early in the 1990s. IPv8 is fully backward compatible with IPv4, and resolves its security and address depletion issues.
  • Cloudflare has released EmDash, an alternative to WordPress based on how the web is used today. Drew Breunig calls this a reimagining: a new phase of software development in which we can use agentic programming to rethink and reimplement tools based on current needs.
  • Is BGP Safe Yet? is a web app that tests whether your ISP has implemented BGP (the protocol that’s responsible for routing packets at internet scale) correctly. Many haven’t.

Biology

  • OpenAI has announced GPT-Rosalind, a model that has been tuned for 50 common workflows in biology. Unlike most models, Rosalind has been tuned to be skeptical rather than enthusiastic or sycophantic. Access to Rosalind is limited because of the potential for harm.

Robotics

  • Spot, the Boston Robotics robotic dog, can now read gauges and thermometers. It uses the Gemini Robotics-ER 1.6 model, which can reason about visual information.
  • Major League Baseball is using a robotic system to rule on challenges to a human umpire’s ball/strike calls.


Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete

How Frontier Firms are rebuilding the operating model for the age of AI

1 Share

Spend time with any software engineering team right now and you’ll see something worth paying attention to. Over the last few years, the way software gets built has moved through four distinct patterns of human-agent collaboration — and the same patterns are beginning to show up across other functions of the firm.

  • Author: You’re producing the work, calling on AI to help as needed — a line of code, a sentence, a chart.
  • Editor: You set the intent and AI creates the first draft for you to edit and approve.
  • Director: You create a spec and hand off entire tasks for AI to execute in the background.
  • Orchestrator: You design a system where multiple agents run in parallel across a workflow, flagging exceptions and escalations to you.

Every business leader knows the world is changing, but far fewer have a clear picture of what to do about it. These four patterns are the place to start. The real work ahead for leaders is redesigning their firm’s operating model around the collaboration patterns.

As agent use increases, human involvement doesn’t disappear — it changes shape. What declines is the amount of tactical, step-by-step execution work humans do themselves. And what rises is the need for humans to set direction, define standards and evaluate outcomes.

Ultimately, the goal is not to move every task and business process to the fourth pattern. Instead, it’s up to leaders to help their organizations develop clarity around matching workstreams to the right collaboration pattern. That’s the shape of the Frontier Firm: defined by how deliberately leaders design work across functions, matching the level of human involvement to the outcome.

What the data shows

Our 2026 Work Trend Index research reinforces this shift across roles and industries. We analyzed trillions of anonymized Microsoft 365 productivity signals and surveyed 20,000 workers using AI across 10 countries. We also spoke with leading experts in AI, work and organizational psychology to help us unpack the insights from the data and understand where all this is going. The conclusion is consistent: the constraint is no longer what people can do, it is how work is structured around them.

  • AI lifts individual potential. A privacy-preserving analysis of more than 100,000 chats in Microsoft 365 Copilot shows that 49% of all conversations support cognitive work — helping workers analyze information, solve problems, evaluate and think creatively. This shift is already visible in output, with 58% of AI users saying they’re producing work they couldn’t have a year ago, rising to 80% among Frontier Professionals, the most advanced AI users in our research. Additionally, when AI users were asked which human skills are most important as AI takes on more work, they said two topped the list: quality control of AI output (50%) and critical thinking — that is, analyzing information objectively and making a reasoned judgment (46%).
  • The Transformation Paradox. We are seeing a pressure point emerge within the organization where the pull to perform collides with the push to transform. 65% of AI users surveyed fear falling behind if they don’t use AI to adapt quickly, yet 45% say it feels safer to focus on current goals than to redesign work with AI. And only 13% of workers say they’re rewarded for reinvention of work with AI even if results aren’t met. The same forces accelerating AI adoption are holding it back.
  • Every organization is a learning system. Our results show that organizational factors like culture, manager support and talent practices account for more than 2X the AI impact of individual factors like mindset and behavior (67% vs. 32%). Specifically, the findings underscore the importance of an AI-ready environment: a culture that treats AI as a strategic advantage and encourages experimentation, managers who model and incentivize AI use and talent practices that build skills and create space to apply them. The real question isn’t whether people have the right skills, it’s whether the organization is built to unlock them.

The firms that build a new operating model today won’t just move faster in the short term. They’ll build something more durable, setting themselves up to create value in ways that we can’t yet conceive of: an organization that learns faster than its competitors, compounds its own intelligence and gets harder to catch with every cycle.

For deeper analysis, see the 2026 Work Trend Index Report.

Enabling the Frontier Firm with Copilot Cowork — now mobile, extensible and enterprise-ready

None of an organization’s system scales without infrastructure that brings people and agents into the same flow of work with connected data and the ability to manage and govern it all. Microsoft 365 Copilot is built for exactly that.

Today, we’re expanding Copilot Cowork with new capabilities for Frontier customers to help organizations move from isolated AI tasks to coordinated, multistep work. Cowork enables people to define outcomes and delegate work across apps, business systems and data, with execution that stays directed and controlled throughout.

This update introduces Copilot Cowork Mobile for iOS and Android, along with a growing plugin ecosystem for Cowork, bringing more of an organization’s tools and data into these experiences. This includes native plugins across Microsoft services like Dynamics 365 and Fabric, and partner integrations available in the coming weeks like LSEG (London Stock Exchange Group), Miro, monday.com, S&P Global Energy and more. Organizations can also build custom plugins to turn their own workflows and expertise into reusable, scalable processes. Additionally, a first wave of federated Copilot connectors in Researcher and Microsoft 365 Copilot Chat is generally available today from partners like HubSpot, LSEG (London Stock Exchange Group), Moody’s, Notion and more.

Together, these updates extend Copilot Cowork from a task-based assistant into an extensible platform that helps orchestrate work across Microsoft and third-party systems. With management and governance through Microsoft Agent 365, organizations can deploy and scale agents across core business functions like sales, service and operations.

For more on these product innovations: Microsoft 365 blog.

AI is no longer an experiment. It is an execution challenge. Employees are already working across all four patterns. The open question for every leadership team is whether they can catch up. Access to AI won’t be the advantage for much longer. How the work is designed around it will be.

Jared Spataro, CMO, AI at Work at Microsoft, shapes how every organization applies AI and agents to reduce costs, create new value and define the future of work. He leads research, strategy and product across Copilot, Copilot Studio, Microsoft 365, Dynamics 365 and Power Platform.

The post How Frontier Firms are rebuilding the operating model for the age of AI appeared first on The Official Microsoft Blog.

Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete

What is dogfooding? How JetBrains builds better developer tools

1 Share

Dogfooding in software development means using your own products to build, test, and improve them. At JetBrains, it’s a core part of how we create developer tools like IntelliJ IDEA, YouTrack, and Rider.

We don’t rely on assumptions or abstract user personas. We use our tools every day in real workflows, which keeps us close to the problems developers actually face.

Our CEO, Kirill Skyrgan, puts it:

“You can only build truly great software if you use it yourself. Every feature and every decision comes from firsthand experience.”

What is dogfooding in software development?

Dogfooding — short for “eating your own dog food” — means putting your product through the same real-world use as your customers.

Our engineers, designers, product managers, and even technical writers build their daily workflows around JetBrains tools. We write code in  IntelliJ IDEA and track issues and internal project statuses in YouTrack.

It’s not about internal compliance – no one forces anyone to use a product. It’s about trust. We use our tools because they help us do our jobs better, and when they don’t, we fix them.

This direct connection between building and using keeps us grounded. We don’t chase trends or design for hypothetical users. If something slows us down, we know it likely affects thousands of developers too.

Benefits of dogfooding: Faster feedback and better software

Dogfooding gives us what every product company dreams of: immediate, unfiltered feedback.

Instead of waiting weeks for customer reports, our developers spot issues as they code.
When a feature feels unintuitive or a shortcut doesn’t work as expected, the fix often starts that same day or even the same hour.

This tight feedback loop turns every JetBrainer into a quality advocate. It shortens the distance between the problem and the solution, helping us catch issues long before they reach users. It also fosters empathy. Using the tools ourselves means we understand not only what users say, but what they experience. We feel the slowdowns, the friction points, and the “why is this like that?” moments – and we care enough to address them.

“Those thousands of tiny corrections made over time are what turn a good product into a great one,” Kirill shared. “They come from people who use the tool every day and want it to be better, not for KPIs, but because they genuinely care.”

Examples of dogfooding at JetBrains

Dogfooding shapes every JetBrains product, often long before release.

Rider: From unstable to production-ready

One of the best examples of dogfooding in action is Rider, our .NET IDE. Back in 2016, when it was still unstable and full of rough edges, JetBrains developers began using it for their work long before it was officially released. Some days, you couldn’t even type because the editor would crash. But instead of giving up, teams fixed the issues they encountered on the spot.

That perseverance turned Rider from an experiment into a world-class IDE. The same principle has shaped countless JetBrains products since.

YouTrack: Built and managed in itself

Another case is the YouTrack team, who use their own issue tracker to manage every internal project and improvement flows for the product itself. That constant internal use surfaces edge cases and drives continuous refinement.

Junie: Shaped before users ever saw it

Junie, one of our newer tools, was used internally months before its closed beta.

The team started using Junie internally in December 2024, even before it reached closed Beta. From the very beginning, internal feedback played a major role in shaping how the product evolved. Team members quickly identified things that didn’t feel quite right, from small interface quirks to moments where Junie didn’t respond as expected. This early insight helped the team refine the experience long before anyone outside JetBrains ever saw it.

One particularly important piece of feedback was that Junie didn’t explain enough about what it was doing. That lack of clarity made some interactions feel confusing. Because the team experienced this themselves, they were able to rethink the product’s communication early on and make it more transparent and helpful.

Another area that benefited enormously from dogfooding was Junie’s connection with different work environments used throughout the company. JetBrainers rely on a wide variety of setups in their daily work, and using Junie across these revealed many edge cases the team wouldn’t have spotted otherwise. Each of these discoveries turned into improvements – hundreds of them.

How dogfooding improves developer experience and ownership

Dogfooding doesn’t just improve products — it changes how teams work. When you use what you build, the distinction between “developer” and “user” disappears. There’s no handoff, no abstraction.

That perspective creates stronger ownership. Decisions have immediate, visible impact. Teams see the results of their work in real time.

Dogfooding AI tools at JetBrains

Our teams use AI-assisted features internally long before release, testing what feels useful, what feels distracting, and what actually improves productivity.

This helps us avoid building AI for the sake of trends. We build it because we need it — and we refine it until it works in real development environments.

Why dogfooding matters for building better software

Dogfooding is how we make sure our tools meet the same high standards our users expect. It keeps us honest, motivated, and connected to the work we do. It’s not always comfortable – finding bugs in your own product rarely is – but it’s the most authentic way we know to build software that truly makes a difference.

This is what has kept JetBrains thriving for over two decades: a culture of doers who build, test, and improve from the inside.

As one of our technical leads put it:

“If I start any new project, the first milestone for it is definitely dogfooding. It’s one of the most important quality gates for the product and a crucial source of high-quality feedback.”

Build what you believe in

Dogfooding isn’t just a process we follow – it’s a fundamental part of how we work. It helps us stay close to our mission, keep improving, and make sure that when developers everywhere open a JetBrains tool, it feels like it was built by someone who truly understands them.

Because it was.

If this way of working resonates with you, if you care about the craft, and prefer solving real problems over just chasing trends — you’ll likely feel at home here. Check out our careers page for open roles!

Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete

8 Agentic AI patterns reshaping team collaboration

1 Share

As AI agents become more capable and can help individuals work faster, the next milestone is surfacing: How do you design AI for optimal team collaboration?

As a user experience researcher, I decided to go looking for answers in the competitive landscape. Most tools I examined are doing at least one thing well for teams, but very few are thinking holistically about how to connect teams across the full arc of their work, and even fewer are connecting them across the software development and delivery lifecycle.

I ran a synthesis study across 17 agentic platforms, specifically cataloging every way these tools support human teams working alongside AI. The goal was to map the full possibility space to ask: If you could take the best of everything out there and bring it together, what would a tool designed for team collaboration actually look like?

What I found were eight capability patterns and three customer outcomes they consistently deliver: moving faster, working smarter, and staying in control.

Eight patterns, three outcomes

The eight patterns span how teams work: from the visible outputs teams rely on every day (status updates, work routing, and communication) to the infrastructure that makes scaled agent use safe and sustainable (role-based access controls, governed environments, and collaborative agent-building).

1. Provide status updates
Outcome: Move faster and work smarter
The most mature agentic tools proactively surface blockers, risks, and progress without anyone having to ask. Agents auto-generate status narratives from live task data, flag slipping deadlines before they escalate, and distribute updates to the right stakeholders automatically. Status meetings and manual check-ins become overhead agents can also absorb.

2. Route work between humans
Outcome: Move faster and stay in control
Rather than work sitting in queues, agents match tasks to people based on skill, capacity, and project context. Workload balancing happens continuously, not just during planning cycles. The routing reasoning is transparent, so humans can course-correct before anything gets assigned, not after.

3. Facilitate team communication
Outcome: Work smarter and move faster
Agents summarize channels, threads, and meeting recordings so team members can catch up on key decisions without reading every message or attending every call. Conversation history is carried forward when new participants join, so no one needs a manual recap. Duplicate questions and re-explanation across roles disappear; async summaries replace synchronous meetings.

4. Role-specific agents in chat
Outcome: Work smarter and move faster
Specialist agents are deployed directly inside the communication tools teams already use, handling role-specific tasks, like onboarding questions, IT incidents, and sales briefings, without requiring anyone to switch tools or open a separate portal. A single emoji reaction can turn a Slack message into a tracked ticket. The work happens where the conversation is.

5. Conversational context
Outcome: Move faster and work smarter
Agents maintain full thread and file awareness across multi-participant conversations. When one person prompts an agent, the whole team benefits from what it learned. New members and agents alike can pick up exactly where the work left off, and shared context prevents the fragmented, duplicated prompting that happens when every team member re-explains the same problem from scratch.

6. RBAC
Outcome: Stay in control
Agents inherit only the access their assigned role allows, enforced down to the field level. An agent can't read, reason about, or act on data its assigned identity isn't authorized to see. Every action is logged, creating a deterministic audit trail that teams need for compliance in shared environments.

7. Governed environments
Outcome: Stay in control and work smarter
Agents move through dev, test, and production via managed pipelines, the same way code does. Isolated sandbox environments prevent conflicts during early build phases. Managed promotion pipelines ensure makers' ongoing updates don't disrupt live work. Untested agents don't reach production, and uncontrolled updates don't break it.

8. Collaborate on building agents
Outcome: Move faster and stay in control
Multiple team members can co-own, edit, and maintain agents with tiered permission structures. Shared development studios let teams debug agents together in real time. Standardized protocols ensure agents built by different contributors stay compatible.

What I took away from the landscape

A few things stood out to me across the full competitive set. AI is moving into chat, with agents being embedded where teams already work rather than in separate tools. Governance is becoming non-negotiable as teams scale agent usage. And agent-building is becoming a team sport, where shared ownership, collaborative iteration, and auditable versioning are now table stakes.

The coordination tax (status meetings, re-explanation across roles, manual check-ins) is a design problem that agents are beginning to solve. The platforms pulling ahead aren't the ones with the most capable individual agent. They're the ones designing the most coherent team experience around their agents.

One pattern I found particularly striking: The rarest capability across the entire landscape is a unified experience that integrates environment grouping, catalog sharing, and managed promotion pipelines in a single place. Most platforms are solving pieces of the governance puzzle. Very few have connected them end to end.

Why this matters for GitLab

GitLab's DevSecOps lifecycle creates a structural advantage most competitors don't have: The entire software delivery workflow already lives in one platform. Agents don't need to be bolted into workflows from the outside. They can be designed to live inside them.

GitLab Duo Agent Platform is built on this principle. Your workflows define the rules, your context maintains organizational knowledge, and your guardrails ensure control, so teams can orchestrate while agents execute across the full software development lifecycle.

Try GitLab Duo Agent Platform for free today.

Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories