Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
150886 stories
·
33 followers

ADeLe: Predicting and explaining AI performance across tasks

1 Share
ADeLe | Three white line icons, showing a circle with a checkmark, a search document, and a set of tools, on a blue‑to‑green gradient background.

At a glance

  • AI benchmarks report performance on specific tasks but provide limited insight into underlying capabilities; ADeLe evaluates models by scoring both tasks and models across 18 core abilities, enabling direct comparison between task demands and model capabilities.
  • Using these ability scores, the method predicts performance on new tasks with ~88% accuracy, including for models such as GPT-4o and Llama-3.1.
  • It builds ability profiles and identifies where models are likely to succeed or fail, highlighting strengths and limitations across tasks.
  • By linking outcomes to task demands, ADeLe explains differences in performance, showing how it changes as task complexity increases.

AI benchmarks report how large language models (LLMs) perform on specific tasks but provide little insight into their underlying capabilities that drive their performance. They do not explain failures or reliably predict outcomes on new tasks. To address this, Microsoft researchers in collaboration with Princeton University and Universitat Politècnica de València introduce ADeLe (opens in new tab) (AI Evaluation with Demand Levels), a method that characterizes both models and tasks using a broad set of capabilities, such as reasoning and domain knowledge, so performance on new tasks can be predicted and linked to specific strengths and weaknesses in a model.

In a paper published in Nature, “General Scales Unlock AI Evaluation with Explanatory and Predictive Power (opens in new tab),” the team describes how ADeLe moves beyond aggregate benchmark scores. Rather than treating evaluation as a collection of isolated tests, it represents both benchmarks and LLMs using the same set of capability scores. These scores can then be used to estimate how a model will perform on tasks it has not encountered before. The research was supported by Microsoft’s Accelerating Foundation Models Research (AFMR) grant program.

ADeLe-based evaluation

ADeLe scores tasks across 18 core abilities, such as attention, reasoning, domain knowledge, and assigns each task a value from 0 to 5 based on how much it requires each ability. For example, a basic arithmetic problem might score low on quantitative reasoning, but an Olympiad-level proof would score much higher.

Evaluating a model across many such tasks produces an ability profile—a structured view of where the model performs and where it breaks down. Comparing this profile to the demands of a new task makes it possible to identify the specific gaps that lead to failure. The process is illustrated in Figure 1.

Diagram illustrating a two-stage AI evaluation framework: the top panel shows model performance on the ADeLe benchmark and resulting ability profiles, while the bottom panel shows how task-level scoring criteria are applied to derive task demand profiles.
Figure 1. Top: (1) Model performance on the ADeLe benchmark and (2) the resulting ability profiles, showing each model’s strengths and limitations across core abilities. Bottom: (1) Application of 18 scoring criteria to each task and (2) the resulting task profiles, showing the abilities each task requires.

Evaluating ADeLe

Using ADeLe, the team evaluated a range of AI benchmarks and model behaviors to understand what current evaluations capture and what they miss. The results show that many widely used benchmarks provide an incomplete and sometimes misleading picture of model capabilities and that a more structured approach can clarify those gaps and help predict how models will behave in new settings.

ADeLe shows that many benchmarks do not isolate the abilities they are intended to measure or only cover a limited range of difficulty levels. For example, a test designed to evaluate logical reasoning may also depend heavily on specialized knowledge or metacognition. Others focus on a narrow range of difficulty, omitting both simpler and more complex cases. By scoring tasks based on the abilities they require, ADeLe makes these mismatches visible and provides a way to diagnose existing benchmarks and design better ones.

Applying this framework to 15 LLMs, the team constructed ability profiles using 0–5 scores for each of 18 abilities. For each ability, the team measured how performance changes with task difficulty and used the difficulty level at which the model has a 50% chance of success as its ability score. Figure 2 illustrates these results as radial plots that show where the model performs well and where it breaks down.

Radar charts comparing ability profiles of 15 large language models across 18 abilities, grouped by model families: OpenAI models on the left, LLaMA models in the middle, and DeepSeek-R1-Distill-Qwen models on the right.
Figure 2. Ability profiles for 15 LLMs across 18 abilities. Left: OpenAI models. Middle: Llama models. Right: DeepSeek-R1 distilled models.

This analysis shows that models differ in their strengths and weaknesses across abilities. Newer models generally outperform older ones, but not consistently across all abilities. Performance on knowledge-heavy tasks depends strongly on model size and training, while reasoning-oriented models show clear gains on tasks requiring logic, learning, abstraction, and social inference. These patterns typically require multiple, separate analyses across different benchmarks and can still produce conflicting conclusions when task demands are not carefully controlled. ADeLe surfaces them within a single framework.

ADeLe also enables prediction. By comparing a model’s ability profile to the demands of a task, it can forecast whether the model will succeed, even on tasks that are unfamiliar. In experiments, this approach achieved approximately 88% accuracy for models like GPT-4o and LLaMA-3.1-405B, outperforming traditional methods. This makes it possible to both explain and anticipate potential failures before deployment, improving the reliability and predictability of AI model assessment.

Whether AI systems can truly reason is a central debate in the field. Some studies report strong reasoning performance, while others show they break down at scale. These results reflect differences in task difficulty. ADeLe shows that benchmarks labeled as measuring “reasoning” vary in what they require, from basic problem-solving to tasks that combine the need for advanced logic, abstraction, and domain knowledge. The same model can score above 90% on lower-demand tests and below 15% on more demanding ones, reflecting differences in task requirements rather than a change in capability.

Reasoning-oriented models like OpenAI’s o1 and GPT-5 show measurable gains over standard models—not only in logic and mathematics but also with interpreting user intent. However, performance declines as task demands increase. AI systems can reason, but only up to a point, and ADeLe identifies where that point is for each model.

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first four episodes on demand.

Opens in a new tab

Looking ahead

ADeLe is designed to evolve alongside advances in AI and can be extended to multimodal and embodied AI systems. It also has the potential to serve as a standardized framework for AI research, policymaking, and security auditing.

More broadly, it advances a more systematic approach to AI evaluation—one that explains system behavior and predicts performance. This work builds on earlier efforts, including Microsoft research on applying psychometrics to AI evaluation and recent work on Societal AI, emphasizing the importance of AI evaluation.

As general-purpose AI systems continue to outpace existing evaluation methods, approaches like ADeLe offer a path toward more rigorous and transparent assessment in real-world use. The research team is working to expand this effort through a broader community. Additional experiments, benchmark annotations, and resources are available on GitHub (opens in new tab).

Opens in a new tab

The post ADeLe: Predicting and explaining AI performance across tasks appeared first on Microsoft Research.

Read the whole story
alvinashcraft
27 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Run multiple agents at once with /fleet in Copilot CLI

1 Share

What if GitHub Copilot CLI could work on five files at the same time instead of one? That’s where /fleet comes in.

/fleet is a slash command in Copilot CLI that enables Copilot to simultaneously work with multiple subagents in parallel. Instead of working through tasks sequentially, Copilot now has a behind the scenes orchestrator that plans and breaks your objective into independent work items and dispatches multiple agents to execute them simultaneously. On different files, in different parts of your codebase, all at once.

Want to learn more about how /fleet works, and more importantly, how to use it most effectively? Let’s jump in.

How it works

When you run /fleet with a prompt, the behind-the-scenes orchestrator:

  1. Decomposes your task into discrete work items with dependencies.
  2. Identifies which items can run in parallel versus which must wait.
  3. Dispatches independent items as background sub-agents simultaneously.
  4. Polls for completion, then dispatches the next wave.
  5. Verifies outputs and synthesizes any final artifacts.

Each sub-agent gets its own context window but shares the same filesystem. They can’t talk to each other directly; only the orchestrator coordinates.

Think of it as a project lead who assigns work to a team, checks in on progress, and assembles the final deliverable.

Getting started

Start fleet mode by sending /fleet <YOUR OBJECTIVE PROMPT>. For example:

/fleet Refactor the auth module, update tests, and fix the related docs in the folder docs/auth/ 

That’s it. The orchestrator takes your objective, figures out what can be parallelized, and starts dispatching.

You can also run it non-interactively in your terminal:

copilot -p "/fleet <YOUR TASK>" --no-ask-user

The --no-ask-user flag is required for non-interactive mode, since there’s no way to respond to prompts. Now let’s look at what makes a good prompt.

Write prompts that parallelize well

The quality of your /fleet prompt determines how effectively work gets distributed. The key is giving the orchestrator enough structure to cleanly break down your task.

A good way to do that is being specific about deliverables. Map every work item to a concrete artifact like a file, a test suite, or a section of documentation. Vague prompts lead to sequential execution because the orchestrator can’t identify independent pieces.

For example, instead of: /fleet Build the documentation, you could try:

/fleet Create docs for the API module: 

- docs/authentication.md covering token flow and examples 

- docs/endpoints.md with request/response schemas for all REST endpoints 

- docs/errors.md with error codes and troubleshooting steps 

- docs/index.md linking to all three pages (depends on the others finishing first)

The second prompt gives the orchestrator four distinct deliverables, three of which can run in parallel, and one that depends on them.

Set explicit boundaries

Sub-agents work best when they know exactly where their scope starts and ends. When writing your prompt include:

  • File or module boundaries: Which directories or files each track owns
  • Constraints: What not to touch (e.g., no test changes, no dependency upgrades)
  • Validation criteria: Lint, type checks, tests that must pass

Here’s a prompt that showcases these boundaries:

/fleet Implement feature flags in three tracks: 

1. API layer: add flag evaluation to src/api/middleware/ and include unit tests that look for successful flag evaluation and tests API endpoints  

2. UI: wire toggle components in src/components/flags/ and introduce no new dependencies 

3. Config: add flag definitions to config/features.yaml  and validate against schema 

Run independent tracks in parallel. No changes outside assigned directories. 

Declare dependencies when they exist

If one piece of work depends on another, say so. The orchestrator will serialize those items and parallelize the rest. For example:

/fleet Migrate the database layer: 

1. Write new schema in migrations/005_users.sql 

2. Update the ORM models in src/models/user.ts (depends on 1) 

3. Update API handlers in src/api/users.ts (depends on 2) 

4. Write integration tests in tests/users.test.ts (depends on 2) 

 Items 3 and 4 can run in parallel after item 2 completes. 

Use custom agents for different jobs

You can define specialized agents in .github/agents/ and reference them in your /fleet prompt. Each agent can specify its own model, tools, and instructions. Be aware that if you don’t specify which model to use, agents will use the current default model.

# .github/agents/technical-writer.md 

--- 

name: technical-writer 

description: Documentation specialist 

model: claude-sonnet-4 

tools: ["bash", "create", "edit", "view"] 

--- 

You write clear, concise technical documentation. Follow the project style guide in /docs/styleguide.md. 

Then reference the custom agent in your prompt:

/fleet Use @technical-writer.md as the agent for all docs tasks and the default agent for code changes. 

This is useful when different tracks need different strengths such as using a heavier model for complex logic and a lighter one for boilerplate documentation.

How to verify subagents are deploying

Watch how the orchestrator deploys subagents, it’s the fastest way to learn how to write prompts that parallelize well.

Use this quick checklist:

  • Decomposition appears: Before it starts working, review the plan Copilot shares with you to see if it breaks work into multiple tracks, instead of one long linear plan.
  • Background task UI confirms activity: Once it begins working, run /tasks to open the tasks dialog and inspect running background tasks.
  • Parallel progress appears: Updates reference separate tracks moving at the same time.

If the fleet doesn’t seem to be parallelizing, try stopping Copilot’s work and asking for an explicit decomposition:

Decompose this into independent tracks first, then execute tracks in parallel. Report each track separately with status and blockers. 

Avoiding common pitfalls

Fleet is powerful, but a few gotchas are worth knowing upfront.

Partition your files

Sub-agents share a filesystem with no file locking. If two agents write to the same file, the last one to finish wins—silently. No error, no merge, just an overwrite.

The fix is to assign each agent distinct files in your prompt. If multiple agents need to contribute to a single file, consider having each write to a temporary path and let the orchestrator merge them at the end. Or set an explicit order for the agents to follow.

Keep prompts self-contained

Sub-agents can’t see the orchestrator’s conversation history. When the orchestrator dispatches a sub-agent, it passes along a prompt, but that prompt needs to include everything the sub-agent needs. If you’ve already gathered useful context earlier in the session, make sure your /fleet prompt includes it (or references files the sub-agents can read).

Steering a fleet in progress

After dispatching, you can send follow-up prompts to guide the orchestrator:

  • Prioritize failing tests first, then complete remaining tasks.
  • List active sub-agents and what each is currently doing.
  • Mark done only when lint, type check, and all tests pass.

When to use /fleet (and when not to)

/fleet shines when your task has natural parallelism—multiple files, independent modules, or separable concerns. It’s particularly effective for:

  • Refactoring across multiple files simultaneously.
  • Generating documentation for several components at once.
  • Implementing a feature that spans API, UI, and tests.
  • Running independent code modifications that don’t share state.

For strictly linear, single-file work, regular Copilot CLI prompts are simpler and just as fast. Fleet adds coordination overhead, so it pays off when there’s real work to distribute.

/fleet is most useful when you treat it like a team, not a magic trick. Start small. Pick a task with clear outputs, clean file boundaries, and obvious parallelism. See how the orchestrator decomposes the work, where it helps, and where it gets in the way. As you get more comfortable, push it further with larger refactors, multi‑track features, or docs and tests in parallel. The fastest way to learn when /fleet pays off is to try it on real work and adjust your prompts based on what you see.

The post Run multiple agents at once with /fleet in Copilot CLI appeared first on The GitHub Blog.

Read the whole story
alvinashcraft
27 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Developer’s Guide to Building ADK Agents with Skills

1 Share
The Agent Development Kit (ADK) SkillToolset introduces a "progressive disclosure" architecture that allows AI agents to load domain expertise on demand, reducing token usage by up to 90% compared to traditional monolithic prompts. Through four distinct patterns—ranging from simple inline checklists to "skill factories" where agents write their own code—the system enables agents to dynamically expand their capabilities at runtime using the universal agentskills.io specification. This modular approach ensures that complex instructions and external resources are only accessed when relevant, creating a scalable and self-extending framework for modern AI development.
Read the whole story
alvinashcraft
27 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

How to Turn SQL Database changes into Real-Time Intelligence with Fabric Eventstreams | Data Exposed

1 Share
From: Microsoft Developer
Duration: 17:26
Views: 27

Every insert, update, and delete in your SQL database is a business event — but turning those changes into real-time streams has traditionally meant stitching together complex pumbling, managing destination tables and handling schema changes that introduce downtime. In this episode, we show you how DeltaFlow in Microsoft Fabric Eventstreams eliminates that complexity. You will see how to go from a SQL database to analytics-ready, queryable tables in minutes; with automatic schema registration, destination table creation, and schema evolution handling built in. Whether you're building real-time dashboards, building real-time AI applications, this episode shows you the fastest path from your SQL database to real-time intelligence in Microsoft Fabric.

✅ Chapters:
0:00 Introduction
1:15 Why isn't every SQL application real-time yet?
2:45 Real-Time Intelligence
3:15 Fabric Eventstream: Ingest, process and route events in real-time
5:45 Building event-driven, real-time applications with DeltaFlow
8:15 Demo
12:25 Handling Source Schema Changes
14:00 Demo
15:25 Getting started

✅ Resources:
Building real-time, event-driven applications with Database CDC feeds and Fabric Eventstreams DeltaFlow (Preview): https://blog.fabric.microsoft.com/en-US/blog/building-real-time-event-driven-applications-with-database-cdc-feeds-and-fabric-eventstreams-deltaflow-preview/

📌 Let's connect:
Twitter - Anna Hoffman, https://twitter.com/AnalyticAnna
Twitter - AzureSQL, https://aka.ms/azuresqltw

🔴 Watch even more Data Exposed episodes: https://aka.ms/dataexposedyt

🔔 Subscribe to our channels for even more SQL tips:
Microsoft Azure SQL: https://aka.ms/msazuresqlyt
Microsoft SQL Server: https://aka.ms/mssqlserveryt
Microsoft Developer: https://aka.ms/microsoftdeveloperyt

#AzureSQL #SQL #LearnSQL

Read the whole story
alvinashcraft
27 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

The Hard Lessons of Cloud Migration: inDrive's Path from Monolith to Microservices

1 Share
Join us for a fascinating conversation with Alexander 'Sasha' Lisachenko (Software Architect) and Artem Gab (Senior Engineering Manager) from inDrive, one of the global leaders in mobility operating in 49 countries and processing over 8 million rides per day. Sasha and Artem take us through their four-year transformation journey from a monolithic bare-metal setup in a single data center to a fully cloud-native microservices architecture on AWS. They share the hard-earned lessons from their migration, including critical challenges with Redis cluster architecture, the discovery of single-threaded CPU bottlenecks, and how they solved hot key problems using Uber's H3 hexagon-based geospatial indexing. We dive deep into their migration from Redis to Valkey on ElastiCache, achieving 15-20% cost optimization and improved memory efficiency, and their innovative approach to auto-scaling ElastiCache clusters across multiple dimensions. Along the way, they reveal how TLS termination on master nodes created unexpected bottlenecks, how connection storms can cascade when Redis slows down, and why engine CPU utilization is the one metric you should never ignore. This is a story of resilience, technical problem-solving, and the reality of large-scale cloud transformations — complete with rollbacks, late-night incidents, and the eventual triumph of a fully elastic, geo-distributed platform serving riders and drivers across the globe.

With Alexander Lisachenko, Software Architect, inDrive ; With Artem Gab, Senior Engineering Manager, Runtime Systems, inDrive





  • Download audio: https://op3.dev/e/dts.podtrac.com/redirect.mp3/developers.podcast.go-aws.com/media/201.mp3
    Read the whole story
    alvinashcraft
    27 minutes ago
    reply
    Pennsylvania, USA
    Share this story
    Delete

    Imagine Cup 2026 Semifinalist: Builder Series Judges

    1 Share

    With submissions closing on January 9, selected startups advance into the semifinals and step into this experience. From meeting their mentors to participating in build labs and pitch clinics, founders sharpen their product, their story, and their readiness for the global stage.  

    In the semifinals, startups present live and step into the next level of the competition. 

    They pitch in front of a panel of AI experts, startup founders, and investors, each bringing real-world experience in building, scaling, and backing technology. Through live Q&A and direct feedback, startups gain insight that challenges their thinking, strengthens their approach, and helps move their solution forward. 

    Meet the semifinals judges (listed in alphabetical order): 
     

    Mike Abbott

    Mike Abbott is a Partner at Antler, co-leading its Australian operations and backing founders from day zero through scale. With a background in equity capital markets and M&A, he was an early Uber leader in Asia and later Head of Operations for Australia and New Zealand, helping scale the business from a small team to a multi-billion-dollar operation. As cofounder of Kaddy, a B2B marketplace acquired within three years, he brings deep experience in building, scaling, and investing in startups. 
     

     

     

     

    Todd Anglin

    Todd Anglin is a Partner Developer Relations Lead at Microsoft with proven experience building and scaling high-performing teams. With a background spanning web and mobile development, cloud native platforms, and low code tools, he has led product, developer relations, and go-to-market efforts across growing technology companies. Known for his strength in communication, he brings the ability to translate complex technical concepts for any audience while helping teams move quickly and build with impact. 

     

     

     

     

     

    Rania Awad

    Rania Awad is Chief Strategy Officer at Helfie.AI and a strategic leader at the intersection of AI, healthcare, and digital transformation. With experience across SaaS, health tech, and global digital businesses, she has led high-impact initiatives that turn bold ideas into scalable outcomes. Known for her cross-functional leadership and strong commercial lens, she brings a focus on connecting strategy to execution to drive meaningful impact. 

     

     

     

    Rick Clause

    Rick Claus is a Cloud Advocate Team Lead at Microsoft with over 25 years of experience in the IT industry. As part of the Developer Relations Cloud Advocacy team, he focuses on enabling cross-team collaboration and engaging global technical communities around Azure and hybrid cloud solutions. With a background in enterprise architecture, virtualization, and technical training, he brings deep expertise in connecting product, engineering, and technical audiences to improve the overall customer experience. 

     

     

     

     

     

    Sonia Cuff

    Sonia Cuff leads the Cloud Native & Linux team inside Microsoft's Developer Relations division, connecting with technical communities worldwide. She has over 30 years’ experience in tech, from large enterprises and government to small businesses and partners. Sonia is passionate about the connection between technology and business. 

     

     

     

     

     

    Mal Filipowska

    Małgorzata (Mal) Filipowska  is a venture capitalist with nearly a decade of experience investing in early-stage companies across emerging markets. As part of Seedstars International Ventures, a fund backed by global institutions including the World Bank, Rockefeller Foundation, and Visa Foundation, she manages a portfolio of over 130 companies across 40 countries, supporting founders across diverse and high-growth markets. She brings deep insight into scaling startups in these regions and a strong perspective on early-stage growth. 

     

     

     

     

    Alexandra Miele

    Alexandra Miele leads Platform at HOF Capital, where she drives portfolio engagement and builds strategic partnerships across the firm’s global network. With experience spanning venture, private capital, and institutional investing, she previously served as a Partner at a family office managing a $1B+ portfolio and held leadership roles at Rockefeller Capital Management and Goldman Sachs. She brings deep insight into alternative investments, growth strategy, and supporting companies from early stage through scale. 

     

     

     

     

     

    Nigel Parker is a technology leader with over 30 years of experience across cloud, data platforms, machine learning, and AI. Having led global engineering and architecture teams, including serving as Chief Engineer for Microsoft Asia Commercial Software Engineering, he co-founded Vivara, an AI-driven wellbeing platform and works as a Data & AI consultant at Arinco (The Artificial Intelligence Company). He brings deep expertise in building scalable systems, integrating AI, and designing technology with a strong focus on human outcomes. 

     

     

     

    Sarah Thiam is the founder and CEO of Germina Labs, a AI x Web3 studio focused on developer-facing products, programs and tooling. With a product and developer relations background at Microsoft, Protocol Labs and the Singapore government, she brings a well-rounded perspective to scaling technical ecosystems. 

     

     


     
    Up next 

     

    The top three startups will advance to the World Championship, where they will compete on the global stage for the title and a $100,000 USD prize, along with a mentorship session with Satya Nadella, Chairman and CEO of Microsoft. 

    This is where everything comes together, as startups step forward to showcase what they have built and how they are ready to scale. 

    Follow along on Instagram, LinkedInX and Facebook for the latest updates, startups announcements, and moments leading up to the World Championship. 

    Read the whole story
    alvinashcraft
    28 minutes ago
    reply
    Pennsylvania, USA
    Share this story
    Delete
    Next Page of Stories