Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
150891 stories
·
33 followers

Practicing What We Preach: AI in Action

1 Share

At ivision, AI isn’t an experiment or a bolt‑on tool. It’s embedded into our daily operations. What began as early internal exploration has evolved into a mature, repeatable workflow that enhances how we listen, analyze, design solutions, and communicate with clients. The result is 16+ hours saved per week per knowledge worker, while delivering higher‑quality outcomes. 

That impact comes from integrating AI into nearly every stage of the work process, rather than treating it as a separate step. 

Meetings & Action Items 

Most workflows begin in meetings. Like many folks at ivision, a significant portion of my day is spent collaborating with clients and internal stakeholders. With Copilot enabled, meeting transcripts become a strategic asset rather than a byproduct. During conversations, AI allows real‑time interaction with the transcript by summarizing key points, clarifying requirements, and validating understanding as the discussion unfolds. Instead of focusing on note‑taking, we can focus on listening, asking better questions, and engaging on a deeper level. 

Once a meeting ends, the transcript becomes the foundation for deeper analysis. I begin by summarizing the discussion and identifying themes or trends that surfaced during the conversation. From there, I start exploring potential solutions. Pulling in additional information and resources. I might ask AI whether a proposed approach actually meets the requirements that were discussed or ask it to suggest alternative options that could address the same problem. This iterative process allows me to move quickly from raw conversation to structured thinking around possible architectures or strategies. 

Often, I’ll switch to voice interaction so I can think through a problem out loud. I may pause and ask questions like “Does this make sense based on the requirements?” or “What did that acronym mean?” The conversation becomes similar to having a peer to bounce ideas off of while working toward a practical solution I can take to the next stage. 

Documentation & Deliverables  

Documentation and deliverable creation have also transformed. Rather than starting from a blank page, we guide AI through structured prompts and then refine the output to align with ivision’s standards, voice, and objectives. This allows research and document creation to happen at the same time. Whether outlining deployment options, evaluating tradeoffs, or aligning recommendations to partner guidance, AI accelerates both creation and validation. The role of our teams shifts from manual production to architectural thinking, directing the narrative while AI supports execution. 

Meeting Preparation  

Preparation is another area where AI delivers tangible advantage. Before client conversations, we routinely use AI to simulate discussions by role‑playing both the client and the solution architect. This helps surface likely objections and pressure‑testing responses in advance, allowing teams to walk into meetings better prepared and more confident. This capability consistently strengthens delivery, sales conversations, and executive communications. 

Time Saved  

The productivity gains are substantial. Tasks that once took hours now take minutes. Presentations that previously required half a day to research and draft can often be assembled in under an hour, all without sacrificing quality. On two recent occasions while at the ball field with one of my sons, I glanced at my calendar and realized I had an upcoming client discussion. I picked up my phone and said, “Create a technical discussion guide I can use to walk a client through these three options.” I put my phone down, watched the next inning, and came back to a polished ten slide draft.  

When I was back at my computer, within twenty minutes I had reviewed, validated, branded, and refined the deck. That same task would previously have taken at least half a day for a less polished output. When I presented it, the client’s response was simply, “Wow, that is an amazing deck. I can see you guys know what you are doing. Can I get a copy to share with my boss?” 

Task Time Before AI Time Using AI How AI Helps 
Meeting Notes 30–45 minutes after meetings 5–10 minutes Copilot transcripts automatically capture the discussion and allow quick summarization 
Understanding Client Requirements 30–60 minutes reviewing notes and emails 10–15 minutes AI analyzes transcripts and highlights key themes and requirements 
Solution Exploration 1–2 hours 20–30 minutes AI helps evaluate options, validate approaches, and suggest alternatives 
Presentation Creation 3–4 hours 30–45 minutes AI generates slide outlines, research content, and draft narratives 
Proposal / Document Drafting 3–4 hours 30–90 minutes AI creates structured drafts that are then edited and refined 
Meeting Preparation 30–60 minutes 10–20 minutes AI role plays objections and helps refine positioning 
Content Review and Validation 30 minutes 5 minutes AI reviews for completeness, logic, and requirement alignment 

Ultimately, AI allows ivision teams to spend more time thinking and less time formatting. It turns conversations into insights, insights into solutions, and solutions into polished, client‑ready deliverables. The result is a faster, more flexible, and more consistent workflow. In an environment where speed, precision, and preparedness matter, AI has become a true competitive advantage for ivision and a better way to work. Get in touch with our team to incorporate these principles into your own team and watch your productivity skyrocket. 

The post Practicing What We Preach: AI in Action appeared first on ivision.

Read the whole story
alvinashcraft
just a second ago
reply
Pennsylvania, USA
Share this story
Delete

New in ThemeBuilder: Typography and AI Theming Enhancements

1 Share

See how the ThemeBuilder Typography module for centralized font management and component-level AI theming enhancements can help in our app design and development.

Progress ThemeBuilder is a visual styling tool for customizing Telerik and Kendo UI components. Instead of digging through component documentation, ThemeBuilder provides an intuitive interface that allows us to see styling changes applied in real time. We can adjust colors, spacing, typography and more, then export production-ready CSS/SASS for our applications.

With the 2025 Q4 release, ThemeBuilder introduced two cool updates: a dedicated Typography module for centralized font management and component-level AI theming enhancements that give us finer control over AI-generated styles. In this article, we’ll spend a little time exploring both.

Setting Up a Theme with AI

Before jumping into the new features, let’s quickly set up a theme using AI to see how these enhancements fit into the workflow. In the ThemeBuilder interface, we’ll find a Generate panel where we can describe our desired theme in plain English. To get started, we can enter something like: “Create a clean, modern analytics theme with a cool blue-gray palette that feels data-driven and professional, suitable for a B2B software dashboard.”

After generating, the AI analyzes the description and produces a cohesive design system. Within seconds, we’ll see a complete theme applied across all components: buttons adopt a refined blue accent, inputs feature subtle borders with appropriate focus states, and data visualization components like charts and grids receive complementary styling that maintains readability.

This theme serves as a good starting point. The colors work well together, the spacing feels balanced and there’s visual consistency across the component library.

But what if we want to adjust the appearance of specific components without altering the entire theme? This is where the new component-level AI theming comes in.

Component-Level AI Theming

One of the challenges with theme generation with AI has been the “all or nothing” approach. Previous AI theming would regenerate our entire theme, which was great for starting fresh, but less helpful when we only wanted to tweak how buttons looked or adjust the styling of our data grid headers.

The new component-level AI theming enhancements solve this problem. We can now target individual components with AI-assisted styling while preserving our overall theme. Instead of manually hunting through variables and properties, we can use the AI theming interface to target just, say, the React Button component.

To see an example of this, we can enter a specific prompt like: “Make the primary button more prominent with a stronger visual presence and subtle gradient that draws attention for call-to-action scenarios.”

The AI adjusted only the primary (solid) button-related variables and styles, enhancing the gradient effect and adding a bit more visual weight to the component. The rest of our theme remains untouched!

We can continue refining our design in the same way. For example, we can apply an AI-driven override to the specific React TextBox component to make it more visually distinctive and usable.

We might use a prompt like: “Redesign the TextBox component with a more prominent border, a subtle background tint, stronger focus glow, and slightly increased padding to make input fields visually distinct and easier to scan in complex layouts.”

The AI updates only the TextBox-related styles while preserving other existing styles in the components module.

New Typography Module

While AI theming handles creative decisions, the new Typography module provides structured, systematic control over typography settings of our theme and is located alongside the existing modules like Metrics, Colors, etc.

In the module, we can define reusable typography variables that bundle multiple text properties into a single unit. Each variable can include font family, size, line height, letter spacing, text transform, font style and text decoration.

Once defined, these typography variables can be assigned to component parts like inputs, headers and form labels. Here’s an example of applying a custom typography set to the text of the Button component.

This provides both flexibility and long-term maintainability, allowing typography variables to be defined once and applied consistently across components with just a few clicks.

Wrap-up

The 2025 Q4 updates to ThemeBuilder address two common pain points.

  • Component-level AI theming lets us refine specific parts of our theme without starting over, making AI generation more practical for real-world projects.
  • The Typography module brings font management into a centralized, reusable system, letting us define text styles once and apply them consistently across components with ease.

Ready to try these new capabilities? Explore ThemeBuilder and see how the latest updates can streamline your theming workflow.

If you aren’t already using Telerik or Kendo UI components, see how ThemeBuilder complements these robust libraries in the Telerik DevCraft bundles.

Try Telerik DevCraft

Read the whole story
alvinashcraft
16 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Build Accessible Components with Angular Aria

1 Share

A simple way to add accessibility to your Angular app is with Angular Aria, which gives you production-ready, WCAG-compliant directives.

Building accessible components is one of those things we know we should do, but often skip because it feels overwhelming. We need to read about accessibility tips and tricks and a lot of documentation.

You start with a simple dropdown menu, knowing you need to handle keyboard navigation, ARIA attributes, focus management and screen reader support. Before you know it, your “simple” component has 200 lines of accessibility code you’re not even sure is correct. (Unless you’re using the Progress Kendo UI for Angular library, which has accessibility baked in for you. )

What if I told you there’s a way to get the accessibility magic you need, regardless of component library, with full control over your styling with the magic of Angular Aria?

But what is Angular Aria? Let’s make a small intro.

What Is Angular Aria?

Think of Angular Aria as a collection of accessibility superpowers for your components, but instead of manually implementing keyboard navigation, ARIA attributes and focus management, you import a directive, add it to your HTML and, boom, your component is accessible.

The Angular team built these directives following the W3C Accessibility Guidelines, so you don’t have to become an accessibility expert to build compliant components.

Hold On, What About Angular Material?

Great question! If you’ve been using Angular for a while, you’re probably thinking: “We already have Angular Material. Why do we need another library?”

Here are the key differences:

Angular Material gives you complete, prestyled components. They look great out of the box, but they come with Material Design opinions baked in. If you want a button that doesn’t look like a Material button, you’re going to fight the framework.

Angular Aria gives you headless directives—just the accessibility logic, zero styling. You get all the keyboard navigation, ARIA attributes and screen reader support, but you control every pixel of how it looks.

Think of it this way:

  • Angular Material: It is plug and play, looks good, works immediately, but everyone has the same result. If you were building a physical doorway to your business, it would look like all other businesses, just with your business name on the sign.
  • Angular Aria: This tool is more like the ramp to your front door, enabling anyone to access the business entrance, while allowing you to choose the awning, the window display and the door color.

So when should you use each one? We’ll dive deeper into that later in the article, but here’s the quick answer:

  • Use Angular Material when you need to ship fast and Material Design works for you.
  • Use Angular Aria when you have custom design requirements and need full control.

Remember, accessibility isn’t optional anymore. It’s a legal requirement in many countries, and, more importantly, it’s the right thing to do.

But implementing accessibility correctly is hard. You need to know:

  • Which ARIA attributes to use (and when)
  • How keyboard navigation should work for each pattern
  • How to manage focus properly
  • How screen readers interpret your markup

Angular Aria handles this complexity for you. You focus on the HTML structure, CSS styling and business logic, and Angular Aria takes care of accessibility.

But, as always, the best way to learn is by building something. Let’s do it!

Set Up the Project

First, create a new Angular application. In your terminal, run the following command (be sure to have Node.js installed).

npx @angular/cli@latest new angular-aria-demo

When CLI prompts stylesheet format, pick CSS.

  - **Which stylesheet format would you like to use?** → CSS

Now, navigate to your project and add Angular Aria:

cd angular-aria-demo
npm install @angular/aria

That’s it. We now have a fresh Angular project with Angular Aria installed. Let’s build something accessible!

Building an Accessible Toolbar

Let’s build a text formatting toolbar, the kind you see in rich text editors. This is a perfect example because it looks simple but has surprising accessibility complexity.

Using the CLI, generate a new component editor-toolbar:

ng generate c components/editor-toolbar

Perfect! Open the component with your editor, and import Toolbar, ToolbarWidget and ToolbarWidgetGroup directives provided by @angular/aria/toolbar.

  import { Component } from '@angular/core';
  import { Toolbar, ToolbarWidget, ToolbarWidgetGroup } from '@angular/aria/toolbar';
  
  @Component({
    selector: 'app-editor-toolbar',
    templateUrl: './editor-toolbar.component.html',
    styleUrl: './editor-toolbar.component.css',
    imports: [Toolbar, ToolbarWidget, ToolbarWidgetGroup],
  })
  export class EditorToolbarComponent {}

Now it’s time to build the HTML structure. We create a div container with the directive [ngToolbar], which works as the main container. We also need ToolbarWidget to use with individual buttons and ToolbarWidgetGroup for groups of related buttons.

In the following HTML, we use every directive with divs and button elements. Copy it and paste into the editor-toolbar.html.

  <div ngToolbar aria-label="Text Formatting Tools">
  <div class="group">
   <button ngToolbarWidget
           value="undo"
           type="button"
           aria-label="undo">
     Undo
   </button>
   <button ngToolbarWidget
           value="redo"
           type="button"
           aria-label="redo">
     Redo
   </button>
 </div>

 <div class="separator" role="separator"></div>

 <!-- Text formatting group -->
 <div class="group">
   <button ngToolbarWidget
           value="bold"
           type="button"
           aria-label="bold"
           #bold="ngToolbarWidget"
           [aria-pressed]="bold.selected()">
     Bold
   </button>
   <button ngToolbarWidget
           value="italic"
           type="button"
           aria-label="italic"
           #italic="ngToolbarWidget"
           [aria-pressed]="italic.selected()">
     Italic
   </button>
 </div>

 <div class="separator" role="separator"></div>

 <!-- Alignment group (radio buttons) -->
 <div ngToolbarWidgetGroup
      role="radiogroup"
      class="group"
      aria-label="Text alignment options">
   <button ngToolbarWidget
           role="radio"
           type="button"
           value="align left"
           aria-label="align left"
           #leftAlign="ngToolbarWidget"
           [aria-checked]="leftAlign.selected()">
     Left
   </button>
   <button ngToolbarWidget
           role="radio"
           type="button"
           value="align center"
           aria-label="align center"
           #centerAlign="ngToolbarWidget"
           [aria-checked]="centerAlign.selected()">
     Center
   </button>
   <button ngToolbarWidget
           role="radio"
           type="button"
           value="align right"
           aria-label="align right"
           #rightAlign="ngToolbarWidget"
           [aria-checked]="rightAlign.selected()">
     Right
   </button>
 </div>
</div>

The final step is adding some CSS styles to make it look nice. Open the editor-toolbar.css and paste the following style:

  [ngToolbar] {
 display: flex;
 gap: 8px;
 padding: 8px;
 background: #f5f5f5;
 border-radius: 4px;
}

.group {
 display: flex;
 gap: 4px;
}

.separator {
 width: 1px;
 background: #ddd;
}

button {
 padding: 8px 12px;
 border: 1px solid #ddd;
 background: white;
 border-radius: 4px;
 cursor: pointer;
}

button[aria-pressed="true"],
button[aria-checked="true"] {
 background: #007acc;
 color: white;
}

Perfect, we just added HTML, CSS and Angular Aria directives. To test it, open the app.html and add <app-editor-toolbar></app-editor-toolbar>, save changes and run your app.

ng serve
Initial chunk files | Names         | Raw size
main.js             | main          | 10.50 kB | 
styles.css          | styles        | 95 bytes | 

                    | Initial total | 10.59 kB

Application bundle generation complete. [0.641 seconds] - 2026-01-25T10:16:33.843Z

Watch mode enabled. Watching for file changes...
NOTE: Raw file sizes do not reflect development server per-request transformations.
  ➜  Local:   http://localhost:4200/
  ➜  press h + enter to show help

User navigates buttons with keyboard arrow keys

Try to use your toolbar with your keyboard. You’ll find it works, and we didn’t have to write any code for keyboard navigation logic (like: Arrow keys, Home, End), focus management, ARIA role attributes, screen reader announcements or selection state management. All of that complexity? Gone.

The ngToolbar, ngToolbarWidget and ngToolbarWidgetGroup directives handle all of that automatically.

We now have a fully accessible toolbar that works with keyboard navigation and screen readers, and we only wrote the HTML structure and CSS.

Angular Aria provides directives for the most common interactive patterns, we can get components for search and selection (like Autocomplete, Listbox and Select), navigation and actions (Menu, Menubar and Toolbar like we just built), and content organization (Accordion, Tabs, Tree and Grid). Every directive comes with complete documentation, working examples and API references. You can see the full list of available components in the official Angular Aria documentation.

But I’m a Fan of Angular Material

Yes, I understand Angular Material has a long relationship with Angular devs. You can continue using Angular Material when you:

  1. Need to ship fast – You’re building an MVP or internal tool and don’t want to spend time on custom styling.
  2. Like Material Design – You’re OK with the Google Material Design aesthetic.
  3. Have limited design resources – Your team doesn’t have dedicated designers.

Remember, Angular Material is amazing, but it comes with opinions about how things should look.

If you try to heavily customize Material components, you’ll spend more time fighting the framework than building features. I’ve been there—overriding Material styles with ::ng-deep and !important until the CSS becomes unmaintainable.

And don’t even get me started on theming. Creating a custom theme for Angular Material is… let’s just say “special.” It’s doable, but it’s tricky and requires deep knowledge of Sass and Material’s theming system. (If you’re curious about the complexity, I wrote about it in Theme UI Frameworks in Angular: Part 2 - Custom Theme for Angular Material.)

Angular Aria solves this by giving you zero styles. You start with a blank canvas and build exactly what you need.

Remember for simple cases, native elements are already accessible when you use <button> for buttons, <input type="radio"> for radio buttons or <select> for simple dropdowns. But Angular Aria gives you a third option: headless accessible components that you can style however you want, we can meet accessibility requirements, maintain full design control and avoid reinventing the wheel.

Recap

We learned that accessibility doesn’t have to be overwhelming. Angular Aria gives you production-ready, WCAG-compliant directives that handle the complex parts.

We provide the HTML structure and CSS and Angular Aria provides the accessibility without pain!

But What About Kendo UI?

Maybe you want to ship faster (really fast) and have complex scenarios with datalist, charts, schedulers and complex UI patterns. If you read this article and are thinking: “This is great, but I still have to write all the HTML and CSS myself. Why not just use a complete component library?” you’re asking the right question.

Here’s when Progress Kendo UI for Angular makes more sense than Angular Aria or Angular Material:

Use Kendo UI when you want:

  1. Everything out of the box – Prebuilt components with professional styling, themes and accessibility already done
  2. Advanced features included – Things like data grids with sorting/filtering, charts, schedulers and complex UI patterns
  3. Consistent design system – A cohesive look across all components without writing custom CSS
  4. Enterprise-grade support – Professional support, regular updates and guaranteed compatibility
  5. Speed PLUS customization – You need to ship fast and need options to customize (you’ll have five fully built theme options plus the ability to tweak those or create your own theme)

There’s no “wrong” choice—just different tools for different jobs.

If you’re building internal tools or you like the Kendo UI design system, Kendo UI saves you weeks of work, and, honestly? I can build complex stuff (data grids, schedulers, charts) in minutes, making Kendo UI the right answer for me.

Plus, you can use the Kendo UI for Angular AI Coding Assistant, an MCP server that automatically scaffolds components for AI agents. Instead of manually writing Kendo UI components, your AI assistant can do it for you.

Want to learn more? Check out my article: Angular and Kendo UI MCP: Making Agents Work for You

Read the whole story
alvinashcraft
21 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Decision Tree Regression from Scratch Using JavaScript

1 Share
Dr. James McCaffrey presents a complete end-to-end demonstration of decision tree regression from scratch using JavaScript. The goal of decision tree regression is to predict a single numeric value. For simplicity and better maintenance, the demo implementation uses list storage instead of pointers. For better customization and interpretability, the implementation uses list iteration instead of recursion or a stack algorithm.
Read the whole story
alvinashcraft
32 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Stop Using Elaborate Personas: Research Shows They Degrade Claude Code Output

1 Share

Scientific research reveals common Claude Code prompting practices—like elaborate personas and multi-agent teams—are measurably wrong and hurt performance.

Stop Using Elaborate Personas: Research Shows They Degrade Claude Code Output

A developer who read 17 academic papers on agentic AI workflows has published findings that contradict much of the common advice circulating in the Claude Code community. The research-backed principles suggest developers are actively harming their output quality with popular prompting patterns.

What The Research Says — Counterintuitive Findings

The key findings, distilled from papers including PRISM persona research and DeepMind (2025) studies, are actionable for any Claude Code user:

  1. Elaborate Personas Hurt: Telling Claude "you are the world's best programmer" actually degrades output quality. The research shows flattery activates motivational and marketing text from the model's training data instead of technical expertise. Brief, functional identities under 50 tokens consistently outperform flowery descriptions.

  2. Shorter System Prompts Win: After 19 requirements in a system prompt, accuracy is lower than with just 5 requirements. More instructions aren't better—they're measurably worse due to cognitive overload and instruction collision.

  3. Multi-Agent Economics Are Poor: A 5-agent team costs 7x the tokens of a single agent but produces only 3.1x the output. Beyond 7 agents, you often get less output than a team of 4. The rubber-stamp "LGTM" from review agents is a documented quality failure pattern.

  4. Context Placement Matters Critically: When key information is placed in the middle of a long context (rather than at the beginning or end), accuracy drops by >30%. MIT researchers traced this to fundamental architectural causes in the transformer itself.

  5. The 45% Threshold Rule: If a single well-prompted agent achieves >45% of optimal performance on a task, adding more agents yields sharply diminishing returns. The recommendation is clear: always start with one agent, measure its performance, and escalate only when data justifies it.

Two Open-Source Tools That Encode The Principles

The researcher built and open-sourced two Claude Code tools that implement these findings:

Forge (github.com/jdforsythe/forge) is a science-backed agent team assembler. It implements vocabulary routing, PRISM identities, and the 45% threshold rule as a Claude Code plugin. Install it via:

claude code plugins install jdforsythe/forge

jig (github.com/jdforsythe/jig) handles selective context loading for Claude Code. It lets you define profiles with specific tools per session, loading only what you need to keep your context clean and performance high.

How To Apply This To Your Claude Code Workflow Today

  1. Rewrite Your CLAUDE.md: Strip elaborate personas. Use brief, functional descriptions like "Senior backend engineer specializing in TypeScript and system design." Keep it under 50 tokens.

  2. Audit Your System Prompt: Count your requirements. If you have more than 10, prioritize and cut. Research suggests 5 well-chosen requirements outperform 19.

  3. Structure Critical Information: Place the most important instructions or context at the beginning or end of your prompt, never buried in the middle of long documents.

  4. Start Single, Measure, Then Scale: Default to a single Claude Code agent. Only consider multi-agent workflows when you have quantitative data showing the single agent is below 45% of optimal performance for that specific task type.

The full article series detailing all 10 principles is available at jdforsythe.github.io/10-principles.

gentic.news Analysis

This research arrives during a period of intense experimentation with Claude Code's multi-agent capabilities, following Anthropic's recent promotion of new features and best practices for the tool. The findings directly challenge the "more agents, more instructions, more persona" approach that has become popular in some circles.

The timing is particularly relevant given recent incidents where Claude agents executed destructive commands like git reset --hard on developer repositories. These quality failures align with the research's identification of "rubber-stamp approval" as a common failure mode in multi-agent systems—when review agents default to agreement as the path of least resistance.

The open-source tools (Forge and jig) represent a growing trend of developers building on Claude Code's Model Context Protocol (MCP) architecture to create specialized enhancements. This follows our coverage of other MCP-based tools like the security audit API and selective WebFetch approval systems, indicating a maturing ecosystem around Anthropic's coding agent.

The research also provides empirical backing for what some experienced Claude Code users have discovered through trial and error: simpler, more focused interactions often yield better results than complex orchestration. As Claude Code usage has appeared in 129 articles this week alone (bringing its total to 426 in our coverage), these evidence-based principles offer a valuable corrective to common but ineffective practices.

Originally published on gentic.news

Read the whole story
alvinashcraft
46 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

ADeLe: Predicting and explaining AI performance across tasks

1 Share
ADeLe | Three white line icons, showing a circle with a checkmark, a search document, and a set of tools, on a blue‑to‑green gradient background.

At a glance

  • AI benchmarks report performance on specific tasks but provide limited insight into underlying capabilities; ADeLe evaluates models by scoring both tasks and models across 18 core abilities, enabling direct comparison between task demands and model capabilities.
  • Using these ability scores, the method predicts performance on new tasks with ~88% accuracy, including for models such as GPT-4o and Llama-3.1.
  • It builds ability profiles and identifies where models are likely to succeed or fail, highlighting strengths and limitations across tasks.
  • By linking outcomes to task demands, ADeLe explains differences in performance, showing how it changes as task complexity increases.

AI benchmarks report how large language models (LLMs) perform on specific tasks but provide little insight into their underlying capabilities that drive their performance. They do not explain failures or reliably predict outcomes on new tasks. To address this, Microsoft researchers in collaboration with Princeton University and Universitat Politècnica de València introduce ADeLe (opens in new tab) (AI Evaluation with Demand Levels), a method that characterizes both models and tasks using a broad set of capabilities, such as reasoning and domain knowledge, so performance on new tasks can be predicted and linked to specific strengths and weaknesses in a model.

In a paper published in Nature, “General Scales Unlock AI Evaluation with Explanatory and Predictive Power (opens in new tab),” the team describes how ADeLe moves beyond aggregate benchmark scores. Rather than treating evaluation as a collection of isolated tests, it represents both benchmarks and LLMs using the same set of capability scores. These scores can then be used to estimate how a model will perform on tasks it has not encountered before. The research was supported by Microsoft’s Accelerating Foundation Models Research (AFMR) grant program.

ADeLe-based evaluation

ADeLe scores tasks across 18 core abilities, such as attention, reasoning, domain knowledge, and assigns each task a value from 0 to 5 based on how much it requires each ability. For example, a basic arithmetic problem might score low on quantitative reasoning, but an Olympiad-level proof would score much higher.

Evaluating a model across many such tasks produces an ability profile—a structured view of where the model performs and where it breaks down. Comparing this profile to the demands of a new task makes it possible to identify the specific gaps that lead to failure. The process is illustrated in Figure 1.

Diagram illustrating a two-stage AI evaluation framework: the top panel shows model performance on the ADeLe benchmark and resulting ability profiles, while the bottom panel shows how task-level scoring criteria are applied to derive task demand profiles.
Figure 1. Top: (1) Model performance on the ADeLe benchmark and (2) the resulting ability profiles, showing each model’s strengths and limitations across core abilities. Bottom: (1) Application of 18 scoring criteria to each task and (2) the resulting task profiles, showing the abilities each task requires.

Evaluating ADeLe

Using ADeLe, the team evaluated a range of AI benchmarks and model behaviors to understand what current evaluations capture and what they miss. The results show that many widely used benchmarks provide an incomplete and sometimes misleading picture of model capabilities and that a more structured approach can clarify those gaps and help predict how models will behave in new settings.

ADeLe shows that many benchmarks do not isolate the abilities they are intended to measure or only cover a limited range of difficulty levels. For example, a test designed to evaluate logical reasoning may also depend heavily on specialized knowledge or metacognition. Others focus on a narrow range of difficulty, omitting both simpler and more complex cases. By scoring tasks based on the abilities they require, ADeLe makes these mismatches visible and provides a way to diagnose existing benchmarks and design better ones.

Applying this framework to 15 LLMs, the team constructed ability profiles using 0–5 scores for each of 18 abilities. For each ability, the team measured how performance changes with task difficulty and used the difficulty level at which the model has a 50% chance of success as its ability score. Figure 2 illustrates these results as radial plots that show where the model performs well and where it breaks down.

Radar charts comparing ability profiles of 15 large language models across 18 abilities, grouped by model families: OpenAI models on the left, LLaMA models in the middle, and DeepSeek-R1-Distill-Qwen models on the right.
Figure 2. Ability profiles for 15 LLMs across 18 abilities. Left: OpenAI models. Middle: Llama models. Right: DeepSeek-R1 distilled models.

This analysis shows that models differ in their strengths and weaknesses across abilities. Newer models generally outperform older ones, but not consistently across all abilities. Performance on knowledge-heavy tasks depends strongly on model size and training, while reasoning-oriented models show clear gains on tasks requiring logic, learning, abstraction, and social inference. These patterns typically require multiple, separate analyses across different benchmarks and can still produce conflicting conclusions when task demands are not carefully controlled. ADeLe surfaces them within a single framework.

ADeLe also enables prediction. By comparing a model’s ability profile to the demands of a task, it can forecast whether the model will succeed, even on tasks that are unfamiliar. In experiments, this approach achieved approximately 88% accuracy for models like GPT-4o and LLaMA-3.1-405B, outperforming traditional methods. This makes it possible to both explain and anticipate potential failures before deployment, improving the reliability and predictability of AI model assessment.

Whether AI systems can truly reason is a central debate in the field. Some studies report strong reasoning performance, while others show they break down at scale. These results reflect differences in task difficulty. ADeLe shows that benchmarks labeled as measuring “reasoning” vary in what they require, from basic problem-solving to tasks that combine the need for advanced logic, abstraction, and domain knowledge. The same model can score above 90% on lower-demand tests and below 15% on more demanding ones, reflecting differences in task requirements rather than a change in capability.

Reasoning-oriented models like OpenAI’s o1 and GPT-5 show measurable gains over standard models—not only in logic and mathematics but also with interpreting user intent. However, performance declines as task demands increase. AI systems can reason, but only up to a point, and ADeLe identifies where that point is for each model.

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first four episodes on demand.

Opens in a new tab

Looking ahead

ADeLe is designed to evolve alongside advances in AI and can be extended to multimodal and embodied AI systems. It also has the potential to serve as a standardized framework for AI research, policymaking, and security auditing.

More broadly, it advances a more systematic approach to AI evaluation—one that explains system behavior and predicts performance. This work builds on earlier efforts, including Microsoft research on applying psychometrics to AI evaluation and recent work on Societal AI, emphasizing the importance of AI evaluation.

As general-purpose AI systems continue to outpace existing evaluation methods, approaches like ADeLe offer a path toward more rigorous and transparent assessment in real-world use. The research team is working to expand this effort through a broader community. Additional experiments, benchmark annotations, and resources are available on GitHub (opens in new tab).

Opens in a new tab

The post ADeLe: Predicting and explaining AI performance across tasks appeared first on Microsoft Research.

Read the whole story
alvinashcraft
2 hours ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories