Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
150497 stories
·
33 followers

Gemini 3.1 Flash Live: Making audio AI more natural and reliable

1 Share
Gemini 3.1 Flash Live is now available across Google products.
Read the whole story
alvinashcraft
16 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

AsgardBench: A benchmark for visually grounded interactive planning

1 Share
AsgardBench | three whit icons on a blue to purple gradient background | first icon shows a laptop screen with a eye in the upper right corner, second icon shows relational nodes | third icon is a security shield with a checkmark

At a glance

  • To successfully complete tasks, embodied AI agents must ground and update their plans based on visual feedback.
  • AsgardBench isolates whether agents can use visual observations to revise their plans as tasks unfold.
  • Spanning 108 controlled task instances across 12 task types, the benchmark requires agents to adapt their plans based on what they observe.
  • Because objects can be in different positions and states (e.g., clean or dirty), the same instruction can require different action sequences, even in the same environment.

Imagine a robot tasked with cleaning a kitchen. It needs to observe its environment, decide what to do, and adjust when things don’t go as expected, for example, when the mug it was tasked to wash is already clean, or the sink is full of other items. This is the domain of embodied AI: systems that perceive their environment and act within it.

The field has made rapid progress, but evaluating these systems is harder than it looks. Many benchmarks test perception, navigation, and physical control all at once, making it difficult to isolate whether an AI agent is actually using what it perceives to make better decisions or just getting lucky because the environment is predictable enough to script around.

To address this, we created AsgardBench. In the paper, AsgardBench — Evaluating Visually Grounded Interactive Planning Under Minimal Feedback,” we describe how this benchmark poses a simple but demanding challenge: give an AI agent a household task, let it observe the environment through images, and see whether it can adjust its plan when what it perceives contradicts what it anticipated. Can it notice that the mug it needs to clean is already in the sink, or that it isn’t, and behave accordingly? That is the core question AsgardBench is designed to answer.

Built on AI2-THOR, an interactive 3D simulation environment used to train and evaluate AI agents on household tasks, AsgardBench positions agents near objects and gives them a small, fixed set of actions, such as find, pickup, put, clean, and toggle_on/off. At each turn, the agent proposes a full sequence of steps to complete the task, but only the first step executes. Throughout, the focus is squarely on plan adaptation, not whether an agent can navigate a room or manipulate an object, but whether it can use what it perceives to revise its next step.

For example, the agent may discover a mug to be clean, dirty, or filled with coffee, or it may observe that a sink contains many other items, so the same instruction can require different action sequences as the task unfolds. This process is illustrated in Figure 1.

Model changes the steps in its plan as new observations are observed
Figure 1: Agent observations and corresponding action plans in AsgardBench. Each image is paired with the plan generated from that observation. This illustrates how AsgardBench requires agents to update or change their plans based on new visual evidence rather than following a fixed sequence.

How it works

Agents start in interaction-ready positions, so navigation and viewpoint selection are not factors. A find action brings objects into view, and the environment handles the details of container sizing and placement, so the agent does not need to reason about which cabinet or countertop to use. The only inputs are color images, a history of attempted actions with simple success or failure signals, and the agent’s own record of what it plans to do next.

At each turn, the agent proposes a complete sequence of steps to finish the task, but only the first step proceeds. It then receives new images and a simple signal—did that action succeed or fail? This prevents the agent from scripting everything upfront and forces it to re-evaluate and revise its plan at every step. Built-in limits on total steps and repeated actions prevent endless loops. Because the environment provides only simple feedback, the agent must be able to notice what it perceives (e.g., whether a mug is dirty, whether a faucet is running) and keep track of where it is in the task from one step to the next.

Evaluating AsgardBench

We tested several leading vision-capable models on AsgardBench and observed that high-performing models require visual grounding to consistently succeed. Across the models, visual input substantially improved performance: most models more than doubled success rates when given images versus text-only descriptions of the scene. This is in contrast to some prior benchmarks where agents could perform reasonably well without vision by relying on textual feedback on what went wrong.

Providing that kind of detailed failure information raises performance for all models in AsgardBench, too, but it can mask the real problem. The strongest vision-capable models still outperform text-only agents even when those agents are given detailed feedback, demonstrating that the benchmark requires visual grounding that text alone cannot replicate. AsgardBench’s performance is illustrated in Figure 2.

Chart showing input substantially improves performance for all but the weakest models when images are included
Figure 2. Success rates for image-based and text-only conditions. Visual input substantially improves performance for all but the weakest agents, while text-only performance remains low, indicating that AsgardBench requires perception-based reasoning.

The results also revealed where today’s agents consistently fall short. Across all models, the same problems kept appearing: agents attempted undoable actions (e.g., trying to clean a mug that was not in the sink), got stuck in repeated action loops, misinterpreted subtle visual cues (on/off, clean/dirty), and lost track of where they were in the task progress from one step to the next. This points to three weaknesses: the inability to distinguish subtle visual details in cluttered scenes, the inability to maintain an accurate picture of task progress across multiple steps, and the inability to consistently translate what the agent sees into timely updates to its plan. Taken together, these point to where the next generation of embodied agents will need to improve.

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first four episodes on demand.

Opens in a new tab

Implications and looking ahead

AsgardBench is useful as both a diagnostic and development tool. By varying what feedback agents receive (none, minimal, or detailed), researchers can isolate whether performance gains come from better perception, better memory, or better planning. Promising directions include systems that combine stronger visual understanding with better state tracking, training approaches that emphasize learning to repair plans mid-task, and evaluation methods that measure not just whether an agent succeeds but how well it adapted along the way.

The failure patterns AsgardBench surfaces point toward a concrete next step: building systems that can make finer visual distinctions, keep track of what changed more reliably across steps, and learn to revise plans mid-task rather than plowing ahead on a script. Agents that make progress on these challenges should be meaningfully better equipped for the messiness of real-world environments: unexpected object states, cluttered scenes, and the constant need to adapt.

AsgardBench is open source and available on GitHub (opens in new tab), providing a foundation for advancing research in visually grounded planning.

Acknowledgements

We thank the AI2-THOR community for building the simulation platform and making reproducible embodied evaluation possible.

Opens in a new tab

The post AsgardBench: A benchmark for visually grounded interactive planning appeared first on Microsoft Research.

Read the whole story
alvinashcraft
16 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

How we standardized MariaDB in our Integration Server

1 Share
Engineering at Wealthfront is centered on the idea that code should be written to facilitate testing, not the other way around. Without a staging environment to fall back on, we maximize confidence through a sophisticated, multi-layered testing strategy. While unit tests provide our most rigorous line of defense, our Integration Server is the workhorse that... Read more
Read the whole story
alvinashcraft
17 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Budget Bytes: Azure Data Leaders on AI & Budget (Sneak Peek)

1 Share
From: Microsoft Developer
Duration: 0:47
Views: 106

What would you do with $25? Priya and Shireesh have some fun answers. You can also build a fully professional AI app in less! Check out Budget Bytes to see exactly how to do it using SQL Database. No big budget required.

🚀Budget Bytes Playlist: https://aka.ms/BudgetBytesPlaylist
🚀Visit the Repo: https://aka.ms/budgetbytes
🚀Get started for FREE: https://aka.ms/budgetbytes/freeoffer

Read the whole story
alvinashcraft
17 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Web App Performance and Migrations

1 Share
From: Fritz's Tech Tips and Chatter
Duration: 3:52:07
Views: 95

Let's review the progress on the Blazor components and then make migrations easier and faster

Read the whole story
alvinashcraft
17 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

136: 0 to 1,000 in a Year: Ionna CEO on Scaling to 30,000 by 2030

1 Share
In this episode:
  • Ionna CEO Seth Cutler talks about achieving growth targets
  • Tesla Stops Making V3 Superchargers
  • Uber buying 10,000 Rivian R2 for autonomous duties 
  • And much, much more!

Regular cohosts:
Tom Moloughney from State of Charge and EVchargingstations.com
https://evchargingstations.com/ |  https://www.youtube.com/StateOfChargeWithTomMoloughney
Kyle Conner from Out of Spec Studios
https://outofspecstudios.com/
Martyn Lee from EV News Daily
https://www.evnewsdaily.com/
Domenick Yoney from Drive Electric with Domenick
https://www.youtube.com/@DriveElectricWithDomenick





Download audio: https://dts.podtrac.com/redirect.mp3/audioboom.com/posts/8879022.mp3?modified=1774515910&sid=5141110&source=rss
Read the whole story
alvinashcraft
17 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories