Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
156447 stories
·
33 followers

Read our white paper on a pragmatic approach to AI governance in America.

1 Share
The debate over AI governance is stuck in a false choice between over-regulation and no regulation. There is a middle way: A pragmatic, evidence-based approach that reco…
Read the whole story
alvinashcraft
5 hours ago
reply
Pennsylvania, USA
Share this story
Delete

8 inbox Windows 11 apps are getting a major update, here’s a closer look

1 Share

Microsoft rolled out its latest Windows Insider flight to its Experimental (26H1) channels this week, and with it, you get a slew of new features and refinements for those nifty in-box Windows apps that include Calculator, Camera, Clock, Media Player, Paint, Photos, and Sound Recorder.

None of it will make the evening news, but it’s the kind of release that makes Windows feel less broken to use every day.

Updates to Windows 11 Inbox apps

Calculator gets language layout fixes and more accurate square-root results

If you happen to be on version 11.2605.9.0 of the Calculator app, follow along as we walk through what’s new and shiny with this release.

The Windows team has been plugging away to fix crashes when calculations that should equal zero return an answer with a small leftover value. Apparently, that was an issue, but no longer should be, going forward.

To test out the fix, you can

  1. Open Calculator
  2. Switch to Standard mode.
  3. Enter sqrt(2.25)-1.5
    1. Type 2.25
    2. Press √ button (or use the keyboard with Num Lock on, Alt key + 251)
    3. Press the – key
    4. Type 1.5
    5. Press =

Calculator improvement in Windows 11 Insider build

The results should produce a clean zero rather than a decimal artifact that would apparently crash the app. Let me know your results in the comments.

High Contrast

Another fix has been applied to the Calculator app that deals with High Contrast Aquatic or Desert options in the Settings page. The options for Aquatic or High Desert now display the correct text associated with the title color. Visit the Settings app to double-check.

  1. Open Settings
  2. Type Accessibility in the Settings Search bar
  3. Select High Contrast
  4. Open Calculator and visit the in-app Settings menu located within the hamburger menu

voilà

Language Layout

The Windows team has also adjusted its Arabic and Hebrew language support by fixing how they are laid out within the app. To test it out, look at the graph, number pad, equation field, and scroll buttons, and how they are now properly oriented.

Here’s how to test it out.

  1. Open Settings
  2. Go to Time and Language
  3. Switch Windows display language to Arabic or Hebrew
  4. Restart the app for the effects to take place

Reliability

The Windows team has assured Windows Insiders that they will no longer ship the Calculator with outdated code, ending the odd failure to resume operations. Ideally, these zero out the “random” crashes that have occurred within the Calculator app.

Camera adds a continuous zoom slider and native QR code scanning

The Windows team didn’t stop at the Calculator app. The team went on to adjust and tweak the Camera app version 2026.2605.7.0) during this release, and here’s how to quickly check to confirm they are installed.

Slider

For Windows users, the updated Camera apps bring some neat fixes to some long-standing pain points of the experience, such as improved granular tweaking with a new Zoom slider. Gone are the standard three-level zoom options.

To test out the improved camera zoom,

1. Open the Camera app

2. Choose the Camera you plan to record with (depending on how many camera-powered apps you have installed)

3. Adjust the zoom slider to the desired level

Front-Facing Camera

For most laptop owners, the front-facing camera is an overlooked component of Windows devices, but that doesn’t mean the Windows team didn’t show it some love.

If you’re someone who relies on that front-facing camera for enterprise chat meetings, family catch-up, or for the rare field worker who is saddled with a Surface Pro, the Windows team offers wide-angle support.

1. Open the camera

2. Switch to Front-Facing Camera

3. Confirm it loads without errors or crashes.

Multiple Resolution Support

As with most wide-angle cameras, Microsoft now supports higher-density resolution options.

To test it out, do steps 1 and 2 from the previous walkthrough, but on the next step, look through the video resolution list.

4. Select the desired resolution and confirm support.

QR Code Scanning

You can finally start scanning QR codes directly from the Camera app without having to install a 3rd party add-on.

To test out the native QR scanning support, simply:

  1. Open the Camera app
  2. Point the camera at the desired QR Code
  3. Confirm the camera is reading the QR Code and select the link

QR Code support may seem trivial to most Windows laptop users, but for the aid worker, hotel concierge, warehouse worker, nurse, or PoS technician, this addition is a lifesaver and long overdue.

Clock overhauls its timer behavior and adds a 15-minute snooze

The Clock app update (version 11.2605.9.0) is the largest in this batch, touching timers, Focus Sessions, alarms, the World Clock, and accessibility.

Apparently, the Timer app had an issue of not being able to count up after hitting zero. Not anymore; you should find that the timers function normally.

  1. Start a timer
  2. Let it count down to zero
  3. Watch as it counts up from there to measure how far you have gone past your targeted time.

Focus Sessions adds an Off option for the daily goal, which was previously absent, and completed tasks no longer clutter the active session list. A rounding bug that could show progress as a minute short (49 minutes instead of 50) is fixed.

  1. Open Clock
  2. Select Focus Sessions
  3. Set a daily goal to Off

There is now a new 15-min snooze option for procrastinators like me, which allows for even more ways to tell the computer to leave me alone.

  1. Set an alarm
  2. When it rings, choose the 15-minute snooze option.

The Countdown Widget now supports three simultaneous countdowns, up from two. The rest of the Clock changes are reliability and accessibility fixes:

  • The World Clock compare page now loads dates progressively as you scroll rather than stalling
  • Country and city names in World Clock have been refreshed to current names
  • An icon that incorrectly showed a moon during all-day daylight in polar regions is corrected
  • The back button in clock comparisons now takes you back one step instead of jumping the date to 1926
  • Newfoundland now uses the correct time zone (St. John’s)
  • Editing a disabled alarm no longer makes it appear enabled
  • Screen readers no longer read timer values twice, and now announce countdown names correctly
  • Keyboard focus no longer disappears after pressing the Timer Reset button

Media Player lets you customize caption styles and enforces playlist names

The Media Player update (version 11.2605.14.0) adds the ability to personalize closed caption appearance. Styling is connected to Windows caption settings, and a quick link inside the app opens those settings directly so you don’t have to go hunting through the Settings app to find them.

To customize your caption style:

  1. Open Media Player
  2. Play a video with captions
  3. Open caption settings and adjust styling
  4. Use the quick link to jump to Windows caption settings

Another improvement is in the labeling and titling of playlists. You can no longer create a playlist without giving it a title.

  1. Create a new playlist
  2.  Try saving it blank
  3. Watch as the app requires a name before closing

A banner now appears in the play queue when the media library is still being indexed, explaining why some items may not show up yet.

The update also fixes a layout glitch with selected items in lists, improves file-type recognition to reduce playback failures, and addresses a crash that could occur when modifying the play queue during session transitions. The codec error dialog has been rewritten to give clearer guidance on what to do when a file needs a codec that isn’t installed.

Paint adds adjustable eraser transparency and fixes JPEG save

The Paint update (version 11.2605.61.0) brings two user-facing changes worth noting. The eraser tool now has a transparency slider, letting you control how much of the image shows through instead of erasing to a hard edge.

  1. Open Paint
  2. Select the eraser
  3. Adjust the transparency slider to control how much of the underlying image now appears.

You can now also test the improved rotating JPEG save functionality process.

  1. Open a rotated JPEG
  2. Press Save
  3. Paint should now overwrite the original save instead of showing another prompt to include a Save As step.

A crash that occurred when opening damaged or invalid image files is fixed; the app now shows an error message instead. The selection outline behavior from classic Paint has also been restored, where the outline now hides while you move, resize, or rotate a selection, which removes the visual clutter that had crept in with recent versions.

Stamp brush strokes now render without color shifts or artifacts, and the AI image generation panel spacing has been cleaned up. Several background crash fixes are included, covering the startup toolbar, background task completion, and app shutdown.

Notepad speeds up and fixes its Find/Replace behavior

Notepad’s update (version 11.2605.29.0) leads with a launch performance improvement, which is a welcome change given how much the app has grown in recent years.

The more substantive fixes in the update are in Find/Replace. Helper tips were persisting between searches instead of clearing, the empty-result notification has been replaced with a less intrusive inline tip, and a “Beginning/end of document reached” notice now appears when wrap-around is disabled so you know the search has bottomed out.

Notepad receives updates

Spell-check underlines that were sticking around after they should have cleared are also fixed, as is a crash that occurred specifically when pasting content copied from Visual Studio. Spanish and Portuguese keyboard users get a fix for a Ctrl+A shortcut conflict that was overriding the Select All behavior in those layouts.

One other detail worth noting is that a bug causing certain plain-text files to display incorrectly on open has been resolved, though Microsoft hasn’t specified which file types or encodings were affected.

Unlike the other seven apps in this build, the Notepad update is rolling out to Beta and Release Preview channels as well, so stable-channel Insiders will pick it up too.

Photos gets AI watermarking controls and better pixel art zoom

The Photos app (version 2026.11060.2004.0) adds an AI watermarking option for images generated or edited with Copilot.

  1. Open Photos
  2. Go to Settings
  3. Search for the AI watermarking option that should now read: Never, Always, or Ask Every Time
  4. Save an AI generated or altered image to confirm

Photos gets AI watermarking controls and better pixel art zoom

You can now navigate text within an image better and via the keyboard.

  1. Open an image with detectable text.
  2. Use arrow keys, Shift+Arrow, Home, End, or Ctrl+A to now navigate text within in image

Tiny images like 16×16 pixel art now zoom in much further to fill the screen without becoming blurry, which is a welcome fix for anyone who uses Photos to inspect icons or small graphics.

A crash that could occur during text recognition is also fixed, and keyboard focus in the navigation bar now requires a single Tab press to skip past hidden controls instead of three.

Sound Recorder fixes its waveform display and a memory leak

Yep, even the Sound Recorder got a few fixes.

The Sound Recorder update (version 11.2605.1.0) is the smallest in this batch, but it addresses some long-standing annoyances. The live waveform now renders correctly when recording with a Bluetooth microphone, which previously showed nothing. A stray horizontal scrollbar that appeared at the bottom of the waveform without serving any function has been removed, as has an issue where the Mark button appeared grayed out on first launch.

Markers are now automatically disabled for WAV recordings, since the WAV format cannot store them and they were being silently discarded before. Rapidly deleting multiple recordings no longer triggers a false “file doesn’t exist” error. A memory leak that occurred each time a recording started has also been resolved.

Sound Recorder fixes its waveform display and a memory leak

None of these updates introduce a new Copilot integration or a redesigned UI. What they do is close gaps in apps that ship on every Windows install and get used by people who have never touched an Insider build. QR scanning in Camera, continuous zoom, AI watermark controls in Photos, and a Clock that counts past zero are features that work the way users expect them to work.

The build is currently rolling out to the Experimental and Experimental (26H1) channels. App version numbers are confirmed in the release notes for each app and can be checked under Apps & Features in Windows Settings.

The post 8 inbox Windows 11 apps are getting a major update, here’s a closer look appeared first on Windows Latest

Read the whole story
alvinashcraft
5 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Microsoft is killing Windows 11’s worst Bluetooth bugs, AirPods and Beats now work better

1 Share

Microsoft has just rolled out the biggest concentrated Windows 11 Bluetooth update yet, addressing a wide range of bugs and introducing faster AirPods pairing. In our tests, Windows Latest observed that LE Audio connections are now more reliable, and it’s actually noticeable as soon as you install June 2026’s optional update (Build 26200.8737+).

When I bought myself a set of Galaxy Buds, which I had been planning to get for a while because they’re one of the best products on the market, I paired them with my PC running Windows 11 24H2 at that point. I did not run into any major problem on the first day, but the next week, one of the buds had disconnected, and Windows was not recognizing it.

After several hours of struggle, I figured that turning off Bluetooth LE Audio was what allowed my PC to recognize both buds. It’s not that Bluetooth LE itself is a bad idea. In fact, Microsoft celebrated it when the feature rolled out to Windows 11. The issue is that Windows has always been bad at Bluetooth, and it’s largely due to legacy code.

Use LE Audio when available in Windows 11

On June 23, Microsoft rolled out Windows 11 KB5095093, and the talk of the town is the more fancy features, such as Point-in-time restore and greater control over Windows updates with new pause options. But it also quietly delivered one of the biggest updates to Bluetooth.

As soon as I installed the update, I no longer ran into pairing issues with my Galaxy Buds. I also tested AirPods, which now pair faster.

Microsoft told me that it’s all part of Windows 11 KB5095093 ( 26200.8737 or 21200.8737 +). Specifically, this update comes with better compatibility with specific audio devices, including AirPods and Beats Studio Pro.

2026-06 Preview Update (KB5095093) (26200.8737)

While the Beats Studio Pro is getting better microphone reliability, AirPods now pair faster and sound better on nearly all PCs. However, the June 2026 optional update also covers other problems, so it’s not just about Apple products or select premium brands.

What’s new in Windows 11’s biggest update for Bluetooth

Windows is finally better at syncing the mute state, so the next time you mute your mic in Windows, your Bluetooth headset will no longer think the mic is unmuted. After the update, both Windows and the Bluetooth headset should show the same mute state.

In our testing, we observed that if you mute your microphone from Windows 11’s audio mixer, your Bluetooth headset will also understand that the mic is muted. And if you tap on the headset, assuming it has touch control or something similar, to unmute the mic, Windows will respect your preference.

This feature works with most modern Bluetooth headsets, but Windows Latest understands that it works only with devices with Hands-Free Profile (HFP).

Bluetooth audio stability in Windows

Another long-standing complaint has been that Bluetooth connections are not always stable, and it can be worse for certain manufacturers than others. In fact, in some cases, you might run into error code 0x9F with OEM drivers for Bluetooth. But today’s update also addresses those concerns.

Microsoft is not only making Bluetooth connections more stable, but also improving voice calls. You’ll observe better voice calls where Bluetooth won’t stutter when audio and mic are both being used together. This is particularly noticeable on PCs with Classic Audio drivers and Hands-Free Profile (HFP).

Last but not least, Microsoft also reduced the time it takes for LE Audio devices to play audio when you are using the microphone. For example, if you are in a call and you’re using Bluetooth LE Audio-compatible headphones, Windows currently plays audio a bit later when the mic is being used. Now, it’ll reduce that time and almost instantly play audio.

Bluetooth connections and Settings experience are getting more reliable

Microsoft also admitted that the Bluetooth & Devices page in Settings is a mess and often lags or crashes. I’ve personally run into issues where the page is slow to reflect the actual status of the connected hardware, and it is far from smooth.

Bluetooth devices page in Windows 11

With Windows 11 KB5095093, Microsoft has made the Bluetooth settings more consistent and smoother.

Finally, there are also connection improvements. For example, Bluetooth now takes less time to reconnect after you wake up the PC from sleep or hibernation. Likewise, if you disconnect a Bluetooth device to connect to another device, you may not be able to reconnect to the previous device quickly. This has been fixed now.

This is a solid list of improvements, and we’re expecting more changes in the coming months. It’s all part of the company’s efforts to win back fans and revive Windows 11.

What else do you want Microsoft to address in the next update? Let me know in the comments below, and we’ll forward your feedback!

The post Microsoft is killing Windows 11’s worst Bluetooth bugs, AirPods and Beats now work better appeared first on Windows Latest

Read the whole story
alvinashcraft
5 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Microsoft introduces cheaper Surface devices with half the memory

1 Share

Microsoft just added a cheaper 12-inch Surface Pro and 13-inch Surface Laptop to its lineup. Both models come equipped with 8GB of RAM instead of 16GB, costing $849 for the specced-down Surface Pro and $949 for the Surface Laptop, as spotted earlier by Windows Central.

When the 12-inch Surface Pro and 13-inch Surface Laptop launched in 2025, the base configurations offered 16GB of RAM and 256GB of USF storage, for $799 and $899, respectively. But in April 2026, Microsoft hiked the prices to $1,049 and $1,199. The memory crunch has also resulted in a higher base price for Microsoft's newest Surface models.

Other than slashing device memory …

Read the full story at The Verge.

Read the whole story
alvinashcraft
6 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Celebrating over 15,000 young creators at the Coolest Projects 2026 online showcase

1 Share

From first-time coders to seasoned makers, this year every single Coolest Projects creator brought all their creativity to their tech projects and made something to be proud of. Over 15,000 young people showcased more than 4,500 creations on a global stage. With participants from 40 countries and 47% girls, this year’s online showcase is a true reflection of what the next generation of tech creators looks like.

Yesterday our global livestream brought together the whole Coolest Projects community to celebrate the young people’s creations with some very special guests.

Meet our 2026 VIP judges and their favourite projects

Every year, we invite new special VIP judges to choose their favourite projects from each of the seven Coolest Projects categories. Meet our 2026 judges and find out about the projects they picked.

Ronit Levavi Morad, Chief of Staff at Google Research 

Ronit’s role involves reimagining the future of learning and she is a passionate advocate for technology in service of pedagogy, leading Google Research’s global Al literacy initiatives. She champions a human-centered vision for technology, leading programmes like Al Quests, a gamified experience designed to teach teens how Al can be applied to humanity’s greatest challenges.

Ronit’s favourite projects are:

Ben Powley, Senior Developer at Jagex

Ben is a senior developer at British video game company Jagex, working on RuneScape to deliver immersive experiences for both new and existing players. He previously worked at Ubisoft on the Assassin’s Creed series and the Avatar Frontiers of Pandora game. Ben learned to code at 12 through making mods for Minecraft, and he has been passionate about game development ever since.

Ben’s favourite projects are:

Akari Kawaguchi, youth mentor, CoderDojo Japan

Akari is a 16-year-old CoderDojo member from Japan who has showcased her own projects at Coolest Projects seven times. She knows exactly what it takes to make something special for the showcase, and we’re so excited to have a young creator’s perspective on the judging panel!

Akari’s favourite projects are:

Sebin Sunny, CEO, EIC IIITM-K

Sebin is CEO of the Entrepreneurship and Innovation Center (EIC) and the Centre of Excellence in loT Sensors at the Indian Institute of Information Technology and Management – Kerala (IIITM-K). A passionate advocate for innovation, he is committed to empowering entrepreneurs, creating sustainable solutions, and shaping the future of technology-driven industries.

Sebin’s favourite projects are:

Broadcom Coding with Commitment® award

The Broadcom Coding With Commitment® award shines a light on creators who use coding to support and strengthen their communities with a project that aligns with 17 sustainable development goals of the United Nations.

A screenshot of a young person's Scratch project showcased at Coolest Projects.
The start screen of Viridia – Back to the World

This year’s Broadcom Coding with Commitment® recipient for the online showcase is Egehan from Türkiye, with their Scratch project Viridia – Back to the World, an educational game designed to teach players about the importance of water and how to use it responsibly.

Get inspired and keep creating!

Browse the Coolest Projects 2026 online gallery to discover thousands of incredible projects from young people all over the world. 

Inspired to make your own project? Or encourage a young person you know? To get you started, we offer over 200 free coding projects, in English and many other languages.

Want to know more about next year’s showcase?

Coolest Projects will be back online in 2027. Sign up to the newsletter to be the first to hear about dates, deadlines, and exciting updates.

Coolest Projects logo.

And did you know there are in-person Coolest Projects events around the globe? There is still time to take part in Coolest Projects India and other partner events this year. Find out more.

Thank you to the Coolest Projects sponsors

We want to say a big thank you to Broadcom Foundation, Allianz, Amazon Future Engineer, Qube-RT, Avnet, and GoTo for sponsoring Coolest Projects 2026 and helping to celebrate young tech creators around the world.

The post Celebrating over 15,000 young creators at the Coolest Projects 2026 online showcase appeared first on Raspberry Pi Foundation.

Read the whole story
alvinashcraft
6 hours ago
reply
Pennsylvania, USA
Share this story
Delete

An Inside Look at Copilot in Excel

1 Share

When you ask Copilot in Excel to “add a drop-down menu to this column” or “fix this formula in the team inventory tracker,” behind the scenes, Copilot reasons about your intent, picks and applies the right Excel features, and then edits the workbook.  This can feel straight forward for some tasks but, now imagine you’re a Finance professional asking Copilot in Excel: “What’s broken in this valuation model, and how does fixing it change the outcome?” 

This is not a simple single formula request; it is an end-to-end finance workflow.  To tackle it, Copilot needs to inspect the workbook, understand the finance specific business question, identify assumptions and formulas that may be wrong, correct the model, recalculate outputs, and explain the impact in a way a finance professional can trust.  A good answer is not just “here is the new result.”  A good answer is a clear, auditable explanation of what changed, why it changed, and its impact on values in the model.

That kind of prompt is what we think of as an L4 workflow: an open-ended business problem that requires multiple steps, multiple Excel capabilities, and domain judgment.  It is also a great way to understand why evals matter. 

To provide a great response to this question, many different operations and features need to ladder together successfully to deliver the final result.  So how do we make sure every part comes together to solve the customer's intent?  This is the job of our evaluation system, aka our "evals.” 

We’ll provide a brief tour through how we think about evals in Excel: what they are, how we organize them, the benchmarks they produce, and how we use these results to make Copilot in Excel better every week.

What are evals, and why do we have them?

An eval is a repeatable test of Copilot quality. We give Copilot in Excel a task and a starting workbook, let it do its work, and then grade the output against a series of checks and rubrics.  These checks go beyond just confirming whether a value is correct.  They also assess whether the workbook is usable, auditable, well- formatted, and aligns to Excel best practices.

We evaluate more than the final workbook output.  For agentic workflows, the path to the result matters too: what steps the agent took, which tools it called, how many turns it needed, how long the experience took end to end, and more.  Looking at that trajectory helps give us a clearer measurement of whether the agent is becoming more accurate, capable, and efficient at how it reaches the desired result.

Evals measure Copilot quality and are how we move from “this feels better” to “this is measurably better.” A prompt tweak, model upgrade, or a new tool can improve one customer intents while quietly making another worse.  Our evals help us catch those regressions before they reach customers, as well as quantify improvements from new capabilities and find quality gaps early enough to prioritize fixes.

How we organize our evals

Excel is used by millions to complete a wide range of work: tracking lists, analyzing sales, managing operations, building financial models, planning budgets, and much more.  Even the same task can have different expected answers whether you’re a finance professional or a small business owner.  To measure Copilot in Excel quality reliably, we ensure our eval cases represent the breadth of customer workflows and all the different flavors of spreadsheets they use.

We organize eval cases across several dimensions. Complexity is one of the most important dimensions, because it lets us test the building blocks separately while also testing the full customer workflow. We also categorize cases by customer role, domain, workbook characteristics, Excel feature usage, and more, so our benchmarks reflect the breadth and depth of real Excel work.

Complexity: L1 through L4

We think about task complexity as a ladder.  At the bottom are atomic actions and individual features.  Higher up are multi-step tasks.  At the top are open-ended workflows that sound more like business problems.  The important point is that an L4 workflow is only as strong as the L1, L2, and L3 capabilities underneath it.

  • L4 workflows: solving an open-ended business problem, such as debugging a valuation model or determining how much cash is available for debt service.
  • L3 multi-step tasks: combining features to complete a defined goal, such as creating a sales summary worksheet or building a cash-flow bridge.
  • L2 feature usage: using an Excel capability correctly, such as creating a PivotTable, applying conditional formatting, building a chart, or using formula auditing.
  • L1 actions: single operations such as inserting a row, editing a formula, formatting a cell, or creating a label.

Let’s go back to the valuation example at the start of the blog: “What’s broken in this valuation model, and how does fixing it change the outcome?”

Copilot needs to break the work into several major steps: inspect the model structure, locate the key assumptions, audit the formulas, correct the calculation logic, compare the before-and-after outputs, and summarize the business impact.  Each step sounds simple at the surface, but each depends on many smaller capabilities working reliably.

Level

What it means

How it shows up in the valuation scenario

L4: Workflow

Solving an end-to-end business problem

“What’s broken in this valuation model, and how does fixing it change the outcome?”

L3: Multi-step task

Composing multiple Excel capabilities to reach a goal

Identify incorrect assumptions

Correct formulas

Recalculate outputs

Quantify the impact

L2: Feature usage

Using one Excel capability correctly

Apply formula auditing

Update cross-sheet references

Build comparison tables

Document changes

L1: Action

Performing a single atomic operation on the grid

Edit a formula

Insert a table

Format cells, values

Write a label and notes

Check cell reference

Customer role and vertical domain

What a good workbook looks like in Excel is often domain-specific.  A formula that is acceptable in a general spreadsheet may not meet the bar for a finance model.  A summary that works for a small business owner may not be enough for a consultant preparing a client deliverable for a large enterprise.  That is why we build and review eval cases with domain-specific expectations in mind.

We’ve worked closely with industry partners and customers to validate coverage, review input and output workbooks, and author grading rubrics and success criteria.  Through these processes, we ensure our evals are realistic, aligned to expert human judgement, and reflect their taste and domain specific expertise. 

For example, we look at the shape of the workbook itself: number of sheets, workbook size, data density, formulas, tables, charts, and other Excel features. Real workbooks are rarely tidy single-tab examples. They can be large, messy, and full of context.

Another example is our finance-specific rubrics for model structure, formula construction, auditability, and presentation quality. 

From categories to benchmarks

Once we have categorized our eval cases, we curate them into benchmarks.  Our goal here isn’t a single benchmark or a single number.  Rather, we have different benchmarks for the different types of decisions we face while building Copilot in Excel. 

Some of our benchmarks provide broad coverage with an emphasis on comparability.  Others target Excel-specific behaviors and customer workflows for specific feature areas or customer segments.  Here is a subset of the benchmarks we use internally:

Public benchmarks

One public benchmark used by many spreadsheet products is SpreadsheetBench, a benchmark of spreadsheet tasks that provides broad coverage of common operations.  Internally, we’ve further curated and validated subsets of this benchmark to adjust ambiguous queries, improve grader correctness, and improve signal to make results more easily applicable.  However, we believe our internal benchmarks better reflect the type of work and expectations of Excel customers.

Private benchmarks

In addition to public benchmarks we maintain several internal benchmarks within the Excel team.  Some examples include:

  • RegressionBench helps us detect whether a change has broken something that used to work.
  • CustomerBench focuses on matching production customer distributions across task complexity, domains, workbook shapes, and other dimensions. This provides an signal that mirrors customer usage.
  • OfficeJSBench measures whether the operations Copilot generates to act on the workbook execute reliably and produce the intended result.
  • FinanceBench is one of our deep and challenging vertical benchmarks focused on finance workflows and uses finance-specific rubrics for correctness, structure, and auditability. We developed FinanceBench with input and review from partners like Financial Modeling Institute (FMI), Microsoft Finance, and other finance professionals and customers.

The collection of these benchmarks provides both broad and granular measurements on quality.  They allow us to measure where the agent is strong, where it can improve, and whether a change has the intended impact on the customer experience.

How eval results make Copilot in Excel better

Evals are not just a reporting mechanism - they are deeply integrated into how we build the AI experience in Excel.  Eval benchmarks help:

  • Gate releases: Benchmark runs are part of how we decide whether a feature is ready to advance from our validation environments to customer availability. If evals detect issues or the improvement doesn’t show intend impact, the change does not move forward.
  • Improve based on customer signals: When a customer conversation or signal reveals opportunity for improvement, we classify it and then create representative eval cases for it. We can then fix it and measure improvement in subsequent eval runs. 
  • Model training: We use rubrics and graders from evals to provide signal to tune models for spreadsheet work.
  • Inform the right model for the job: Comparable benchmark results help us route tasks to models that balance quality, latency, and capability, to provide the best experience to customers.

Over time, this creates a virtuous cycle: feedback and usage identify opportunities, evals quantify them, fixes and training address them, benchmarks confirm the win, and the next release starts from a higher bar.

Why the hero scenario matters

Coming back to the valuation prompt, the reason it matters is that it represents the kind of work people increasingly want Copilot in Excel to help with. They do not just want help inserting a chart or writing one formula. They want help completing meaningful work: understanding a model, finding an issue, improving the analysis, and making the result easier to trust.

We can only deliver that L4 experience if the underlying layers are strong.  Formula edits have to be correct, feature usage has to be reliable, multi-step plans have to stay coherent, and the final workbook has to be clear, complete, and auditable.  That is why our eval system measures both the broad set of things people do in Excel and the deep workflows that matter most in demanding domains like finance.  

Read the whole story
alvinashcraft
6 hours ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories