Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
154603 stories
·
33 followers

Claude Opus 4.8 on GitLab: Complex agentic work, less disruption

1 Share

Anthropic's latest model on GitLab is built for precise execution across complex multi-step agent work.

Agents fail most often on complex, multi-step work: tasks that span multiple tools and go from intent to production without losing track of the project goal. Claude Opus 4.8, Anthropic's latest model for coding and agentic tasks, is built for that work, and now available in GitLab Duo Agent Platform via model selection in Agentic Chat and across agent workflows in your GitLab instance.

Opus 4.8 delivers more precise execution across complex agentic sequences where agents run autonomously over extended time periods. With more comprehensive reasoning and planning, teams can expect cleaner end-state results with fewer interventions to redirect agents along the way.

Improved long-horizon agentic execution

For teams with established agent workflows, Opus 4.8 interprets instructions more precisely than prior models. Agents handling extended sequences complete each step as specified, which means more efficient and accurate outcomes with less time reviewing and correcting agent output.

Opus 4.8 is also stronger on professional work beyond coding: document drafting, data analysis, and structured knowledge tasks. For teams using GitLab Duo agents across planning and documentation workflows, as well as coding, Opus 4.8 handles those tasks more reliably, too.

Support for mid-conversation system prompts

Opus 4.8 adds support for mid-conversation system prompts: System instructions can be updated partway through a session without invalidating the prompt cache. Teams building on the API can use this when async context arrives mid-session: when files change on disk, when token budget shifts, or when user context updates, without restarting the cache.

Get started today

Claude Opus 4.8 is available now in GitLab Duo Agent Platform. Like other models, Opus 4.8 runs on GitLab Credits. For a full list of models and their credit consumption, please visit our documentation.

Start a free trial of GitLab Duo Agent Platform today, or sign up from the GitLab Free tier by following a few simple steps. Existing GitLab Premium or Ultimate subscribers can use the GitLab Credits included with their subscription.

Read the whole story
alvinashcraft
just a second ago
reply
Pennsylvania, USA
Share this story
Delete

Automate evaluations | Microsoft Foundry

1 Share

Trace every run end-to-end, generate synthetic datasets to stress-test on demand, fire automated Red Team attacks at your own agents, and pin down why evaluations fail — all from the Microsoft Foundry control plane. Lock in guardrails that inspect every tool call at runtime, define the risks once, and enforce them across every agent run. 

Mohammad Abuomar, Responsible AI Principal Architect, shares how to turn a coding agent into production-ready software inside Foundry. 

Describe the agent, set the row count, confirm.

Your test set lands in seconds. Microsoft Foundry’s synthetic dataset generator builds eval data on demand. Get started.

Pin down why your agent fails evaluations. 

Foundry’s Analyze Results uses AI to cluster failures, name the root cause, and recommend specific fixes. Check it out.

Lock down agent behavior with the Task Adherence Guardrail.

It inspects every tool call against the original task and blocks the off-script ones. Try it in Microsoft Foundry.

QUICK LINKS: 

00:00 — Microsoft Foundry control plane 

00:33 — See a finished agent 

02:30 — See where the agent started 

03:19 — Traces 

04:04 — Built-in monitoring 

04:34 — Evaluation types 

05:51 — Red team evaluations 

07:08 — Evaluation results 

08:14 — Built-in Guardrails 

08:14 — Wrap up

Link References 

Get everything you need in Microsoft Foundry at https://ai.azure.com 

Unfamiliar with Microsoft Mechanics? 

As Microsoft’s official video series for IT, you can watch and share valuable content and demos of current and upcoming tech from the people who build it at Microsoft. 

Keep getting this insider knowledge, join us on social: 


Video Transcript:

-If you want to build agents that meet your expectations for output quality, performance, safety, and cost, it’s not just about the model or framework you select. The testing, evaluation, and the controls surrounding your agent matter. And that’s what the Microsoft Foundry control plane is designed to do, with tools you can use during development to make sure your agents raise the bar across every important dimension. Today, I’m going to walk through the process of building a coding agent and demonstrate where the controls in Foundry come in to make it better. I’ll start by showing my finished agent, then after that, I’ll show you the steps I took using Foundry to make it production ready. This agent is designed to take a simple user prompt, then find what it needs to build apps automatically. 

-So, I’ll paste in my prompt asking it to generate a Windows desktop app for personal cashflow management. It needs to be fast, use WebUI, and easy to use for broad appeal. I’m also asking it to make it safe, secure, and follow privacy best practices. And it needs to be easy for a developer to read, maintain, and to add to it. I’ll submit the request and it gets to work, with its reasoning on the left and code on the right. This process takes several minutes to complete with a few interactions in between, so to save a little time, I’ll skip to the result. We can see the agent’s reasoning and plan, with its technology stack, approach and initial action. Then, we can follow all of the steps it performed to author and configure the app and its dependencies. 

-Below that, is the React code and JavaScript. It asked whether to proceed writing this as an Electron and React setup, and I confirmed. Then it started to write, test and iterate on the app, followed by another question whether to implement more features or focus on security. And I responded to do both. It then finished writing the app and finally it outlined the steps to run the app locally. 

-So, let’s test it out. I’ll move over to my terminal window running PowerShell and start it. And here is the generated app. It’s fully functional with user authentication. I can enter my first item, Travel Expenses, and the amount, and there’s a Category dropdown menu with pre-configured options, and I’ll choose “Transportation”. And it writes that record into the local data store. This is a simple, production-ready app that the agent was able to create in just a few minutes. But it didn’t start out this way, and if you’ve built agents or apps yourself, you’ll know a lot of what doesn’t get shown is the testing, iteration, and refinement work to end up with production-ready code. Let’s change that. 

-Let’s go back in time to where this agent started. I’m in Visual Studio Code and this is my agent, which I built using the Foundry SDK. Here are the defined tools for it to use, WebSearch and CodeInterpreter. And on the left, we can see the full list of local tools. Like interacting with the file system, as well as git, patching, registry, local search and running shell scripts. And here in the center is the key SDK line that creates the agent, adds the tools, deployment name and so on. 

-So, the agent is functional and I’ve also started manual testing. And this is where Foundry controls let me stress‑test the agent to see what works and what doesn’t and see the details for each run. In the Microsoft Foundry portal, I have my agent open and the Traces tab. These are OTel traces of all of the runs for this agent, with the newest runs on top, everything here is backed by Azure Monitor. And I can click into any conversation or Trace ID to view the Input + Output turns for that session. They’re easier to parse than standard logs, speeding up reviews. We can also see the system message, user input, and what the agent did. Along with the agent’s reasoning, the technology stack it used, and the app features. Below that, we can see the development process as well as tool outputs Beyond that, with built-in monitoring, you can get a roll-up view of all activities for our agent with key metrics I’m in the Monitor tab. It shows me the estimated cost and token usage so far. This agent is new so I haven’t configured Evaluations yet, but we’ll get to those in a moment. 

-Next, you’ll see Operational metrics like the number of agent runs and how many successfully completed or failed, token consumption, tool calls made by the agent, and the error rate over time. Evaluations are where a lot more testing automation comes in to help you improve agent faster. I’m in the Evaluations tab, I need to create my first one. The options are: Automatic Evaluation, where you can automate the process using AI; Human Evaluation, where someone tests the agent and completes surveys; and Red team, where an agent runs automated attacks to expose vulnerabilities. I’ll start with Automatic Evaluation and hit Create. It starts with defining a target. My agent and the version I want are already selected. For data, I can upload an existing dataset or save time by creating a synthetic dataset, which is very cool. This generates data automatically, you just select the number of rows you want. I’ll guide it with a prompt, “Create a dataset for evaluating a coding agent.” I’ll skip the reference file and just Confirm. That automatically generates 90 rows of data to test with. 

-Next, I’ll choose the evaluation Criteria. There are several built-in evaluators for Agents. Below that are evaluators for Quality. These are editable, so I’ll remove Coherence, Fluency, and Groundedness because my agent doesn’t need them. For Safety, there are seven evaluators, and I’ll keep them as-is and move on to Review, then Submit it. These Automatic Evaluations can take several minutes to complete, so while it’s working, I’ll move into Red Teaming, which is now becoming a core part of AI testing to spot vulnerabilities early on. I’ve started creating my first red team evaluation. Let’s look at the standard configuration for risk categories. You can modify these. It can check for unsafe categories plus ungrounded attributes, code vulnerabilities, and task adherence. It shows the tools that the agent can access. I’ll provide descriptions for web_search, to search the internet for relevant SDKs, and the code_interpreter to run code for the coding agent. Then I’ll Save it. 

-Next, I’ll change Seed queries from 5 to 10 per category for more testing. In the Attack strategies, I can see exactly what the red teaming agents will try to do and select the ones most relevant to my agent. Each tile describes the attack type that will be tested. I’ll choose AsciiSmuggler, Base64, Jailbreak, StringJoin, UnicodeSubstitution, and IndirectJailbreak. Now, I can review the prohibited actions, including things like attempts to change password, and more. These are all things attackers might try to do with your agent, and we’re automating those tests for you. I’ll hit Submit to get everything started. Now, with two evaluations running, to save a little time, I’ll fast forward to the results of the evaluations. 

-Here, we can see the two runs. I’ll open the Automatic Evaluation first. Then clicking into the Run shows the details for each evaluator. If I scroll to the right, you’ll see that we’re green almost across the board. One glaring exception is the TaskCompletion score at 59%, which is below my bar, so it’s something to fix. One of my favorite capabilities in evaluation is using AI to analyze the results. I’ll start the analysis, and it creates a nice cluster analysis showing the main issues. I mentioned TaskCompletion before. Here, you can see “incomplete resolution” and “action plan issues”. Drilling in, looks that there is a “lack of actionable output” and the AI suggests specific ways to fix it. This saved me time to find ways to improve my agent. 

-Now, let’s review our Red Teaming evaluation. I’m at the top level view and I’ll click in to see the issues. Immediately, I can see that the Task adherence is red, which is also related to TaskCompletion. We can fix this using a built-in guardrail to check for task adherence. Guardrails define what risks to detect, from which point in the process, and how to respond. Let’s go to the agent playground. Scrolling down to Guardrails, I can see only the default model guardrail is set. Let’s add another by clicking Manage guardrails and Create. Here, I can define the risks and controls I want to enforce. I’ll start with Risk, and these are the types of risks we can detect and mitigate. There’s an option for “Task adherence” that I’ll choose. This guardrail checks any tool call made by the agent to ensure it’s used appropriately to “adhere” to the task. 

-Now, I just need hit Submit to activate this guardrail. And the TaskCompletion issue should now be fixed. In fact, here I’ve run another evaluation, and we can see that TaskCompletion is now green and everything meets our overall quality goals. With that, my agent is ready for broader use. And while I focused today on a single agent and using Foundry controls to test it, expose vulnerabilities, and make it better, Foundry also provides fleet-wide performance visibility across all agents and enables centrally applied and enforced policies and configurations to keep agents compliant. 

-To find out more and get started with these and other controls, you’ll find everything you need in Microsoft Foundry at ai.azure.com. Subscribe to Mechanics for the latest tech updates, and thanks for watching.

Read the whole story
alvinashcraft
19 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Introducing federated Copilot connectors for LSEG and Moody's in Excel

1 Share

Earlier this month, we announced Microsoft 365 Copilot federated connectors were coming to Copilot in Excel. Built on the emerging industry standard Model Context Protocol (MCP), federated connectors pull data into Microsoft 365 Copilot live at query time, helping bring institutional data sources into the tools customers already use every day. Starting today, Copilot can now pull the latest data from LSEG and Moody’s directly into your Excel workbook. This is the first set of trusted data providers we’re bringing into Excel, with more on the way.

 

For many finance teams, the work starts before the analysis even begins: finding the right market data, copying values into a model, and making sure nothing got lost along the way. With federated connectors in Excel, Copilot helps bring that information into the workbook where the real work is already happening. Because these connectors query source systems at the moment a request is made, responses reflect the latest available data, helping with scenarios such as checking a deal’s current status or a company’s stock rating. That means less time stitching together inputs and more time analyzing, modeling, and making decisions with the most current data. 

How it works 

In Copilot, open the Sources menu, connect to LSEG or Moody’s with your provider credentials, and turn the source toggle on. From there, when you specify a data provider in the prompt or your request references specific data sources such as credit ratings or spot rates, Copilot will retrieve the relevant data and ask you to confirm the data source before incorporating it into responses or results inserted into the sheet.  

 

LSEG 

The LSEG connector brings institutional market data — including foreign exchange rates, equities, and pricing — straight into Excel. This makes LSEG data and services available through a standardized, AI-ready interface and enables both users and agents to access trusted LSEG context inside the workflows they already use, while preserving governance, entitlements and control.  

For a treasury team updating a hedging model or preparing a leadership readout, that means less exporting, less manual copy-paste, and faster analysis with current market inputs already in the workbook. For a wealth advisor reviewing a client portfolio, it means easier access to current pricing, performance, risk and market context to support faster, more informed client conversations. 

Prompts to try: 
  • Pull current FX spot rates for EUR/USD, GBP/USD, and JPY/USD from LSEG into a new sheet. 
  • What would it cost to roll our six-month forward hedges on EUR/USD out another six months? Pull the forward points from LSEG and show the all-in rate. 
  • Bring in the USD swap curve from LSEG so I can model the impact of issuing 10-year debt at current levels. 

Moody’s 

The Moody’s connector brings credit ratings, research, entity data, and news into Excel so teams can work with decision-grade credit intelligence alongside the rest of their model. Whether you’re evaluating an issuer, pressure-testing exposure, or building a credit view for internal stakeholders, you can bring trusted credit context directly into the workbook instead of piecing it together across systems. 

Prompts to try: 
  • Pull the latest Moody’s rating, outlook, and recent research for each company in the portfolio, then summarize the key credit considerations in a new column. 
  • For each issuer in column B, pull the Moody's company profile, 5-year financial summary, peer group, and sector outlook — then flag any peers where the sector outlook is negative. 
  • For the issuers in this portfolio, bring in Moody’s sector outlook, recent news, and any notable credit risks so I can compare exposures across the list. 

Availability 

LSEG and Moody’s connectors are available starting today in Excel for Web, Windows, and Mac for commercial customers with a Microsoft 365 Copilot license.  

MCP servers and agentic solutions are available through a Bring Your Own License (BYOL) model, with customers licensing directly from partner services. Commercial marketplace availability will follow. 

Learn more

 

Read the whole story
alvinashcraft
31 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Why was open-sourcing TypeScript so important?

1 Share
From: Microsoft Developer
Duration: 1:39
Views: 28

TypeScript didn’t just grow, it evolved with the community. Amanda Silver shares why open sourcing it was key to working alongside frameworks like Angular, React, and Vue.

#TypeScript Repo: https://msft.it/6052vZ3ZI

#opensource #javascript #webdev #developers

Read the whole story
alvinashcraft
52 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Give your Agent memory with SQL Server and Microsoft Agent Framework | Data Exposed

1 Share
From: Microsoft Developer
Duration: 12:29
Views: 156

Building Agents should be easy and it is with Microsoft Agent Framework. Plugging in memory makes the Agent remember you and your conversations. SQL Server is a great option for storing your memory in, and Azure SQL Database future proofs you. Your users deserve to be remembered so learn how.

✅ Chapters:
0:00 Intro
0:45 sample e-commerce app
1:49 agent recommendations
3:00 agent code walkthrough
3:20 context provider and history provider with Microsoft Agent Framework
5:04 more details on the history provider and how Microsoft Agent Framework helps
6:10 mssql-python driver
8:54 what else you can do with the framework
9:00 local to cloud, easily pluggable
10:10 final tips and trips, Agent Framework UI

✅ Resources:
Microsoft agent framework, Microsoft Agent Framework Overview | Microsoft Learn
SQL Server Docker, Docker: Run Containers for SQL Server on Linux - SQL Server | Microsoft Learn
Azure SQL Server, Connect to and Query Azure SQL Database Using Python and the mssql-python Driver - Azure SQL Databa…


📌 Let's connect:
GitHub - Chris Noris - softchris
Twitter - Anna Hoffman, https://twitter.com/AnalyticAnna
Twitter - AzureSQL, https://aka.ms/azuresqltw

🔴 Watch even more Data Exposed episodes: https://aka.ms/dataexposedyt

🔔 Subscribe to our channels for even more SQL tips:
Microsoft Azure SQL: https://aka.ms/msazuresqlyt
Microsoft SQL Server: https://aka.ms/mssqlserveryt
Microsoft Developer: https://aka.ms/microsoftdeveloperyt

#AzureSQL #SQL #LearnSQL

Read the whole story
alvinashcraft
58 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

How to use Jest mocks

1 Share

For better or worse, my life has become infested by Jest mocks.

I come from the land of dependency injection (I even gave a talk once!), but whenever I talk about dependency injection to JavaScript developers I get this reaction:

Meme of Black Panther saying "we don't do that here"

Instead, mocking is king. Unfortunately, I’ve found Jests’s API to be confusing and the docs largely unhelpful.

I thought I could get by winging it but that never works in the long run, so I finally sat down and studied how Jest mocking actually works. Here's the notes from my research.

(I’m not aiming to write a complete guide here, just covering what I use regularly.)

Mocking modules

The classic example of needing to mock is when you want to test a component that depends on an API. You want to test the business logic, but you don't want to actually make network calls to an external service:

A dependency graph showing that the "test" module depends on the "business logic" module, which then depensd on the "external API" module

This is where jest.mock() comes into play. Calling it replaces the real module with a mock version:

import 'logic';

jest.mock('api');

test('logic foo', () => { 
  expect(logic.foo()).etc();
});

Calling logic.foo() normally results in a real API call, but since I mocked api, it doesn't.

Because mock modules must be set up before you start importing modules, Jest will “hoist” all jest.mock() calls so they execute first. As such, it doesn’t matter where you put your calls to jest.mock() (which honestly spooks me a bit).

(Mock hoisting is why you can’t, for example, use constants in your mocks. While the constant might be defined in the file before calling jest.mock(), the execution order flips them the other way around!)

Automock

What is returned from a mocked module, anyways?

By default, Jest “automocks” modules, replacing its properties with mockable alternatives. For example, it replaces any function with a mock function (aka jest.fn()). It’ll walk object trees, so if you’ve got an object with a function property, it’ll mock that function as well.

Not all module properties are mocked, though. For example, constants (e.g. strings or numbers) are unaffected by automock. I couldn’t find any docs mapping type → mock behavior, so some trial-and-error may be required for unusual circumstances.

(Warning: “automock” is an overloaded term in Jest; it’s used both to describe what happens to a mocked module by default, but is also the name of the automock feature.)

Mocking behavior

Automock is all well and good for keeping the tests from crashing, but sometimes you need to replace behavior as well (such as returning a stub value).

Let's make our mocked API return a sample response:

api.get.mockReturnValue('hello');
expect(api.get()).toEqual('hello');

Normally api.get() would query the API; here, we force it to return "hello" instead.

There’s a bunch of ways to mock return values, such as mockReturnValue(), mockResolvedValue(), and mockImplementation(). It depends on whether you want to return a constant, a promise, or add some logic to the mock.

If you’re using TypeScript, make sure you wrap your mocked function with jest.mocked() before you use it!

jest.mocked(api.get).mockReturnValue('hello');

Calling jest.mocked() adds type information. Without it, you'll get compiler errors & sadness.

Partially mocking modules

What if you only want to mock some (but not all) of the module?

In that case, jest.mock() has a second parameter: a factory function. When provided, the module is mocked with its output instead of using automock.

The trick to partial mocks is to combine jest.requireActual() (which imports the real implementation of a module) with explicit mocking:

jest.mock('api', () => {
  return {
    ...jest.requireActual('api'),
    get: jest.fn().mockReturnValue('hello'),
  };
});

We’ve mocked api.get() but the rest of the api module uses the actual implementation.

The factory function allows for all sorts of shenanigans, such as returning a module which doesn’t match the spec of the original module at all. That sucks, so I only use factory functions when needed.

Cleaning mocks

Reusing a mock usually means cleaning up between tests. There are three levels of cleaning:

  • Clear - clears mock usage metadata (e.g., how many times was this mock function called).
  • Reset - does the above plus replaces the mocked function w/ the default empty function (i.e. undoes calls like mockReturnValue()).
  • Restore - does all the above plus restores the function to its original (real) implementation.

You can clean up individual functions (using myMock.mockClear(), myMock.mockReset(), or myMock.mockRestore()) or you can apply it to all mocks at once (using jest.clearAllMocks(), jest.resetAllMocks(), and jest.restoreAllMocks()).

You can also call jest.resetModules() to completely reset the modules in the cache, in case there’s some local state in a module that you need to clear between tests.

Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories