Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
155539 stories
·
33 followers

Why We Removed FluentAssertions from Akka.NET

1 Share

11 minutes to read

Back at the very end of March we shipped Akka.NET v1.5.64. We’ve done a handful of releases since then, but I want to come back to this one specifically, because it contained a change that’s been a long time coming: we removed FluentAssertions as a dependency from every Akka.NET TestKit package.

If you write tests against Akka.NET, this might be a breaking change for you. So let me walk through exactly what changed, why we did it, and what you should do about it.

Click here to read the full article.

Read the whole story
alvinashcraft
12 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Beyond WebSockets: A Glimpse into the Future of Document Editing with WebAssembly

1 Share
Every major shift in browser technology creates new possibilities for software developers. WebAssembly may be the most significant advancement for document editing since we introduced our WebSocket-based architecture.

Read the whole story
alvinashcraft
19 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

You can fork a package, but can you own it?

1 Share

cover

Fork your dependencies, trim them to only your use case, never update unless it breaks for your users. I’ve been vocal about this for 10+ years. I’ve always said that updating is way riskier than latent bugs (which can be tracked and CVEs monitored).

If you are updating a dependency, it’s on you to analyze every single commit in the full transitive set of dependencies. If you dont see anything compelling, dont update!

I remember at HashiCorp once in awhile an engineer would try to update a dep or replace a DIY lib with an external one and id always ask “show me the commit we need.” Dont update for the sake of it.

Feeling pretty swell about this mentality with all the supply chain attacks happening.

That’s from Mitchell Hashimoto. A friend sent it to me, and the first word he reached for was bold. Mine too. I mostly agree with him, honestly. But the second I read it, my head jumped to a caveat, and the caveat turned out to be the thing I actually wanted to write about.

You can fork a small library and trim it to what you use. I’ve done it, it’s fine. But could you fork React and maintain it? I couldn’t. I don’t think most teams could either. So “fork your dependencies” is wonderful advice right up to the point where the dependency is too big to hold in your hands, and then it quietly stops being advice at all.

And that caveat says something about the advice itself. I think that it was never really about forking. It’s about knowing exactly what you’ve taken on and being willing to own it, and that’s the bit most of us skip. We don’t decide to take a dependency. We install one. And almost everything that goes wrong with dependencies:

  • the supply chain attacks,
  • the licence dramas,
  • the half-finished SBOMs,
  • the things you find out about only after they break

comes back to that one move: we took dependence on someone else’s product without ever deciding to. Everything else here is a variation on it.

To fork or not?

I was reminded of this not long ago. I was doing a live Q&A about my TypeScript port of some code, and a whole chunk of it was one question asked five different ways:

Why did you write your own deepEquals instead of just pulling a package?

I tried to explain that it wasn’t laziness, and it wasn’t not-invented-here pride either. I explained that it’s the code I write once and forget. Of course, it’ll take some time, but not that long as you might think. The reception was, let’s say, mixed. I get it, it’s much easier to just install a package and outsource maintenance. Especially when you’re in a hurry, you don’t decide anything. You reach. And this reach can be a decent interim solution, but then you should reflect on yourself on the next step.

Why do I reach for my own code on the small stuff? Look at which packages actually get hit in all these supply chain attacks we keep reading about. It’s almost always the little ones. The one-liners. Remember the Left-pad Incident? It’s usually some tiny or low-level helper nobody thinks about. They’re everywhere, they’re trivial, and precisely because they’re trivial, nobody is watching them. Which is also exactly why they’re so easy to write yourself. So for me, handwriting a small helper isn’t paranoia. It’s the calmer option, and it’s one of the few cases where I know exactly what’s in my own system.

That’s also why, in Emmett, my Event Sourcing library, I’m almost stubborn about keeping the core package free of dependencies and limiting whatever I inject elsewhere. Probably too stubborn, if I’m honest. But I’d rather err that way, because when it goes wrong on the user’s side, it goes wrong far worse than a bit of code I maintain myself. It’s about responsibility for the outcome of your actions.

If I had to compress what I believe into one line, it wouldn’t be “fork everything”, and it wouldn’t be “write everything yourself”. It would be something duller than both:

Be precise about which dependencies you take on, look at how many dependencies your dependencies pull in, and treat that as part of the decision.

The number itself isn’t the point. It just tells you how much you’re agreeing to own without ever seeing it.

Because your direct dependencies were never the worst part. You chose those. Most of the time, you can read them, and you can follow what they’re doing. The scary part is the dependencies of your dependencies, and theirs, all the way down, the part you didn’t choose, can’t see, and have no say over.

And I want to be careful here, because it’s easy to let this slide into another round of JavaScript-bashing, and that’s not what I mean. Every ecosystem has this. JS and TypeScript just sit at one far end of it, where there’s a package for absolutely everything. Which is good, in general, you’re not rebuilding the wheel every other week, the way you are in some other places. But it’s also how you end up with a node_modules you couldn’t fully explain if someone put a gun to your head.

At the other extreme, there’s Microsoft and .NET, where the instinct runs so hard the opposite way that it tips into Not Invented Here Syndrome. Neither end is the “right” one. They’re both defaults people drift into without ever making a call.

So for me, it’s not about reaching zero dependencies. But having dependencies that we cautiously agreed upon.

Which takes me to the part that, in my experience, almost nobody does. You can’t make a call on what you can’t see, and if you don’t even have the basic knowledge (e.g. some list) of what you depend on, then every conversation about supply chain risk is a bit of theatre.

Dependency Inventory

In most of the environments, there are tools to generate the Software Bill of Materials - the inventory of your dependency tree. In some, they’re even built in. It’s easy to dunk on NPM, but it’d be better to do due diligence before doing so. Not many people seem to know this, but recent versions of npm ship an npm sbom. So the tooling exists, even in NPM. That isn’t the problem.

The problem is that most organisations have never generated one in their life. No SBOM, no inventory, nothing written down anywhere. So the day the next Log4Shell lands, and there will be a next one, they can’t answer the very first question anyone will ask them: do we run this, and if so, where?

On the other hand, tools often don’t help here, even those built to do so. NPM audit mostly does the opposite. I honestly can’t remember the last time I installed something, and the audit didn’t immediately tell me to bump a stack of packages. Most of it is false positives, with no real attempt to say how dangerous any of it is. And that lands you in the oldest trap going: if it’s always red, you stop looking at red. So the one signal that was supposed to make you stop and decide ends up training everyone to decide nothing.

Bus factor and rug pulls

There’s a related thing I can’t quite leave alone, so let me wander into it. I think a lot of teams spend their energy on the symptom and never once look at the source.

Watch what happens in the .NET world whenever a popular package changes the deal. Fluent Assertions went commercial. Moq shipped a thing that quietly hashed your git email and phoned it home. MassTransit and AutoMapper announced commercial licenses within the same stretch. And nearly every time, the reaction across .NET shops is identical. It’s a mixture of:

  • let’s rip it out and write their own,
  • search for a free alternative,
  • Cry to Microsoft to buy the lib or provide a replacement,

Essentially: a throw-the-baby-out-with-the-bathwater strategy.

And for me, that’s solving the wrong thing entirely. The source isn’t that the package started charging money, or pulled a rug. The source is that we took on a critical dependency without ever admitting to ourselves that it was critical, and never once thought about what we’d do if the terms changed. We didn’t consider the bus factor, and we didn’t do due diligence to ensure the work on it was sustainable and could continue. Pulling it out and hand-rolling a replacement fixes none of that. It just resets the same trap, this time with only the code we maintain.

The IdentityServer episode was the clearest version of it I’ve seen. People were upset that they had to pay suddenly. Then, in the next sentence, calling it a critical security component. Then, in the sentence after that, asking what the free alternatives were. A critical security component that you want for free and are ready to swap out overnight is, to my mind, asking for a security incident.

And there’s a bit of maths that quietly settles most of these arguments, if anyone bothers to do it. Take what the licence costs you per year. Then take into account what it would cost to have an engineer build and maintain your own version. Put the two side by side.

Almost every time, paying the maintainer comes out cheaper, and on top of that, you’ve lowered the bus factor on something you already lean on, which is its own kind of supply chain security. “We’d write it ourselves, but then we’d have to maintain it” is true. I just read it as the argument for paying the person who already does, not against it. If you depend on something, its survival is your problem too. That’s part of owning the decision.

I know this case too well.

LLM as a fork

Getting back to Mitchell’s thought. The part I find most interesting is because of the moment we’re in. I keep hearing that LLMs change all of this, that writing your own small things is suddenly trivial, so the whole dependency question softens. I don’t buy it. It’s never that easy. Writing the small thing was never the hard part anyway. Owning it, understanding it, maintaining it, being the one on the hook when it breaks at 2 am, that’s the hard part, and no model takes that off your plate.

I don’t see how LLMs can change the cost of owning code. They can (maybe) change the cost of producing it. That doesn’t fix the “install without deciding”. The old move was install and move on. The new move is “vibe it” and move on. Same missing decision, new flavour. The same lack of responsibility and ownership.

This trend isn’t new. It’s a classic Shadow IT. If you haven’t been around long enough to run into the term, Shadow IT refers to the tools and systems people build or adopt within a company without going through whoever is officially meant to approve them. The spreadsheet that quietly runs a whole department. The little script someone wrote on a Friday that half the team now depends on. The integration nobody in the platform group has ever heard of. It has always existed because people route around slow governance to get their job done, and most of the time, nobody notices until it breaks.

With LLMs, it’s more tempting than it has ever been. Someone in sales promises a customer a feature the team supposedly needs. The team has no time, so they cobble a tool together from the API and ship it. It doesn’t work. The customer says they’re not paying for this. It escalates. The thing was unowned from the moment it was conceived; nobody decided to take it on, it just appeared, and the blame game is starting.

And here’s where I think it all settles, because the corporate steamroller flattens everything in the end. Companies will dictate the allowed list, the way they always have. The cautious majority will stick to what’s known and popular: React, TypeScript, Python, Spring Boot. That’s what they did last time, and the time before. And the people who want to move faster will do it off the books, with an LLM, as Shadow IT. The declarative, standards-based frameworks that hide their complexity will do well in that world, because that style suits how these tools work, but it’s the same ceiling as before. You bet on React. You don’t own it. The small stuff you can hold; the big stuff stays as bet.

What to do then?

We cannot fix the entire software industry, but we can fix how our own engineering teams operate. Instead of waiting for automated audits to scream at us, or ripping out packages in an emotional panic, I suggest a simple, regular exercise for your organisation.

Sit down and explicitly define your dependency posture:

  1. Inventory: List the dependencies you use (even without peer dependencies). Use tools npm-sbom to actually see what you are pulling in.
  2. Criticality: Identify which of these packages are absolutely critical to your system.
  3. Lifecycle: Define a clear strategy for upgrading and versioning them. Are you updating just for the sake of it, or are you looking for specific commits like Mitchell suggests?
  4. The Bus Factor: Ask yourself: what happens if the author of a critical package gets hit by a bus, burns out, or the tool becomes paid?
  5. Mitigation: Decide on a concrete backup plan for that exact scenario. Do you fork it? Do you pay the license fee? Maybe pay earlier for support or help in another way to maintain it.
  6. Response Time: Estimate how quickly you can upgrade and deploy the application if a major security breach occurs in a dependency. Also, if the strategy is to use replacement, then how fast will you be able to replace this dependency?

Building reliable software requires intent. We don’t have to write everything from scratch, but we must be precise about what we bring into our software. Architecture is not just about writing code; it is about choosing which liabilities you are willing to own.

Cheers!

Oskar

p.s. Ukraine is still under brutal Russian invasion. A lot of Ukrainian people are hurt, without shelter and need help. You can help in various ways, for instance, directly helping refugees, spreading awareness, putting pressure on your local government or companies. You can also support Ukraine by donating e.g. to Red Cross, Ukraine humanitarian organisation or donate Ambulances for Ukraine.

Read the whole story
alvinashcraft
29 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Microsoft Open-Sources PostgreSQL Extension for In-Database Durable Execution

1 Share

Recently open-sourced by Microsoft, pg_durable is a PostgreSQL extension that enables durable workflows to run natively inside the database, eliminating the need for external orchestration systems.

By Sergio De Simone
Read the whole story
alvinashcraft
38 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Is your agent extension actually working?

1 Share

This is the third article in a series about Agent Experience (AX): the practice of making AI coding agents work correctly with your technology. The series covers what you can and can’t control in the agent stack, how to measure whether your extensions are helping or hurting, and how to iterate toward better outcomes.

You shipped your skill, wrote clear instructions, developers install it, and agents discover it. Everything looks like it’s working. But is the generated code actually better because of your extension? Or would the agent have produced the same result without it?

In the first article, we introduced lift and drag: your extension either improves outcomes or makes them worse. In the second article, we traced step by step through the mechanics of how agents use your technology. Now comes the uncomfortable part: measuring which one you’re shipping.

You can’t tell by looking

The most common mistake in AX work is treating tool invocation as a success signal. Your skill got invoked, the agent followed the instructions, it generated code. Done, right?

Not even close. The agent might have generated the same code without your extension, because the model already knew your API from training data. Or worse: your extension returned so much content that it pushed relevant workspace context out of the window, and the agent missed a configuration file that would have made the code work on the first try. Your tool was called, it returned content, and outcomes got worse. From the outside, everything looks fine.

How would you know? You wouldn’t, at least not without measuring.

What measuring actually looks like

Measuring AX impact comes down to a controlled comparison. You define a task, run it with and without your extension, and compare the outcomes. Everything else stays the same: the model, the harness, the prompt, the workspace. The only variable is your extension.

This gives you two data points:

  • Baseline: how does the agent perform using only the model’s training data and the workspace context?
  • With extension: how does the agent perform when your extension is available?

Controlled comparison: two parallel lanes with the same inputs produce baseline and extension outcome scores, with deltas for lift/drag and cost

If outcomes improve with the extension, you’ve got lift. If they stay the same or get worse, you’ve got drag. But outcomes aren’t the only thing you’re comparing. Your extension adds tokens to the context window, triggers tool calls, and can increase the number of turns the agent needs. A scenario that completes in 3 turns without your extension might take 7 with it. If outcomes improve by 10% but token costs triple, that’s still lift, just an expensive one. This is why you must track both dimensions from the start: did it get better? and what did it cost?

Scenarios

A scenario is a specific task you ask the agent to complete: “Build a REST API with authentication using Contoso Identity” or “Add telemetry to this Express app using Contoso SDK.” Each scenario needs three things:

  1. A starting workspace. The repository state before the agent starts. This can be an empty folder if the scenario tests building from scratch, or a project with existing code, configuration files, and dependencies. Match the workspace to what the scenario represents, because agents behave differently in an empty folder than in a project with existing structure.
  2. A prompt. What you tell the agent to do. Keep it representative of what real developers actually ask for. Don’t optimize the prompt for your extension: write it the way a developer who doesn’t know your extension exists would write it.
  3. Evaluation criteria. How you determine whether the agent’s output is correct. This is the hard part.

Evaluation criteria

Evaluation criteria define what “correct” means for a given scenario. They’re the rubric you score against. There are two dimensions for you to consider: what you check, and how you check it.

What you check

Simple facts. Did the generated code use the v3 SDK instead of the deprecated v2? Does the project compile? Does a specific test pass? These are concrete, binary, and usually the first criteria you write.

Patterns and architecture. Did the code follow the recommended authentication flow? Does the error handling match the SDK’s conventions? Is the solution structured the way your documentation recommends? These require understanding intent and context, not just presence of a string or import.

Both types of checks produce pass/fail results. The difference is what it takes to verify them reliably.

How you check

Deterministic checks use code to verify criteria programmatically. Precise, repeatable, no ambiguity. But they’re harder to build correctly than they look. Take “does the code use the v3 SDK?” A naive string search for the v3 import statement would pass if the import appears in a comment, even though the code doesn’t actually use it. To do this properly, you’d need AST parsing to inspect the actual code structure, not just match text. That’s a meaningful development effort for each criterion, and every time you change what you’re measuring, you’re back writing and debugging grader code. For pattern checks it gets worse: verifying that authentication follows the PKCE flow means walking the AST to trace the code path. The grader code quickly becomes harder to maintain than the extension you’re evaluating.

LLM-as-judge checks the same criteria, but you write them in natural language. “The generated code imports and uses the v3 SDK in application code, not just in comments.” “The authentication flow uses the PKCE pattern with a redirect URI, not client credentials.” A capable model can distinguish a comment from actual code, trace authentication flows, and evaluate architectural patterns, all from a text description of what to look for.

The practical advantage is iteration speed. Adding, changing, or removing a criterion is editing a text file, not writing code. Anyone on the team can update the rubric without touching a codebase. When you mix deterministic graders with LLM-as-judge, you end up maintaining two systems, and the deterministic side needs a developer every time the criteria change.

That said, LLM judges have their own failure modes. Models can trip on seemingly simple things: ask one whether the used SDK version is within 3 minor versions of the required version, and some models get the math wrong, especially when minor versions cross into double digits. If you’re building your own judge harness, you might need to supply custom tools for things like version comparison, date parsing, or schema validation. The judge is only as reliable as the model behind it, and the model’s blind spots become your eval’s blind spots.

The tricky part with both approaches is writing good criteria. A vague criterion like “the code should be well-structured” gives you inconsistent scores regardless of how you check it. A specific criterion like “the authentication flow should use the PKCE pattern with a redirect URI, not client credentials” gives you reliable, repeatable results. Write criteria the same way you’d write acceptance criteria for a pull request: specific enough that two reviewers would agree on the verdict.

With LLM-as-judge, you also need to calibrate every time you update the criteria. After each change, verify:

  • Accuracy: do the judge’s scores reflect what’s actually in the code? Run the judge against outputs where you already know the correct verdict. If the judge passes code that should fail (or vice versa), the criterion is ambiguous or the model is misinterpreting it.
  • Consistency: does the judge return the same scores when you evaluate the same output multiple times? Even with temperature set to 0, you can get variation across runs if your criteria are ambiguous. Run the same evaluation multiple times and check whether the verdicts are stable. If a criterion flips between pass and fail on identical input, it’s not reliable enough to measure with.

Reliable evaluation criteria are the foundation of measurement. Do them wrong, and you end up amplifying noise and making decisions based on false signals.

Profiles

A profile is a configuration of the agent’s environment. At minimum, you need two:

  • Baseline profile: no extensions installed. The agent works with just the model’s training data and workspace context.
  • Extension profile: your extension installed and available.

Run every scenario against both profiles. The difference in outcomes is your extension’s impact. But two profiles only tell you whether your extension helps in isolation. Real developers have other extensions installed too, so you’ll want additional profiles:

  • Extension + popular tools: your extension alongside the 5-10 most common extensions your audience uses.
  • Extension variants: different configurations of your extension (verbose vs. concise responses, different tool descriptions).

Each additional profile multiplies your eval runs, but that’s not the real cost. The real cost is debugging: when something fails in a profile with 10 extensions installed, which extension caused the problem? Was it yours, someone else’s, or the interaction between them? The more variables you introduce, the harder it is to isolate what went wrong and what to fix. This is why starting with just baseline and extension matters: you get a clean comparison with one variable. Once you understand your extension’s impact in isolation, you have a solid base to compare against when you add complexity.

The four things you’re measuring

The first article introduced discovery, selection, quality, and composition. Each one fails differently, and each one is measurable.

Measurement funnel: discovery, selection, quality, and composition as sequential stages, each with a failure exit

Discovery

From the outside, a discovery failure looks identical to a selection failure: the tool doesn’t get called. The difference is why. Some harnesses expose which tools were loaded into a session. If yours does, check there first. If it doesn’t, test with only your extension installed. If the agent still doesn’t call it, your tool descriptions might be too long (getting truncated by the harness), improperly registered, or exceeding the harness’s length limit. Fix discovery first, because everything downstream depends on it.

Selection

Look at scenarios where the developer’s prompt clearly relates to your technology, but the agent picks a different tool or none at all. Test with only your extension installed: if the agent still doesn’t pick your tool for a clearly relevant task, the description doesn’t match how developers think about the problem. Fix the description, test again, measure the change. Selection fixes are often the highest–ROI changes you can make: a single description rewrite can shift selection rates by double digits.

If your tool gets selected fine in isolation but loses to another tool when both are present, that’s a composition–, not selection–problem.

Quality

Compare evaluation scores with and without your extension. If scores improve, you’re producing lift. If scores stay flat, the extension is being called but not helping: the model already knew the answer, or your content didn’t add anything useful. If scores drop, you’re producing drag: the content confused the model, conflicted with its training data, or consumed tokens that would have been better spent on workspace context. Without the specific, measurable criteria we talked about earlier, you can’t tell whether the output is better or worse, only whether it’s different.

Composition

Run the same scenarios with your extension alone and then alongside popular tools from your ecosystem. If outcomes degrade, you have a composition problem. Isolate the cause by adding extensions one at a time: does the degradation come from a specific extension, or does it scale with the total number of tools competing for the context window?

The cost dimension

Lift isn’t free. Your extension adds tokens to the context window, and every tool invocation costs compute. A scenario that completes in 3 turns without your extension might take 7 with it. If outcomes improve by 10% but token costs triple, that’s still lift, just an expensive one. Whether it’s worth the cost depends on what the improvement buys you: a 10% improvement on a trivial task probably isn’t worth 3x the tokens, but a 10% improvement on a complex task where getting it wrong means hours of debugging might be worth 10x.

The cost metric that matters is token usage: total tokens consumed per scenario run (prompt + completion). Compare it across profiles the same way you compare quality scores. Significant lift at modest token cost? Good. Modest lift at double or triple the tokens? Optimize what your extension returns. Drag at any token cost? Rethink.

Common measurement mistakes

If you’re new to measuring AX impact, it’s easy to make mistakes that produce false signals. Here are the most common ones we’ve seen:

Checking presence instead of usage

The easiest criteria to write are presence checks: does the output mention your SDK? Does the code reference your API? But mention isn’t usage. A code comment like // TODO: Add Contoso Identity integration passes a presence check while doing the opposite of integrating. Check for correct usage, not just mention.

Skipping build verification

Static analysis (checking syntax, structure, imports) can produce high scores that collapse when you actually build and run the project. We’ve seen evaluations where static checks showed 90%+ pass rates, but the projects didn’t compile. Build-and-run verification is the minimum bar for any evaluation that claims to measure real-world quality.

Testing with optimized prompts

If you tune the developer prompt to mention your tool by name or use the exact vocabulary from your tool description, you’re measuring best-case–, not real-world selection. Use prompts that sound like what developers actually type. Don’t assume they speak your internal vocabulary.

Testing in empty workspaces

Agents behave differently in empty folders than in real projects. An empty workspace means no existing code to consider, no configuration files to read, no dependencies to account for. Your extension might perform well in a clean environment and fall apart in a project with 200 files and 15 existing dependencies.

Measuring a single run

LLMs are non-deterministic. The same prompt, same model, same configuration can produce different results across runs, so a single run tells you almost nothing. Run each scenario multiple times (5-10 minimum) and look at the distribution. If your extension lifts outcomes 8 out of 10 times, that’s a clear improvement. 3 out of 10 is inconclusive to say the least.

Ignoring the baseline

The baseline tells you what kind of extension you need to build. If the model already generates correct code for your technology 90% of the time, you need a lightweight extension that covers the remaining edge cases. If it gets things right 50% of the time, you need an extension that corrects misconceptions and fills knowledge gaps. If the model has never seen your technology at all, you need an extension that teaches it from scratch: what the API looks like, what the conventions are, how things fit together. These are fundamentally different extensions with different designs, different content, and different cost profiles. You can’t know which one to build without measuring the baseline first. Also, different models have different baselines so be sure to understand where your users are!

Evaluating on the wrong platform

If your developers use VS Code on Windows, but your evaluations run on a CLI-based agent on Linux, you’re measuring a different experience. The harness, the OS, and the available tooling all influence agent behavior. So to get reliable signals, you need to evaluate in an environment that matches your users’.

Getting started

You don’t need a full eval infrastructure to start measuring. Here’s the minimum viable approach:

  1. Pick 3-5 scenarios that represent the most common tasks developers use your technology for. Start with tasks where you suspect the model struggles without help.
  2. Write specific evaluation criteria for each scenario. What does correct output look like? What are the common failure modes?
  3. Run each scenario 5 times with and without your extension. Same model, same harness, same prompt.
  4. Score the outputs against your criteria. Start with manual scoring if you don’t have automated eval tooling yet.
  5. Compare. Is your extension producing lift, drag, or noise?

This gives you a baseline understanding of your extension’s impact. From there, you can add more scenarios, automate scoring, and introduce composition testing as you scale.

Summary

Shipping an agent extension without measuring its impact is guessing. You might be shipping lift, you might be shipping drag, and the only way to tell is to compare outcomes with and without your extension, keeping everything else the same. Measure discovery, selection, quality, and composition separately, because each one fails differently and each one has a different fix. And don’t forget cost: lift that triples your token budget might not be worth it. In the next article, we’ll look at what happens when multiple extensions are present at the same time, and why more extensions don’t mean better outcomes.

The post Is your agent extension actually working? appeared first on Microsoft for Developers.

Read the whole story
alvinashcraft
49 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Deterministic SQL Bad Practice Detection in SSMS 22.7 with T-SQL Analyzer MCP Server and Copilot

1 Share

SQL Server Management Studio (SSMS) 22.7 introduces support for agents and MCP (Model Context Protocol) Servers, opening new possibilities for integrating third-party analysis tools directly into your development workflow. In this post, I'll show you how to use the T-SQL Analyzer MCP Server to deterministically detect bad practices and anti-patterns in your SQL Server scripts right within SSMS.

Note that everything demonstrated in this blog post is already available in Visual Studio and VS Code.

Requirements

  • .NET 10.0 runtime (minimum requirement for MCP Server mode)
  • SSMS 22.7 or later

Background: Why This Matters

As a SQL developer, you want to catch design issues, performance anti-patterns, and security vulnerabilities early—ideally before code reaches production. The T-SQL Analyzer tool includes 140+ rules covering design, naming, and performance concerns. With SSMS 22.7's MCP Server support, you can now leverage this analysis directly in your IDE, with the added benefit of being able to rely on .NET 10's latest performance improvements.

What's New in SSMS 22.7?

SSMS 22.7 adds native Agent support that can communicate with MCP Servers. This means tools like T-SQL Analyzer can now:

  • Analyze SQL scripts deterministically using pre-defined rules
  • Provide actionable feedback within the editor context
  • Integrate with your existing SQL development workflow

Setting Up the T-SQL Analyzer MCP Server

Step 1: Verify .NET 10.0 Installation

Ensure you have .NET 10.0 runtime installed:

dotnet --version

If you don't have .NET 10.0, download it from dot.net.

Step 2: Configure SSMS 22.7 MCP Server

Create or update your SSMS MCP configuration file at:

C:\Users\<username>\.mcp.json

Replace <username> with your Windows username. The recommended configuration uses the .NET 10 dnx launcher to automatically restore and invoke the T-SQL Analyzer package:

{
    "servers": {
        "tsqlanalyzer": {
            "type": "stdio",
            "command": "dnx",
            "args": [
                "ErikEJ.DacFX.TSQLAnalyzer.Cli",
                "--yes",
                "--",
                "-mcp"
            ]
        }
    }
}

This configuration:

  • Uses dnx to download and run the latest T-SQL Analyzer package from NuGet
  • --yes automatically restores the package if needed
  • -- separates package arguments from MCP arguments
  • -mcp tells the analyzer to launch in MCP Server mode

Step 3: Enable the MCP Server in SSMS

  1. Open SSMS 22.7
  2. Open GitHub Copilot, select Agent mode and enable the find_sql_script_problems in the TSQLAnalyzerMCP local MCP server.

How It Works: Deterministic Analysis

The MCP Server exposes a single analysis tool: find_sql_script_problems. This tool:

  1. Accepts the text content of your SQL CREATE script
  2. Parses the script using DacFx's T-SQL parser
  3. Evaluates it against 140+ built-in rules (design, naming, performance, security)
  4. Returns a structured list of issues with line numbers, severity, and descriptions

Example: Detecting a Potential SQL Injection

CREATE PROCEDURE [dbo].[GetUserData]
    @userId VARCHAR(255)
AS
BEGIN
    DECLARE @sql NVARCHAR(1024);
    SELECT @sql = CONCAT(N'SELECT * FROM Users WHERE Id = ''', @userId, N'''');
    EXEC [sys].[sp_executesql] @stmt = @sql;
END;

When you analyze this procedure in SSMS:

  1. Right-click on the query editor or use the SSMS Agent
  2. Select "Analyze with T-SQL Analyzer"
  3. The MCP Server immediately detects:
    • SRD0096 (Design): Potential SQL injection issue (tainted variable in dynamic SQL)
    • Any other applicable rule violations
  4. GitHub Copilot provides a summary of the findings, and may offer to fix them.

The analysis is deterministic—the same script always produces the same results based on the rule set.

Supported Rule Categories

The analyzer covers three main categories:

Design Rules (SRD*)

  • Hardcoded credentials
  • Unnamed primary keys
  • SQL injection vulnerabilities
  • Missing schema qualifiers
  • And more...

Naming Rules (SRN*)

  • Single-character aliases and variables
  • Non-standard naming patterns
  • Case sensitivity issues

Performance Rules (SRP*)

  • Implicit range windows in window functions
  • Deprecated functions (CAST, CONVERT misuse)
  • Cross-server joins
  • Missing indexes and query store settings
  • And more...

Key Advantages in SSMS 22.7

  1. In-Context Feedback: Analyze without leaving the editor
  2. Deterministic Results: Same analysis every time; no LLM variance or hallucinations
  3. Rule-Based: Transparent, predictable logic based on SQL Server best practices
  4. Quick Feedback Loop: Immediate results as you work

Example Workflow

1. Open your stored procedure in SSMS
   ↓
2. Use SSMS Copilot Agent to invoke the MCP Server
   ↓
3. T-SQL Analyzer scans the script against 140+ rules
   ↓
4. Results appear in SSMS—each issue shows line number, rule ID, and description
   ↓
5. Fix the issues or add IGNORE comments for false positives
   ↓
6. Re-analyze to confirm

Comparison: T-SQL Analyzer MCP Server vs. Command Line

Feature SSMS MCP Server Command-Line CLI
In-IDE analysis
Batch file processing
CI/CD Integration
Interactive feedback
Deterministic results

Getting Started

  1. Verify .NET 10.0: Run dotnet --version to confirm .NET 10.0 is installed
  2. Create configuration file: Create or edit C:\Users\<username>\.mcp.json (replace <username> with your Windows username)
  3. Add the MCP server entry to your configuration file (see Step 2 above)
  4. Enable the server in SSMS: Enable the MCP server in the GitHub Copilot window
  5. Analyze: Ask Copilot to use the MCP Server
  6. Fix: Ask Copilot to address the issues

Why .NET 10.0?

The T-SQL Analyzer MCP Server requires .NET 10.0 to take advantage of the latest performance improvements and framework enhancements. The dnx package runner automatically handles fetching and executing the latest version of the tool, ensuring you always have access to the newest rules and optimizations without manual updates.

Next Steps

  • Check out the documentation for detailed rule descriptions
  • Review the GitHub repository for source code and community contributions
  • Follow the blog for updates on new rules and SSMS features

Call to Action

The combination of SSMS 22.7's MCP Server support and the T-SQL Analyzer brings professional-grade SQL static analysis to your fingertips. Try it out and let me know how it improves your SQL development workflow!


Follow the blog for more posts on SQL Server, static analysis, and developer tools.



Read the whole story
alvinashcraft
58 seconds ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories