“We're seeing the rise of a new persona that uses our products, the autonomous agent. That means we need to design for them, too. Agent Experience (AX) is about creating products that agents can navigate, integrate with, and orchestrate effectively.”
In this episode, Matt Biilmann, CEO and co-founder of Netlify, joins Ankit Jain, founder of Aviator, to unpack the next evolution in software: agent experience.
Matt, known for coining the term Jamstack, shares how AI is transforming the way we build for the web, making almost everyone a web developer. From Agent Experience (AX) to open vs. closed agent ecosystems, he explains how autonomous agents will reshape software development and why he thinks we are just entering the decade of agents.
00:00 Introduction to Developer Experience and AI 03:00 The Evolution of Coding and Development 06:07 The Impact of AI on Software Development 09:59 Understanding Agent Experience (AX) 13:40 The Future of Human and Agent Collaboration 17:50 Open vs. Closed Systems in Development 21:50 Simplicity in Development Frameworks 25:31 The Role of Agents in Web Development 29:34 Predictions for the Future of Development
About Matt Billmann
Matt Billmann is CEO of Netlify, a company he co-founded in 2014. He has been building developer tools, content management systems, and web infrastructure for more than 30 years and is recognized for coining the term “Jamstack.”
About Hangar DX (https://dx.community/) The Hangar is a community of senior DevOps and senior software engineers focused on developer experience. This is a space where vetted, experienced professionals can exchange ideas, share hard-earned wisdom, troubleshoot issues, and ultimately help each other in their projects and careers.
We invite developers who work in DX and platform teams at their respective companies or who are interested in developer productivity.
MCP (Model Context Protocol) is a simple, common way for AI models (LLMs) to talk to APIs and data. Think of it like a universal plug: if both sides support MCP, they can connect and work together. An MCP server is any service or app that “speaks MCP” and offers tools the model can use, publishing a list of tools, what each tool does, and what inputs (parameters) each tool needs.
The GitHub MCP Server is the foundation for many GitHub Copilot workflows, both inside and outside of GitHub. As an engineering team working on GitHub MCP, we’re always looking to deliver new features and functionality, while avoiding regressions and improving quality with every iteration. And how we name a tool, explain what it does, and spell out its parameters directly affects whether the model picks the right tool, in the right order, with the right arguments.
When it comes to our work, small edits matter: tightening a description, adding or removing a tool, or combining a few similar tools can shift results a lot. When descriptions are off, agents choose the wrong tool, skip a step, send arguments in the wrong format, or drop them entirely. The outcome is weak. We need a safe way to change MCP and know if things actually got better, not worse. That’s where offline evaluation comes in.
Offline evaluation catches regressions before users see them and keeps the feedback loop short, so we can ship changes that genuinely improve performance.
This article walks through our evaluation pipeline and explains the metrics and algorithms that help us achieve these goals.
How automated offline evaluation works
Our offline evaluation pipeline checks how well our tool prompts work across different models. The tool instructions are kept simple and precise so the model can choose the right tool and fill in the correct parameters. Because LLMs vary in how they use tools, we systematically test each model–MCP pairing to measure compatibility, quality, and gaps.
We have curated datasets that we use as benchmarks. Every benchmark contains the following parameters:
Input: This is a user request formulated in natural language.
Expected tools: Tools we expect to be called.
Expected arguments: Arguments we expect to be passed to each tool.
Here are a few examples:
Asking how many issues were created in a given time period
Input: How many issues were created in the github/github-mcp-server repository during April 2025? Expected tools: list_issues with arguments:
The evaluation pipeline has three stages: fulfillment, evaluation, and summarization.
Fulfillment: We run each benchmark across multiple models, providing the list of available MCP tools with every request. For each run, we record which tools the model invoked and the arguments it supplied.
Evaluation: We process the raw outputs and compute metrics and scores.
Summarization: We aggregate dataset-level statistics and produce the final evaluation report.
Evaluation metrics and algorithms
Our evaluation targets two aspects: whether the model selects the correct tools and whether it supplies correct arguments.
Tool selection
When benchmarks involve a single tool call, tool selection reduces to a multi-class classification problem. Each benchmark is labeled with the tool it expects, and each tool is a “class.”
Models tasked with this classification are evaluated using accuracy, precision, recall, and F1-score.
Accuracy is the simplest measure that shows the percentage of correct classifications. In our case it means the percentage of inputs that resulted in an expected tool call. This is calculated on the whole dataset.
Precision shows the proportion of the cases for which the tool was called correctly out of all cases where the tool was called. Low precision means the model picks the tool even for the cases where the tool is not expected to be called.
Recall shows the proportion of correctly called tools out of all cases where the given tool call was expected. Low recall may indicate that the model doesn’t understand that the tool needs to be called and fails to call the tool or calls another tool instead.
F1-score is a harmonic mean showing how well the model is doing in terms of both precision and recall.
If the model confuses two tools, it can result in low precision or recall for these tools.
We have two similar tools that used to be confused often, which are list_issues and search_issues. Let’s say we have 10 benchmarks for list_issues and 10 benchmarks for search_issues. Imagine list_issues is called correctly in all of 10 cases and on top in 30% of cases where search_issues should be called.
This means we’re going to have lower recall for search_issues and lower precision for list_issues:
Precision (list_issues) = 10 (cases where tool is called correctly) / (10 + 3 (cases where tool is called instead of search_issues)) = 0.77
Recall (search_issues) = 7 (tool was called correctly) / 10 (cases where tool is expected to be called) = 0.7
In order to have visibility into what tools are confused with each other, we build a confusion matrix. Confusion matrix for the search_issues and list_issues tools from the example above would look the following:
Expected tool / Called tool
search_issues
list_issues
search_issues
7
3
list_issues
0
10
The confusion matrix allows us to see the reason behind low precision and recall for certain tools and tweak their descriptions to minimize confusion.
Argument correctness
Selecting the right tool isn’t enough. The model must also supply correct arguments. We’ve defined a set of argument-correctness metrics that pinpoint specific issues, making regressions easy to diagnose and fix.
We track four argument-quality metrics:
Argument hallucination: How often the model supplies argument names that aren’t defined for the tool.
All expected arguments provided: Whether every expected argument is present.
All required arguments provided: Whether all required arguments are included.
Exact value match: Whether provided argument values match the expected values exactly.
These metrics are computed for tools that were correctly selected. The final report summarizes each tool’s performance across all four metrics.
Looking forward and filling the gaps
The current evaluation framework gives us a solid read on tool performance against curated datasets, but there’s still room to improve.
More is better
Benchmark volume is the weak point of offline evaluation. With so many classes (tools), we need more robust per-tool coverage. Evaluations based on just a couple of examples aren’t dependable alone. Adding more benchmarks is always useful to increase the reliability of classification evaluation and other metrics.
Evaluation of multi-tool flows
Our current pipeline handles only single tool calls. In practice, tools are often invoked sequentially, with later calls consuming the outputs of earlier ones. To evaluate these flows, we must go beyond fetching the MCP tool list and actually execute tool calls (or mock their responses) during evaluation.
We’ll also update summarization. Today we treat tool selection as multi-class classification, which assumes one tool per input. For flows where a single input can trigger multiple tools, multi-label classification is the better fit.
Take this with you
Offline evaluation gives us a fast, safe way to iterate on MCP, so models pick the right GitHub tools with the right arguments. By combining curated benchmarks with clear metrics—classification scores for tool selection and targeted checks for argument quality—we turn vague “it seems better” into measurable progress and actionable fixes.
We’re not stopping here. We’re expanding benchmark coverage, refining tool descriptions to reduce confusion, and extending the pipeline to handle real multi-tool flows with execution or faithful mocks. These investments mean fewer regressions, clearer insights, and more reliable agents that help developers move faster.
Most importantly, this work raises the bar for product quality without slowing delivery. As we grow the suite and deepen the evaluation, you can expect steadier improvements to GitHub MCP Server—and a better, more predictable experience for anyone building with it.
This week on Azure Friday, we explore how Azure Logic Apps enables developers to build autonomous and conversational agentic workflows. The episode dives into how Agent Loop brings intelligence into workflows, supports multi-agent patterns, and makes it easy to connect agents, people, and enterprise systems — all on a secure, enterprise-ready platform.
🌮Chapter Markers 0:00 - Introduction 1:05 - Automation with Agents and Workflows 2:20 - Autonomous Agents and Agent loop 5:38 - Demo – IT Operations Agent 6:38 - Demo - Agent with variety of tools 7:57 - Demo – Use native A2A Chat client 9:49 - Demo - On Behalf Of consent flow within chat 13:40 - Demo - Agent observability and traceability 14:35 - Demo - Multiagent patterns 18:20 - wrap up
🌮Follow us on social: Scott Hanselman | @SHanselman – https://x.com/SHanselman Blog - Azure Integration Services Blog | Microsoft Community Hub Youtube - Azure Logic Apps - YouTube Azure Friday | @AzureFriday – https://x.com/AzureFriday
Jayson Street talks about low-tech ways that hackers can bypass cybersecurity systems, and ways you can protect your network and data against these attacks.