Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
150936 stories
·
33 followers

Building Applications with AI Agents

1 Share

Following the publication of his new book, Building Applications with AI Agents, I chatted with author Michael Albada about his experience writing the book and his thoughts on the field of AI agents.

Michael’s a machine learning engineer with nine years of experience designing, building, and deploying large-scale machine learning solutions at companies such as Uber, ServiceNow, and more recently, Microsoft. He’s worked on recommendation systems, geospatial modeling, cybersecurity, natural language processing, large language models, and the development of large-scale multi-agent systems for cybersecurity.

What’s clear from our conversation is that writing a book on AI these days is no small feat, but for Michael, the reward of the final result was well-worth the time and effort. We also discussed the writing process, the struggle of keeping up with a fast-paced field, Michael’s views on SLMs and fine-tuning, and his latest work on Autotune at Microsoft.

Here’s our conversation, edited slightly for clarity.

Nicole Butterfield: What inspired you to write this book about AI agents originally? When you initially started this endeavor, did you have any reservations?

Michael Albada: When I joined Microsoft to work in the Cybersecurity Division, I knew that organizations were facing greater speed, scale, and complexity of attacks than they could manage, and it was both expensive and difficult. There are simply not enough cybersecurity analysts on the planet to help protect all these organizations, and I was really excited about using AI to help solve that problem.

It became very clear to me that this agentic pattern of design was an exciting new way to build that was really effective—and that these language models and reasoning models as autoregressive models generate tokens. Those tokens can be function signatures and can call additional functions to retrieve additional information and execute tools. And it was clear to me [that they were] going to really transform the way that we were going to do a lot of work, and it was going to transform a lot of the way that we do software engineering. But when I looked around, I did not see good resources on this topic.

And so, as I was giving presentations internally at Microsoft, I realized there’s a lot of curiosity and excitement, but people had to go straight to research papers or sift through a range of blog posts. I started putting together a document that I was going to share with my team, and I realized that this was something that folks across Microsoft and even across the entire industry were going to benefit from. And so I decided to really take it up as a more comprehensive project to be able to share with the wider community.

Did you have any initial reservations about taking on writing an entire book? I mean you had a clear impetus; you saw the need. But it is your first book, right? So was there anything that you were potentially concerned about starting the endeavor?

I’ve wanted to write a book for a very long time, and very specifically, I especially enjoyed Designing Machine Learning Systems by Chip Huyen and really looked up to her as an example. I remember reading O’Reilly books earlier. I was fortunate enough to also see Tim O’Reilly give a talk at one point and just really appreciated that [act] of sharing with the larger community. Can you imagine what software engineering would look like without resources, without that type of sharing? And so I always wanted to pay that forward. 

I remember as I was first getting into computer science hoping at one point in time I would have enough knowledge and expertise to be able to write my own book. And I think that moment really surprised me, as I looked around and realized I was working on agents and running experiments and seeing these things work and seeing that no one else had written in this space. That moment to write a book seems to be right now. 

Certainly I had some doubts about whether I was ready. I had not written a book before and so that’s definitely an intimidating project. The other big doubt that I had is just how fast the field moves. And I was afraid that if I were to take the time to write a book, how relevant might it still be even by the time of publication, let alone how well is it going to stand the test of time? And I just thought hard about it and I realized that with a big design pattern shift like this, it’s going to take time for people to start designing and building these types of agentic systems. And many of the fundamentals are going to stay the same. And so the way I tried to address that is to think beyond an individual framework [or] model and really think hard about the fundamentals and the principles and write it in such a way that it’s both useful and comes along with code that people can use, but really focuses on things that’ll hopefully stand the test of time and be valuable to a wider audience for a longer period.

Yeah, you absolutely did identify an opportunity! When you approached me with the proposal, it was on my mind as well, and it was a clear opportunity. But as you said, the concern about how quickly things are moving in the field is a question that I have to ask myself about every book that we sign. And you have some experience in writing this book, adjusting to what was happening in real time. Can you talk a little bit about your writing process, taking all of these new technologies, these new concepts, and writing these into a clear narrative that is captivating to this particular audience that you targeted, at a time when everything is moving so quickly?

I initially started by drafting a full outline and just getting the sort of rough structure. And as I look back on it, that rough structure has really held from the beginning. It took me a little over a year to write the book. And my writing process was to do a basically “thinking fast and slow” approach. I wanted to go through and get a rough draft of every single chapter laid out so that I really knew sort of where I was headed, what the tricky parts were going to be, where the logic gap might be too big if someone were to skip around chapters. I wanted [to write] a book that would be enjoyable start to finish but would also serve as a valuable reference if people were to drop in on any one section. 

And to be honest, I think the changes in frameworks were much faster than I expected. When I started, LangChain was the clear leading framework, maybe followed closely by AutoGen. And now we look back on it and the focus is much more on LangGraph and CrewAI. It seemed like we might see some consolidation around a smaller number of frameworks, and instead we’ve just splintered and seen an explosion of frameworks where now Amazon has released Thread, and OpenAI has released their own [framework], and Anthropic has released their own.

So the fragmentation has only increased, which ironically underscores the approach that I took of not committing too hard to one framework but really focusing on the fundamentals that would apply across each of those. The pace of model development has been really staggering—reasoning models were just coming out as I was beginning to write this book, and that has really transformed the way we do software engineering, and it’s really increased the capabilities for these types of agentic design patterns.

So, in some ways, both more and less changed than I expected. I think the fundamentals and core content are looking more durable. I’m excited to see how that’s going to benefit people and readers going forward.

Absolutely. Absolutely. Thinking about readers, I think you may have gotten some guidance from our editorial team to really think about “Who is your ideal reader?” and focus on them as opposed to trying to reach too broad of an audience. But there are a lot of people at this moment who are interested in this topic from all different places. So I’m just wondering how you thought about your audience when you were writing?

My target audience has always been software engineers who want to increasingly use AI and build increasingly sophisticated systems, and who want to do it to solve real work and want to do this for individual projects or projects for their organizations and teams. I didn’t anticipate just how many companies were going to rebrand the work they’re doing as agents and really focus on these agentic solutions that are much more off-the-shelf. And so what I’m focused on is really understanding these patterns and learning how you can build it from the ground up. What’s exciting to see is as these models keep getting better, it’s really enabling more teams to build on this pattern.

And so I’m glad to see that there’s great tooling out there to make it easier, but I think it’s really helpful to be able to go and see how you build these things really from the model up effectively. And the other thing I’ll add is there’s a wide range of additional product managers and executives who can really benefit from understanding these systems better and how they can transform their organizations. On the other hand, we’ve also seen a real increase in excitement and use around low-code and no-code agent builders. Not only products that are off-the-shelf but also open source frameworks like Dify and n8n and the new AgentKit that OpenAI just released that really provide these types of drag-and-drop graphical interfaces. 

And of course, as I talk about in the book, agency is a spectrum: Fundamentally it’s about putting some degree of choice within the hands of a language model. And these sort of guardrailed, highly defined systems—they’re less agentic than providing a full language model with memory and with learning and with tools and potentially with self-improvement. But they still offer the opportunity for people to do very real work. 

What this book really is helpful for then is for this growing audience of low-code and no-code users to better understand how they could take those systems to the next level and translate those low-code versions into code versions. The growing use of coding models—things like Claude Code and GitHub Copilot—are just lowering the bar so dramatically to make it easier for ordinary folks who have less of a technical background to still be able to build really incredible solutions. This book can really serve [as], if not a gateway, then a really effective ramp to go from some of those early pilots and early projects onto things that are a little bit more hardened that they could actually ship to production.

So to reflect a little bit more on the process, what was one of the most formidable hurdles that you came across during the process of writing, and how did you overcome it? How do you think that ended up shaping the final book?

I think probably the most significant hurdle was just keeping up with some of the additional changes on the frameworks. Just making sure that the code that I was writing was still going to have enduring value.

As I was taking a second pass through the code I had written, some of it was already out of date. And so really continuously updating and improving and pulling to the latest models and upgrading to the latest APIs, just that underlying change that is happening. Anyone in the industry is feeling that the pace of change is increasing over time—and so really just keeping up with that. The best way that I managed that was just constant learning, following closely what was happening and making sure that I was including some of the latest research findings to ensure that it was going to be as current and as relevant as possible when it went to print so it would be as valuable as possible. 

If you could give one piece of advice to an aspiring author, what would that be?

Do it! I grew up loving books. They really have spoken to me so many times and in so many ways. And I knew that I wanted to write a book. I think many more people out there probably want to write a book than have written a book. So I would just say, you can! And please, even if your book does not do particularly well, there is an audience out there for it. Everyone has a unique perspective and a unique background and something unique to offer, and we all benefit from more of those ideas being put into print and being shared out with the larger world.

I will say, it is more work than I expected. I knew it was going to be a lot, but there’s so many drafts you want to go through. And I think as you spend time with it, it’s easy to write the first draft. It’s very hard to say this is good enough because nothing is ever perfect. Many of us have a perfectionist streak. We want to make things better. It’s very hard to say, “All right, I’m gonna stop here.” I think if you talk to many other writers, they also know their work is imperfect.

And it takes an interesting discipline to both keep putting in that work to make it as good as you possibly can and also the countervailing discipline to say this is enough, and I’m going to share this with the world and I can go and work on the next thing.

That’s a great message. Both positive and encouraging but also real, right? Just to switch gears to think a little bit more about agentic systems and where we are today: Was there anything you learned or saw or that developed about agentic systems during this process of writing the book that was really surprising or unexpected?

Honestly, it is the pace of improvement in these models. For folks who are not watching the research all that closely, it can just look like one press release after another. And especially for folks who are not based in Seattle or Silicon Valley or the hubs where this is what people are talking about and watching, it can seem like not a lot has changed since ChatGPT came out. [But] if you’re really watching the progress on these models over time, it is really impressive—the shift from supervised fine-tuning and reinforcement learning with human feedback over to reinforcement learning with verifiable rewards, and the shift to these reasoning models and recognizing that reasoning is scaling and that we need more environments and more high-quality graders. And as we keep building those out and training bigger models for longer, we’re seeing better performance over time and we can then distill that incredible performance out to smaller models. So the expectations are inflating really quickly. 

I think what’s happening is we’re judging each release against these very high expectations. And so sometimes people are disappointed with any individual release, but what we’re missing is this exponential compounding of performance that’s happening over time, where if you look back over three and six and nine and 12 months, we are seeing things change in really incredible ways. And I’d especially point to the coding models, led especially by Anthropic’s Claude, but also Codex and Gemini are really good. And even among the very best developers, the percentage of code that they are writing by hand is going down over time. It’s not that their skill or expertise is less required. It’s just that it is required to fix fewer and fewer things. This means that teams can move much much faster and build in much more efficient ways. I think we’ve seen such progress on the models and software because we have so much training data and we can build such clear verifiers and graders. And so you can just keep tuning those models on that forever.

What we’re seeing now is an extension out to additional problems in healthcare, in law, in biology, in physics. And it takes a real investment to build those additional verifiers and graders and training data. But I think we’re going to continue to see some really impressive breakthroughs across a range of different sectors. And that’s very exciting—it’s really going to transform a number of industries.

You’ve touched on others’ expectations a little bit. You speak a lot at events and give talks and so on, and you’re out there in the world learning about what people think or assume about agentic systems. Are there any common misconceptions that you’ve come across? How do you respond to or address them?

So many misconceptions. Maybe the most fundamental one is that I do see some slightly delusional thinking about considering [LLMs] to be like people. Software engineers tend to think in terms of incremental progress; we want to look for a number that we can optimize and we make it better, and that’s really how we’ve gotten here. 

One wonderful way I’ve heard [it described] is that these are thinking rocks. We are still multiplying matrices and predicting tokens. And I would just encourage folks to focus on specific problems and see how well the models work. And it will work for some things and not for others. And there’s a range of techniques that you can use to improve it, but to just take a very skeptical and empirical and pragmatic approach and use the technology and tools that we have to solve problems that people care about. 

I see a fair bit of leaping to, “Can we just have an agent diagnose all of the problems on your computer for you? Can we just get an agent to do that type of thinking?” And maybe in the distant future that will be great. But really the field is driven by smart people working hard to move the numbers just a couple points at a time, and that compounds. And so I would just encourage people to think about these as very powerful and useful tools, but fundamentally they are models that predict tokens and we can use them to solve problems, and to really think about it in that pragmatic way.

What do you see as the sort of one or some of the most significant current trends in the field, or even challenges? 

One of the biggest open questions right now is just how much big research labs training big expensive frontier models will be able to solve these big problems in generalizable ways as opposed to this countervailing trend of more teams doing fine-tuning. Both are really powerful and effective. 

Looking back over the last 12 months, the improvements in the small models have been really staggering. And three billion-parameter models getting very close to what 500 billion- and trillion-parameter models were doing not that many months ago. So when you have these smaller models, it’s much more feasible for ordinary startups and Fortune 500s and potentially even small and medium-sized businesses to take some of their data and fine-tune a model to better understand their domain, their context, how that business operates. . .

That’s something that’s really valuable to many teams: to own the training pipeline and be able to customize their models and potentially customize the agents that they build on top of that and really drive those closed learning feedback loops. So now you have this agent solve this task, you collect the data from it, you grade it, and you can fine-tune the model to do that. Mira Murati’s Thinking Machines is really targeted, thinking that fine-tuning is the future. That’s a promising direction. 

But what we’ve also seen is that big models can generalize. The big research labs—OpenAI and xAI and Anthropic and Google—are certainly investing heavily in a large number of training environments and a large number of graders, and they are getting better at a broad range of tasks over time. [It’s an open question] just how much those big models will continue to improve and whether they’ll get good enough fast enough for every company. Of course, the labs will say, “Use the models by API. Just trust that they’ll get better over time and just cut us large checks for all of your use cases over time.” So, as has always been the case, if you’re a smaller company with less traffic, go and use the big providers. But if you’re someone like a Perplexity or a Cursor that has a tremendous amount of volume, it’s probably going to make sense to own your own model. The cost per inference of ownership is going to be much lower.

What I suspect is that the threshold will come down over time—that it will also make sense for medium-sized tech companies and maybe for the Fortune 500 in various use cases and increasingly small and medium-sized businesses to have their own models. Healthy tension and competition between the big labs and having good tools for small companies to own and customize their own models is going to be a really interesting question to watch over time, especially as the core base small models keep getting better and give you sort of a better foundation to start from. And companies do love owning their own data and using those training ecosystems to provide a sort of differentiated intelligence and differentiated value.

You’ve talked a bit before about keeping up with all of these technological changes that are happening so quickly. In relation to that, I wanted to ask how do you stay updated? You mentioned reading papers, but what resources do you find useful personally, just for everyone out there to know more about your process.

Yeah. One of them is just going straight to Google Scholar and arXiv. I have a couple key topics that are very interesting to me, and I search those regularly. 

LinkedIn is also fantastic. It is just fun to get connected to more people in the industry and watch the work that they’re sharing and publishing. I just find that smart people share very smart things on LinkedIn—it’s just an incredible feat of information. And then for all its pros and cons, X remains a really high-quality resource. It’s where so many researchers are, and there are great conversations happening there. So I love those as sort of my main feeds.

To close, would you like to talk about anything interesting that you’re working on now?

I recently was part of a team that launched something that we call Autotune. Microsoft just launched pilot agents: a way you can design and configure an agent to go and automate your instant investigation, your threat hunting, and help you protect your organization more easily and more safely. As part of this, we just shipped a new feature called Autotune, which will help you design and configure your agent automatically. And it can also then take feedback from how that agent is performing in your environment and update it over time. And we’re going to continue to build on that. 

There are some exciting new directions we’re going where we think we might be able to make this technology be available to more people. So stay tuned for that. And then we’re pushing an additional level of intelligence that combines Bayesian hyperparameter tuning with this prompt optimization that can help with automated model selection and help configure and improve your agent as it operates in production in real time. We think this type of self-learning is going to be really valuable and is going to help more teams receive more value from the agents that are designing and shipping.

That sounds great! Thank you, Michael.



Read the whole story
alvinashcraft
1 hour ago
reply
Pennsylvania, USA
Share this story
Delete

Building AI Agents in Kotlin – Part 3: Under Observation

1 Share

Previously in this series:

  1. Building AI Agents in Kotlin – Part 1: A Minimal Coding Agent
  2. Building AI Agents in Kotlin – Part 2: A Deeper Dive Into Tools

Two articles in, and our coding agent can already do quite a bit. It can explore projects, read and write code, execute shell commands, and run tests. Adding a definition of done (DoD) in our last article gave it the feedback loop it needed – the agent now iterates until all tests pass, not until it decides it’s done.

We should be celebrating, right? Well, yes and no.

As the agent gets more capable, debugging becomes more challenging. Each tool adds surface area. The DoD loop adds more calls and tokens. Evaluation runs can take hours, and when something fails, it is often unclear what failed or where the issue started.

Our agent can solve tasks now. Too bad we can’t see how.

This becomes a problem when you want to improve the agent, debug failures, or estimate costs. In this article, we work through this visibility gap. We start with the obvious options (they don’t quite work), review common observability tools for LLM apps, and integrate tracing with Langfuse. The result is a step-by-step view of the agent’s actions, including cost per action. Along the way, we uncover behavioral patterns and even find a bug we didn’t know existed.

Let’s start by understanding what we’re missing.

Your agent story

When your agent completes a coding task, it produces a chain of decisions and actions: Read this file, analyze that function, modify this code, run those tests. This chain is the agent’s trajectory, and it tells the story of how the agent works.

Trajectories matter because they reveal where the agent wastes time and where it goes off track. They show you when the agent reads the same file 47 times, or why it rewrites half your codebase when you asked for a simple bug fix. In short, you can’t improve what you can’t observe.

Up to now, we’ve been getting by with this simple setup:

handleEvents {  
    onToolCallStarting { ctx ->  
        println("Tool '${ctx.tool.name}' called with args: ${ctx.toolArgs.toString().take(100)}")  
    }  
}

This was fine for the early steps. It at least showed activity. But we’ve outgrown it. Without full parameters and observations, we can’t understand agent behavior in enough detail. And if we print everything, the console turns into noise: object dumps, file contents, and error messages all mixed together.

We need something better.

“Just turn on logging”

Let’s try the obvious solution first. Koog has built-in debug logging, so we switch Logback from ERROR to DEBUG and run it again:

Three minutes and thousands of lines later: Informative? Yes. Convenient? No!

Verbose logs have their place. In production, they help with postmortems. They also help when you want to extract insights and analyze behavior patterns across many runs. But when you’re debugging a single run and trying to understand why your agent is stuck in a loop, you need something built for humans, not statistical analysis.

The money trail

While those logs aren’t ideal for debugging, they did remind us about something else we should be tracking: usage statistics. OpenAI’s dashboard provides detailed breakdowns of API consumption per key. It answers one question quickly: Are we burning through our budget?

But it still doesn’t answer the question that matters during development: What does one run cost?

Okay, $25 in one week. Useful, but not the complete picture. The OpenAI dashboard does what it’s designed to do: track organization-level API usage. But for agent development, we need run-level insights. We can see the total, but not which tasks cost $0.50 and which cost $5.00.

Now add more moving parts. If we split work across sub-agents and use different providers for specific steps (to be discussed in upcoming articles), we end up with multiple dashboards and no single view of the true cost of a run. And if several people share the same agent setup, it gets even harder to see who spent what and where.

The API key approach works for one agent, one provider, and limited runs. It’s a start, but it doesn’t scale. We need observability.

The four-line integration

We’re not the first to encounter this issue. Over the last few years, an ecosystem of observability tools has formed around LLM apps and agents. There are proprietary tools such as Weights & Biases Weave and LangSmith, as well as open-source options like Langfuse, Opik, Arize Phoenix, OpenLLMetry, Helicone, OpenLIT, and Lunary. Some are cloud-based, others support self-hosting, and some offer both. Each comes with its own strengths and trade-offs.

After evaluating several options, our team chose Langfuse. The decision came down to practical factors: It’s open source with a self-hosted option (so traces stay under your control), it offers a free cloud tier for getting started, the UI makes it easier to inspect traces, and the team is responsive when questions come up.

The integration itself is straightforward. Koog needs four lines to connect to Langfuse:

) {
+   install(OpenTelemetry) {  
+       setVerbose(true) 
+       addLangfuseExporter()  
+   }
    handleEvents {  
        // existing handlers remain
    }
}

That’s it.

A quick note about the setVerbose(true): Koog sends only telemetry metadata by default. Full details are sent only if you enable verbose mode. Prompts and responses remain hidden, which makes sense when traces can include customer data. During agent development, you often need full visibility, and verbose mode enables that.

Setting up Langfuse takes about five minutes. For this article, we’re using their free cloud instance, but you can also run the full stack locally in Docker.

  1. Create an account at cloud.langfuse.com.
  2. Create an organization.
  3. Create a project.
  4. Click Create API Key.

You’ll get three values:

export LANGFUSE_HOST="https://cloud.langfuse.com"
export LANGFUSE_PUBLIC_KEY="<your-public-key>"
export LANGFUSE_SECRET_KEY="<your-secret-key>"

Koog reads these values from the execution environment. If you need to pass them in code, you can pass them directly to addLangfuseExporter. Here’s our complete agent with observability enabled:

val executor = simpleOpenAIExecutor(System.getenv("OPENAI_API_KEY"))
val agent = AIAgent(
    promptExecutor = executor,
    llmModel = OpenAIModels.Chat.GPT5Codex,
    toolRegistry = ToolRegistry {
        tool(ListDirectoryTool(JVMFileSystemProvider.ReadOnly))
        tool(ReadFileTool(JVMFileSystemProvider.ReadOnly))
        tool(EditFileTool(JVMFileSystemProvider.ReadWrite))
        tool(ExecuteShellCommandTool(JvmShellCommandExecutor(), PrintShellCommandConfirmationHandler()))
    },
    systemPrompt = """
        You are a highly skilled programmer tasked with updating the provided codebase according to the given task.
        Your goal is to deliver production-ready code changes that integrate seamlessly with the existing codebase and solve given task.
        Ensure minimal possible changes done - that guarantees minimal impact on existing functionality.
        
        You have shell access to execute commands and run tests.
        After investigation, define expected behavior with test scripts, then iterate on your implementation until the tests pass.
        Verify your changes don't break existing functionality through regression testing, but prefer running targeted tests over full test suites.
        Note: the codebase may be fully configured or freshly cloned with no dependencies installed - handle any necessary setup steps.
        """.trimIndent(),
    strategy = singleRunStrategy(ToolCalls.SEQUENTIAL),
    maxIterations = 400
) {
    install(OpenTelemetry) {
        setVerbose(true) // Send full strings instead of HIDDEN placeholders
        addLangfuseExporter()
    }
    handleEvents {
        onToolCallStarting { ctx ->
            println("Tool '${ctx.tool.name}' called with args: ${ctx.toolArgs.toString().take(100)}")
        }
    }
}

The first trace

Let’s start with a simple task: Find the main() function in your project. Run it, wait for it to finish, and then open the Tracing tab in your Langfuse project.

Immediately, the table displays something useful: the Total Cost column. Koog reports token counts, Langfuse applies pricing, and you can see that finding main() costs $0.016.

Open the trace to see the full run.

The left-hand panel shows your agent’s trajectory: messages, tool calls, and observations in order, with indentation indicating the call hierarchy. The right-hand panel displays details for each step: prompts, responses, tool parameters, and per-span cost breakdown.

At the bottom, you’ll also see an execution graph that visualizes the flow. For more details on Koog’s graph-based strategies, check out the documentation. You can also read a deep dive by the Koog tech lead: Mixing the Secret AI Sauce: How to Design a Flexible, Graph-Based Strategy in Koog.

Discovering issues through observability

While preparing examples for this article, we noticed a failed tool call highlighted in red.

The observation was:

endLine=400 must be <= lineCount=394 or -1

The agent requested lines 0–400, but the file had only 394 lines. This points to a limitation in our tool implementation. Instead of failing, it should clamp the range and return the available lines (0–394), allowing the agent to continue. Without traces, this kind of issue often gets buried in logs. With traces, you see it in the exact run, at the exact step, and with the exact input.

That’s why observability matters. It shows not just that something failed, but also how and why, which makes fixes much easier.

Working with evaluation runs

Single traces are great for debugging, but evaluation runs need a batch view. 

When you run your agent across multiple SWE-bench Verified tasks (a standard benchmark for coding agents), you want grouped traces and aggregated cost.

Langfuse supports this through sessions. Add a session ID as a trace attribute:

install(OpenTelemetry) {
    setVerbose(true)
    addLangfuseExporter(  
        traceAttributes = listOf(  
            CustomAttribute("langfuse.session.id", "eval-run-1"),  
        )  
    )
}

With this configuration, traces from the evaluation share a session ID. In Langfuse, navigate to the Sessions tab to see aggregate duration and total cost.

In our run, we executed 50 SWE-bench Verified tasks (out of 500). With ten parallel instances, the run took 30 minutes and cost $66. This gives a baseline for future experiments: success rate, runtime, and cost.

Langfuse doesn’t show success scores by default. That’s expected. Tracing records what happened, not whether a run passed. If you want success metrics in Langfuse, your evaluation harness needs to score each attempt and report custom scores. It’s worth implementing if you have the supporting infrastructure.

Looking ahead

Observability isn’t just about debugging. It’s also how you learn about the agent’s behavior. 

With a few lines of code, we turned the agent from a black box into something you can inspect. We caught that line-range error that would likely have gone unnoticed. We learned that our 50-task evaluation costs $66. And we can now see which tasks consume tokens and which run efficiently.

In the next article, we’ll introduce a sub-agent pattern: Delegate specific tasks to smaller, cheaper models. With traces in place, you can decide what to delegate based on what the agent does in each run. No more guessing.

Resources


Thank you for reading! I’d be happy to hear about your own approaches to trace analysis – what patterns have you spotted in your agents’ behavior? What surprised you? Feel free to share your experience in the comments!

Read the whole story
alvinashcraft
1 hour ago
reply
Pennsylvania, USA
Share this story
Delete

122: Eyes Off the Road: Rivian’s Bold Autonomy Reveal at AI Day

1 Share
In this episode:
  • Rivian promises eyes-off autonomy and more in the not-distant future
  • The Fiat Topolino, weirdly, is coming to the US
  • The all-electric Subaru Uncharted’s new US price tag
    …and of course, much, much more







Download audio: https://dts.podtrac.com/redirect.mp3/audioboom.com/posts/8818313.mp3?modified=1765557252&sid=5141110&source=rss
Read the whole story
alvinashcraft
1 hour ago
reply
Pennsylvania, USA
Share this story
Delete

Open Source C# Command-line tools for @StJudePlayLive

1 Share
From: Fritz's Tech Tips and Chatter
Duration: 0:00
Views: 61

Made with Restream. Livestream on 30+ platforms at once via https://restream.io

We're working on CodeMedic, our open-source command-line tool for .NET developers. More at https://github.com/FritzAndFriends/CodeMedic

Read the whole story
alvinashcraft
1 hour ago
reply
Pennsylvania, USA
Share this story
Delete

Australia Kicks Kids Off Social Media + Is the A.I. Water Issue Fake? + Hard Fork Wrapped

1 Share
“I’m told that Australian teens, in preparation for this ban, have been exchanging phone numbers with each other.”
Read the whole story
alvinashcraft
1 hour ago
reply
Pennsylvania, USA
Share this story
Delete

The Agile Organization as a Learning System With Tom Gilb and Simon Holzapfel

1 Share

BONUS: The Agile Organization as a Learning System

Think Like a Farmer, Not a Factory Manager

"Go slow to go fast. If you want to go somewhere, go together as a team. Take a farmer's mentality."

 

Simon contrasts monoculture industrial thinking with the permaculture approach of Joel Salatin. Industrial approaches optimize for short-term efficiency but create fragile systems. Farmer thinking recognizes that healthy ecosystems require patience, diversity, and nurturing conditions for growth. The nervous system that's constantly stressed never builds much over time—think of the body, trust the body, let the body be a body.

Value Masters, Not Scrum Masters

"We need value masters, not Scrum Masters. Agile is a useful tool for delivering value, but value itself is primary. Everything else is secondary—Agile included."

 

Tom makes his most provocative point: if you asked a top manager whether they'd prefer an agile person or value delivery, the answer is obvious. Agile is one tactic among many for delivering value—not even a necessary one. The shift required is from process mastery to value mastery, from Scrum Masters to people who understand and can deliver on critical stakeholder values.

The DOVE Manifesto

"I wrote a paper called DOVE—Deliver Optimum Values Efficiently. It's the manifesto focusing on delivering value, delivering value, delivering value."

 

Tom offers his alternative to the Agile Manifesto: a set of principles laser-focused on value delivery. The document includes 10 principles on a single page that can guide any organization toward genuine impact. Everything else—processes, frameworks, methodologies—are secondary tools in service of this primary goal.

Read Tom's DOVE manifesto here

Building the Glue Between Social and Physical Technology

"Value is created in interactions. That's where the social and physical technology meet—that joyous boundary where stuff gets done."

 

Simon describes seeing the world through two lenses: physical technology (visible tools and systems) and social technology (culture, relationships, the air we breathe). Eric Beinhoeker's insight is that progress happens at the intersection. The Gilbian learning loops provide the structure; trust and human connection provide the fuel. Together, they create organizations that can actually learn and adapt.

 

Further Reading To Support Your Learning Journey

Resources & Further Reading

Explore these curated resources to deepen your understanding of strategic planning, value-based management, and transformative organizational change.

 


 

📚 Essential Reading

Competitive Engineering 

Tom Gilb's seminal book on requirements engineering and value-based development approaches.

What is Wrong with OKRs (Paper by Tom Gilb)
A critical analysis of the popular OKR framework and its limitations in measuring real value.

DOVE Manifesto by Tom Gilb
Detailed exploration of the DOVE (Design Of Value Engineering) methodology for quantifying and optimizing stakeholder value.

 


 

🎓 Learning Materials

Tom Gilb's Strategy Ringbook
A comprehensive collection of strategic planning principles and practical frameworks.

Tom Gilb's Video at the Strategy Meetup
Watch Tom Gilb discuss key strategic concepts and answer questions from the community.

Design Process Paper by Tom Gilb
An in-depth look at value-driven design processes and their practical application.

Esko Kilpi's Work on Conversations
Exploring how organizational conversations shape thinking, decision-making, and change.

 


 

🧭 Frameworks & Models

OODA Loop
The Observe-Orient-Decide-Act decision cycle for rapid strategic thinking and adaptation.

 


 

🎯 Practical Tips

Measurement of Increased Value

Focus on tracking actual value delivery rather than activity completion. Establish baseline measurements and regularly assess improvements in stakeholder-defined value dimensions.

Quantify Critical Values

Identify the 3-5 most important value attributes for your stakeholders. Make these concrete and measurable, avoiding vague qualities in favor of specific, quantifiable metrics.

Measurement vs Testing Process

Understand the distinction: measurement tells you how much value exists, while testing validates whether something works. Use both strategically—test hypotheses early, then measure outcomes continuously.

 


 

🔗 Related Profiles

Todd Covert - Montessori School of the Berkshires
Educational leadership and innovative approaches to value-based learning environments.

 

About Tom Gilb and Simon Holzapfel

 

Tom Gilb, born in the US, lived in London, and then moved to Norway in 1958. An independent teacher, consultant, and writer, he has worked in software engineering, corporate top management, and large-scale systems engineering. As the saying goes, Tom was writing about Agile before Agile was named. In 1976, Tom introduced the term "evolutionary" in his book Software Metrics, advocating for development in small, measurable steps. Today, we talk about Evo, the name Tom uses to describe his approach. Tom has worked with Dr. Deming and holds a certificate personally signed by him.

You can listen to Tom Gilb's previous episodes here

 

You can link with Tom Gilb on LinkedIn 

 

Simon Holzapfel is an educator, coach, and learning innovator who helps teams work with greater clarity, speed, and purpose. He specializes in separating strategy from tactics, enabling short-cycle decision-making and higher-value workflows. Simon has spent his career coaching individuals and teams to achieve performance with deeper meaning and joy. Simon is also the author of the Equonomist newsletter on Substack.

And you can listen to Simon's previous episodes on the podcast here

 

You can link with Simon Holzapfel on LinkedIn.

 





Download audio: https://traffic.libsyn.com/secure/scrummastertoolbox/20251212_Simon_Tom_F.mp3?dest-id=246429
Read the whole story
alvinashcraft
1 hour ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories