Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
154509 stories
·
33 followers

How to Build AI Apps in the Browser with TensorFlow.js and WebGPU

1 Share

Most developers think of AI the same way: you send data to a server, the server thinks, you get a response back. That mental model made sense for a long time. It still makes sense for a lot of use cases.

But there’s a quiet shift happening inside the browser environment that a lot of engineers are completely missing out on.

The modern browser isn’t just a glorified engine for rendering HTML and CSS anymore. It’s turning into a full-blown runtime for local intelligence. We’ve reached a point where you can ship raw machine learning models straight to a user's device and run inference completely client-side. No server trips, no API keys to protect, and once those initial assets load, zero dependency on an internet connection.

This is the reality of Web AI. If you're building for the web today, understanding this paradigm shift is easily one of the most valuable skills you can add to your stack.

In this guide, we’re going to pull back the curtain on how Web AI actually operates under the hood, break down the browser technology stack making it possible, and build a real, working image classifier using Teachable Machine and TensorFlow.js. Along the way, we’ll also set up a live benchmark so you can watch exactly how WebGL and WebGPU stack up against each other in real-time execution speeds.

Prerequisites

To follow along with this tutorial, you should have:

  • A working knowledge of JavaScript

  • Basic familiarity with HTML and how the browser works

  • Google Chrome installed (required for WebGPU support and Chrome's built-in AI APIs)

  • A code editor like VS Code with the Live Server extension installed (recommended for running the demo locally)

No prior machine learning experience is required.

Table of Contents

What is Web AI?

Instead of sending data off to a distant cloud server, Web AI lets you run machine learning models directly on the user’s device inside their browser. It uses standard web tech like JavaScript, WebAssembly, and WebGPU to handle all the heavy lifting right then and there.

The simplest definition: intelligence that runs in the browser, without sending your data anywhere.

Most of us already interact with on-device AI every day without realizing it. Think about unlocking an iPhone. The second you lift it, Face ID maps out roughly 30,000 infrared points, feeds that data through a neural network living on Apple's local silicon, matches it against an encrypted embedding, and opens the phone. The whole process takes milliseconds and happens entirely offline.

Browser-based AI works on that exact same core architecture. The only real difference is that we're building on top of shared web standards rather than native hardware APIs. When you spin up a face-tracking model using TensorFlow.js or MediaPipe in Chrome, you're running that exact same pipeline:

Camera input → Local ML model → Local decision

No round trip. No server. The browser is your Neural Engine.

Browser AI vs Cloud AI

There’s no right or wrong answer here. It just depends on what you’re trying to build. Both approaches have their pros and cons, so it’s just a matter of picking the tool that fits your specific use case.

Browser AI (Client-Side) Cloud AI (Server-Side)
Internet required No Yes
Latency Near-zero Depends on network
Privacy Data stays on device Data leaves the device
Model size Small to medium As large as you need
Cost at inference time Free Per token or per request

Use browser AI when:

  • You need split-second speed for things like tracking gestures or detecting objects live on a webcam

  • The app has to work offline (whether it's a PWA or just needs to survive spotty internet)

  • Privacy is a hard requirement to keep sensitive data like medical inputs, biometrics, or financial information strictly local

  • You want to reduce or eliminate API costs on high-frequency, lightweight predictions

Use cloud AI when:

  • You need large models like GPT-4, Gemini Pro, or Stable Diffusion

  • You need centralized model updates, A/B testing, or user analytics

  • You require serious GPU or TPU compute power

Most production systems actually use a mix of both. Take Google Photos: it handles face detection right on your device so it’s fast and private, but leaves the heavier categorization work for the cloud. Or think of a modern web app that might use TensorFlow.js locally to classify images instantly, but calls the Gemini API when it needs deeper language processing.

This hybrid setup, keeping lightweight intelligence at the edge and heavy compute in the cloud, is usually the sweet spot for most apps.

The Technology Stack

Browser AI isn’t just a single tool – it’s a stacked layer of technologies. Knowing how these layers fit together makes it a lot easier to choose your setup and navigate the trade-offs.

Tensors

Before jumping into any ML framework, you need to understand tensors. Not deeply, just enough of a handle on them so you don't get blindsided by tensor shape errors, because they will happen and they can be tricky to debug.

Think of a tensor as a multi-dimensional grid of numbers. Whether your model is processing images, audio, or text, everything gets converted into this format first. Models only speak numbers, and tensors are the containers that hold them.

A single number       → 0D tensor (scalar):  42
A list of numbers     → 1D tensor (vector):  [0.2, 0.8, 0.5]
A table of numbers    → 2D tensor (matrix):  [[1,2,3],[4,5,6]]
An image              → 3D tensor:           shape [224, 224, 3]
A batch of images     → 4D tensor:           shape [32, 224, 224, 3]

Models accept inputs in specific shapes. If your tensor shape doesn't match the model's expected input, your code breaks. That's why understanding dimensions is practical, not just theoretical.

TensorFlow is literally named after this concept. Tensor + Flow = tensors flowing through neural networks.

Here's how you create tensors in TensorFlow.js:

// 1D tensor — a list of values
const scores = tf.tensor([0.1, 0.7, 0.2]);

// 3D tensor — a single image (height x width x RGB channels)
const image = tf.tensor([
  [[255, 0, 0], [0, 255, 0]],
  [[0, 0, 255], [255, 255, 0]]
]);

// 4D tensor — a batch of 32 images
const batch = tf.zeros([32, 224, 224, 3]);

TensorFlow.js

TensorFlow.js is Google's JavaScript version of TensorFlow. It lets you run pre-trained models right in the browser and, if you really want to, train new ones completely client-side.

The most important concept in TensorFlow.js is the backend, the hardware your model actually runs on. You can switch between backends depending on what the user's device supports, and it makes a significant difference to performance.

await tf.setBackend('webgpu');  // fastest — true GPU compute
await tf.setBackend('webgl');   // very fast — GPU via graphics shaders
await tf.setBackend('wasm');    // fast — near-native CPU speed
await tf.setBackend('cpu');     // slowest — plain JavaScript on CPU

await tf.ready();
console.log('Running on:', tf.getBackend());

In practice, you want to try the fastest available backend and fall back gracefully if a user's browser doesn't support it:

const backends = ['webgpu', 'webgl', 'wasm', 'cpu'];

for (const backend of backends) {
  try {
    await tf.setBackend(backend);
    await tf.ready();
    console.log('Using backend:', backend);
    break;
  } catch {
    continue;
  }
}

WebAssembly

WebAssembly (WASM) basically lets code written in C++ or Rust run inside the browser at near-native speeds. When it comes to AI, this is a big deal because heavy math operations like tensor calculations, data preprocessing, and running compressed models happen way faster in WASM than they ever could in standard JavaScript.

Under the hood, TensorFlow.js's WASM backend is using a compiled C++ runtime. If you're running compressed models on a device's CPU, switching to the WASM backend can make your app anywhere from 2 to 10 times faster than just sticking with regular JavaScript.

await tf.setBackend('wasm');
await tf.ready();

WebGL and WebGPU

This is where browser AI performance gets interesting.

WebGL was originally built for 3D graphics. But developers discovered that the parallel computation that GPUs use for rendering is exactly the kind of parallel computation neural networks need.

TensorFlow.js's WebGL backend encodes tensor operations as graphics shader programs and runs them on the GPU. It works well, but it's a workaround, as WebGL was never designed for this kind of work.

WebGPU is what was actually designed for the job. It launched in Chrome back in April 2023 after six years of collaboration between Apple, Google, Mozilla, Intel, and Microsoft.

Instead of just handling graphics, it's a modern API built from the ground up for general-purpose computing. When it comes to running AI models, it can be 2 to 3 times faster than WebGL, which means you can actually run significantly larger models right in the browser.

Here's how to check for WebGPU support and use it:

if ('gpu' in navigator) {
  console.log('WebGPU is supported');
  await tf.setBackend('webgpu');
} else {
  console.warn('WebGPU not available, falling back to WebGL');
  await tf.setBackend('webgl');
}

await tf.ready();

To enable WebGPU in Chrome for development, go to:

chrome://flags/#enable-unsafe-webgpu → Enable → Restart Chrome
Enable web-gpu in chrome

The performance progression across backends looks like this:

Backend What's happening under the hood Relative speed
cpu Plain JavaScript on CPU Slow
wasm Compiled C++ on CPU Fast
webgl GPU via graphics shaders Very fast
webgpu GPU via compute shaders Fastest

MediaPipe

MediaPipe is Google's framework for real-time perception tasks like hand tracking, face mesh detection, pose estimation, and object detection. Think of it as plug-and-play AI for anything that involves a camera.

You don't build these models yourself – you just import them and use them. MediaPipe is what actually powers the background blur in Google Meet and the visual filters in YouTube. Under the hood, it runs on TensorFlow.js and WebAssembly to keep everything moving fast.

You can try all MediaPipe models interactively before writing any code at MediaPipe Studio.

How to Build AI in the Browser

Step 1: Train a Model with Teachable Machine

Teachable Machine is Google's no-code tool for building models. It lets you create custom images, audio, or pose classifiers right from your webcam without needing any machine learning experience. Once you're done, you can export them as TensorFlow.js models that are completely ready to drop straight into your app.

Here's how to get started:

  1. Go to teachablemachine.withgoogle.com

  2. Choose Image Project, standard image model.

  3. Create two or more classes. "Thumbs Up" and "Thumbs Down" is a simple starting point

  4. Record examples for each class using your webcam

  5. Click Train Model — training happens entirely in your browser

  6. Click Export Model and choose TensorFlow.js

Train with teachable machine

When you export, you get three files:

  • model.json: The model architecture: layers, input/output shapes, and paths to the weights

  • weights.bin: The trained weights stored as binary data

  • metadata.json: Class labels, input size, and inference configuration

A note on training data quality

Teachable Machine relies on supervised learning. You give the model labeled examples, and it figures out the underlying patterns. When you're gathering your data, two things matter way more than the sheer number of pictures you take:

  • Balance: If one class has significantly more examples than another, the model will be biased toward it. Keep the data roughly equal across classes.

    Variety: Fifty photos from different angles, distances, and lighting conditions will easily outperform two hundred near-identical shots from the same spot. The model needs to understand the concept of a "thumbs up", not memorise one specific photo of your specific thumb.

Keep in mind that the actual machine learning model is usually just a tiny fraction of your overall codebase. The vast majority of what you write is going to be standard JavaScript. At the end of the day, it's just another asset in your stack.

Step 2: Setting up and Writing the Code

Now that you have your model files, set up your project structure like this and create an index.html file:

your-project/
├── index.html
├── model.json
├── weights.bin
└── metadata.json

The model.json, weights.bin, and metadata.json files all go in the same folder as your index.html. The demo loads them from the same directory using const URL = "./".

To run it locally, open the folder in VS Code or your preferred IDE and use the Live Server extension. Just right-click index.html and select Open with Live Server. Opening the file directly in the browser without a server will cause CORS errors when loading the model files.

Step 3: Load the Model and Run Predictions

Paste the following in your index.html file. This demo loads your Teachable Machine model, starts your webcam, and runs continuous predictions in a loop:

<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Teachable Machine - Webcam + Backend Switch Demo</title>
    <style>
        body {
            font-family: Arial;
            text-align: center;
            margin: 20px;
        }

        #webcam-container {
            margin-top: 20px;
        }

        #label-container {
            margin-top: 10px;
            font-size: 18px;
            font-weight: bold;
        }

        button.backend-btn {
            margin: 5px;
            padding: 8px 16px;
            font-size: 16px;
            cursor: pointer;
        }

        #status {
            margin-top: 10px;
            font-weight: bold;
            color: #0078ff;
        }

        table {
            margin: 20px auto;
            border-collapse: collapse;
            width: 80%;
            max-width: 600px;
        }

        th,
        td {
            border: 1px solid #ccc;
            padding: 10px;
        }

        th {
            background: #0078ff;
            color: white;
        }
    </style>
</head>

<body>
    <h2>AI in the web Demo</h2>

    <div>
        <button class="backend-btn" onclick="switchBackend('cpu')">CPU</button>
        <button class="backend-btn" onclick="switchBackend('webgl')">WebGL</button>
        <button class="backend-btn" onclick="switchBackend('webgpu')">WebGPU</button>
    </div>

    <p id="status">Click a backend to start</p>

    <table>
        <thead>
            <tr>
                <th>Backend</th>
                <th>Load Time (s)</th>
                <th>Inference Time (ms)</th>
                <th>Status</th>
            </tr>
        </thead>
        <tbody id="results"></tbody>
    </table>

    <div id="webcam-container"></div>
    <div id="label-container"></div>

    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@latest/dist/tf.min.js"></script>
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-backend-webgpu"></script>
    <script
        src="https://cdn.jsdelivr.net/npm/@teachablemachine/image@latest/dist/teachablemachine-image.min.js"></script>

    <script>
        const URL = "./";
        const resultsTable = document.getElementById("results");
        const statusEl = document.getElementById("status");
        const backends = ["cpu", "webgl", "webgpu"];

        let model, webcam, maxPredictions;
        const backendResults = {};

        // Initialize webcam
        async function initWebcam() {
            if (!webcam) {
                webcam = new tmImage.Webcam(200, 200, true);
                await webcam.setup();
                await webcam.play();
                document.getElementById("webcam-container").appendChild(webcam.canvas);

                const labelContainer = document.getElementById("label-container");
                labelContainer.innerHTML = "";
                for (let i = 0; i < 2; i++) labelContainer.appendChild(document.createElement("div"));
            }
        }

        async function switchBackend(backend) {
            statusEl.innerText = `Switching to ${backend.toUpperCase()}...`;

            await initWebcam();

            try {
                const startLoad = performance.now();
                await tf.setBackend(backend);
                await tf.ready();
                model = await tmImage.load(URL + "model.json", URL + "metadata.json");
                maxPredictions = model.getTotalClasses();
                const endLoad = performance.now();
                const loadTime = ((endLoad - startLoad) / 1000).toFixed(2);

                // Single inference to measure time
                const startInference = performance.now();
                await model.predict(webcam.canvas);
                const endInference = performance.now();
                const inferenceTime = (endInference - startInference).toFixed(1);

                // Store results
                backendResults[backend] = { loadTime, inferenceTime };

                updateTable();

                statusEl.innerText = `${backend.toUpperCase()} ready`;
            } catch (err) {
                console.error(`${backend} not supported:`, err);
                statusEl.innerText = `${backend.toUpperCase()} not supported`;
            }
        }


        function updateTable() {
            resultsTable.innerHTML = "";
            for (let backend of backends) {
                const row = document.createElement("tr");
                const backendCell = document.createElement("td");
                const loadCell = document.createElement("td");
                const inferenceCell = document.createElement("td");
                const statusCell = document.createElement("td");

                backendCell.textContent = backend.toUpperCase();

                if (backendResults[backend]) {
                    loadCell.textContent = backendResults[backend].loadTime;
                    inferenceCell.textContent = backendResults[backend].inferenceTime;
                    statusCell.textContent = "✓";
                } else {
                    loadCell.textContent = "-";
                    inferenceCell.textContent = "-";
                    statusCell.textContent = "-";
                }

                row.appendChild(backendCell);
                row.appendChild(loadCell);
                row.appendChild(inferenceCell);
                row.appendChild(statusCell);
                resultsTable.appendChild(row);
            }
        }

        // Continuous prediction loop
        async function loop() {
            if (webcam && model) {
                webcam.update();
                const prediction = await model.predict(webcam.canvas);
                const labelContainer = document.getElementById("label-container");
                labelContainer.innerHTML = "";
                for (let i = 0; i < maxPredictions; i++) {
                    const p = document.createElement("div");
                    p.textContent = `\({prediction[i].className}: \){(prediction[i].probability * 100).toFixed(1)}%`;
                    labelContainer.appendChild(p);
                }
            }
            requestAnimationFrame(loop);
        }

        loop();
    </script>
</body>

</html>

A few things worth understanding about what this code is doing:

The switchBackend function does more than just swap the backend. Each time you click a backend button, it records how long the model takes to load on that backend and how long a single inference takes. Those numbers go straight into the comparison table so you can see the difference without having to look at console logs.

The loop function runs continuously using requestAnimationFrame. Every frame, it grabs the current webcam image, passes it to the model, and updates the prediction labels on screen. This is what makes the detection feel real-time.

Notice that initWebcam only runs once. It checks if webcam already exists before setting up. Switching backends reloads the model but keeps the same webcam stream running.

Open Chrome DevTools and go to the Network tab while the demo runs. After the model files finish loading, you'll see zero outbound requests. Every prediction is happening entirely in the browser.

Step 4: Switch Backends and Compare Performance

Once the demo is running, click each backend button one at a time: CPU, then WebGL, then WebGPU. The table updates after each switch and shows you the load time in seconds and inference time in milliseconds for each backend side by side.

Here's what you should expect to see:

  • CPU will be the slowest with everything running in plain JavaScript

  • WebGL will be noticeably faster as the GPU is now handling the tensor operations

  • WebGPU will be the fastest with true GPU compute and less overhead than WebGL. The exact numbers depend on your machine, but the gap between CPU and WebGPU is usually significant enough to see immediately in the table.

Demo with network tab

Note: WebGPU requires Chrome with the flag enabled. If the WebGPU button shows "not supported", go to chrome://flags/#enable-unsafe-webgpu, enable it, and restart Chrome.

Chrome's Built-in AI APIs

Beyond loading your own models, Chrome is rolling out native AI capabilities that you can hook into directly through browser APIs. This means no managing bulky model files, no importing TensorFlow.js, and zero manual setup.

The powerhouse here is Gemini Nano, a lightweight version of Google's Gemini model built to run completely on-device inside Chrome. It handles tasks like smart replies and page summarization right in the browser without ever making a cloud call.

If you want to build with it, you can tap into these experimental APIs that Chrome exposes to developers:

chrome://flags → search "Prompt API for Gemini Nano" → Enable → Restart Chrome
Gemini nano

These are still experimental and behind flags. But they show clearly where the platform is heading.

For the full prerequisites and setup guide for Chrome's built-in AI, see the official Chrome AI getting started documentation.

Where Web AI Is Headed

The browser is evolving into something that doesn't really have a clean name yet. It's no longer just a document viewer, and it's not quite a native app runtime either. Instead, it's becoming an intelligent edge node – a piece of infrastructure that can perceive, process, and act all on its own, without constantly phoning home for permission.

A few massive shifts are already well underway:

  • Native AI built directly into the platform: AI capabilities are turning into standard browser APIs. Because they're cached and shared across the entire ecosystem, you won't have to re-download massive models for every single domain you visit.

    Browsers designed with AI as their core foundation are already popping up. OpenAI's Atlas browser is a perfect early signal of this trend. Every year, the idea of the browser acting as an intelligent agent platform rather than a simple content renderer gets more concrete.

  • The developer shift: For developers, the immediate future is clear: a significant chunk of AI features that currently live on expensive servers will migrate straight to the client side. It won't be everything, but the lightweight, high-frequency, and privacy-sensitive tasks will absolutely make the jump.

WebGPU isn't just a flashy demo technology, and browser inference is definitely not a toy. These are serious production tools, and they're only getting more capable as AI models shrink and user hardware gets more powerful.

If you're currently building an interactive, AI-powered feature, it's well worth pausing to ask yourself: does this actually need a server?

Sometimes the answer is still yes. But more and more often, the answer is a definitive no.

What You Learned

In this tutorial, we covered:

  • What Web AI is and how it differs from cloud-based AI

  • When to use browser AI versus cloud AI and how a hybrid approach works

  • The technology stack behind browser AI: tensors, TensorFlow.js, WebAssembly, WebGL, WebGPU, and MediaPipe

  • How to train a custom model with Teachable Machine and export it for the browser

  • How to load that model, run it against live webcam input, and manage GPU memory correctly

  • How to benchmark WebGL vs WebGPU inference times to measure real performance differences

  • How to access Chrome's built-in AI APIs including Gemini Nano

If you found this useful or want to connect, you can find me on Twitter/X or LinkedIn.

Resources



Read the whole story
alvinashcraft
just a second ago
reply
Pennsylvania, USA
Share this story
Delete

Stop Joining Tables In Your “Modular” Monolith

1 Share

Modular monoliths are all the rage.

You have well defined modules built around business capabilities. Maybe inside those modules you are using something like Clean Architecture. You have separation of concerns. You have direction of dependencies. Everything looks great and you think, “Wow, finally I have a really good structure.”

But do you?

YouTube

Check out my YouTube channel, where I post all kinds of content on Software Architecture & Design, including this video showing everything in this post.

Modular Monlith

You are looking at an order management system, viewing a particular order, and somebody wants to see the quantity on hand in the warehouse for the items on that order.

That seems harmless. Straightforward. Since you are in a monolith, you have the same database. No big deal. When you get the order line items out, if you are using a relational database, you just join with the inventory table.

Problem solved.

Except what looked like a simple answer to show some inventory on a screen may have just welded two modules together.

Your codebase can be lying to you. Even though you have this well structured solution with different projects around modules, it might really just be a nice looking wrapper on top of a very coupled database.

Your code might say Sales and Warehouse. But what do your queries say?

The Query That Crosses the Boundary

You might think it is architectural nonsense to hear “don’t cross a boundary,” especially in a situation like this. You are in a monolith. Why would you create two different queries when you can just have one with a join?

What is really the harm?

That is the trap you might not realize you are putting yourself into.

Jumping back to the order detail page, the request seems simple enough. Add another column that shows quantity on hand from the warehouse.

To do that, it is going to be pretty simple, right? Sales is building out the page. Sales gets the order details and the line items. From there, if we are using a relational database, we can just join to the inventory table wherever that lives as part of Warehousing.

Sure, they are separate things, but just go to the data directly. That is the tempting version.

I say that because it is simple, easy, and probably really fast to get that data and render it with a simple join. But architecturally, something important just happened.

You decided to couple Sales and Warehouse at the database level.

Now it is not just a query. You have a dependency.

That is the hidden coupling. You do not see it in code. Everything can look well structured, but the coupling is hidden inside your queries and how you are dealing with data access.

I know a lot of people immediately think, “Yes, this is why my system is a disaster. It is a rat’s nest.”

Because there is a free for all of coupling and integration at the database level. You can have some level of ownership and organization within code, but if you are not doing that at your database schema level, all bets are off.

Just Put an API in Front of It?

A solution you might be thinking of is to create an API. Sales interacts with that API to Warehouse to get the inventory.

That way, Warehouse defines and uses its explicit schema, and so does Sales. Sales has to do the composition of getting the information from Warehouse through the API.

Sure, that solves a problem.

But you still have coupling. The form of coupling is just different. Before, you were coupled directly to the database. With an API, you are coupled to a contract.

That gives you more options. Maybe you have versioning. Maybe you can deprecate some APIs or mark something as obsolete. You have different ways of changing the behavior of how things work.

For example, say the requirements change and quantity on hand is stored differently. Now you have different warehouses, and you are showing which warehouse has what quantity on hand in what location.

If you were integrating directly into the database, you are handcuffed. That is especially true if you are not in the same codebase and cannot easily find all those queries.

Imagine you have some other codebase somewhere else that is also integrating into your database. Now you cannot change it. You are totally handcuffed.

With the API, at least you have some means to migrate. You have versioning. You have some course of action if you want to evolve and make change.

What Do You Actually Need?

There is more to it than that, because sometimes what you think you need and what you really need are two different things entirely.

I actually love this example because it highlights that.

If we are talking about needing the actual quantity on hand in the warehouse when looking at orders, maybe it is just for view purposes. Maybe it is also for placing an order, where you think you need the quantity on hand.

The thing is, that is not reality.

Sure, you may think this is a trivial example, but I can guarantee you that in your own business domain, this happens all the time.

With Sales, you might think about a particular SKU and the price for the product, or what you are selling it for. Purchasing might care about buying that product from a vendor and what the actual cost is. The Warehouse cares about that SKU, the quantity on hand, where it is in the warehouse, what bin it is in, and whatever else matters physically inside the warehouse.

But the idea of quantity on hand in Sales is not actually a thing.

It is not.

It is a totally different concept. It is a business function called Available to Promise.

Available to Promise is solving the actual issue, not necessarily from a technical perspective, but from a business perspective.

Sales does not actually care how many items are in the warehouse at any given time, or what the system thinks is there, because that is not reality.

Reality is going into the warehouse and seeing what parts or products are actually there and in what quantity. Something might be damaged. Something might be stolen. What is in y

our system is not reality. What is physically on the shelf is reality.

So something like ATP, Available to Promise, is a business function. It is the idea of, “What have we sold right now? What have we purchased from our vendors or manufacturers that we are going to receive? What number can I actually promise to sell because I know more are coming in?”

That is the number Sales cares about.

This goes back to the difference between what you think the requirement is and what you actually need.

When you model that properly, you may not need the coupling. You may not need to call some API to get another boundary’s data. You may not need to reach into its database or do a join or cross that boundary line at all.

You have all the data you need because you aligned everything with the business functions and the requirements within the boundary.

It Is All Tradeoffs

This is all about tradeoffs.

There really is no right answer, per se, because it comes down to what is right in your system.

What is the size of the system? How much is it evolving? If today it is just one extra query, no big deal. Maybe you have to do a little composition. Maybe it is a little bit slower. If that helps you tomorrow deal with coupling, then sure.

But if you have to deal with schema changes, and things are evolving, and you cannot change the schema because all these different things are coupled to the same underlying database, that is a problem.

There is your tradeoff.

How are you going to deal with regressions? If you change something, are you going to break other processes or application code? If you want to make a change, what different applications or parts of the same system do you have to coordinate with because you are changing the underlying schema?

There is no universal right answer here. It is a matter of the tradeoffs and the degree of coupling you are creating.

Which Camp Are You In?

I think there are probably two camps.

One camp is thinking this is insane. Sales and Warehouse? It is just a simple warehousing system. I can do the join. It is one database. It is a monolith. No problem.

And that may be working totally fine for you, depending on the size of your system and what else is accessing that database.

If it works, it works.

However, there are large enough systems where you absolutely need boundaries because you cannot evolve without them.

I can guarantee there are a lot of people living in those types of systems, where they are handcuffed because they cannot make any schema changes at all. They have no idea what application code or what queries are happening anywhere that are calling those tables.

You cannot make a change because you are going to break something else.

In those situations, you want to create an API because it is your contract. It is something you can evolve, at least a little bit easier than when clients are reaching directly into your database.

That is much more difficult to change.

Your Code Might Be Structured, But Your Data Might Not Be

If you just want to make that query and do that join, have at it. But realize the tradeoff you are making and the coupling you are creating.

Your application code might be well structured. It might look like it does not have much coupling. It might look like a modular monolith.

But if all the coupling is happening at the database level, what you still have is a big ball of mud.

It just has a nicer looking code structure wrapped around it.

Join CodeOpinon!
Developer-level members of my Patreon or YouTube channel get access to a private Discord server to chat with other developers about Software Architecture and Design and access to source code for any working demo application I post on my blog or YouTube. Check out my Patreon or YouTube Membership for more info.

The post Stop Joining Tables In Your “Modular” Monolith appeared first on CodeOpinion.

Read the whole story
alvinashcraft
15 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Stop guessing: the performance loop for production code

1 Share

TL;DR: A benchmark can tell you whether code got faster. It cannot tell you whether the code mattered. For that, use a loop: profile with a profiling harness, improve a hot path, benchmark and compare, profile again, then ship and observe production.

The first benchmark I wrote looked deceptively easy.

A class. A few attributes. A method. Run BenchmarkDotNet and get a table.

It looked a lot like unit testing, which made me dangerously confident.

That confidence did not survive contact with real code.

Benchmarks resemble unit tests structurally, but they answer a completely different question. A unit test says, “Does this behavior still work?” A benchmark says, “What distribution of timings and allocations did this code produce under these conditions?” The second question is harder because the benchmark does not know whether the measured code matters.

Many performance investigations go wrong right there. The hard part is not adding [Benchmark] to a method. The hard part is deciding what to measure, what to cut away, what to keep, and when the result is good enough to ship.

The loop comes before the benchmark

The workflow I use has five steps:

  1. Profile using a profiling harness
  2. Improve one hot path.
  3. Benchmark and compare.
  4. Profile the improved code again.
  5. Ship it and observe production.

The order matters.

Profiles find candidates.

Benchmarks compare alternatives.

Profiling comes first because it shows where the system spends time or allocates memory. Benchmarking comes after that because it compares a focused change under repeatable conditions.

If you skip the first profile, you are guessing. Sometimes you will guess right. Most of the time you will polish code that was not holding the system back.

Flame graph showing CPU cost in the NServiceBus publish pipeline
A flame graph can show the relationship between infrastructure code and business code before a benchmark exists.

The example I will use throughout this series comes from the NServiceBus pipeline. Conceptually, the pipeline is similar to middleware in ASP.NET Core. A message flows through a chain of behaviors. Each behavior can do work before and after the next behavior runs. Serialization, tracing, transactions, persistence, and user code can all sit behind that abstraction.

That makes the pipeline a good performance target. It runs on the message-processing hot path, and every user benefits if the infrastructure gets out of the way.

Start by becoming performance aware

Performance work does not have to start with a profiler. It can start with a few uncomfortable questions during normal development:

  • How often will this code run when the system is under load?
  • What does it allocate on that path?
  • Does it copy data that could be reused, pooled, streamed, or passed as a span?
  • Can setup work move out of the hot path?
  • Which parts are under our control, and which belong to another team, package, or service?
  • What would make us stop?

That last question is not a joke. Performance work has no natural stopping point. There is always another allocation to remove, another branch to simplify, another loop to tighten. Without a stopping rule, the investigation turns into hobby work.

I know this about myself. I once solved it at home by shutting off my internet around midnight. If I could no longer search for the next clue, I would finally go to bed.

For product work, perfect code is the wrong target. Code needs to be fast enough, cheap enough, and simple enough to maintain.

I call myself a principal chocolate lover these days, not a performance engineer. My job is still to ship useful software. Performance work has to earn its place beside everything else.

Why throughput and memory matter

When code runs at scale, small costs become visible. CPU time limits throughput. Allocations increase garbage collector pressure. More pressure means more pauses, more cores, more instances, or a larger cloud bill.

That bill is where performance stops sounding abstract. Someone puts down a credit card, the cloud turns CPU, memory, throughput units, and premium throughput units into cheerful line items, and then someone has to explain why the number got so large.

There is also a waste angle. If the same workload can run with fewer resources, the system uses less capacity for the same business outcome. That is good for cost, and it is good for the amount of energy we ask the platform to burn.

The Microsoft Teams migration to newer .NET versions is a good public example. The team reported large Azure compute cost reductions after moving to .NET 6 and benefiting from runtime performance improvements. Most teams will not get that size of win from one change. They do not need to. A few percent on a hot path can still matter when the code runs all day.

That is the mindset behind the loop: find repeated work, remove unnecessary cost, measure the change, and check whether the larger system improved.

A profiling harness makes the system visible

The first concrete step is a profiling harness. It runs the part of the system you want to understand while keeping unrelated work out of the profile.

For the NServiceBus pipeline investigation, the profiling harness used local infrastructure, a fast JSON serializer, and in-memory persistence. It was not comparing transports, serializers, databases, or cloud services. It was making pipeline invocation visible under a CPU profiler and a memory profiler.

EndpointConfiguration endpointConfiguration = new EndpointConfiguration("PipelineHarness");
endpointConfiguration.UseTransport<MsmqTransport>();
endpointConfiguration.UseSerialization<SystemJsonSerializer>();
endpointConfiguration.UsePersistence<NonDurablePersistence>();

IEndpointInstance endpoint = await Endpoint.Start(endpointConfiguration);

Console.WriteLine("Warmup complete. Attach profiler and press enter.");
Console.ReadLine();

for (int messageNumber = 0; messageNumber < 1000; messageNumber++)
{
    await endpoint.Publish(new SomethingHappened
    {
        Number = messageNumber
    });
}

Console.WriteLine("Published. Take snapshot and press enter.");
Console.ReadLine();

This is not production code. It is an instrument. The console prompts create clear snapshot points. The local transport keeps setup simple. The persistence choice avoids unrelated database work. The message loop runs the pipeline enough times for profiling tools to show useful patterns.

A good profiling harness has boring rules: build and run in Release mode, keep the run long enough to profile, remove avoidable noise, and emit symbols so profiler stacks point back to useful code. If tiered just-in-time compilation gets in the way during early investigation, disable it for the profiling harness and document that choice.

The profiling harness is not the final truth. It is the first map.

Profiles decide what deserves a benchmark

Once the profiling harness is running, take at least two views: memory and CPU. Memory often gives faster wins in .NET because allocations are easier to spot and easier to remove than algorithmic CPU costs.

Memory profiler view showing behavior chain allocations in the NServiceBus pipeline
The useful target was not every allocation in the process. It was the allocation pattern connected to pipeline invocation.

The profiler will show noise. Some of that noise may be large. That does not make it your best target.

In the pipeline example, some allocations came from Microsoft Message Queuing (MSMQ). That mattered less than it first appeared. MSMQ was not the target of the investigation, the code was outside the pipeline work, and its user base was shrinking. Optimizing the pipeline itself would help every transport. Optimizing MSMQ-specific overhead would not.

Context decides what the profiler output means. The tool can show you where cost appears. It cannot decide which cost is worth paying down.

Benchmark the hot path, not the whole system

After profiling points at a hot path, the benchmark can become small and focused. That usually means copying the relevant code into a controlled benchmark project, trimming unrelated dependencies, and comparing before and after versions.

Copying code feels wrong because duplication is usually how mess grows. In this case, the copy is part of a controlled experiment. It lets you remove dependency injection containers, external input/output, unrelated behaviors, and other moving parts that would blur the result.

[ShortRunJob]
[MemoryDiagnoser]
public class PipelineExecutionBenchmark
{
    BaseLinePipeline<IBehaviorContext> pipelineBeforeOptimizations;
    PipelineOptimization<IBehaviorContext> pipelineAfterOptimizations;
    BehaviorContext behaviorContext;

    [Params(10, 20, 40)]
    public int PipelineDepth { get; set; }

    [GlobalSetup]
    public void SetUp()
    {
        behaviorContext = new BehaviorContext();

        pipelineBeforeOptimizations = CreateBeforePipeline(PipelineDepth);
        pipelineAfterOptimizations = CreateAfterPipeline(PipelineDepth);
    }

    [Benchmark(Baseline = true)]
    public Task Before()
    {
        return pipelineBeforeOptimizations.Invoke(behaviorContext);
    }

    [Benchmark]
    public Task After()
    {
        return pipelineAfterOptimizations.Invoke(behaviorContext);
    }
}

The benchmark measures one thing: pipeline invocation after setup. [GlobalSetup] keeps construction out of the measured path. [Params] keeps the cases realistic. [ShortRunJob] keeps the feedback loop moving.

When the direction looks right, run a longer benchmark. Short runs are for steering. Longer runs are for confidence.

A benchmark win is not the finish line

A benchmark win is not the end. Put the improved code back into the profiling harness and profile again.

Optimized CPU flame graph showing less infrastructure overhead in the NServiceBus publish pipeline
After the optimization, the same profiling harness shows less infrastructure overhead around pipeline invocation.

This second profile answers a different question: did the focused improvement still matter when placed back into a larger execution path?

A microbenchmark might show a five-times faster operation. The whole subsystem may improve by less because it still does other work. That is fine. The dramatic benchmark number is not the prize. The system-level effect is.

Then production gets the final vote. Watch throughput, latency, allocation rate, garbage collection behavior, and cost. If the assumptions were wrong, learn from that. If the assumptions were right, write down why the change worked so the team can reuse the knowledge.

Do not rewrite before you understand

Performance investigations often start near code that looks ugly. That can tempt a team into a rewrite. Rewrites feel clean because the new code has not met production yet.

The loop pushes in the other direction. Profile first. Improve one path. Benchmark the change. Profile again. Ship when the gain justifies the complexity. Write down the trade-offs.

After a few cycles, the team knows more about the code than it did before. Sometimes that knowledge makes a rewrite unnecessary. Sometimes it makes a rewrite safer because the team now has benchmarks, profiles, and production observations to guide the new design.

That is the point of the performance loop. It turns “this code is slow” into a repeatable investigation.

Further reading

Common questions

Can I use this approach for application code?

Yes. Most applications have framework-like infrastructure inside them: message pumps, request pipelines, validation layers, serialization boundaries, caching layers, or database adapters. Start there if the business code is too broad to isolate.

Should every benchmark become a regression test?

No. Keep benchmarks that protect important hot paths. Delete or archive the experiments that helped you learn but would be expensive to maintain.

What should I do first on Monday?

Pick one path that runs often. Build a tiny profiling harness around it. Take one memory profile and one CPU profile. Do not write the benchmark until the profiles tell you what deserves one.

Performance loop status

  • [x] Understand the loop
  • [ ] Build profiling harness
  • [ ] Profile
  • [ ] Improve
  • [ ] Benchmark
  • [ ] Profile again
  • [ ] Ship and observe
Read the whole story
alvinashcraft
33 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Anatomy of an Open-Source AI Coding Agent Built in .NET: CodeAlta

1 Share
May 27, 2026 9 minutes read

CodeAlta-Anatomy-Coding-Agent

A large language model doesn’t perfom any action. It reads text and writes text. That is the whole contract.

So how does an “AI agent” edit your files, run your tests, and call your APIs? It runs a loop around the model. The model asks for an action, the program performs it, and the result goes back into the next prompt. Repeat until the model stops asking.

That loop is short. The code around it is not. This post walks through a real open-sourced one: CodeAlta, an agentic coding CLI written in .NET. We will look at the actual source, name the actual types, and separate the part that is simple from the part that is hard.

CodeAlta is opinionated, and it says so. Its manifesto is eight principles: efficient, transparent, keyboard-first, thread-oriented, provider-agnostic, native .NET, error-aware, pluggable. Two of them drive most of what follows. Provider-agnostic means no model SDK leaks into the core. Native .NET, with a deliberately narrow and auditable dependency graph, is treated as a constraint, not a tagline. That constraint is what makes the code worth reading.

Before diving into the code, here is a screenshot of CodeAlta running over the question:

“Explain how the method SendAsync() works and what it does It is located in file .\src\CodeAlta.Agent\LocalRuntime\LocalAgentSession.cs ln144”

The CodeAlta TUI is very impressive and is built on top of XenoAtom.Terminal.UI. Both CodeAlta and XenoAtom.Terminal.UI projects are primarily developed by Alexandre Mutel (xoofx).

CodeAlta.TUI

An agent is just a loop

Here is the core of CodeAlta’s LocalAgentSession.SendAsync, cut down to its shape:

while (true)
{
    var response = await CallModelAsync(conversation, tools);   // the LLM call
    conversation.Add(response.AssistantMessage);

    var toolCalls = response.GetToolCalls();
    if (toolCalls.Count == 0)
        return;                                                 // model is done

    foreach (var call in toolCalls)
        conversation.Add(await RunToolAsync(call));             // observe, append
}

Call the model. If it asked for tools, run them and append the results. Loop. If it didn’t, you’re done. This is the ReAct pattern, and it is the entire idea behind “agent.”

You could write this in an hour. Everything else in this post is about the gap between that snippet and an agent you would trust with your repository.

The loop, for real

The real SendAsync method source code is about a few hundred lines across the method and its helpers, not 15. It leans on a handful of collaborators, each with one job:

  • Session (LocalAgentSession) — owns one conversation and drives the loop.
  • Turn executor (ILocalAgentTurnExecutor, LocalAgentChatClientTurnExecutor) — performs one round-trip to the model.
  • Conversation (LocalAgentConversationMessage and its parts) — the working memory.
  • Store (ILocalAgentSessionStore) — the durable, event-sourced history.
  • Tools (AgentToolDefinition, LocalAgentToolBridge) — what the model is allowed to do.
  • Compaction (the Compaction folder) — keeps the context window from overflowing.
  • Steering (SteerAsync method) — lets you add input to a run that is already going, without stopping it. Like telling a driver “turn left up here” while the car keeps moving.

Several responsibilities orchestrated, the LLM call is one of them. Most of the engineering lives in the other ones.

Detailled-Loop-Diagram

Before each turn, the session estimates how many tokens the prompt will use. If it is over a threshold, it compacts. Then it calls the model.

If the provider rejects the request because the context is too long, the session catches that specific error, compacts, and retries the same turn once. A naive loop would crash here.

After the model replies, the session checks for tool calls. No tool calls usually means the run is over. But first it drains any steering input that arrived mid-run, and if there is some, it keeps going. With tool calls, it runs each one, records a diff of any files that changed, appends the result, and loops.

Summarizer / Compaction

Interacting with an LLM consumes tokens, and tokens come at a cost. As a result, a key responsibility of any agent is to manage and minimize this usage. Compaction plays a crucial role by preserving essential information in the context while reducing token consumption.

The compaction layer lives in CodeAlta.Agent.LocalRuntime.Compaction/ and shrinks a session’s conversation history when it nears the model context limit, replacing summarized turns with a structured Markdown checkpoint.

The system is split into focused components: LocalAgentTokenEstimator and LocalAgentTokenBudgetResolver decide when compaction is needed and the allowed summary size; LocalAgentCompactionPlanner selects which messages to summarize, keep, or treat as oversized anchors; canonicalization, media stripping, serialization, and chunking are handled by LocalAgentCompactionCanonicalizer, LocalAgentMediaCompaction, LocalAgentCompactionSerializer, and LocalAgentCompactionChunker.

LocalAgentCompactionSummarizer orchestrates recursive summarization and shrink-to-fit passes, delegating LLM calls to LocalAgentTurnExecutorCompactionSummaryExecutor, which wraps ILocalAgentTurnExecutor.ExecuteTurnAsync.

CodeAlata-Compaction

LLM Calls

The model is invoked through three paths, all converging on ILocalAgentTurnExecutor.ExecuteTurnAsync. First, the normal turn (LocalAgentSession.ExecuteTurnWithOverflowRecoveryAsync) streams user requests to the provider and relays responses.

Second, the summarizer (LocalAgentCompactionSummarizer) runs when context is near capacity and issues dedicated model calls to compress history into a checkpoint using a summarization prompt, not a continuation of the chat.

Third, if a context overflow still occurs, the catch handler (IsContextOverflow) triggers CompactCoreAsync and then replays the same turn on the compacted conversation. This can involve the retry path in LocalAgentSession (e.g., post-compact re-execution). A single user request may therefore trigger the main call, summarization calls, and a retry, treating overflow as a recoverable condition.

CodeAlta.LLMCall

 

Composing the prompt

People picture a system prompt as a long string a developer typed. In a working agent it is built per run, from parts. LocalAgentInstructionComposer.Compose assembles three.

First, the system message. Second, a runtime context block it generates on the spot: the date, the OS, the default shell (pwsh on Windows), the working directory, the project roots. Third, the developer instructions.

That third part is where it gets interesting. The composer walks up the directory tree and pulls in AGENTS.md and CLAUDE.md files it finds along the way. So when the agent seems to “just know” your project’s conventions, this is why. It read the file you committed.

Loaded skills get appended too, inside an <active_skills> block. The whole bundle is hashed. If the hash hasn’t changed since last turn, the session doesn’t re-log it.

Prompt engineering at this scale is deterministic assembly, closer to a build step than to creative writing.

Here is what a request sent to the LLM looks like:

CodeAlta.Composing

Invoking Tools

The prompt also includes a list of available tools and their descriptions, enabling the LLM to decide when to schedule a tool call.

CodeAlta.Tools.Description

When asking CodeAlta to explain the method SendAsync(), the LLM asks the agent to call the read_file tool, providing the file path along with the offset and limit parameters. The tool is then executed, and the source file content is sent back to the LLM in a new request within the next loop.

CodeAlta.Tools.Call

MCP Tools?

You can usually extend a coding agent with additional tools through the Model Context Protocol (MCP). For instance, at NDepend we have an open-source NDepend.MCP.Server built on top of the NDepend API. It exposes tools for workspace analysis, code inspection, and automated fixes for .NET projects. With pre-code scanning in place, such tools can help a coding agent reduce AI bias and avoid unnecessary token consumption.

From the LLM’s perspective, MCP tools works the same way as described earlier: the available tools and their descriptions are included in the prompt, and the model can decide which tool to call and with which parameters.

We also document the approach in this tutorial: Developing an MCP Server with C#: A Complete Guide.

Unlike Anthropic’s Claude Code, Copilot CLI and OpenAI Codex, CodeAlta does not currently include an MCP client. It’s worth noting that the tool is still in a pre-1.0 stage, so the architecture and integrations will evolve over time.

Memory of a Conversation

The obvious way to store a conversation is a List<Message> in memory. CodeAlta keeps that list, but it is not the source of truth.

Every user message, assistant reply, and tool result is appended to the store as an event. On load, ReplayConversation rebuilds the in-memory conversation from those events. If a compaction checkpoint exists, replay seeds from the summary and then replays everything after it.

This is event sourcing, a pattern most .NET developers already know from CQRS. The benefits come for free: sessions are durable, they survive a crash, and they resume exactly where they stopped.

CodeAlta.ReplayConversation

CodeAlta Overall Architecture

Now that we have a solid understanding of the orchestration happening in the main loop, here is a diagram of the overall CodeAlta architecture.

CodeAlta-Diagram

The host only ever sees IAgentSession and a stream of events. Everything below the session is swappable.

One LLM call, any provider

Here is the overall CodeAlta architecture at the code level. Unsurprisingly, CodeAlta.Agent is the root project of the solution.

One notable aspect shown in the graph is that there is a separate project for each LLM provider: OpenAI, Anthropic, XAI, Copilot, and Google GenAI.

The session layer never interacts directly with a model SDK. Instead, it depends on the interface ILocalAgentTurnExecutor, whose responsibility is simply to execute one turn. This interface acts as the abstraction layer that makes the different providers interchangeable.

Not all providers implement ILocalAgentTurnExecutor. The Anthropic and Google GenAI providers rely on the interface IChatClient defined in Microsoft.Extensions.AI. They can therefore use the class LocalAgentChatClientTurnExecutor, which implements ILocalAgentTurnExecutor.

IChatClient is a powerful standard interface for interacting with LLMs. We explain its usage in details here: LLM Chat in .NET with IChatClient: The Complete Guide.

CodeAlta.Architecture

 

The Agent Client Protocol

In the overall graph from the previous section, you can see an important project: CodeAlta.Acp. At the time of writing, this feature is not yet enabled, but it is still worth mentioning. As stated in acp.md: protocol and backend support exists, but the frontend integration is intentionally hidden for now. The TUI does not register ACP backends, command-palette entries, slash commands, shortcuts, or management-dialog entry points because ACP has not yet been exercised and validated in the frontend.

Up to now, the agent has effectively always been ours. Whether wired to OpenAI, Anthropic, or Google, CodeAlta is the system in control — composing prompts, selecting tools, executing them, streaming tokens, and compacting history. The provider is reduced to a model API behind an IChatClient. In this setup, CodeAlta is the brain; the provider merely supplies the neurons on demand.

The out-of-process ACP agent flips this model.

ACP — the Agent Client Protocol — is a stdio-based JSON-RPC contract for interacting with a fully autonomous coding agent running as its own process: Claude Code, or Gemini CLI, or a vendor-specific agent. In this model, CodeAlta becomes the host. It forwards prompts, mediates permissions, and bridges access to the filesystem, while the remote process brings its own reasoning loop, toolset, planning logic, model access, and authentication. Crucially, CodeAlta no longer calls an LLM at all — the intelligence sits entirely on the other side of the socket.

This is unified through the interface IAgentSession, implemented by both LocalAgentSession and AcpAgentSession, which abstracts the execution model completely.

From one agent to orchestrating many

Finally let’s mention the project CodeAlta.Orchestration. If a single agent turn is one heartbeat, orchestration is the nervous system that keeps a whole organism of them alive at once. CodeAlta lets you run several conversations — “work threads” — in parallel, each talking to its own agent, each streaming events back to the UI while you keep typing. The hard part isn’t talking to a model; it’s doing it many times concurrently without the state dissolving into race conditions.

CodeAlta’s answer is to give every thread a single owner. Instead of sharing mutable state behind locks, each thread processes a serialized stream of commands — send this, steer that, abort, compact — and emits a clean stream of events in return. One writer, one mailbox, no tug-of-war. The runtime turns the messy, asynchronous reality of live agents into something the rest of the app can treat as a predictable feed.

That design buys three things: concurrency that doesn’t corrupt, a frontend fully decoupled from the engine, and uniform handling whether the agent lives in-process or across an ACP socket. Orchestration, in short, is where CodeAlta stops being a chatbot and starts being a workspace.

Conclusion: What separates a good coding agent from a demo

Feedback from the Author

On a reddit post about CodeAlta first public release, its author Alexandre Mutel answers the question: What was the biggest challenge making this project spanning all of the quirks of LLMs / inference providers / tool calling differences?


Answer: Not so many actually, but quite a few challenges:

  • Anthropic API .NET had a few issues for some non-Anthropic standard models: Handle missing streaming thinking signatures or Fix several HasNext() that should rely on response.HasMore

  • I had to implement the more efficient websocket protocol for ChatGPT/Codex subscriptions which is not implemented by OpenAI .NET. I relied on Codex CLI codebase to replicate the behavior, but it was possible to nicely integrate with the HttpClient/System.ClientModel to configure/plug custom protocol.

  • Google.GenAI / Gemini is completely broken due to unmerged PR here from Google folks. This one is annoying to the point that I wonder if I should not fork/maintain my own API/endpoint wrappers instead on relying on 1st(!) party providers.

  • I still have an unmerged Copilot SDK PR here but I removed support for Copilot SDK and went straight to Copilot API endpoints. The middle-man API is not worth the trouble, and it blocks many scenarios of CodeAlta own harness. Same for Codex app-server, I’m connecting directly to Codex endpoints instead.

Some Chinese provider/models have slight differences in their behavior, so they require a few knobs, but hopefully, it is possible to configure these knobs via the config file. (e.g. developer role not supported by some models)

The biggest challenge for developing this assistant was to go through every single details, and fix/improve them one after the other. I had already something working 2.5 months ago, but it took a lot more iteration to polish, improve the performance, fix provider issues…etc.

As I’m also heavily relying for the TUI on XenoAtom.Terminal.UI, I had to push several improvements/fixes there, and the separate repository/package model complicates a bit the development process, but I think in the end it is for the better, as I have clear dependencies and having to slow down between these dependencies helps to avoid taking bad shortcuts.

In the same post Alexandre also writes: “I have spent 3 months working on this, mornings, evenings, and during my weekends, it is worth > 200 hours of my time


So what matters?

Highlighted in bold what I consider being an answer to the initial question: What separates a good agent from a toy

At its core, it is not just model choice or clever prompting — it is the amount of real-world failure exposure and the engineering effort spent turning those failures into predictable behavior. Unlike traditional software, agents compose uncertainty across multiple steps, so failures don’t just stack — they compound. The gap between demo and production is earned in logs, not notebooks. One can only imagine the volume of iteration behind systems like Claude Code, Codex, Copilot CLI or Cursor, where every edge case becomes a design constraint.

A demo agent is optimized for the happy path: clean tool outputs, short sessions, ideal model responses. A good agent is designed to survive entropy. It handles long-running sessions, degraded or partial model outputs, tool failures, and context saturation without collapsing — treating the environment as unreliable and building recovery as a first-class concern.

In practice, this means pushing work out of the model and into deterministic tooling — using structured operations like grep instead of asking the model to re-scan large contexts in natural language. A significant portion of CodeAlta’s codebase lives in exactly this layer: the machinery that keeps the system stable when everything around it is noisy, incomplete, or wrong.

Going Further

The CodeAlta github repo provides some interesting documentation.

Also it appears that the source code of Claude Code leaked at the end of March 2026. You can refer to the GitHub project Claude Code Source Analysis to learn more about its internal architecture and implementation details.

The post Anatomy of an Open-Source AI Coding Agent Built in .NET: CodeAlta appeared first on NDepend Blog.

Read the whole story
alvinashcraft
40 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Control what your agentic workflows see with integrity filtering

1 Share
GitHub Agentic Workflows filter untrusted GitHub content before it reaches the agent. Here’s why integrity filtering matters for repository maintainers, and how we built it.
Read the whole story
alvinashcraft
50 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Querying Reliable AI Resources with Telerik Agent Tools API

1 Share

Telerik Document Processing Libraries not only let you create RAG resources for your LLMs. They let you integrate those libraries into Microsoft’s latest agent-based tools.

In my previous post, I showed how to use Progress Telerik Agent Tools API to create a workflow that loads content into a Resource-Augmented Generation (RAG) resource that can be used with Large Language Models (LLMs). That RAG resource, integrated with an LLM in your application, helps your users get reliable, grounded answers, driven by the content you load into your resource.

This post shows how to tie your RAG resource to an LLM and integrate the combination in an application that allows users to query the resource—or any combination of resources that you’ve created. Specifically, I’ll show how to use Microsoft’s current technology for querying an AI resource, the ChatClientAgent object.

Telerik Agent Tools API provides a collection of toolsets that the chat client agent works with to query your RAG resources. You just have to load your RAG resource(s) into in-memory repositories, attach the appropriate toolsets and pass the resulting tools to a chat client agent. Once you’ve done that, you can submit prompts to the agent and get back the results driven by the RAG resources you’ve loaded.

Configuring Your Project

To create an application that can query your resources, you first need the Telerik.Documents.AI.* NuGet packages that work with the documents in your RAG resource (I covered those libraries in my previous post).

After that, you also need two Microsoft AI NuGet packages:

  • Azure.AI.OpenAI
  • Microsoft.Agents.AI.OpenAI

I used an ASP.NET Web API project for my case study so I also needed to add the Microsoft.AspNetCore.OpenApi package.

As I write this, the Microsoft.Agents.AI.OpenAI package was in “release candidate” mode which means that, while its interfaces and functionality are fixed, I needed to use the “prerelease” option when adding the package. Having said that, by the time you read this, the package may have moved to “latest stable” status (this is a very agile environment).

Configuring a Chat Client Agent

Before you create your ChatClientAgent agent object, you need to assemble a set of tools, tied to one or more of your RAG resources. Your first step is to create a repository of the right type (PDF, spreadsheet, Word/Word-related) and load the file that holds your RAG resource into that repository.

For my case study, I’m only working with PDF documents—what Progress Telerik calls “Fixed” documents—so I created a IFixedDocumentRepository pdfRepo repository (all the repositories share a common interface so they all look very much alike).

Once I created the repository, I loaded my RAG resource using the repository’s Import method. The Import method must be passed a FileStream object pointing to the resource and its format which, in this case, was DocumentFormat.PDF (see my previous post for the details on creating that resource).

Here’s the code that loads my case study’s repository:

IFixedDocumentRepository pdfRepo = 
               new InMemoryFixedDocumentRepository( new TimeSpan(0, 2, 0) );
if ( Path.Exists(repoPath) )
{
    string name =pdfRepo.Import(
                                        FileStream(repoFilePath, FileMode.Open, FileAccess.Read),
                                       DocumentFormat.PDF,
         Path.GetFileNameWithoutExtension(repoFilePath)
                                     );
}

If you provide the name of the file as the third parameter, the Import method returns that name (and returns “ImportedDocument” if you don’t provide the third parameter).

To support having the chat client agent query my PDF repository, I just need the FixedDocumentContentAgentTools toolset. To create that toolset, I have to pass two parameters: the repository itself and a folder that holds any images used in those PDF documents. When attaching a toolset to a repository, you’ll always have to pass the repository parameter, but other toolsets will require different parameters (and, often, no other parameters).

Once I’ve created that toolset, I extract its tools using the toolset’s GetTools method and add those tools to a list of AITool objects that will, eventually, be passed to my chat client agent. The code to do that looks like this:

pdfReadTools = new FixedDocumentContentAgentTools(pdfRepo, repoPath);
List<AITool> queryTools = new List<AITool>( pdfReadTools.GetTools() );

If I wanted to add more tools to my list, I would use the list’s AddRange method. As an example, the following code:

  • Creates a repository for holding spreadsheets
  • Imports a RAG resource file of spreadsheets into that repository using the repository’s Import method
  • Attaches the SpreadProcessingReadAgentTools toolset (which only needs to be passed a reference to the repository)
  • Extracts the toolset’s tools and adds them to my list of AITools
IWorkbookRepository workbookRepo = 
              new InMemoryWorkbookRepository( new TimeSpan(0, 2, 0) );
workbookRepo.Import(workbookRepoPath, DocumentFormat.XLSX);

SpreadProcessingReadAgentTools workbookTools = new(workbookRepo);

queryTools.AddRange( workbookTools.GetTools() );

Querying Your Repositories

Now that I have a list of tools, I’m ready to create an agent. There are three steps to doing that (but you can do it in one line of code).

First, you need to create a AzureOpenAIClient object which, in its simplest form, just requires passing the URL for the LLM deployment you’ve created and the authorization key for that deployment.

One note: Using an authorization key is probably fine for development but, in production, you should be authorizing access using something more robust (e.g., Managed Identities in Entra ID). If you’re using an authentication key, then you shouldn’t hardcode into your application as I do here but move that key to some more secure location (e.g. An Azure Key Vault).

Once you’ve created your AzureOpenAIClient object, your second step is to call its GetChatCient method to configure and return a ChatClient object. The GetChatClient method just needs to be passed the name of the deployment you created for your LLM.

Finally, you need to call your ChatClient object’s AsAIAgent method to configure your agent. You can pass a variety of parameters as part of configuring your agent. I settled for specifying a name for my agent, some instructions on how my agent is to answer questions, and my list of tools:

AIAgent agent = new AzureOpenAIClient(
                                      new Uri(url),
                                      new AzureKeyCredential(apiKey))
        .GetChatClient(deploymentName)
        .AsAIAgent(
                instructions: "You provide guidance to Azure software developers",
                name: "Async App Expert",
                tools: queryTools);

With your chat client agent in hand, you process your user’s prompts by calling the agent’s RunAsync method and passing the prompt. The agent will return an AgentResponse object which has a Messages collection holding a list of responses drawn from the repositories associated with your tools (you probably want the first message). The Text property on a message will give you the agent’s response.

Typical code would look like this:

AgentResponse response = await agent.RunAsync("What do NuGet packages do I need”);
string responseText = response.Messages[0].Text;

Interacting with Your RAG-Enabled Sources

Of course, you’ll also want to create a UI for your users to interact with when querying your RAG resource. In earlier posts, I showed how to leverage Telerik tools to create dedicated user-friendly UIs in Blazor or JavaScript. Alternatively, instead of creating a UI dedicated to your RAG-enabled resource, you might want to more tightly integrate your resource into a JavaScript application’s UI

But, really, it’s up to you how you’ll use your RAG resource to support your users.

Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories