Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
154771 stories
·
33 followers

What's new in Astro - May 2026

1 Share
May 2026 - A new Astro jobs board, TinaCMS makes Astro their default template, experimental advanced routing, and more!
Read the whole story
alvinashcraft
2 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Cordova iOS 8.1.0 is now available!

1 Share

We are happy to announce that we have just released Cordova iOS 8.1.0! This is one of Cordova's supported platforms for building iOS applications.

This release contains fixes for several bugs that were reported against the 8.0.1 version.

To upgrade:

cordova platform remove ios
cordova platform add ios@8.1.0

To install:

cordova platform add ios@8.1.0

Release Highlights

  • Fix for Ionic WebView plugin

    Re-add an implementation of the deprecated shouldOverrideLoadWithRequest:navigationType: selector because the Ionic WebView plugin relies on it.

  • Fix /_app_file_/ URLs not working

    During refactoring of the URLSchemeTask handler for cordova-ios 8, an error was introduced where /_app_file_ URLs were treated as being relative to the resources directory rather than as filesystem paths.

  • Fixes various build issues

Changes include:

Fixes:

  • GH-1653 fix(actions): fix CDVURLSchemeHandlerTest warnings
  • GH-1652 fix(actions): IPhone 16e not found on macOS 26 for latest OS
  • GH-1640 fix: NSInternalInconsistencyException: "No response has been sent for this task" - second attempt
  • GH-1637 fix(build): Target a generic iOS simulator for building
  • GH-1632 fix(ionic): Add workaround for Ionic WebView plugin
  • GH-1618 fix(xcode): Fix library search paths for target
  • GH-1621 fix(xcode): Ensure we do NFD normalization on PRODUCT_NAME
  • GH-1616 fix(spm): Ensure the deployment target always gets set
  • GH-1606 fix(webview): Ensure scheme task is always finished
  • GH-1610 fix(scheme): Fix /app_file/ URLs not working
  • GH-1612 fix(spm): Set deployment target in Package.swift
  • GH-1597 fix: ignore spm build artifacts

Others:

  • GH-1650 chore: remove redundant Hello World template files
  • GH-1647 chore(deps): bump @xmldom/xmldom from 0.8.12 to 0.8.13
  • GH-1634 doc: keepCallback and setKeepCallbackAsBool: of CDVPluginResult
  • GH-1646 chore(deps): bump lodash from 4.17.23 to 4.18.1
  • GH-1644 chore(deps): bump @xmldom/xmldom from 0.8.11 to 0.8.12
  • GH-1642 chore(deps): bump picomatch
  • GH-1639 chore: update package-lock
  • GH-1635 chore: Fix improperly ignored deprecation warning
  • GH-1633 chore: Bump patch level for ongoing dev work
  • GH-1628 chore: Fix missing licence headers
  • GH-1627 release(8.0.1): Update release notes and version
  • GH-1623 chore(ci): draft release
  • GH-1624 chore: cleanup license headers
  • GH-1625 chore: add DEVELOPMENT.md & cleanup README.md
  • GH-1622 refactor(versions): Refactor version code for test reliability
  • GH-1619 chore(deps): Update to latest jasmine & c8 versions
  • GH-1614 doc(readme): improve badges
  • GH-1601 chore: Remove compileBitcode from export options
  • GH-1611 chore: set swift-tools-version to 5.9
  • GH-1599 chore(deps): bump lodash from 4.17.21 to 4.17.23
  • GH-1598 Add missing trailing new line
  • GH-1592 doc(readme): add minimum iOS version
  • GH-1591 doc(readme): add Link to iOS Platform Guide
  • GH-1588 chore: update release audit workflow & license headers
Read the whole story
alvinashcraft
2 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Agents That Build Agents: A SKILL-first Blueprint with MS Agent Framework & Foundry

1 Share

The single insight that changes everything

Most "build an AI agent" tutorials collapse two completely different jobs into one tangled mess:

  • the job of building an agent (writing the code, defining its tools, evaluating it, packaging it), and
  • the job of running an agent (planning, reasoning, calling tools, remembering users, delivering outcomes).

Once you separate them, modern agent development becomes a clean two-layer architecture:

Coding Agent sits on top — that's how you produce an agent. A Runtime Agent sits below — that's the agent your business operates. Microsoft Agent Framework is the SDK that ties them together; Microsoft Foundry is the platform both layers publish to and run on.

But the secret ingredient — the thing that turns a generic Copilot into a domain-aware engineer — is the SKILL. SKILL is what the Coding Agent reads before writing a single line. It's how requirements become artifacts that actually match your framework, your conventions, and your fixtures.

This post walks the entire two-layer architecture, in the order you should learn it — with SKILL as the star of Layer 1. We ground every concept in ZavaShop, a fictional global e-commerce company with 5 fulfillment centers, dozens of suppliers, and a CEO who wants one live dashboard for all of it. Both Python and .NET (C#) are first-class — pick the language your team will run in production.

LAYER 1 — The Coding Agent (Build Time)

The Coding Agent is not the agent your customer talks to. It's the agent that constructs the agent your customer talks to. Its output is a bundle of artifacts — code, agent definitions, workflows, skills, connectors, evals, tests, configs, docs — that flow through validation and into Foundry.

Build time has five movements.

Movement 1 — Requirements & Planning

Before the Coding Agent writes a single line, you owe it three things:

  1. A real business pain. Not "let's build an agent." Rather: "Mei, the supervisor at Seattle DC, gets interrupted 60 times a day by stock-level questions."
  2. A list of acceptance criteria. What does "done" look like? "Agent answers stock questions for SKUs in our 10-SKU catalog. P95 latency under 4s. Wrong-tool rate under 5% on the eval set."
  3. The fixtures it'll run on. Real or realistic data — warehouses, SKUs, POs, customers — so the Coding Agent isn't reasoning about a vacuum.

ZavaShop context. The workshop ships workshop/data/ — 5 warehouses, 10 SKUs, 6 POs, 8 suppliers, 5 contracts, 4 customers (3 VIP), 6 orders, 5 carriers, 4 open exceptions. Every artifact the Coding Agent generates is anchored to this shared fixture set, so numbers stay consistent across the entire system.

Movement 2 — The Coding Agent + its SKILL (the star of build time)

This is the movement most teams skip — and it's the one that decides whether your build-time output is professional code or "ChatGPT-shaped" code.

What a Coding Agent actually is

The Coding Agent is GitHub Copilot Chat in Agent Mode, configured with a domain-aware agent definition. In the ZavaShop workshop, it lives at .github/agents/zavashop-coding-agent.agent.md and is activated from the VS Code Agent picker. You start each session with one plain sentence:

"I'm working on the inventory agent in Python — wire up stock and PO lookups against the fixtures, plus a HostedMCPTool for the warehouse handbook."

Notice what's not in that sentence: no library names, no class names, no file paths. The Coding Agent has to fill all of that in. The mechanism it uses is the SKILL.

What a SKILL is

A SKILL is a structured contract that teaches the Coding Agent how to write code in your framework, your conventions, and your domain. It is the most important file in the entire build-time layer — without it, GitHub Copilot is a fluent generalist; with it, it becomes a domain-aware specialist that writes code your tech leads would have written.

Conceptually, a SKILL contains:

SectionPurpose
Scope & when to use"Use this SKILL for building agents on Foundry / Azure AI — tools, MCP, Toolbox, Skills, Memory, Threads"
Framework idiomsThe exact way to construct AzureAIAgentClient, register function tools, wire HostedMCPTool, create a Thread
Code patternsReference snippets the Coding Agent imitates — naming, import order, error handling, type hints
Fixture/data contractHow to load workshop/data/, which loaders exist (find_stock, find_po, etc.), where to add sys.path
Anti-patternsWhat not to do — don't hardcode the model name, don't write inline mock dicts, don't bypass the data loader
Acceptance heuristicsHow to map a LAB's acceptance criteria to runnable checks (eval rows, smoke tests)

A SKILL is versioned with the codebase. When the framework releases a new idiom, you update the SKILL once; every agent built afterwards picks it up automatically. This is the single biggest reason convention drift disappears.

The six SKILLs in the ZavaShop workshop

The workshop ships six SKILLs — three for each language track — and they cover three orthogonal capability surfaces:

TrackSKILLUse it for
🐍 Pythonagent-framework-azure-ai-pySingle agent on Foundry: tools, MCP, Toolbox, Skills, Memory, Threads
🐍 Pythonagent-framework-workflows-pyMulti-agent workflows: WorkflowBuilder, executors, HITL, Checkpoint
🐍 Pythonagent-framework-agui-pyAG-UI server + client: SSE, frontend/backend tools, shared state, HITL
🟦 .NETagent-framework-azure-ai-csharpSame as the Python azure-ai SKILL, for C#
🟦 .NETagent-framework-workflows-csharpSame as the Python workflows SKILL, for C#
🟦 .NETagent-framework-agui-csharpAG-UI in ASP.NET Core: MapAGUI, AGUIChatClient, HITL

How the Coding Agent uses SKILL

The Coding Agent's loop is SKILL-first, code-second:

The discipline is captured in the workshop's one mantra:

"Read the SKILL first."

It is not optional. Skip it and you're back to generic Copilot output.

Movement 3 — A worked example: from a single sentence to a runnable agent

Let's trace what happens when you sit down in front of the Coding Agent and say:

"I'm working on the inventory agent in Python — wire up stock and PO lookups against the fixtures, plus a HostedMCPTool for the warehouse handbook."

Step 1 — The Coding Agent routes

The Coding Agent's definition (.github/agents/zavashop-coding-agent.agent.md) contains a routing table:

It loads both files into context before doing anything else.

Step 2 — The Coding Agent plans

Using the SKILL's patterns and the LAB's acceptance criteria, it generates a plan:

Every item in the plan traces back to either the SKILL (how) or the LAB README (what). No invention, no guessing.

Step 3 — The Coding Agent generates

Now — and only now — it writes code. Because the SKILL specified the patterns, what comes out is shaped like the rest of your codebase:

# scripts/inventory_agent.py import os import sys from pathlib import Path # Pattern from SKILL: add data folder to sys.path, import named loaders sys.path.insert(0, str(Path(__file__).resolve().parents[1] / "data")) from zava_data import find_stock, find_po # noqa: E402 from agent_framework import HostedMCPTool from agent_framework_azure_ai import AzureAIAgentClient from azure.identity import DefaultAzureCredential def stock_lookup(sku: str, warehouse: str) -> dict: """Return on-hand quantity for a SKU at a warehouse.""" return find_stock(sku, warehouse) def po_lookup(po_id: str) -> dict: """Return the status of a purchase order.""" return find_po(po_id) async def main() -> None: client = AzureAIAgentClient( project_endpoint=os.environ["FOUNDRY_PROJECT_ENDPOINT"], model_deployment_name=os.environ["FOUNDRY_MODEL"], # gpt-5.5 from env async_credential=DefaultAzureCredential(), ) agent = client.create_agent( name="Zara", instructions=( "You are Zara, the warehouse assistant for Seattle DC (SEA-01). " "Use the tools when asked about stock or POs. " "Answer briefly." ), tools=[ stock_lookup, po_lookup, HostedMCPTool( name="warehouse-docs", url=os.environ["WAREHOUSE_DOCS_MCP_URL"], ), ], ) thread = agent.get_new_thread() print(await agent.run("How many SKU-7421 are at SEA-01?", thread=thread))

Notice the things the SKILL enforced without you having to ask: env-var-driven config, named function tools with docstrings, sys.path data-loader pattern, HostedMCPTool placed alongside function tools, Thread for multi-turn.

Step 4 — The Coding Agent validates

The SKILL also told it how to validate. The Coding Agent runs:

  • a smoke test against fixtures (SKU-7421 @ SEA-01 → 312),
  • the eval set (eval_queries.jsonl) — was the right tool called? did the answer contain the expected fact?
  • a red-team probe round.

It reports back: "3/3 acceptance criteria pass. Eval score 5/5. Red-team: no successful prompt injections."

Step 5 — Done

What landed in your repo is not just a script. It's an artifact bundle — code + agent definition + tools + eval rows + a one-page README — that matches the way your team writes agents. That bundle is what flows into the next three movements.

Movement 4 — Agent Artifacts (the outputs)

A well-instructed Coding Agent produces eight kinds of artifact. Together they make up "an agent" in the deployable sense:

ArtifactWhat it isWhy it matters
Source codeThe Agent / Workflow programVersioned, reviewable, diffable
Agent definitionsName, instructions, tool listThe "personality" — independently editable
WorkflowsWorkflowBuilder graphsMulti-agent orchestration as code
SkillsNamed, packaged behaviorsReusable capabilities — one Skill, many agents
ConnectorsMCP servers, Toolbox registrationsWhere the agent reaches into the world
Evalseval_queries.jsonl and harnessRegression target for every prompt change
Tests & configsUnit tests, .env schema, deployment manifestsReproducibility
DocumentationREADMEs, runbooksThe agent your future self can operate

Don't confuse two senses of "skill" here. A SKILL file (uppercase, in .github/skills/) instructs the Coding Agent at build time. An Agent Skill (a Foundry concept) is a named runtime capability the Runtime Agent calls. Both names are deliberate — Layer 1's SKILL produces, among other artifacts, Layer 2's Skills.

Movement 5 — Validation

Before any artifact reaches Foundry, four gates run:

  1. Tests — unit + integration. Did find_stock("SKU-7421", "SEA-01") return 312, the value in the fixture?
  2. Lint & types — ruff/mypy on Python, dotnet build warnings on .NET. The model has to read these signatures; sloppy ones cause real bugs.
  3. Evaluation — run the eval set. Did the right tool get called? Did the answer contain the expected fact? You need a score, not a vibe.
  4. Red-Team probes — adversarial inputs that try to drift the agent off topic or extract another customer's data. The Foundry red-team SDK ships a battery of these.

Evangelist takeaway. "We built an agent" is not a deliverable. "We built an agent and here is its pass rate on a versioned eval set, plus a red-team report" is a deliverable. Validation belongs at build time, not "we'll add it later."

Movement 6 — Publish & Deploy

When validation is green, the Coding Agent's outputs flow into Foundry and Azure:

  • Push to Microsoft Foundry — agent definitions, Skills, Toolbox tools, and custom evals register against your Foundry project. They are now governed, versioned, and observable.
  • Deploy to Azure — the runtime host (AG-UI server, workflow worker, Teams app, API surface) ships to your Azure target (App Service, Container Apps, AKS, Functions). Same env vars drive local dev and cloud.

The same artifact set deploys to dev, staging, and production. There is no "production-only" code in your agent.

LAYER 2 — The Runtime Agent (Runtime)

Now the agent is live. Every conversation, every action against your data, every memory it writes — that's Layer 2. Five concerns define it.

Concern 1 — Users & Channels

A Runtime Agent reaches users through the channels they already use:

  • Microsoft Teams — the agent shows up where work already happens.
  • Outlook — triage, reply, summarize, schedule.
  • Custom web / mobile / voice — built on AG-UI, which ships a React client covering streaming text, frontend tools, backend tools, shared state, generative UI, predictive updates, HITL prompts.

The channel is a deployment choice, not an architectural choice. The same agent definition can surface in Teams and on a React dashboard.

ZavaShop context. Mei's agent shows up in Teams. The CEO's control tower is a React app on top of AG-UI. The agent definition behind both is the same artifact set the Coding Agent produced.

Concern 2 — The Runtime Agent itself

The Runtime Agent is the loop you've heard about a thousand times — now it's a concrete piece of architecture:

AIAgent = model + instructions + tools + thread

Inside the loop:

  1. The model plans & reasons about the next step.
  2. It calls tools through MCP, Toolbox, or local functions.
  3. It reads & writes memory.
  4. It streams output back to the channel.
# Python — the runtime shape (exactly what the Coding Agent produced) agent = client.create_agent( name="Zara", instructions="You are Zara, the warehouse assistant for Seattle DC.", tools=[stock_lookup, po_lookup, warehouse_docs_mcp], )

Concern 3 — Tools & Integrations (the runtime capability surface)

At runtime, a Runtime Agent reaches the outside world through four kinds of capabilities — and which one to use is a real engineering decision:

CapabilityLives inUse when
Function toolThe agent's own processLocal code: a calculation, a DB query, a fixture lookup
MCP toolAn external MCP serverThe capability is owned by another system, exposed via MCP
Toolbox toolThe Foundry project (server-side, tenant-wide)Capability is shared by multiple agents, must be governed
Agent SkillThe Foundry projectA combination of tools + policy as one named capability

Mental progression:

You don't have to start with Toolbox — but the moment a second agent touches the same domain, migrate.

ZavaShop context. Local fixtures → function tools. The warehouse handbook → MCP. Supplier-portal connectors shared by procurement, fulfillment, and finance → Toolbox tools. "Validate-PO-against-contract" → an Agent Skill.

Concern 4 — Memory & State

State at runtime comes in two flavors:

Thread = state inside one conversation

thread = agent.get_new_thread() await agent.run("Look up PO-1043.", thread=thread) await agent.run("And its supplier?", thread=thread) # knows which PO

Memory = state across conversations

Foundry Memory is durable, retrievable knowledge about a user — VIP status, packaging preferences, delivery windows. Memory holds stable preferences and facts, not chat transcripts.

ZavaShop context. Customer service agent Aria remembers across sessions that C-204 is VIP, prefers no cardboard, and wants 6–8pm delivery.

Concern 5 — Actions & Outcomes

Real systems take actions that change state and produce outcomes other systems observe:

  • Trigger events — kick off a workflow, page a human.
  • Generate outputs — write a PO, draft an email, push to a record.
  • Notify channels — send back to Teams, update a dashboard, hit a webhook.
  • Observability — every action streams to Application Insights / Azure Monitor.

This is also where Workflows live. WorkflowBuilder is Agent Framework's orchestration primitive:

Three workflow features matter most:

  1. Reuse, don't rebuild — tools written at build time are workflow nodes at runtime.
  2. Human-in-the-Loop (HITL) — pauses, asks a human, resumes from the exact step.
  3. Checkpointing — workflows survive process restarts.

ZavaShop context. Fulfillment director Diego's team handles a $10K+ exception every day. Before: an email chain across 5 teams. After: a WorkflowBuilder graph with one HITL approval and full audit trail.

Cross-cutting: the shared services that make this safe

Both layers sit on top of platform services non-negotiable for enterprise deployment:

ServiceWhat it does for your agents
Microsoft Entra IDWho is the user? Who is the agent? Managed identity for tool calls
Microsoft Defender for CloudThreat detection across the agent's compute + data plane
Microsoft SentinelSIEM — correlate agent actions with security signals
Azure Key VaultSecrets, keys, connection strings — never in code, never in .env checked to git
Azure Monitor / App InsightsEvery agent turn, every tool call, every workflow step — observable and queryable
Azure Policy & governanceGuardrails on what can be deployed where, by whom

Skip this row and you have a demo that has not yet failed.

Mapping the ZavaShop workshop to the architecture

Layer 1 artifacts shipped in the repo:

  • .github/agents/zavashop-coding-agent.agent.md — the Coding Agent definition
  • .github/skills/agent-framework-{azure-ai,workflows,agui}-{py,csharp}/ — the six SKILLs
  • workshop/data/ — shared fixtures every artifact grounds in
  • Per-lab READMEs + eval_queries.jsonl — Layer 1 validation inputs

Layer 2 artifacts produced over the course of the workshop:

  • A single agent (Zara) — function tools + HostedMCPTool + Thread
  • A procurement agent (Pierre) — Toolbox + Agent Skills + approval policy
  • A customer-service agent (Aria) — Foundry Memory + Evaluation + Red-Team
  • A multi-agent fulfillment workflow (Diego) — WorkflowBuilder + HITL + Checkpoint
  • An AG-UI control tower for the CEO — covering all 7 AG-UI features

Same model across the stack — gpt-5.5 on Foundry + text-embedding-3-small. Change one env var, run the same artifact in the other language.

Three habits that separate strong agent engineers

  1. Read the SKILL first. Make it ritual. The Coding Agent does it automatically; you should do it manually when reviewing the agent's output.
  2. Treat tools as a public API. Names, signatures, docstrings, return shapes — they are how the model sees your system at runtime. Refactor them like any other API.
  3. Measure before you tune. A prompt change without an eval delta is a vibe. With one, it's engineering.

Getting started in 60 seconds

git clone https://github.com/microsoft/Learn-Microsoft-Agent-Framework-with-Foundry-ZavaShop-Supply-Chain-Workshop cd Learn-Microsoft-Agent-Framework-with-Foundry-ZavaShop-Supply-Chain-Workshop # Foundry prereqs: gpt-5.5 + text-embedding-3-small deployed in your Foundry project az login --use-device-code # Python track python -m venv .venv && source .venv/bin/activate pip install agent-framework agent-framework-azure-ai agent-framework-ag-ui \ azure-identity python-dotenv fastapi "uvicorn[standard]" # .NET track dotnet --version # ≥ 10.0.100 # .env at repo root cat > .env <<EOF FOUNDRY_PROJECT_ENDPOINT=https://<your-project>.services.ai.azure.com/api/projects/<project-name> FOUNDRY_MODEL=gpt-5.5 AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-3-small AGUI_SERVER_URL=http://127.0.0.1:5100/ AG_UI_API_KEY=zava-control-tower-demo-key EOF # In VS Code → Copilot Chat → Agent Mode → pick zavashop-coding-agent # Then say: "I'm working on the inventory agent in Python — meet Mei."

The one mantra: "Read the SKILL first."

Closing thought

Modern agent development is not one job — it's two. The Coding Agent designs and builds; the Runtime Agent operates and delivers. Microsoft Agent Framework is the SDK that makes both layers feel like the same conceptual model. Microsoft Foundry is the platform both layers publish to and run on.

And the engine that turns a generic Copilot into a domain-aware engineer — that takes a sentence-long requirement and lands a runnable, validated, deployable artifact — is the SKILL. Write a good SKILL once, and every agent built afterwards inherits your team's taste, your fixtures, your patterns, your discipline.

The ZavaShop workshop is the smallest end-to-end example I can give you that actually exercises both layers, with six SKILLs ready to read. Walk it once, and the next time someone asks "how do we build agents in our org?", you won't be pointing at a tutorial — you'll be pointing at an architecture.

👉 Start with the workshop on GitHub

Read the whole story
alvinashcraft
3 hours ago
reply
Pennsylvania, USA
Share this story
Delete

GitHub Copilot Agent for Unit Tests: My Real-World Spargine Experiment

1 Share
After experimenting with the GitHub Copilot Agent during the 2026 Microsoft MVP Summit, the author faced numerous challenges, including code deletion, slow performance, and inconsistent adherence to coding conventions. Despite these issues, the agent added valuable unit tests to the Spargine projects, but it requires careful oversight and refining of prompts for effective use.
Read the whole story
alvinashcraft
3 hours ago
reply
Pennsylvania, USA
Share this story
Delete

VS Code 1.122 Makes BYOK Easier

1 Share

⚠ This blog post was created with the help of AI tools. Yes, I used a bit of magic from language models to organize my thoughts and automate the boring parts, but the geeky fun and the 🤖 in C# are 100% mine.

One of the most interesting announcements in the recent VS Code 1.122 release is that Bring Your Own Key (BYOK) models can now be used without signing in to GitHub. VS Code now supports BYOK scenarios, including chat, tools, and MCP integrations, without requiring GitHub authentication, making enterprise, offline, and air-gapped workflows much easier to support.

I originally came across this update thanks to a tweet from Pierce Boggan:

This immediately caught my attention because it connects directly with the experiments I’ve been running around GitHub Copilot CLI, local models, BYOK, SQUAD, and AI-assisted software engineering.

Quick recap:

CPU-only local models
Slow, useful mostly for questions and tiny tasks.
https://elbruno.com/2026/05/03/running-github-copilot-cli-offline-with-local-models-a-cpu-only-reality-check/

GPU local models
Faster, with more room for local agent experiments.
https://elbruno.com/2026/05/06/running-github-copilot-cli-offline-with-local-models-gpu-edition/

GPT-5-mini BYOK
Capable, but stabilization, quality gates, and validation became the real work.
https://elbruno.com/2026/05/11/github-copilot-cli-gpt-5-mini-byok-the-code-was-cheap-the-quality-gates-were-expensive/

GPT-5.5 BYOK
The most disciplined run so far, with better phase control, quality gates, and manual UX validation.
https://elbruno.com/2026/05/14/github-copilot-cli-squad-gpt-5-5-byok-better-engineering-same-hard-truth/

The new VS Code 1.122 BYOK flow feels like a natural next step for this series.

I haven’t rerun the ElBruno.NetAgent experiment with this new capability yet, but it is definitely on my list.

The lesson remains the same:

The code keeps getting cheaper. Validation and engineering discipline are still expensive.

Happy coding!

Greetings

El Bruno

More posts in my blog ElBruno.com.

More info in https://beacons.ai/elbruno




Read the whole story
alvinashcraft
3 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Build agents, not pipelines

1 Share

There are only two ways to use LLMs in a computer program: as part of a pipeline, or as an agent. In other words, either you express the control flow of the program in code, or you give a LLM tools and allow it to manage the control flow itself1.

Here’s how you might structure a trivial “summarize a bunch of information and email it to me” program as a pipeline:

context = gather_context(various, data, sources)
llm_response = llm_summarize(context)
summary = parse(llm_response)
email_me(summary, my_email)

And here’s how you’d do it as an agent:

read_data_tool = build_read_data_tool(various, data, sources)
email_tool = build_email_tool(my_email)
run_agent(tools: [read_data_tool, email_tool])

It’s like the difference between a library and a framework. When you use a library, you define the structure of the program yourself, and call out to various library helpers along the way. When you use a framework, the main structure of the program lives in the framework, and it calls your code at various points. There are tradeoffs involved in both approaches. Frameworks let you get started more quickly and typically give you features “for free”, but can be difficult when you want to do something that isn’t part of the framework’s design. Libraries give you a lot more control, but require you to write (and maintain) more boilerplate code.

In the trivial case, the distinction between a pipeline and an agent melts away. If you only have a few paragraphs of possible context for the problem, an agent with a gather_context and an email_me tool will perform exactly the same steps as a pipeline that calls a reasoning model with the context injected into the prompt (i.e. the agent will reproduce the trivial control flow of your pipeline). But when you have more context than will fit into a single prompt, or you want to take an action and then react to the result, the choice between pipelines and agents becomes very significant.

Predictability, flexibility and intelligence

Pipelines are more predictable, but agents are more flexible. When you give a problem to an agent, work stops when the LLM thinks it’s done. Depending on the perceived difficulty of the problem, this can take anywhere from a few LLM turns to hundreds (and thus cost anywhere from a few cents to many dollars). If you’re building something intended to run at scale, this unpredictability can be a nightmare. Any subtle change to the user data could cause the LLM to take twice as long on each task, which would double your latency2 and cost.

Pipelines are only immune to this problem if they don’t use reasoning models, or don’t allow the model to “think out loud” in its output tokens (for instance, by using structured output). However, individual LLMs offer much tighter control over model reasoning than over how long an agentic loop will take. In all frontier model APIs, you can explicitly set the level of reasoning you want. That doesn’t give you total control, but it does cap “take longer” at maybe ten or twenty percent (instead of with agents, where it can be 2x or more).

Why use agents, then? Agents are smarter. If you’re happy to accept the unpredictability, an agentic system can handle much more difficult tasks, by virtue of being able to loop for longer, and to gather more information after thinking about the problem. There’s a reason that the most successful AI products (coding agents like Claude Code, Codex, Cursor, and Copilot3) are agents: coding is a hard enough task that you simply cannot build a functional coding agent with pipelines.

Context-gathering

The context-gathering stage is far more delicate for pipelines than for agents. If an agent is trying to solve a problem and realizes it needs more data, it can simply go and get it. But for a pipeline, all the required data has to be present in the context already, because the LLM only gets to run once.

Much of the work involved in building pipelines is in getting context-gathering right. Agents are much easier. For instance, with a coding agent, you can basically just provide a “grep” and “read file” tool and let the agent figure out what chunks of code are relevant to the current file. In a pipeline, you have to figure that out yourself: good luck, it’s an unsolved technical problem! Typically you’ll end up doing some set of clever tricks, like walking the AST to identify which parts of code “contribute” to the current file, or indexing the whole codebase with semantic embeddings and doing some kind of nearest-neighbor search to build the context (called RAG, or “retrieval-augmented generation”). Neither of these will work as well as using an agent.

In 2023 and 2024, many people believed that RAG would solve context-gathering. Every LLM would have a fully-indexed context base that would magically surface the precise information the LLM needed at any given moment. This did not happen. Instead, we went backwards, getting our agents to do plain-text search and figure it out like a human would. Why didn’t RAG work? This is a topic for a whole other post, but the short answer is this: “find what information is relevant to this problem” is often as hard a task as actually solving the problem. Semantic embeddings and cosine similarity are simply not powerful enough tools for the job.

Multi-model pipelines

Pipelines that make multiple LLM invocations do have an extra dimension of flexibility: they can use different LLMs for different tasks. For instance, if one LLM benchmarks better at task A, or is cheaper for an easier task B, you can use the right model for the job. Agents (at least right now) have to stay the same model the whole time, so you’re always pinned to the highest level of intelligence you need.

Is this a big deal? I’m suspicious. One pattern I see a lot is tasking a cheaper model with collating or summarizing data for a smarter model to do something with. But often the signal is in the raw data itself! I think designs like this are really shooting themselves in the foot, for the same reasons that RAG didn’t work: context-gathering was a harder problem than people anticipated.

In any case, if you do want to farm out tasks to different models, you can also do it via careful agentic tool design. For instance, you could build your web_search tool so that it uses a cheap model to summarize web pages.

Small contexts and future-proofing

Pipelines allow working with smaller contexts, and thus with local models. An agent’s ability to fetch its own context means that it almost always ingests more data than it needs. On top of that, agents run in loops, so each agent turn increases the size of the context. This isn’t a big problem for systems built on top of frontier model APIs, because:

  • frontier models all expose large context windows,
  • frontier models tend to hold up pretty well for the first 200k tokens, and
  • KV caching means that passing around the same large context block is surprisingly cheap.

However, it is a big problem for local models. The context window consumes a lot of VRAM, so most people running local models stay below 32k (or even 6k) tokens. If you’re writing a program to run in this environment, you likely will not be able to give an agent the space it needs, and you will be instead forced to use a pipeline.

In my opinion, agents are more future-proof. This is partly because models are now being explicitly built to be better agents, and partly because agents delegate more to the LLM and thus benefit more from LLM improvements. If you have a pipeline-based system, new models will probably do a bit better than old ones. If you have an agentic system, new models might do much better than old ones (to the point that it’s worth building an agentic systems for tasks that are currently too hard, on the assumption that by the time you’ve finished the models may be good enough). I have been banging this drum since 2023, before tool-calling was even a part of model APIs.

Safety and legibility

In general, I disagree with the popular advice that workflows are safer than agents. Workflows offer more control over budget, but when it comes to taking action based on LLM output, you have exactly the same problem whether you’re checking at the tool-call level or at the next stage in the pipeline: either you make some heuristic assessment via code, which might be wrong, or you queue the action up for a human to approve, which will be slow.

Don’t agents open you up to prompt injection? Yes, but pipelines do too. In both cases, you’re feeding some block of human-generated data (e.g. the files in a codebase, or the results of a web search) into the LLM. Any prompt injections in that data will be consumed by the LLM just the same whether they’re the result of a tool-call or directly injected into the prompt by the pipeline. You have to sanitize user content and double-check LLM-triggered actions, no matter what design you choose4.

I do want to acknowledge that pipelines are slightly more legible. You can trace most of what a pipeline is doing because you’re in control over more of it. It’s harder to figure out why an agent queried for a particular piece of information or took some action. But even in a pipeline, you’ll never know for sure why the LLM responded in the way it did. That’s just what it means to program with LLMs.

LLM-driven mass surveillance

Let’s apply some of these principles to a real-world, non-trivial example. Suppose you are the NSA, and you are attempting to use LLMs to get a grip on the wild firehose5 of covert email surveillance data6. Should you use pipelines or agents? Well, if you’re building something that’s supposed to run on every single piece of email in America, you probably shouldn’t use agents: keeping performance and cost strictly bounded requires a pipeline. However, you’re definitely well-resourced enough to use agents in general, and the problem is definitely hard enough to benefit from the extra intelligence. I’d probably recommend using both: a low-context, cheap pipeline that can run once against each email and flag it, and a fleet of agents that can dig into those flags, make ordinary queries, and act more like human analysts would.

The pipeline would have to scale with the total volume of data, which should be mostly fine, since pipelines scale in a predictable-ish manner. The fleet of unpredictable agents can be scaled entirely independently, though in practice it would get bottlenecked on GPU availability and the necessity for human review. The majority of the engineering work7 would likely go into context-assembly for the pipeline: feeding in enough data about who’s involved in the email conversation so that the LLM can make a sensible decision on whether or not to flag it.

Summary

Overall, I’d suggest following these guidelines:

  1. Use pipelines when you have strict requirements around context size
  2. Use pipelines when you need to be able to accurately predict (or limit) GPU cost
  3. Use pipelines when you have to use local models
  4. Use agents when you’re not confident you’ll be able to assemble all of the relevant context in one shot
  5. Use agents when the problem is hard enough that you’re not sure a pipeline will be able to solve it

When in doubt, use agents. I am aware of several AI projects that have migrated from pipelines to agents in the last year, but none that have gone the other way around. As a general point about software design, if you’re not sure what to do, pick the solution that’s easier to build and more likely to be able to solve your actual problem. If you want to change to a cheaper, pipeline-based system later on, at least you’ll be able to compare it to a working agentic design and make an informed decision.


  1. This distinction was popularized by Anthropic’s Building effective agents, written in December 2024, and now (I believe) made at least partially obsolete by advances in agents since then. They say “workflow”, but I slightly prefer the term “pipeline”.

  2. Yes, I know this is technically not what “latency” means, but there’s no other single-word shorthand for “the duration of a standard unit of work”.

  3. If you’re building your own coding agent, I suggest you begin with the letter “C”.

  4. For instance, in my trivial example at the top of the post, doesn’t the agent have a failure mode where it might send a ton of emails, or email a bunch of different people? No, because you ought to constrain the email tool so that it can only send to the right address, and (if this is important) that it can only be called once.

  5. In a draft post I never published, I ballpark-estimated all non-spam American email data at around seven trillion tokens per day (around a third of OpenAI’s total daily token usage).

  6. Should you do this? Probably not, but it’s a fascinating engineering problem, and I imagine the NSA has been thinking about these questions for several years by now. If the example bothers you, substitute some other more-ethical firehose of English language.

  7. Not counting evals, operations, standing up a trusted GPU cluster somewhere, scaling the physical hardware, and all the other thousand things you have to do in order to ship anything.

Read the whole story
alvinashcraft
3 hours ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories