Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
154494 stories
·
33 followers

Koog 1.0 Is Out: Stable Core, Better Interop, and Multiplatform Observability

1 Share

Last week at the KotlinConf 2026 keynote (watch the recording here), we announced Koog 1.0.

Koog is JetBrains’ open-source framework for building AI agents in Kotlin and Java. It provides the core building blocks for agentic applications: tools, workflows, persistence, memory, observability, and integrations with existing JVM and Kotlin Multiplatform projects.

We introduced Koog at KotlinConf last year. Since then, the framework has evolved through community feedback, internal use, and several public releases. Koog 1.0 is the next step: a more stable foundation for building reliable enterprise-ready agents.

What’s new in Koog 1.0

The biggest change in 1.0 is a strict commitment to stability. To give you a solid foundation for production, we guarantee no breaking changes for stable modules for at least one year.

This release also brings several major improvements across the framework:

  • Local Android AI: New provider integrations, featuring support for running LiteRT models locally on Android devices.
  • A redesigned Java interop layer with a cleaner and more consistent API.
  • Decoupled HTTP transport, which makes it easier to integrate Koog into existing infrastructure and use different HTTP clients.
  • OpenTelemetry support across Koog targets, including Kotlin Multiplatform environments.
  • Improved persistence and memory support for long-running agents.
  • Anthropic prompt caching support to help reduce latency and token costs for repeated prompts.

Koog 1.0 also includes many fixes, API cleanups, and migration improvements that prepare the framework for a more stable long-term evolution. For the full list of changes, see the Koog 1.0 release notes.

Try Koog 1.0

Koog 1.0 marks the framework’s move to a stable core API.

If you’re building agents that need tools, structured workflows, persistence, memory, observability, or integration with existing Kotlin and JVM applications, this release gives you a sturdier foundation to build on.

Explore the docs, update your dependencies, and start with the stable core modules. Add Beta modules only where you need functionality that is still evolving.

Thank you

We’d like to thank everyone who tried Koog, submitted issues, shared feedback, and contributed to the project over the past year. Koog 1.0 reflects a lot of that input, and we’re excited to keep building it with the community.

Read the whole story
alvinashcraft
8 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Proposed Advancement of Pointer Events Level 3 to W3C Recommendation

1 Share

Today, the W3C Team proposed advancing Pointer Events Level 3 to W3C Recommendation. This specification was published by the Pointer Events Working Group as a Candidate Recommendation Draft on 22 May 2026. The features in this specification extend or modify those found in Pointer Events, a W3C Recommendation that describes events and related interfaces for handling hardware-agnostic pointer input from devices including a mouse, pen, or touchscreen. For compatibility with existing mouse-based content, this specification also describes a mapping to fire Mouse Events for other pointer device types.

This revision includes new features: altitudeAngle, azimuthAngle, pointerrawupdate event, associated coalesced events, and built-in predicted events. This revision of Pointer Events is intended to supersede Pointer Events Level 2.

Read the whole story
alvinashcraft
8 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Hybrid AI Agents in Python: Routing Between Foundry Local and Microsoft Foundry

1 Share

Why hybrid, and why now

If you build AI features today, you are caught between three forces. Users want low latency and strong privacy. Product teams want frontier reasoning capability. Finance teams want predictable cost. No single model satisfies all three. Run everything on a small on-device model and you bottleneck on complex questions. Send everything to a frontier cloud model and you pay for trivial requests, leak sensitive data across a network boundary, and add hundreds of milliseconds of latency to greetings.

The pragmatic answer is hybrid inference: a lightweight local model classifies every request first, simple or sensitive ones stay on the device, and only the genuinely hard or frontier-capability requests escalate to the cloud. Microsoft now ships both halves of that pattern as supported Python SDKs — foundry-local-sdk for on-device inference and azure-ai-projects for Microsoft Foundry cloud models. This post walks through a working reference implementation that combines them behind a single ask() call.

The full source is at github.com/leestott/fl-mixedmodel. It is Python-only, secretless by design, and ships with a Gradio diagnostics UI, a CLI demo mode, and a full pytest suite.

The contract: one schema, two paths

The most important architectural decision is that callers never know which path served a request. Every response, local or cloud, returns the same dataclass:

class InferencePath(str, Enum):
    LOCAL = "local"
    CLOUD = "cloud"
    LOCAL_FALLBACK = "local_fallback"   # cloud attempted, fell back to local
    CLOUD_FALLBACK = "cloud_fallback"   # local attempted, fell back to cloud

@dataclass
class AgentResponse:
    answer: str
    path: InferencePath
    model: str
    reason: str
    confidence: float
    latency_ms: float
    correlation_id: str
    prompt_tokens: Optional[int] = None
    completion_tokens: Optional[int] = None
    fallback: bool = False
    fallback_reason: Optional[str] = None
    metadata: dict = field(default_factory=dict)

This is what makes the design honest. The router can change, the cloud model can be swapped from gpt-4o to gpt-5.4, fallback policies can flip — and the calling code never breaks. The four InferencePath values give you full observability without leaking implementation details into the API surface.

Architecture in one diagram

┌─────────────┐   prompt    ┌──────────────────────────┐
│   caller    │ ──────────► │   HybridAgentService     │
└─────────────┘             │      .ask(prompt)        │
                            └────────────┬─────────────┘
                                         │
                            ┌────────────▼─────────────┐
                            │     RoutingPolicy        │
                            │  1. Heuristic gate       │
                            │  2. Local router LLM     │
                            │  3. Hard policy gates    │
                            └─────┬─────────────┬──────┘
                                  │             │
                          LOCAL  ◄┘             └► CLOUD
                                  │             │
                       ┌──────────▼──┐   ┌──────▼───────┐
                       │ Foundry     │   │ Microsoft    │
                       │ Local SDK   │   │ Foundry      │
                       │ (phi-4-mini)│   │ (gpt-5.4)    │
                       └─────────────┘   └──────────────┘

Best practice: the two-stage router pattern

Before walking through the implementation, it is worth stating the design pattern explicitly, because it is the part that generalises beyond this specific repo. The cleanest design for hybrid inference is a two-stage router.

  1. Stage 1 — local router. A small local model performs intent and complexity classification first. It does not answer the question; it decides where the question should go.
  2. Stage 2 — route the answer.
    • If the prompt is simple, private, latency-sensitive, or clearly within local capability, route to a local task model on the device.
    • If the prompt is complex, needs deeper reasoning, a larger context window, or a capability unavailable locally, escalate to a cloud frontier model in Microsoft Foundry.

Microsoft's current guidance for the cloud side is to use the Responses API and choose one of two control modes:

  • Pass a specific deployment name (for example gpt-5.4) when you want deterministic control over which model serves the request, which is the right choice for regulated workloads, repeatable evaluations, or cost ceilings.
  • Pass model-router as the deployment when you want Microsoft Foundry to automatically select the best available cloud model for each request. This is a sensible default for general-purpose agents where you would rather let the platform optimise the model choice as new ones are released.

The reference repo exposes both as environment variables so you can switch without code changes:

# .env.example
FOUNDRY_CLOUD_MODEL_DEPLOYMENT=gpt-5.4        # deterministic
FOUNDRY_CLOUD_ROUTER_DEPLOYMENT=model-router  # auto-select

Best practice: pin the right SDK versions

Two SDKs do the heavy lifting and both have had recent breaking changes, so version discipline matters.

  • Local development — foundry-local-sdk. The current public guidance is to use the Foundry Local SDK package foundry-local-sdk, which provides model discovery, download, cache, load, unload, chat completions, embeddings, audio transcription, and an optional built-in web service. Use version 1.1.0, released on 5 May 2026. Earlier versions used an OpenAI-compatible client surface that has since been replaced by the FoundryLocalManager → load_model → get_chat_client → complete_chat chain shown above. Pin it explicitly:
    # requirements.txt
    foundry-local-sdk>=1.1.0
  • Cloud orchestration and agents — azure-ai-projects. For cloud-side orchestration, Microsoft's current Python guidance is to use azure-ai-projects, which the docs describe as part of the Microsoft Foundry SDK and as the entry point for agents, deployments, connections, datasets, evaluations, and an OpenAI-compatible client returned by get_openai_client(). The current PyPI listing shows azure-ai-projects 2.1.0. Pin it explicitly:
    # requirements.txt
    azure-ai-projects>=2.1.0
    azure-identity>=1.17.0

If you find yourself reading old samples that import azure.ai.inference as the cloud entry point, or that initialise Foundry Local through a raw openai.OpenAI(base_url=...) client, you are looking at pre-2026 patterns. The current shape is what the reference repo uses: FoundryLocalManager.initialize(Configuration(...)) for the device and AIProjectClient(...).get_openai_client() for the cloud.

 

 

Stage 1: a deterministic privacy gate

Before any model touches a prompt, a deterministic heuristic classifier scans for sensitive patterns — passwords, API keys, SSN/NHS numbers, PII signals, explicit "do not share" flags. If the heuristic returns PrivacyClass.RESTRICTED, the prompt is forced local. The router LLM is not called. The cloud provider is not called. The decision is auditable from a single regex pass.

# app/routing/policy.py
def decide(self, prompt: str, correlation_id: str = "") -> RoutingDecision:
    hint, privacy, complexity, h_reason = self._heuristic.classify(prompt)

    # Hard gate: restricted content never leaves the device
    if privacy == PrivacyClass.RESTRICTED:
        return self._make_decision(
            target=RouteTarget.LOCAL,
            confidence=1.0,
            reason=f"Policy hard-gate: {h_reason}",
            privacy=privacy,
            complexity=complexity,
            deterministic=True,
            correlation_id=correlation_id,
        )

    # Hard gate: very high complexity always goes to cloud
    if complexity == ComplexityBand.VERY_HIGH:
        return self._make_decision(
            target=RouteTarget.CLOUD,
            confidence=1.0,
            reason="Policy hard-gate: very_high complexity requires frontier model",
            ...
        )

This is the most important responsible-AI control in the whole system. If your privacy review depends on an LLM correctly classifying every prompt, you do not have a privacy control — you have a probability distribution. Deterministic gates first, model judgement second.

Stage 2: a local LLM as the router

For everything that passes the privacy gate, a small local model classifies whether the prompt needs frontier capability. This is the bit that surprises most engineers: you can do useful routing with a 4B parameter model running on a laptop CPU. The router does not need to answer the question. It only needs to classify it.

The reference implementation uses phi-4-mini via Foundry Local. Initialising it is two lines:

# app/providers/local_provider.py (excerpt)
from foundry_local import FoundryLocalManager
from foundry_local.models import Configuration

self._manager = FoundryLocalManager.initialize(
    Configuration(app_name="hybrid-agent")
)
self._router_model = self._manager.load_model(self._config.local_router_alias)
self._chat_client  = self._router_model.get_chat_client()

response = self._chat_client.complete_chat(
    messages=[
        {"role": "system", "content": ROUTER_SYSTEM_PROMPT},
        {"role": "user",   "content": prompt},
    ],
)

The router prompt asks for a strict JSON response: { "target": "local|cloud", "confidence": 0.0-1.0, "complexity": "low|medium|high|very_high", "reason": "..." }. The application parses it, applies the confidence threshold from config (default 0.6), and falls back to the heuristic decision if the router LLM is unsure or its JSON is malformed. The router never blocks the answer path — that is a deliberate reliability choice.

 

 

Cloud inference via Microsoft Foundry

When the policy returns RouteTarget.CLOUD, the request goes through AIProjectClient, which gives you an openai.OpenAI-compatible client wired to your Foundry project with DefaultAzureCredential. No API keys. No secrets in .env.

# app/providers/cloud_provider.py (excerpt)
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential

self._project = AIProjectClient(
    endpoint=self._config.foundry_project_endpoint,
    credential=DefaultAzureCredential(),
)
self._openai_client = self._project.get_openai_client()

response = self._openai_client.chat.completions.create(
    model=self._config.foundry_cloud_model_deployment,  # e.g. "gpt-5.4"
    messages=messages,
    max_completion_tokens=max_tokens,
)

A subtle gotcha worth flagging: gpt-5 and o-series deployments reject the legacy max_tokens parameter and require max_completion_tokens. They also reject custom temperature values. The reference repo handles this by trying the new parameter first and falling back to the legacy one only when the API returns the specific unsupported parameter error. That keeps the same code working against older deployments without forking the provider.

 

Graceful degradation: the fallback paths

Hybrid systems fail in interesting ways. The cloud can be down. The local model can throw because the GPU ran out of memory. A reasoning model can return an empty completion. The service handles all of these by attempting the alternative path and labelling the response so observability stays honest:

  • Cloud route fails → local fallback. The response carries path=LOCAL_FALLBACK, fallback=true, and a populated fallback_reason. The user gets an answer instead of an error.
  • Local route fails → cloud fallback, but only if privacy class is not RESTRICTED. A sensitive prompt that the local model could not handle never leaks to the cloud as a fallback. It returns a clear error instead. This is the second hard gate in the system.
  • Both fail. A structured error response with a correlation ID, never a stack trace.

That last rule — fallback respects privacy class — is the kind of decision that is easy to skip and impossible to bolt on later. Encode it once in the service layer and your privacy reviewers will thank you.

 

What it looks like in practice

The diagnostics panel in the Gradio UI shows the routing decision live: path, model, confidence, latency, privacy class, complexity band, and the full JSON response. Five canonical scenarios shake out the entire decision tree:

  1. "hello"path=local, confidence=1.0, complexity=low. Heuristic only. No router LLM call. ~3 seconds end-to-end with phi-4-mini cached.
  2. "explain transformer self-attention in depth with maths"path=cloud, model=gpt-5.4, complexity=high. Router LLM classifies, hard gate confirms.
  3. "my password is hunter2, suggest a stronger one"path=local, privacy=restricted, deterministic=true. Privacy gate fires before any model sees it.
  4. "summarise this 8 KB document" with cloud unavailable → path=cloud_fallback (local handles it, response is labelled).
  5. Complex prompt with local model error → path=local_fallback, fallback_reason populated.

You can reproduce all five without any models installed by running python -m app.main --demo. The demo mode swaps the providers for deterministic stubs so you can validate the routing logic and the response schema in under a second on any machine.

Operational lessons learned

Some things the reference implementation only gets right because it got them wrong first:

  • Pick a non-reasoning model for the router. Reasoning-tuned local models (Phi-4-reasoning, o-style) wrap their output in <think> blocks and blow your JSON parser. phi-4-mini is faster and more reliable for classification.
  • Cache the local model. First load can take 30–60 seconds while Foundry Local downloads weights. Initialise the service once at process startup, not per request.
  • Use correlation IDs everywhere. The service attaches one per request and the structured JSON logger emits it on every event. When you are debugging a fallback path across two model providers, this is the difference between five minutes and five hours.
  • Run the privacy heuristic on every fallback path too. A naive implementation might route locally, fail, and then send the same sensitive prompt to the cloud as a "graceful" fallback. That is not graceful, it is a data leak.
  • Keep configuration in .env and out of code. Privacy mode, fallback toggles, confidence threshold, model aliases — all environment-driven. The config.py module is the only place that reads them.

Responsible AI in a hybrid topology

Hybrid does not make responsible AI harder, but it does make it different. Three controls earn their keep:

  • Data residency by default. The local path keeps prompts and answers on the device. For RESTRICTED content this is mandatory; for everything else it is a free latency and cost win.
  • Auditability. Every routing decision is logged with the deterministic reason, the heuristic class, the router LLM output, the confidence, and the correlation ID. You can answer "why did this prompt go to the cloud?" months later.
  • Keyless auth. DefaultAzureCredential means there is no API key to leak, rotate, or commit by accident. The repo's .gitignore, SECURITY.md, and pre-push checklist enforce this end-to-end.

Try it

Five minutes, no Azure account needed for the demo:

git clone https://github.com/leestott/fl-mixedmodel.git
cd fl-mixedmodel

python -m venv .venv
.venv\Scripts\activate          # Windows
# source .venv/bin/activate     # macOS / Linux

pip install -r requirements.txt
python -m app.main --demo       # all five scenarios, no models required

To run with real models, install Foundry Local, copy .env.example to .env, set your FOUNDRY_PROJECT_ENDPOINT, then:

az login
python -m app.main --ui --port 7860

Where to go next

Key takeaways

  • The best-practice pattern is a two-stage router: local model classifies first, then either a local task model or a Microsoft Foundry cloud model answers.
  • For cloud control, use the Responses API with either a named deployment (deterministic) or model-router (auto-select).
  • Pin foundry-local-sdk >= 1.1.0 (5 May 2026) and azure-ai-projects >= 2.1.0. The 2026 SDK surfaces are not backwards-compatible with pre-2026 samples.
  • Hybrid inference is a routing problem, not a model problem. A small local model is enough to classify the request.
  • Deterministic privacy gates beat probabilistic ones. Code the rules; let the LLM judge only what is left.
  • Return the same response schema from every path. Label fallbacks honestly. Carry a correlation ID everywhere.
  • Keep auth keyless with DefaultAzureCredential and your .env out of git.
  • Test the routing decisions, not just the model outputs. Demo mode and a strong pytest suite pay back every time you swap a model.

Hybrid AI is not a compromise between local and cloud. It is the supervisor pattern applied to inference — fast and private where you can be, frontier where you have to be, observable everywhere. The hard part is the contract, not the models.

Read the whole story
alvinashcraft
8 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Rebuilding Your Mental Models In the Midst Of an AI Tech Revolution

1 Share

Right now, the questions we have about our careers feel existential. We keep coming back to the same theme: how do you prepare for an industry that's changing this fast, and what mindset actually works in this new reality? One skill keeps surfacing as the answer — your ability to update your own mental models. In today's episode, I want to push on that further and put some of software engineering's most beloved thinking models under scrutiny. Some of these models served you well for years. Some of them now deserve to be challenged, replaced, or thrown out entirely — and learning how to tell the difference is itself the skill that will determine whether you hit a ceiling.

  • Move Past "So What" Questions: The typical engineering objection to agentic coding is that it produces quality issues. But the people deciding to adopt these tools already accept that. Our job is to stop arguing the surface-level point and start asking the real one: so what do we actually do about this new economic reality?
  • The Economics of Acceptable Loss: Abstraction always leaves something to be desired. An agent's code may not match what a staff engineer produces by hand over months — but that gap is usually an acceptable trade against shipping something two, three, or four times faster. Understand the cost-benefit picture instead of pretending the cost doesn't exist.
  • Abstraction Has Always Done This: This isn't new. The calculator dissolved the specialization once required for complex math. Spreadsheets commoditized ledgering and accounting. Agentic coding is the same pattern arriving for our work — making something that required deep specialization suddenly far more accessible.
  • Roles Are Blurring: As these generic tools raise everyone's ability to abstract, the boundaries soften. You're already seeing product managers open pull requests and engineers making product decisions. The neat lines around "what an engineer is" are not as fixed as they used to feel.
  • Why Your Hard-Won Wisdom Is the Target: If you've spent years in this industry, your models were bought with blood, sweat, and failed projects. That experience is real wisdom — and it's exactly what I'm asking you to be willing to challenge, because the thing that always worked for you is the thing most likely to become a ceiling.
  • This Skill Survives Either Way: Even if you think AI is mostly hype and I've been infected by it — fine. The ability to challenge your pre-existing models is a critical skill regardless. It's how you keep growing as you get more senior instead of repeating what used to work.
  • Models Are Approximations: The whole point of a model is to approximate the reality around us. That's their value and their limitation. When the underlying reality shifts this dramatically, holding tightly to an old approximation stops being wisdom and starts being a liability.

🙏 Today's Episode is Brought To you by: Unblocked

Your coding agents have access to your codebase and probably a lot more — tools connected through MCPs, skills, and more. But access isn't the same as context. Agents aren't great at reasoning across MCPs, and they don't know your architectural decisions, your team's patterns, or why your API is shaped the way it is. So they look in the wrong place and deliver bad outputs, and you burn time and tokens correcting them. ● Unblocked is the smart context layer your agents are missing. ● Instead of dumping tons of data into a giant context window and getting lost, it builds reasoning over shared context. ● It turns code, docs, tickets, and conversations into actionable context, so engineers move faster and agents make better plans, write higher quality code, use fewer tokens, and need fewer correction loops. ● If you're running Claude Code, Cursor, or any other agentic workflow, it's worth a look. Get a free three-week trial at getunblocked.com/developer-tea.

📮 Ask a Question

If you enjoyed this episode and would like me to discuss a question that you have on the show, drop it over at: developertea.com.

📮 Join the Discord

If you want to be a part of a supportive community of engineers (non-engineers welcome!) working to improve their lives and careers, join us on the Developer Tea Discord community today!

🗞️ Subscribe to The Tea Break

We are developing a brand new newsletter called The Tea Break! You can be the first in line to receive it by entering your email directly over at developertea.com.

🧡 Leave a Review

If you're enjoying the show and want to support the content head over to iTunes and leave a review!





Download audio: https://dts.podtrac.com/redirect.mp3/cdn.simplecast.com/audio/c44db111-b60d-436e-ab63-38c7c3402406/episodes/c9f6f900-f07e-406a-8049-d08b2ff65f55/audio/2b187a73-528e-47b8-af1d-bf1424fed11c/default_tc.mp3?aid=rss_feed&feed=dLRotFGk
Read the whole story
alvinashcraft
8 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Measuring Performance in FrontEnd using FPS

1 Share
How to measure frontend performance with a tiny, dependency-free FPS meter, with live React and Angular demos you can jank on demand.
Read the whole story
alvinashcraft
8 hours ago
reply
Pennsylvania, USA
Share this story
Delete

How we built integration testing for fast-moving AI backend

1 Share

How do you keep your backend compatible with an upstream dependency that changes its API every week? And how do you do it without spending a dollar on large language model (LLM) calls?

That was our problem on Red Hat OpenShift AI, where our Go backend integrates with Llama Stack. We replaced our mocked unit tests with a real Llama Stack server, used its built-in record-replay to avoid LLM costs and wired a daily Slack sentinel that notifies us before our users do.

The problem: Our mocks were lying to us

Our Go test suite for the backend for frontend (BFF) used mocked Llama Stack APIs. Standard practice: define the expected request and response shapes, assert against them and move on.

The problem was that Llama Stack was iterating fast, really fast. In early 2026, the upstream project released multiple times a week, adding new API fields, changing response shapes, and deprecating endpoints. Every time they changed something, our mocked tests kept passing, green checkmarks everywhere. And then someone would deploy, and the BFF would break against the real server.

Mocks test your assumptions about an API. When the API changes underneath you, the mocks don't update themselves. You end up with a test suite that gives you confidence about a contract that no longer exists. We needed to test against the real thing.

The first decision: Start a real server

The idea to use a real Llama Stack instead of mocks came from our architect. The BFF already had a pattern for this. When you run make test, the test suite automatically starts a local MLflow server as a child process, seeds it with test data, runs tests against it, and tears it down. 

Could we do the same for Llama Stack? Yes, but the implementation details mattered. I took the idea and ran with it.

Docker or subprocess?

The obvious approach was Docker. Llama Stack publishes container images. Spin one up, run tests, and tear it down.

I went in a different direction, using the uv run command as a subprocess.

The uv command is a fast Python package manager. With a single command, it installs a specific version of Llama Stack from PyPI and starts it inside the same terminal session as the BFF. No Docker daemon. No image pulls. No container networking. The Llama Stack process runs on 127.0.0.1:18321. The Go test suite connects to it directly and gets a SIGTERM when tests finish the process group.

// Simplified — full version in lsmocks/llamastack_process.go
cmd := exec.CommandContext(ctx, uvBin,
    "run",
    "--with", "llama-stack=="+version,
    "--with", "llama-stack-api=="+version,
    "--with-requirements", requirementsPath,
    "llama", "stack", "run", configPath,
    "--port", fmt.Sprintf("%d", port),
)

When you run make test, it calls go test ./…, which enters the Ginkgo BeforeSuite in api_suite_test.go. That's where SetupLlamaStack starts the child process, waits up to 180 seconds for a health check on /v1/models and seeds the server with test data. If a Llama Stack is already running on the target port (from a previous dev session), it reuses that instance with a warning about potentially stale database state.

This meant any developer could run make test and get a full integration test against a real Llama Stack server with no Docker, no special setup, and no cluster. The version was pinned in the Makefile, so updating it was a one-line change.

The second problem: LLMs cost money

Here's where it got tricky. A real Llama Stack server needs a real inference provider. When the BFF calls the responses API, Llama Stack forwards that to an LLM (e.g., Gemini), gets back generated text, and returns it to the BFF. The same is true for embeddings when testing vector stores and for safety when testing guardrails.

If every make test run hit a frontier model, we'd burn through API credits fast. In continuous integration (CI) where tests run on every pull request, the cost would be unsustainable. Not to mention, the LLM responses are non-deterministic. The same prompt doesn't give the same answer twice. That's a bad property for a test suite.

We needed a way to get real Llama Stack behavior without real LLM calls on every run.

The eureka moment: The record-replay system

I went digging through the upstream Llama Stack codebase, looking for anything that could help. I found a built-in record-replay system for testing.

The mechanism is straightforward. Llama Stack intercepts every request sent to the inference provider (Gemini in our case). It can operate in two modes:

  • Record mode: Llama Stack makes the real API call to Gemini, gets the response, returns it to the caller, and saves the request and response as a JSON fixture. A hash of the request parameters keys each fixture.
  • Replay mode: Llama Stack looks up the request hash in its fixture directory. If a matching fixture exists, it returns the recorded response without ever calling Gemini. If not, the test fails, telling you that your test is making a call that hasn't been recorded yet.

The recorded fixtures look like this:

{
  "test_id": "bff/testdata/llamastack/test.py::record",
  "request": {
    "method": "POST",
    "url": "https://generativelanguage.googleapis.com/v1beta/openai/v1/embeddings",
    "body": {
      "model": "models/gemini-embedding-001",
      "input": ["Artificial intelligence (AI) is the simulation of..."],
      "encoding_format": "float",
      "dimensions": 128
    }
  },
  "response": {
    "body": {
      "data": [{"embedding": [0.002, -0.001, ...], "index": null}],
      "model": "models/gemini-embedding-001"
    }
  }
}

Note: I set the embedding dimension to 128 instead of the default 768. You’ll commit these fixtures to Git. At 768 dimensions with multiple embeddings per test, the JSON files would bloat the repository. At 128, each fixture is a few kilobytes. Our tests don't care about embedding quality. They care that the BFF correctly handles the response shape.

Wiring it together

The implementation required three pieces working in concert.

1. Test ID correlation

Llama Stack's record-replay system uses a test ID to isolate recordings. Every HTTP request the BFF sends to Llama Stack needs to carry an X-LlamaStack-Provider-Data header containing a __test_id field. This header flows through to the inference provider layer, where the record-replay system uses it to match requests to fixtures.

I built a custom Go HTTP transport (testIDHeaderTransport) that wraps every outbound request and injects the X-LlamaStack-Provider-Data header with a __test_id field. The tricky part is merging. The BFF already sends this header for some operations (provider-specific data like API keys), so the transport needs to add the test ID alongside existing data rather than overwriting it.

2. A dual-mode client factory

Not every test run needs a real Llama Stack server. Cypress end-to-end tests, local development and some CI jobs can use a simple in-memory mock. 

The MockClientFactory auto-detects the environment:

func (f *MockClientFactory) CreateClient(...) LlamaStackClientInterface {
    if client := TryCreateTestClient(); client != nil {
        return client  // Real Llama Stack server available
    }
    return NewMockLlamaStackClient()  // Fallback to in-memory mock
}

If you set TEST_LLAMA_STACK_PORT (which make test does), you’ll get a real client connected to the local server. If not, you’ll get the mock. The developer provides one factory, two modes, and zero configuration.

3. Data seeding

A bare Llama Stack server has no vector stores, files, or embeddings. The SeedData function runs after the server is healthy. It verifies available models, creates a test vector store, uploads a sample document, waits for the embedding to complete (Llama Stack processes embeddings asynchronously), and returns the IDs that tests reference.

The polling on embedding completion is important. Without it, you get a race condition where tests that do file_search find zero chunks because the index isn't ready. The wait loop checks every 500ms with a 60-second timeout.

How the team runs it daily

With the test infrastructure in place, our team’s daily workflow involves several key processes. For local development (which is most common), running make test starts Llama Stack in replay mode using committed fixtures at no cost, taking about 2 minutes. When you change a BFF API, running GEMINI_API_KEY=<key> make llamastack-record clears existing recordings and runs all tests in record mode against real Gemini to capture and commit new fixtures, which typically occurs once or twice a sprint. Finally, in PR CI, the process is the same as local development, using replay mode and committed fixtures without requiring an API key.

The next level: The Compatibility Sentinel

Once the integration testing infrastructure was solid locally and in PR CI, I saw a bigger opportunity. We had the machinery to test any version of Llama Stack with a single variable change. Why not automate that?

Our Makefile pins Llama Stack to this specific version:

TEST_LLAMA_STACK_VERSION ?= 0.7.2.dev20260423

Change this one line, and the entire test suite runs against a different version.

I built a GitHub actions workflow in a separate repository (odh-automations) that runs every day at 06:30 UTC. It does three things:

1. Resolves the latest versions from two sources:

Stable: fetches pypi.org/pypi/llama-stack/json and reads .info.version

Dev: fetches test.pypi.org/pypi/llama-stack/json, filters releases containing "dev", sorts, takes the latest

Both run as parallel matrix jobs. For dev builds, the workflow pre-downloads wheels from test.pypi.org into a local directory and sets UV_FIND_LINKS so uv can resolve those pre-release packages reliably without depending on index ordering.

2. Runs the BFF tests with a two-phase strategy:

First, try replay mode. If the fixtures match, the version is compatible, and you don’t need LLM calls.

If replay fails (i.e., the new version changed something in the request/response format), try recording mode with a real Gemini key. If recording succeeds, the version is compatible, but the fixtures need refreshing.

If both fail, the version is genuinely incompatible, and you’ll need BFF code changes.

# Abbreviated — full workflow in gen-ai-bff-nightly.yml
# Also sets working-directory, continue-on-error, UV_FIND_LINKS for dev wheels,
# and a separate determine-result step that reads pinned vs tested versions.
- name: Run BFF tests (replay mode)
  id: replay-test
  continue-on-error: true
  run: TEST_LLAMA_STACK_VERSION=${{ steps.resolve-version.outputs.version }} make test

- name: Run BFF tests (recording mode)
  id: record-test
  if: steps.replay-test.outcome == 'failure'
  env:
    GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
  run: TEST_LLAMA_STACK_VERSION=${{ steps.resolve-version.outputs.version }} make llamastack-record

This two-phase approach was the key design insight. Most days, stable and dev pass on replay, making zero LLM calls. Gemini only gets calls when something actually changes. We're talking maybe a few cents per week in API costs for the entire team.

3. Sends structured results to Slack via Workflow Builder:

The notification script computes an overall status and sends a JSON payload to a Slack Workflow Builder webhook. We chose Workflow Builder over a custom Slack app because it's simpler to set up and doesn't require app approval from the workspace admins. Figure 1 shows what the team sees in their Slack channel every morning.

A Slack notification showing a successful verification from LlamaStack Compatibility Sentinel.
Figure 1: This is the workflow with successful verification.

Figure 2 shows the Slack message when things break.

A Slack notification showing a failed verification from LlamaStack Compatibility Sentinel.

Figure 2: This is the workflow with failed verification.

We named it LlamaStack Compatibility Sentinel. Every day, it tells us exactly where we stand.

What the Sentinel caught in practice

The Sentinel has been running since March 2026. Here's what actually happened:

March 18: First alert: Stable LLS 0.6.0 was compatible with recording required, meaning response shapes drifted; but the API contract held. Dev 0.6.1 was incompatible. This gave us weeks of lead time before those breaking changes hit stable.

Late March: The Llama Stack team announced v1.0.0 with major architectural changes, including the removal of inline providers and new file_processors API requirements. Already, our Sentinel had been flagging dev build failures daily, so we knew exactly which changes would affect us.

April 9: It was all green, stable 0.7.1 compatible, and dev 0.7.1.dev20260409 compatible, providing a brief window of peace.

April 10 onward: Dev builds started failing consistently. The 0.7.2.dev series introduced changes that broke our BFF. We could see it coming, so we tracked the regression daily and prepared our upgrade path as a team. Without the Sentinel, we would have discovered this when someone tried to upgrade in production.

I used this data directly in sprint planning. I wrote to the team: "Our nightly compatibility workflow has already flagged a dev version of LLS that breaks the BFF. Once those breaking changes land in a stable release, we'll need to start the upgrade in earnest.” This wasn't a guess. The Sentinel had been telling us for two weeks.

Our suggestions

Test against the real dependency, not a simulation of it. Mocks test your assumptions. Integration tests against a real server test reality. The gap between the two is where production bugs hide.

Use record-replay, not just mocks. If your dependency has a record-replay system (or you can build one), use it. You get the fidelity of real integration tests with the speed and cost of mocks. The Llama Stack team built this into their test infrastructure. Leveraging it for our Go BFF was the single best decision I made on this project.

Automate the version matrix. All you need is a Makefile variable and a CI job that swaps it. Test against stable (what your users have) and dev (what's coming). The delta between the two is your early warning system.

Make the results visible. A test that runs in CI and nobody looks at is a test that doesn't exist. Slack notifications with clear status (e.g., compatible, needs recording, or incompatible) turn a background CI job into a daily team signal.

Keep fixtures small. If you're committing LLM response fixtures to Git, think about size. Choose 128-dimension embeddings instead of 768. Use short sample documents. Every kilobyte multiplied by every fixture multiplied by every clone adds up.

How to adapt this for your project

This pattern works for any backend that integrates with a fast-moving upstream dependency, not just Llama Stack. If your project consumes a Python server, an API that releases frequently, or any dependency whose contract changes between versions, the same approach applies. 

Here is what you need:

  • Subprocess lifecycle management: A module that starts the upstream server as a child process waits for a health check and tears it down after tests complete.
  • Test data and fixtures: A config file for the upstream server, pinned dependency versions, and committed record-replay fixtures (JSON recordings of real API responses).
  • Makefile targetstest (replay mode), record (re-record fixtures against the real dependency), and up / down (manual server lifecycle for local development)
  • CI workflow: A scheduled job that resolves the latest stable and dev versions, runs the two-phase test strategy (replay first and record if needed), and reports results.
  • Notifications: A Slack webhook (or equivalent) that surfaces daily compatibility status to the team.

Any backend that depends on Llama Stack (or any rapidly evolving upstream project) faces the same version-drift problem. We designed this infrastructure to be portable.

It took me about three weeks to build the entire system, from make test starting a local Llama Stack server to the Slack Sentinel reporting compatibility every morning. The ongoing cost is effectively zero. A few minutes of GitHub Actions compute per day, and the occasional Gemini API call when fixtures need refreshing.

It's been running daily since March. So far, daily runs have caught every breaking change in upstream dev builds before they reached stable, giving us weeks of lead time on each. When Llama Stack v1.0.0 announced breaking changes, we already had a clear picture of which ones affected us because the Sentinel had been flagging them for two weeks.

That's the whole point, not perfection, but early visibility. The kind of visibility that turns a production outage into a planned sprint task for the team.

Learn more

Get started with Red Hat OpenShift AI or try it in the Developer Sandbox. The OpenShift AI platform hosts Gen AI Studio. You can find the test infrastructure under packages/gen-ai/bff/. Review the Llama-stack upstream project, including the record-replay system. Get familiar with the Red Hat AI portfolio.

The post How we built integration testing for fast-moving AI backend appeared first on Red Hat Developer.

Read the whole story
alvinashcraft
8 hours ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories