Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
149831 stories
·
33 followers

Lock Down Values in Pulumi ESC with fn::final

1 Share

Pulumi ESC (Environments, Secrets, and Configuration) allows you to compose environments by importing configuration and secrets from other environments, but this also means a child environment can silently override a value set by a parent. When that value is a security policy or a compliance setting, an accidental override can cause real problems. With the new fn::final built-in function, you can mark values as final, preventing child environments from overriding them. If a child environment tries to override a final value, ESC raises a warning and preserves the original value.

How it works

Let’s say you have a parent environment that sets the AWS region for all deployments. You can use fn::final to ensure no child environment can change it:

# project/parent-env
values:
 aws-region:
 fn::final: us-east-1

If a child environment tries to override the final value, ESC raises a cannot override final value warning.

# project/child-env
imports:
 - project/parent-env
values:
 aws-region: eu-west-1 # raises a warning

This evaluates to:

{
 "aws-region": "us-east-1"
}

In this scenario, the ESC environment is still valid, but the final value remains unchanged.

When to use fn::final

Use fn::final for:

  • Security-sensitive values that shouldn’t be changed
  • Compliance or policy settings enforced by a platform team
  • Shared base environments where certain values must remain consistent

Getting started

The fn::final function is available now in all Pulumi ESC environments. For more information, check out the fn::final documentation!

Read the whole story
alvinashcraft
just a second ago
reply
Pennsylvania, USA
Share this story
Delete

RT.Assistant: A Multi-Agent Voice Bot Using .NET and OpenAI

1 Share

This is a guest post by Faisal Waris, an AI strategist in the telecom industry. Faisal built RT.Assistant to explore how .NET, F#, and the OpenAI Realtime API can come together in a production-style, multi-agent voice application.

RT.Assistant is a voice-enabled, multi-agent assistant built entirely in .NET — combining the OpenAI Realtime API over WebRTC for low-latency, bidirectional voice; F# discriminated unions and async state machines for agent orchestration; .NET MAUI (via Fabulous) for cross-platform native UI on iOS, Android, macOS, and Windows; and Microsoft.Extensions.AI for portable LLM integration with both OpenAI and Anthropic models.

Under the hood, a custom RTFlow framework hosts multiple specialized agents — a Voice Agent, a CodeGen Agent, a Query Agent, and an App Agent — that communicate over a strongly-typed async bus, while a deterministic state-machine (the “Flow”) keeps the non-deterministic LLM behavior in check. The sample also showcases an unconventional RAG approach: instead of vector search, user queries are translated into Prolog and executed against a logic-programming knowledge base embedded in a .NET MAUI HybridWebView, yielding precise, hallucination-resistant answers.

Why telecom plan selection?

Faisal works in the telecom industry, where one of the most common customer pain points is choosing the right phone plan. Carriers offer dozens of bundled plans that mix voice, data, hotspot, streaming, and promotional pricing in ways that are genuinely difficult to compare — even for the people selling them. It’s the kind of domain where a conversational AI assistant can make a real difference: customers ask natural-language questions and get precise, verifiable answers instead of sifting through comparison matrices or waiting on hold.

RT.Assistant uses this domain as a realistic proving ground. The application maintains a mocked — but representative — catalog of plans modeled after a major US carrier’s actual offerings. Let’s look at what makes these plans hard to compare in the first place.

Plan offerings

Phone plans (like many other offerings these days) are bundled products and services which makes it non-trivial to ascertain which plan will work best for one’s needs. Consider the following components of a typical contemporary plan:

  • Base plan with voice, text and data rates & limits
  • Mobile hotspot data rates & limits (20gb, 50gb, 100gb?)
  • Premium data rates & limits (with premium, one’s data gets preference in a crowded/overloaded network)
  • Streaming services e.g. Netflix, Hulu, Apple TV, etc.
  • Taxes and fees may or may not be included in the plan price
  • Special discounts for Military Veterans, First Responders, Seniors, etc.
  • In-flight data rates & limits
  • Then there may be seasonal or campaign promotions

Additionally, the available features may be dependent on the number of lines (distinct phone numbers). For example, Netflix may be excluded for a single line but included for two or more lines.

Technologies Showcased

The system internally maintains a mocked—but representative—set of phone plans, modeled as Prolog facts, simulating offerings from a typical major telecom provider. From (a) capturing the voice input, to (b) querying the Prolog knowledge base, and finally (c) generating the results, multiple components work together seamlessly.

This sample highlights the integration of the following frameworks and technologies:

  • RTFlow: Multi-agent framework for realtime GenAI applications – written in F#
  • RTOpenAI: F# library for interfacing with the OpenAI realtime API via the WebRTC protocol
  • Fabulous for .NET MAUI: F# library for building native mobile applications (IOS, Android, +) with Microsoft .NET MAUI
  • Tau: JavaScript-based Prolog engine and its use in a native mobile app via the .NET MAUI HybridWebView control.
  • Integration with the OpenAI & Anthropic ‘chat’ APIs for Prolog code generation with the Microsoft.Extensions.AI library

Overview

There is a lot going on here: generative AI; old-school symbolic AI; multi-agents; realtime voice; cross-platform native mobile apps; to name some. The following explains how these are all stitched together into a comprehensive system.

  • Voice-enabled interaction: The assistant allows users to ask questions about phone plans through natural speech, making the experience conversational and intuitive.
  • Structured knowledge base: Plan details are represented as logically consistent Prolog facts, ensuring clarity and eliminating ambiguity in feature–price combinations.
  • Internally, multiple specialized agents work together to process and respond to user queries:
    • Voice Agent – Maintains a connection to the OpenAI realtime ‘voice’ API, enabling natural voice conversations about phone plans. It handles the steady stream of messages from the API, include tool calls containing natural language queries. These queries are then routed to other agents, and the resulting answers are returned to the voice model, which conveys them back to the user in audio form.
    • CodeGen Agent – Converts natural language queries into Prolog statements using another LLM API, then leverages the Tau Prolog engine to evaluate those statements against the knowledge base of facts.
    • Query Agent – Executes predefined (“canned”) Prolog queries directly against the Prolog engine for quick, structured lookups.
    • App Agent – Oversees communication among agents and reports activity back to the user interface for transparency and monitoring.

All agents are orchestrated by the RTFlow framework, which provides hosting and communication services. The diagram below illustrates the RTFlow agent arrangement for the RT.Assistant sample:

RTFlow

As there are multiple frameworks / technologies in play here. Let’s briefly delve into each one of them – in the order of perceived importance.

1. RTFlow

RTFlow is a framework for building real-time, agentic applications. It is composed of three primary elements: Flow, Bus, and Agents.

Bus

The Bus provides the communication substrate that connects Agents to one another and to the Flow. It exposes two distinct logical channels:

  • Agent broadcast channel All agent-intent messages published to this channel are broadcast to all agents.
  • Flow input channel Messages published to this channel are delivered exclusively to the Flow. Agents do not receive these messages.

This separation allows agent collaboration to occur independently of system-level orchestration, while still enabling agents to explicitly signal the Flow when required.

Messages and State

Both Flow and Agents maintain private internal state and communicate exclusively via strongly typed, asynchronous messages. Message ‘schemas’ are defined as F# discriminated unions (DUs) types and are fixed at implementation time, providing:

  • Compile-time exhaustiveness checking
  • Explicit modeling of intent and system events
  • Clear separation between agent-level and flow-level concerns

Flow

The Flow is an asynchronous, deterministic state machine. Its state transitions are triggered solely by messages arriving on the Flow input channel.

Depending on application requirements, the Flow can range from minimal to highly directive:

  • Minimal control A simple lifecycle state machine (e.g., Start → Run → Terminate), where agents primarily interact with each other via the broadcast channel and the Flow plays a supervisory role.
  • Orchestrated control A more granular state machine in which agents primarily communicate with the Flow, and the Flow explicitly coordinates agent behavior based on its current state.

This design allows system-level determinism and control to be introduced incrementally, without constraining agent autonomy where it is unnecessary.

Topology

From a multi-agent systems perspective, RTFlow employs a hybrid bus–star topology:

  • The Bus enables broadcast-based, peer-style agent communication.
  • The Flow acts as a central coordinating node when orchestration is required.

This hybrid model balances scalability and decoupling with deterministic system control.

The F# language offers a clean way to model asynchronous state machines (or more precisely Mealy machines) where the states are functions and transitions happen via pattern matching over messages (DUs) or with ‘active patterns’. In the snippet below s_XXX are functions as states and M_xxx are messages that arrive on the Bus. The structure F packages the next state along with any output messages to be sent to agents.

let rec s_start msg = async {
  match msg with 
  | M_Start -> return F(s_run,[M_Started]) //transition to run
  | _       -> return F(s_start,[]) //stay in start state
  }

and s_run msg = async {
  match msg with 
  | M_DoSomething -> do! doSomething()
                     return F(s_run,[M_DidSomething])                     
  | M_Terminate   -> return F(s_terminate,[])
  | _             -> return F(s_run,[])
}

and s_terminate msg = async {
...

LLMs are inherently non-deterministic. RTFlow offers a way to control non-determinism to keep the overall system stable. As applications move from being human-centric to being more autonomous, we will need increasingly sophisticated methods to manage non-determinism. RTFlow’s approach is to inject a deterministic state machine in the mix to effect such control.

Given the relatively simple building blocks of RTFlow, we can construct rich agentic systems that can support many realtime needs with the ability to dial-in the desired degree of control, when needed.

2. RTOpenAI

RTOpenAI wraps the OpenAI realtime voice API for native mobile(+) apps. Its two key features are a) support for the WebRTC protocol; and b) strongly-typed realtime protocol messages. These are discussed next.

WebRTC

The OpenAI voice API can be used via Web Sockets or WebRTC, where WebRTC has some key advantages over the other;

  • Firstly, WebRTC was designed for bidirectional, realtime communication. It has built-in resiliency for minor network disconnects – which crucially Web Sockets does not.
  • WebRTC has separate channels for voice and data. (Also video – which is not currently used). This means that typically the application only needs to handle the data channel explicitly. The in/out audio channels are wired to the audio hardware by the underlying WebRTC libraries. In the case of Web Sockets, the application explicitly needs to handle in/out audio as well as the data.
  • WebRTC transmits audio via the OPUS codec that has excellent compression but also retains good audio quality. For Web Sockets multiple choices exist. High quality audio is sent as uncompressed 24KHz PCM binary as base64 encoded strings. The bandwidth required is 10X that for OPUS. There are other telephony formats available but the audio quality drops significantly.

Strongly-Typed Event Handling

The RTOpenAI.Events library attempts to define F# types for all OpenAI realtime API protocol messages (that are currently documented).

Additionally, the server (and client) messages are wrapped in DUs, which is convenient for consuming applications; incoming events can be handled with simple pattern matching. After the realtime connection is established, there is a steady flow of incoming events from the server that the application needs to accept and handle. The following snippet is an impressionistic version of how the Voice Agent handles server events:

let handleEvent (ev:ServerEvent) = async {
  match ev with
  | SessionCreated                                   -> ...
  | ResponseOutputItemDone ev when isFunctionCall ev -> ...
  | _                                                -> ... //choose to ignore
}

The RTOpenAI library is a cross-platform .NET MAUI (see next) library and as such supports realtime voice applications for IOS, MacOS, Android and Windows.

3. Fabulous .NET MAUI Controls

Microsoft .NET MAUI is a technology for building cross-platform native apps. The F# library Fabulous.MauiControls enables building of .NET MAUI apps in F#.

Fabulous is a functional-reactive UI framework (influenced by Elm and React).

Fabulous is a joy to use. UI’s can be defined declaratively in simple and understandable F#. UI ‘events’ are messages, which again are F# DU types that are ‘handled’ with pattern matching. In the simplest case, events update the application state, which is then rendered by Fabulous on to the screen.

Fabulous for .NET MAUI has a rich feature set, which cannot be fully covered here but the Counter App sample is replicated below to provide some sense of how the library works:

/// A simple Counter app

type Model =         //application state
    { Count: int }

type Msg =           //DU message types
    | Increment
    | Decrement

let init () =
    { Count = 0 }

let update msg model = //function to handle UI events/messages
    match msg with
    | Increment -> { model with Count = model.Count + 1 }
    | Decrement -> { model with Count = model.Count - 1 }

let view model =  
    Application(
        ContentPage(
            VStack(spacing = 16.) {                     //view
                Image("fabulous.png")

                Label($"Count is {model.Count}")

                Button("Increment", Increment)
                Button("Decrement", Decrement)
            }
        )
    )

The RT.Assistant is a .NET MAUI application and so the project structure is defined by .NET MAUI. Its a single project that targets multiple platforms. Components specific to each target platform are under the Platforms folder:

/RT.Assistant
  /Platforms
    /Android
    /IOS
    /MacCatalyst
    /Windows

The platform specific folders contain the native-app required components (plists, app manifests, etc.). For example, here is the IOS plist.

RT.Assistant application code is 90% shared across platforms. However platform-specific libraries are required when interfacing with hardware that .NET MAUI does not cover. For WebRTC, RTOpenAI uses platform-native libraries with Native Library Interop. The iOS WebRTC binding library wraps the WebRTC.xcframework written in C++. And for Android the native libwebrtc.aar Android Archive is wrapped.

Since most mobile apps have both IOS and Android versions, as such, .NET MAUI makes a lot of sense. Instead of maintaining multiple code bases and dev teams, with .NET MAUI one can maintain a single code base with 90% shared code across platforms. And unlike other mobile platforms (e.g. React Native), .NET MAUI apps are proper native apps. For example, it would be problematic to host a realtime multi-agent systems like RTFlow in a JavaScript-based system like React Native.

4. Prolog for RAG

To make the sample somewhat fun and interesting, I decided to use Prolog-based ‘RAG’. Generative AI meets Symbolic AI.

Prolog is a language for logic programming that was created almost 50 years ago. It has endured well even till today. The best known open source implementation is SWI- Prolog. However here I am using the much lighter weight Tau Prolog engine that runs in the browser.

Fortunately web content can be easily hosted in .NET MAUI apps via the HybridWebViewControl. In RT.Assistant there is a hidden web view that loads the Tau engine and the plan facts.

Prolog Representation

The typical phone plans from the major telecoms are ‘rich’ offerings. The interplay of base plans, number of lines, features and promotions suggest a rules engine based approach. This is precisely where Prolog excels. By representing valid combinations of plans, features, and pricing as logical facts, Prolog ensures consistency and removes ambiguity.

Prolog is a declarative language for first-order logic programming. A Prolog ‘database’ consists of facts (e.g. plans and their features) and rules (to derive new facts from existing ones). A Prolog implementation will find any and all possible solutions that satisfy a query, given a set of facts and rules.

The ‘schema’ for the plan and its features is in plan_schema.pl. The skeletal form is:

plan(title,category,prices,features)
% where each feature may have a different attribute set

A partial fact for the ‘Connect’ plan is given below:

plan(
    "Connect",
    category("all"),
    prices([
      line(1, monthly_price(20), original_price(25)),
      line(2, monthly_price(40), original_price(26)),
      ...
    ]),
    features([

        feature(
            netflix(
                desc("Netflix Standard with Ads On Us"),
                included(yes)
            ),
            applies_to_lines(lines(2, 2))
        ),

        feature(
            autopay_monthly_discount(
                desc("$5 disc. per line up to 8 lines w/AutoPay & eligible payment method."),
                discount_per_line(5),
                lines_up_to(8),
                included_in_monthly_price(yes)
            ),
            applies_to_lines(all)
        ),
        ...

    ])
).

Note

The full Prolog fact may seem complex, however the same rules expressed in a relational database schema would be far more complex to understand and query. The metadata (columns, tables, relations) required to represent the rules and facts will be far greater than what is required under Prolog.

Query Processing

While we can obtain an answer by prompting the LLM with text descriptions of the plans along with the query, there is a sound reason for not doing so. LLMs are not perfect and can make mistakes. And here we desire a more precise answer. So, instead we transform the natural language user query into an equivalent Prolog query – with the help of an LLM. It is surmised that the reformulation of the question is easier for the LLM, i.e. the LLM is less likely to hallucinate compared to the case of generating the answer directly. For direct answer generation, the LLM will need to sift through a much larger context – the entire plan database as plain text. For query generation, the LLM need only look at the database ‘schema’ – which is much more compact, especially in the case of Prolog.

If query transformation goes awry then the Prolog query may fail entirely or produce strange results. Either way the user will be alerted and will not rely on the results to make a decision. If on the other hand, the answer is generated directly, a hallucination may subtly alter or miss facts. The user is likely to accept it without questioning because the answer looks plausible. This is a more egregious error.

Example Queries

Below are some typical questions that can be asked:

  • What categories of plans are available?
  • What is the lowest cost plan for 4 lines?
    • Follow up: Does this plan include taxes and fees in the price?
  • Are there any special plans for military veterans?
    • Follow up: What is the amount of mobile hotspot data available for this plan?

The RT.Assistant application shows the natural language query; the generated Prolog; and the Prolog query results on the UI in realtime.

Example:

Natural language query generated by voice model from conversation:

Find the plans in the category 'military_veteran' for 2 lines and list their costs.

Prolog query:

plan(Title,
     category(military_veteran),
     prices(Lines),
     _),
member(price(2, monthly_price(Price), _), Lines).

Note

In Prolog, uppercase-starting names are ‘free’ variables that can be bound to values. For example, ‘Title’ above will bind to each of the plan titles for the found solutions. A solution satisfies all constraints. One obvious constraint is ‘category=military_veteran’ so only Military Veteran plans will be considered.

Results:

Title = Connect Next Military, Lines =
[line(1,monthly_price(85),original_price(90)),
 line(2,monthly_price(130),original_price(140)),
 line(3,monthly_price(165),original_price(180)),
 line(4,monthly_price(200),original_price(220)),
 line(5,monthly_price(235),original_price(260))],
Price = 130

Title = Core, Lines = ...

If a Prolog error occurs, the system regenerates the Prolog query but this time includes the Prolog error message along with the original query. This cycle may be repeated up to a limit.

Prolog Code Generation and Results

For code generation, the application allows for a choice between Claude Sonnet 4.5 and GPT 5.1 (via the app Settings). The GPT Codex model was also tested but there the latency is too high for realtime needs.

For this particular task, GPT-5.1 has the clear edge, generating code that produces concise and relevant output. See this analysis for more details.

For what it’s worth, both models generate syntactically correct Prolog 99% of the time. (A retry loop corrects generated errors, if any.)

For question-answering, the OpenAI realtime model generates satisfactory answers to user queries from the generated Prolog output. Note that for any real production system there should be a well-crafted ‘eval’ suite to truly gauge the performance.

The post RT.Assistant: A Multi-Agent Voice Bot Using .NET and OpenAI appeared first on .NET Blog.

Read the whole story
alvinashcraft
just a second ago
reply
Pennsylvania, USA
Share this story
Delete

Stop Paying for the Same Answer Twice: Agent Cache in Fiddler Everywhere

1 Share

Agent Caching in Fiddler Everywhere allows you to iterate as you build an agent without having to pay for every response when it hasn’t changed.

If you have ever built a model-powered agent, you know the development loop. Write some code, fire it at the endpoint, check the response, tweak the parsing, fire it again. Repeat until the output looks right. It is a perfectly normal workflow—and it quietly drains your token budget with every single iteration.

The new Progress Telerik Fiddler Everywhere Agent Cache feature is designed to break that cycle. Once you capture a response from a model-provider endpoint, you can flip a single switch and have Fiddler software replay that response for every subsequent matching call—without the request ever leaving your machine. Same output, zero additional tokens consumed on the provider side.

This post walks through exactly how that works, using a small open-source demo project to make everything concrete.

The Hidden Cost of Agent Development

Building an agent that calls a completion endpoint involves a lot of repetition that has nothing to do with the model itself. You are iterating on:

  • How you construct the prompt
  • How you parse and validate the structured response
  • How you surface the result to the rest of your system
  • How your error handling behaves when the response is malformed

None of those iterations require a new, unique response from the model. You already have a good one from the first call. But unless you manually save the raw response and mock it yourself, every invocation sends a fresh request, and the provider charges for it.

Once agents move beyond demos, three pressures show up together and stay for the duration of development:

  • Cost – Repeated runs during development burn budget. For a simple agent that exchanges a few hundred tokens per call, this might feel negligible. But development sessions involve dozens of sequential runs, teams have multiple developers iterating in parallel, and the costs compound quickly.
  • Latency – Every round trip to the provider stretches the feedback loop. When you are tweaking prompt construction or adjusting response parsing, waiting on a live call each time slows everything down.
  • Determinism – The same input does not always produce the same output. That variability makes it harder to isolate whether a difference in behavior came from your code change or from the model.

This is especially visible in teams that build many small, task-specific agents rather than one large agent. Even small per-run costs compound when iteration is constant—and none of that spend actually improves the agent.

What Teams Already Do

Most teams already compensate for this manually. Common patterns include separating development runs from real execution, validating agent wiring before triggering model calls, reusing mocked or previously captured responses, and avoiding live execution early to keep iteration fast.

These approaches work, but they are fragmented. Provider-level caching helps in some cases but is limited. Custom mocks and fixtures are costly to maintain. Replay logic often lives outside the main development flow, and different teams end up solving the same problem with different local tooling.

The problem is not a lack of solutions. It is the lack of a low-friction one that fits naturally into everyday iteration.

What Agent Cache Does

Fiddler Everywhere acts as a proxy that sits between your agent and the remote endpoint. When your agent makes an HTTPS call to, say, api.anthropic.com, Fiddler software intercepts it, forwards it and logs the full request-response pair in the Traffic pane.

The new Agent Calls tab is a focused view inside that pane. It automatically filters and displays HTTPS sessions that target supported model-provider endpoints—such as OpenAI, Anthropic and Gemini—so you are not wading through noise from other traffic. Every captured call gets a Caching toggle.

Enable the toggle, and Fiddler software starts intercepting any outbound call that matches that session’s request. Instead of forwarding the request, it immediately returns the cached response. The endpoint never receives the duplicate call. Your agent sees the exact same payload it would have received from a live call. Token count: zero.

Disable the toggle at any time and live traffic resumes, no restarts required.

How Agent Calls and Caching Behave

A few details that matter when you start using it:

  • Deterministic filtering: Sessions appear in Agent Calls automatically when Fiddler software detects traffic to a supported agentic endpoint. You do not need to configure which endpoints to watch.
  • First-match caching: If two or more sessions target the same endpoint (for example, https://api.anthropic.com/v1/messages) and both are cached, Fiddler software returns the response from the first cached session.
  • No rule interference: Fiddler rules are executed only for non-cached sessions. Cached responses are returned as-is, without rule evaluation.
  • Visibility split: After a session is cached, subsequent matching requests appear only in Live Traffic. The Agent Calls tab continues to show the original non-cached captures.

Why It Matters During Development

Agent Cache is built around three practical benefits that matter most during active development.

  1. Faster iterations: Replaying a cached response is instant. Instead of waiting on a round trip to the provider on every run, you get a result back immediately—shortening the feedback loop so you can move through prompt and code changes without unnecessary delays.
  2. Lower execution costs: Each cached run consumes zero tokens on the provider side. During active development, where the same request may be triggered dozens of times, this directly reduces the token spend that accumulates before a feature is even complete.
  3. More predictable behavior: A cached response is fixed and repeatable. Running the same agent logic against the same response on every iteration makes it straightforward to verify that a code change had the intended effect, without having to account for variability in live model output.

Demo: Bug Report Analyzer

To make this tangible, walk through the agent-cache-demo—a minimal Python agent that takes a fixed bug report and returns a structured analysis (severity, category, a plain-English summary and a suggested next step).

The input never changes between runs, which makes it a perfect showcase for Agent Cache: the model’s answer to an identical prompt is always reusable, so there is genuinely no reason to pay for it more than once.

What the Agent Does

The core of agent.py is straightforward:

message = client.messages.create(  
model=MODEL,  
max_tokens=256,  
system=SYSTEM_PROMPT,  
messages=[  
{"role": "user", "content": f"Analyze this bug report:\n\n{report}"}  
],  
)  

It sends the bug report to the Claude API and expects a JSON response like this:

{  
"severity": "high",  
"category": "crash",  
"summary": "App crashes with a NullPointerException when attempting to log in under no network connectivity.",  
"suggested_next_step": "Add a null or connectivity check in NetworkManager.checkConnectivity() before network calls."  
}  

That response is then formatted and printed to the terminal:

── Bug Report Analysis ─────────────────────────────────────  
Severity  : HIGH  
Category  : crash  
Summary  : App crashes with a NullPointerException when attempting to  
log in under no network connectivity.  
Next step : Add a null or connectivity check in  
NetworkManager.checkConnectivity() before network calls.  
─────────────────────────────────────  

Setup

Clone the repository and install dependencies:

git clone [https://github.com/NickIliev/agent-cache-demo](https://github.com/NickIliev/agent-cache-demo)  
cd agent-cache-demo  
  
python -m venv .venv  
source .venv/bin/activate  # macOS / Linux  
.venv\Scripts\activate  # Windows  
  
pip install -r requirements.txt  
export ANTHROPIC_API_KEY=sk-ant-... # macOS / Linux (Git Bash)  
set ANTHROPIC_API_KEY=sk-ant-... # Windows (CMD)  

The demo supports routing traffic through the Fiddler proxy or running directly against the provider. It also covers SSL/TLS trust configuration for HTTPS interception. See the repository README for full details on proxy setup, environment variables and certificate options.

Step 1: The First Live Call

Start Fiddler Everywhere and run the agent:

python agent.py  

The terminal shows the result and, crucially, the token consumption:

[tokens] Input: 312  |  Output: 68  |  Total: 380  

Switch to Fiddler Everywhere and open Traffic > Agent Calls. You will see the captured call to api.anthropic.com with the full request and response visible.

Fiddler Everywhere Traffic Agent Calls shows the captured call to api.anthropic.com with the full request and response.

This is your baseline. You paid for 380 tokens. That is fair—you needed the live call to validate the end-to-end flow.

Step 2: Enable the Cache

In the Agent Calls grid, find the captured session and flip its Caching switch to on. That is the entire configuration step.

Fiddler Everywhere Agent Calls has caching toggled on

Step 3: All Subsequent Runs Are Free

Run the agent again:

python agent.py  

The output in the terminal is byte-for-byte identical to the first run, including the token count display. Because the Caching switch was on, Fiddler software served the stored response immediately and never forwarded the request to the provider. The endpoint never saw the call.

Fiddler caching stored response immediately and never forwarded the request to the provider.

You can now iterate on agent.py as many times as you need—refactor the display logic, adjust the JSON parsing, add logging—and none of those runs cost a single token.

On the Claude Console: Only the first call was received by the provider. All sequential calls were successfully cached by Fiddler and were not taxed.

When to Use Agent Cache

Agent Cache is a development-stage tool. It is particularly valuable when:

  • Iterating on response handling: Your agent already returns a correct response from the model. You are now working on how your code handles that response—formatting, validation, error recovery. None of that work requires fresh model calls.
  • Sharing a working state with teammates: Cache a known-good response and share the Fiddler session. Everyone on the team can iterate against the same replay without burning tokens or depending on network access to the provider.
  • Working offline or in restricted environments: Once the cache is populated, your agent keeps working even without connectivity to the provider.

Things to Keep in Mind

  • Cache matching is request-based. If your agent changes the prompt, the model or any request headers, the cached session will no longer match. Capture and cache the updated variant separately.
  • The cache lives in the current Fiddler session. Closing and reopening Fiddler clears the cache state, so the next run after a restart will make a live call. Review cached sessions periodically to keep stored responses aligned with your current workflow.
  • Cache is for development, not production. Agent Cache is designed for development workflows where deterministic, repeatable responses are the goal. When you are ready to validate against a live endpoint, disable the cache and resume live calls.

Availability

Agent Cache is available on Fiddler Everywhere Trial, Pro and Enterprise tiers. The feature is not included in Lite licenses.

Try It Yourself

The full demo is on GitHub: github.com/NickIliev/agent-cache-demo. Clone it, set your Anthropic API key, and you can see the before-and-after token counts yourself in under five minutes.

The point is not really the 380 tokens saved in a single run. It is the dozens of runs you make in a typical development session, the parallel runs across a team—all of which can stop paying for answers they already have.

Agent Cache does not change how you build agents. It just removes the tax on iterating.

If you aren’t already using Fiddler Everywhere, it does come with a free trial:

Try Fiddler Everywhere

Leave Feedback

Agent development workflows are still evolving quickly, and your feedback shapes what comes next. If you try Agent Cache during development—or if there is something you wish it did differently—we want to hear about it.

Read the whole story
alvinashcraft
11 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

With AI Writing Code, What Are Developers For?

1 Share

As a software consultant, I’ve noticed a pattern play out at nearly every client over the last year. A team adopts Cursor or Claude Code or Copilot and their productivity, especially on greenfield tasks, jumps noticeably. And then, someone asks: “If the AI can do this, what are the developers for?”

It’s a valid question, and one I’ve been thinking about myself as AI as improved at many software development tasks over the last year or so. Using these tools daily on client projects, internal work, side project has made the answer clear to me. No matter how good AI gets at some of our daily tasks, developers will still be needed for their systems thinking, their setting of guardrails for the AI, and most importantly, their human judgement.

Software Development Was Always Repetitive

For the last many decades, a huge part of professional software development is following the patterns that already exist in a codebase. You might stand up a new REST API, and that first endpoint is genuinely hard. You’re making real decisions about design patterns, systems architecture, URL structure, authentication/authorization, database access, caching, and error handling. But for endpoint two through twenty, you’re just following the recipe you already wrote. We run into this a lot with client teams. They might not have the experience to architect a well-designed system from scratch, but once we get them going on a good pattern, they can easily follow it for new features.

AI is also very good at following recipes. Point it at a codebase with established conventions and it’ll crank out the next endpoint, the next service method, or the next React component in the same shape as the ones before it. In that way it’s like a junior developer who reads the existing code before writing new code and carefully follows the established patterns.

So yes, a meaningful chunk of what we used to spend our days typing is now automatable. My fingers don’t hurt at the end of the day anymore, and I don’t think they’re going to again, especially as more developers leverage voice chat capabilities with their AI tools.

AI Doesn’t Know What Good Looks Like

Here’s what I keep running into. These models are trained on an internet’s worth of code. A lot of that code, most of it really, is mediocre. Developers who have tried to find solutions to their questions on StackOverflow for the last deccade already know this. Whether it is tutorial snippets or Reddit, Stack Overflow answers written in a hurry, or open source projects with no review process, a lot of the code on the internet (and in the world) is poor-to-mediocre. These models are fundamentally averaging machines the guess at the most likely next word or token. If you give them a vague prompt, you will most like get back something that looks like the average of what’s out there on the internet. It might eventually compile, it might work, but it won’t reflect the specific decisions and tradeoffs your project needs.

I’ve never seen an AI look at a codebase and suggest a better architecture unless it’s specifically asked to by the developer running it. It probably won’t notice that your auth middleware has a subtle timing vulnerability. It won’t propose event sourcing because it picked up on a pattern of concurrency bugs in your shared state. It doesn’t know your deployment constraints, your team’s skill level, your future scalability needs, or the fact that your biggest customer hammers one particular endpoint at the same time every Monday morning.

Those are judgment calls that require human experience. In my experience, the developers getting the most out of AI right now are the ones whose judgment is already sharp. They can tell when the model missed and know exactly how to correct it.

The Specification Problem Remains

We have a joke in consulting: clients say they have “detailed specs” and then hand you the title of their project. I’ve been doing this long enough to know that the gap between what someone says they want and what they actually need is where most project risk can be found.

AI has the exact same problem. A quick, vague prompt by someone without experience can generate a lot of impressive-looking code fast. Then you spend three times as long iterating it into something that actually meets the requirements–requirements you should have pinned down before you started generating anything.

The teams I’ve seen get real traction with AI-assisted development aren’t the ones who figured out some magic prompt template. They’re the ones who already had their software fundamentals together: clear requirements, fast CI pipelines, strong automated test coverage, pull request reviews that actually catch things. Those aren’t AI skills. Those are engineering discipline. AI just raised the stakes on them.

If your feedback loops are slow (if you don’t know your code is broken until it’s in staging) then AI is only going to make you produce broken code faster, and that’s not a win.

What About the Next Generation?

This one’s harder to talk about, and I think our industry is being too quiet about it. Hiring managers at companies I work with are pausing junior roles. Not eliminating them all together, just pausing hiring for them, so they can see how far their current teams can scale with AI. Honestly, it’s rough for new graduates right now.

I lived through the outsourcing scare of the mid-2000s. Teams I worked on lost people to offshore replacements. For a while it felt like the bottom was falling out. But it didn’t. The work evolved, the value proposition shifted, and eventually it stabilized. Those who moved up the value chain survived. I think something similar is happening with AI, but I’m not going to pretend it’s happening on the same timeline. AI is a much more quickly moving disruption.

What I’ll say is this: the developers entering the field now need to lead with judgment earlier than my generation did. Writing decent algorithms or code that compiles aren’t differentiators when AI can do that. Understanding why certain decisions matter and being able to look at generated code and say “this won’t scale” or “this misses the actual requirement,” that’s what will stand out in the marketplace. It’s a higher bar, and the industry owes it to junior devs to be upfront about that instead of pretending nothing has changed. Recently, Mark Russinovich and Scott Hanselman from Microsoft published an excellent paper about what organizations can do to avoid some of the pitfalls of this era for junior engineering talant.

Developers Are For Judgment

After about year of working with these tools, here’s where I’ve landed. Developers aren’t paid for typing code. The best ones never really were. It just felt that way because typing in code took up so much of the day. Developers are paid for knowing which endpoint needs the cache and which one doesn’t. They are paid for catching that the generated migration will lock a production table for twenty minutes and slow down other critical tasks. They are paid for understanding that what product owner thinks they need isn’t what they actually need, and to help them see it.

AI now handles the mechanical translation of intent into code. Developers are the ones who make sure that intent is right in the first place and the ones who know how to fix it when it isn’t. And that’s not a new skill. It’s the skill that was always underneath the typing. We just get to spend more time on it now.

I recently went deep on the tactical side of all this with my guest Cory House on an extended edition of the Blue Blazes Podcast. We discussion choosing between AI harnesses, model selection, multi-agent workflows, and how to actually structure your prompts and feedback loops. If you’re adopting AI-assisted development on your team, give it a watch or listen.

The post With AI Writing Code, What Are Developers For? appeared first on Trailhead Technology Partners.

Read the whole story
alvinashcraft
21 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Introducing the Azure Cosmos DB Plugin for Cursor

1 Share

We’re excited to announce the Cursor plugin for Azure Cosmos DB bringing AI-powered database expertise, best practices guidance, and live database connectivity directly into your coding workflow.

Whether you’re designing a new data model, optimizing queries, or building a RAG application with vector search, the Cursor plugin gives your AI assistant deep knowledge of Azure Cosmos DB so you can write better code, faster.

What’s in the Plugin

The Azure Cosmos DB plugin bundles three capabilities:

Component What it does Setup required?
Rules 10 best-practice rule sets that guide the AI when writing Cosmos DB code None works immediately
MCP Server Live connection to your Cosmos DB account for querying data, discovering schemas, and running vector searches Requires deployed MCP Toolkit
Saved Prompt A one-click prompt to regenerate rules when upstream guidance updates None

Finding and Installing the Plugin

From the Cursor Marketplace

  1. Open Cursor and go to the Marketplace panel
  2. Search for “Azure Cosmos DB”
  3. Click Install

Cursor editor marketplace showing the Azure Cosmos DB plugin available for installation, with search results highlighting access to Azure Cosmos DB accounts within the IDE.
Install azure cosmosdb plugin from the cursor marketplace

That’s it. The rules are active immediately no configuration needed.

From a Direct Link

You can also install by visiting cursor.com/marketplace and searching for “Azure Cosmos DB”, or by sharing the plugin link with your team.

Cursor marketplace webpage showing recently added plugins, including Azure Cosmos DB, Svelte, Postman, Cisco ThousandEyes, Datadog, and Firetiger, with navigation options for browsing and managing plugins.

For Teams and Enterprise

Admins on Teams or Enterprise plans can add the plugin to a Team Marketplace for centralized distribution:

  1. Go to Dashboard → Settings → Plugins → Team Marketplaces
  2. Click Import and paste the plugin repository URL
  3. Set the plugin as Required (auto-install for everyone) or Optional (developer’s choice)

Using Rules — Zero-Setup Best Practices

The plugin includes 10 curated rule sets covering the most important aspects of working with Azure Cosmos DB. These rules are sourced from the Azure Cosmos DB Agent Kit and encode the same guidance that our engineering team recommends.

What the Rules Cover

  • Data Modeling — Document design, embedding vs. referencing, denormalization, type discriminators, size limits, and schema evolution
  • Partition Key Design — High-cardinality selection, query alignment, avoiding hot partitions, hierarchical partition keys, and synthetic keys
  • Query Optimization — Eliminating cross-partition queries, avoiding full scans, using projections, parameterized queries, and continuation-token pagination
  • SDK Best Practices — Singleton client pattern, Direct connection mode, retry handling, diagnostics logging, async patterns, ETags for concurrency, and availability strategies
  • Indexing Strategies — Composite indexes for ORDER BY, excluding unused paths, spatial indexes, and choosing the right index type
  • Throughput & Scaling — Autoscale vs. provisioned vs. serverless, right-sizing throughput, container vs. database throughput, and burst capacity
  • Global Distribution — Multi-region writes, consistency levels, automatic failover, conflict resolution, and zone redundancy
  • Monitoring & Diagnostics — Tracking RU consumption, P99 latency, throttling alerts, diagnostic logging, and Azure Monitor integration
  • Design Patterns — Change feed materialized views, ranking patterns, multi-tenant architectures, and event sourcing
  • Vector Search — Feature enablement, embedding policies, vector index configuration, VectorDistance queries, and repository patterns

How Rules Work in Practice

Once installed, Cursor’s AI agent automatically draws on these rules when you’re working with Cosmos DB code. You don’t need to do anything — just ask questions in chat or let the agent assist while you code.

Try these prompts:

What partition key should I use for a multi-tenant SaaS app?

Review my Cosmos DB data model for this e-commerce application

Optimize this query — it’s consuming too many RUs

Should I embed order items in the order document or store them separately?

The AI will respond with specific, actionable guidance grounded in Cosmos DB best practices — not generic database advice.

Managing Rules

You can control how each rule behaves from Cursor Settings → Rules:

  • Always — Rule applies to every conversation
  • Agent Decides — The AI activates the rule when it’s relevant (default)
  • Manual — Only used when you explicitly invoke it

Using MCP — Live Database Connectivity

The Model Context Protocol (MCP) Toolkit for Azure Cosmos DB connects Cursor to your actual Azure Cosmos DB account, enabling the AI to query real data, explore schemas, and search documents directly from chat.

Available MCP Tools

Tool What it does
list_databases List all databases in your Cosmos DB account
list_collections List all containers in a database
get_approximate_schema Sample documents to infer container schema
get_recent_documents Retrieve the N most recent documents
find_document_by_id Look up a specific document by its ID
text_search Search documents where a property contains a phrase
vector_search Run semantic similarity searches using Azure OpenAI embeddings

Setting Up the MCP Server

The MCP tools require a deployed instance of the Azure Cosmos DB MCP Toolkit. Here’s how to get it running:

Step 1: Deploy the MCP Toolkit

git clone https://github.com/AzureCosmosDB/MCPToolKit.git
cd MCPToolKit
azd up

Or use the Deploy to Azure button in the MCPToolKit README.

Step 2: Set Environment Variables

After deployment, set two environment variables that the plugin uses to connect:

# The URL of your deployed MCP Toolkit (from deployment-info.json)

Note: JWT tokens expire after approximately one hour. Re-run the command above to refresh.

Step 3: Restart Cursor and Verify

After setting the environment variables, restart Cursor. Then open a chat and try:

List all databases in my Cosmos DB account

You should see the AI call the list_databases tool and return your actual database names.

Managing the MCP Server

Toggle the MCP server on or off from Cursor Settings → Features → Model Context Protocol. When disabled, the server won’t load and the tools won’t appear in chat. The rules continue to work independently.

Example MCP Workflows

Once connected, you can have conversations like:

Show me the schema of the users container in the mydb database

Get the latest 10 documents from the orders container

Search for documents where the name contains “Azure”

Find the document with id “user-001” in the users container

The AI will invoke the appropriate MCP tool, retrieve real data from your account, and use it to answer your question — all without leaving the editor.

Rules + MCP Together

The real power comes from combining rules and MCP. When you ask the AI to review a data model, it can:

  1. Query your actual schema using get_approximate_schema
  2. Analyze it against best practices using the data-modeling and partition-key rules
  3. Give you specific recommendations grounded in both your real data and Cosmos DB expertise

For example:

Look at the schema of my orders container and suggest improvements for partition key design and query performance

The AI will fetch your container’s schema via MCP, then apply partition key, query optimization, and indexing rules to provide tailored recommendations.

Getting Started

  1. Install the plugin from the Cursor Marketplace — search for “Azure Cosmos DB”
  2. Start using rules immediately — ask Cosmos DB questions in chat
  3. Optionally deploy the MCP Toolkit for live database connectivity
  4. Explore the plugin repository for full documentation

Resources

We’d love to hear your feedback. Try the plugin and let us know how it works for your Cosmos DB projects  file issues or contribute on GitHub.

About Azure Cosmos DB

Azure Cosmos DB is a fully managed and serverless NoSQL and vector database for modern app development, including AI applications. With its SLA-backed speed and availability as well as instant dynamic scalability, it is ideal for real-time NoSQL and MongoDB applications that require high performance and distributed computing over massive volumes of NoSQL and vector data.

To stay in the loop on Azure Cosmos DB updates, follow us on XYouTube, and LinkedIn.  Join the discussion with other developers on the #nosql channel on the Microsoft Open Source Discord.

The post Introducing the Azure Cosmos DB Plugin for Cursor appeared first on Azure Cosmos DB Blog.

Read the whole story
alvinashcraft
32 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

GPT-5.4 mini and GPT-5.4 nano, which can describe 76,000 photos for $52

1 Share

OpenAI today: Introducing GPT‑5.4 mini and nano. These models join GPT-5.4 which was released two weeks ago.

OpenAI's self-reported benchmarks show the new 5.4-nano out-performing their previous GPT-5 mini model when run at maximum reasoning effort. The new mini is also 2x faster than the previous mini.

Here's how the pricing looks - gpt-5.4-nano is notably even cheaper than Google's Gemini 3.1 Flash-Lite:

Model Input Cached input Output
gpt-5.4 $2.50 $0.25 $15.00
gpt-5.4-mini $0.75 $0.075 $4.50
gpt-5.4-nano $0.20 $0.02 $1.25
Other models for comparison
Claude Opus 4.6 $5.00 - $25.00
Claude Sonnet 4.6 $3.00 - $15.00
Gemini 3.1 Pro $2.00 - $12.00
Claude Haiku 4.5 $1.00 - $5.00
Gemini 3.1 Flash-Lite $0.25 - $1.50

I used GPT-5.4 nano to generate a description of this photo I took at the John M. Mossman Lock Collection:

Description below

llm -m gpt-5.4-nano -a IMG_2324.jpeg 'describe image'

Here's the output:

The image shows the interior of a museum gallery with a long display wall. White-painted brick walls are covered with many framed portraits arranged in neat rows. Below the portraits, there are multiple glass display cases with dark wooden frames and glass tops/fronts, containing various old historical objects and equipment. The room has a polished wooden floor, hanging ceiling light fixtures/cords, and a few visible pipes near the top of the wall. In the foreground, glass cases run along the length of the room, reflecting items from other sections of the gallery.

That took 2,751 input tokens and 112 output tokens, at a cost of 0.069 cents (less than a tenth of a cent). That means describing every single photo in my 76,000 photo collection would cost around $52.44.

I released llm 0.29 with support for the new models.

Then I had OpenAI Codex loop through all five reasoning effort levels and all three models and produce this combined SVG grid of pelicans riding bicycles (generation transcripts here). I do like the gpt-5.4 xhigh one the best, it has a good bicycle (with nice spokes) and the pelican has a fish in its beak!

Described by Claude Opus 4.6: A 5x3 comparison grid of AI-generated cartoon illustrations of a pelican riding a bicycle. Columns are labeled "gpt-5.4-nano", "gpt-5.4-mini", and "gpt-5.4" across the top, and rows are labeled "none", "low", "medium", "high", and "xhigh" down the left side, representing quality/detail settings. In the "none" row, gpt-5.4-nano shows a chaotic white bird with misplaced arrows and tangled wheels on grass, gpt-5.4-mini shows a duck-like brown bird awkwardly straddling a motorcycle-like bike, and gpt-5.4 shows a stiff gray-and-white pelican sitting atop a blue tandem bicycle with extra legs. In the "low" row, nano shows a chubby round white bird pedaling with small feet on grass, mini shows a cleaner white bird riding a blue bicycle with motion lines, and gpt-5.4 shows a pelican with a blue cap riding confidently but with slightly awkward proportions. In the "medium" row, nano regresses to a strange bird standing over bowling balls on ice, mini shows two plump white birds merged onto one yellow-wheeled bicycle, and gpt-5.4 shows a more recognizable gray-and-white pelican on a red bicycle but with tangled extra legs. In the "high" row, nano shows multiple small pelicans crowded around a broken green bicycle on grass with a sun overhead, mini shows a tandem bicycle with two white pelicans and clear blue sky, and gpt-5.4 shows two pelicans stacked on a red tandem bike with the most realistic proportions yet. In the "xhigh" row, nano shows the most detailed scene with a pelican on a detailed bicycle with grass and a large sun but still somewhat jumbled anatomy, mini produces the cleanest single pelican on a yellow-accented bicycle with a light blue sky, and gpt-5.4 shows a well-rendered gray pelican on a teal bicycle with the best overall coherence. Generally, quality improves moving right across models and down through quality tiers, though "medium" is inconsistently worse than "low" for some models, and all images maintain a lighthearted cartoon style with pastel skies and simple backgrounds.

Tags: ai, openai, generative-ai, llms, llm, vision-llms, llm-pricing, pelican-riding-a-bicycle, llm-release

Read the whole story
alvinashcraft
52 seconds ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories