Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
148525 stories
·
33 followers

The 'Million AI Monkeys' Hypothesis & Real-World Projects

1 Share

I have run into this post by John Rush, which I found really interesting, mostly because I so vehemently disagree with it. Here are the points that I want to address in John’s thesis:

1. Open Source movement gonna end because AI can rewrite any oss repo into a new code and commercially redistribute it as their own.

2. Companies gonna use AI to generate their none core software as a marketing effort (cloudflare rebuilt nextjs in  a week).

Can AI rewrite an OSS repo into new code? Let’s dig into this a little bit.

AI models today do a great job of translating code from one language to another. We have good testimonies that this is actually a pretty useful scenario, such as the recent translation of the Ladybird JS engine to Rust.

At RavenDB, we have been using that to manage our client APIs (written in multiple languages & platforms). It has been a great help with that.

But that is fundamentally the same as the Java to C# converter that shipped with Visual Studio 2005. That is 2005, not 2025, mind you. The link above is to the Wayback Machine because the original link itself is lost to history.

AI models do a much better job here, but they aren’t bringing something new to the table in this context.

Claude C Compiler

Now, let’s talk about using the model to replicate a project from scratch. And here we have a bunch of examples. There is the Claude C Compiler, an impressive feat of engineering that can compile the Linux kernel.

Except… it is a proof of concept that you wouldn’t want to use. It produces code that is significantly slower than GCC, and its output is not something that you can trust. And it is not in a shape to be a long-term project that you would maintain over the years.

For a young project, being slower than the best-of-breed alternative is not a bad thing. You’ve shown that your project works; now you can work on optimization.

For an AI project, on the other hand, you are in a pretty bad place. The key here is in terms of long-term maintainability. There is a great breakdown of the Claude C Compiler from the creator of Clang that I highly recommend reading.

The amount of work it would require to turn it into actual production-level code is enormous. I think that it would be fair to say that the overall cost of building a production-level compiler with AI would be in the same ballpark as writing one directly.

Many of the issues in the Claude C Compiler are not bugs that you can “just fix”. They are deep architectural issues that require a very different approach.

Leaving that aside, let’s talk about the actual use case. The Linux kernel’s relationship with its compiler is not a trivial one. Compiler bugs and behaviors are routine issues that developers run into and need to work on.

See the occasional “discussion” on undefined behavior optimizations by the compiler for surprisingly straightforward code.

Cloudflare’s vinext

So Cloudflare rebuilt Next.js in a week using AI. That is pretty impressive, but that is also a lie. They might have done some work in a week, but that isn’t something that is ready. Cloudflare is directly calling this highly experimental (very rightly so).

They also have several customers using it in production already. That is awesome news, except that within literal days of this announcement, multiple critical vulnerabilities have been found in this project.

A new project having vulnerabilities is not unexpected. But some of those vulnerabilities were literal copies of (fixed) vulnerabilities in the original Next.js project.

The issue here is the pace of change and the impact. If it takes an agent a week to build a project and then you throw that into production, how much real testing has been done on it? How much is that code worth?

John stated that this vinext project for Cloudflare was a marketing effort. I have to note that they had to pay bug bounties as a result and exposed their customers to higher levels of risk. I don’t consider that a plus. There is also now the ongoing maintenance cost to deal with, of course.

The key here is that a line of code is not something that you look at in isolation. You need to look at its totality. Its history, usage, provenance, etc. A line of code in a project that has been battle-tested in production is far more valuable than a freshly generated one.

I’ll refer again to the awesome “Things You Should Never Do” from Spolsky. That is over 25 years old and is still excellent advice, even in the age of AI-generated code.

NanoClaw’s approach

You’ve probably heard about the Clawdbot ⇒ Moltbot ⇒ OpenClaw, a way to plug AI directly into everything and give your CISO a heart attack. That is an interesting story, but from a technical perspective, I want to focus on what it does.

A key part of what made OpenClaw successful was the number of integrations it has. You can connect it to Telegram, WhatsApp, Discord, and more. You can plug it into your Gmail, Notes, GitHub, etc.

It has about half a million lines of code (TypeScript), which were mostly generated by AI as well.

To contrast that, we have NanoClaw with ~500 lines of code. Not a typo, it is roughly a thousand times smaller than OpenClaw. The key difference between these two projects is that NanoClaw rebuilds itself on the fly.

If you want to integrate with Telegram, for example, NanoClaw will use the AI model to add the Telegram integration. In this case, it will use pre-existing code and use the model as a weird plugin system. But it also has the ability to generate new code for integrations it doesn’t already have. See here for more details.

On the one hand, that is a pretty neat way to reduce the overall code in the project. On the other hand, it means that each user of NanoClaw will have their own bespoke system.

Contrasting the OpenClaw and NanoClaw approaches, we have an interesting problem. Both of those systems are primarily built with AI, but NanoClaw is likely going to show a lot more variance in what is actually running on your system.

For example, if I want to use Signal as a communication channel, OpenClaw has that built in. You can integrate Signal into NanoClaw as well, but it will generate code (using the model) for this integration separately for each user who needs it.

A bespoke solution for each user may sound like a nice idea, but it just means that each NanoClaw is its own special snowflake. Just thinking about supporting something like that across many users gives me the shivers.

For example, OpenClaw had an agent takeover vulnerability (reported literally yesterday) that would allow a simple website visit to completely own the agent (with all that this implies). OpenClaw’s design means that it can be fixed in a single location.

NanoClaw’s design, on the other hand, means that for each user, there is a slightly different implementation, which may or may not be vulnerable. And there is no really good way to actually fix this.

Summary

The idea that you can just throw AI at a problem and have it generate code that you can then deploy to production is an attractive one. It is also by no means a new one.

The notion of CASE tools used to be the way to go about it. The book Application Development Without Programmers was published in 1982, for example. The world has changed since then, but we are still trying to get rid of programmers.

Generating code quickly is easy these days, but that just shifts the burden. The cost of verifying code has become a lot more pronounced. Note that I didn’t say expensive. It used to be the case that writing the code and verifying it were almost the same task. You wrote the code and thus had a human verifying that it made sense. Then there are the other review steps in a proper software lifecycle.

When we can drop 15,000 lines of code in a few minutes of prompting, the entire story changes. The value of a line of code on its own approaches zero. The value of a reviewed line of code, on the other hand, hasn’t changed.

A line of code from a battle-tested, mature project is infinitely more valuable than a newly generated one, regardless of how quickly it was produced. The cost of generating code approaches zero, sure.

But newly generated code isn’t useful. In order for me to actually make use of that, I need to verify it and ensure that I can trust it. More importantly, I need to know that I can build on top of it.

I don’t see a lot of people paying attention to the concept of long-term maintainability for projects. But that is key. Otherwise, you are signing up upfront to be a legacy system that no one understands or can properly operate.

Production-grade software isn’t a prompt away, I’m afraid to say. There are still all the other hurdles that you have to go through to actually mature a project to be able to go all the way to production and evolve over time without exploding costs & complexities.

Read the whole story
alvinashcraft
48 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Creating a Fun Multi-Agent Content Strategy System with Microsoft Agent Framework

1 Share

That's what we're building in this tutorial. Using Microsoft Agent Framework, we'll create a multi-agent system where three specialised AI agents collaborate to help gaming content creators craft posts that actually perform. One agent generates platform-native content. Another evaluates it the way TikTok's, Twitter's, or YouTube's recommendation algorithm would. A third reacts as a real audience member, complete with the slang, biases, and short attention span of an actual person scrolling their feed.

I have named the simulation app Viral or Fail, and by the end of this tutorial you'll have a working tool that demonstrates some of the most important patterns in multi-agent system design: role specialisation, structured evaluation, iterative feedback loops, and tool integration with external data sources.

What We Will Cover

By the end of this tutorial, you'll understand how to design a multi-agent system where each agent has a distinct role and expertise, orchestrate agent communication using Agent Framework's Agent class and async sessions, integrate external tools (live Google Trends data) into an agent workflow, build iterative refinement pipelines where agents improve each other's output through structured feedback, and create evaluation rubrics that ground agent behaviour in real-world domain logic.

These patterns can be applied to numerous other tasks as this is the same building block behind multi-agent customer support systems, automated code review pipelines, and any application where specialised agents need to collaborate on a shared task.

Prerequisites

You'll need Python 3.10 or higher, a GitHub account with a Personal Access Token (free tier — get one at github.com/settings/tokens), and a basic understanding of what AI agents are. If you're new to agents, I'd recommend the AI Agents for Beginners course; this project was inspired by and builds on concepts from that curriculum.

Why Multi-Agent? Why Not Just One Big Prompt?

You could write a single prompt that says "generate a gaming post, score it, and react to it." But you'd get mediocre results across the board. A single LLM call tries to be creative, analytical, and authentic simultaneously  and will probably end up being none of those things convincingly.

Multi-agent systems solve this through role specialisation. When an agent's only job is to think like TikTok's recommendation algorithm, it does that job significantly better than a generalist prompt. And when agents with different objectives interact, natural tension emerges: a creator wants to be bold and viral, an algorithm wants measurable engagement signals, and an audience member just wants to feel something. That tension produces more realistic, more useful outputs than any monolithic approach.

This is the same principle behind production multi-agent systems. Content moderation platforms use separate agents for classification, response generation, and quality assurance. Code review tools use one agent to identify issues and another to suggest fixes. The pattern scales because specialisation scales.

System architecture: The Content Creator generates platform-native content from live trends, the Algorithm Simulator scores it against platform-specific rubrics, and a randomly selected Audience Persona reacts authentically. Feedback from both evaluators flows back to the Creator for iterative refinement.

 

System Design: Three Agents, Three Perspectives

The system's power comes from the fact that each agent represents a fundamentally different lens on the same piece of content. Let's break down each one.

The Content Creator Agent

This agent, here, is the strategist. It is a trend-savvy gaming content creator who understands the nuances of each platform. it generates platform-native content that respects the conventions, formats, and cultural norms of TikTok, Twitter/X, YouTube, or Instagram.

The key design decision here is in the system prompt. Rather than generic instructions, we encode platform-specific knowledge directly:

CREATOR_SYSTEM_PROMPT = """You are the Content Creator — a trend-savvy gaming content creator who lives and breathes internet culture. You know every platform inside out and create content that feels native, not generic. RULES: - Be platform-native. A TikTok script should feel like a TikTok, not a blog post. - Use gaming terminology correctly. Don't say "the game Valorant" — say "Valo" or "Val". - For Twitter/X: Write punchy, provocative takes. Think ratio-worthy engagement bait. - For YouTube: Focus on title + thumbnail concept + video structure outline. - Be bold. Safe content doesn't go viral. When given FEEDBACK from the Algorithm Simulator and Audience Persona, revise your content to address their specific concerns while keeping the creative energy high. Explain what you changed and why."""

That last instruction is important as it tells the Creator how to handle feedback from the other agents, which is what enables the iterative refinement loop we'll build later.

The Algorithm Simulator Agent

This is the most unusual agent in the system. Instead of acting as a generic critic, it role-plays as a social media platform's actual recommendation algorithm. It evaluates content the way an algorithm would through signals, weights, and distribution mechanics.

ALGORITHM_SYSTEM_PROMPT = """You are the Algorithm Simulator — a cold, analytical system that evaluates content exactly like a social media platform's recommendation algorithm would. You think in signals, weights, and distribution mechanics. You have no feelings about the content; only data. RULES: - Be specific. Don't say "the hook is weak" — say "the hook lacks a pattern interrupt in the first 1.5 seconds, which will drop initial retention below the 65% threshold needed for FYP promotion." - Reference actual platform mechanics: completion rate, dwell time, engagement velocity, session time contribution... - Think like an algorithm, not a human reviewer. The algorithm doesn't care if the take is "good" — it cares if the take drives engagement signals."""

This distinction between quality and distribution probability is the core insight. A beautifully written post can score poorly because it lacks the specific signals an algorithm needs to push it into wider circulation. Content creators deal with this disconnect every day — the Algorithm Simulator makes it visible and measurable.

In a production context, this same pattern of an agent that simulates an external system's decision logic, has applications well beyond content creation. Imagine an agent that simulates a CI/CD pipeline's quality gates, or one that evaluates code the way a specific linter or reviewer would. The pattern is the same: encode the evaluation system's rules into the agent's prompt and let it reason within those constraints.

The Audience Persona Agent

The third agent brings the human element. Each session, it randomly becomes one of three gaming community personas — each with distinct tastes, language, and engagement patterns:

PERSONAS = { "casual_mobile_gamer": { "name": "CasualChloe", "description": "Casual mobile gamer", "system_prompt": """You are CasualChloe — a casual mobile gamer... - You use a lot of "lol", "ngl", "lowkey", "fr fr", and "no cap" - You'll scroll past anything that feels too "sweaty" or try-hard - You judge content in about 2 seconds — if it doesn't grab you, you're gone ...""" }, "competitive_esports_fan": { "name": "TryHard_Tyler", "description": "Competitive esports fan", "system_prompt": """You are TryHard_Tyler — a hardcore competitive esports fan... - You'll call out content that gets facts wrong or oversimplifies - You'll ratio someone in the comments if their take is bad ...""" }, "retro_indie_enthusiast": { "name": "PixelPete", "description": "Retro/indie game enthusiast", "system_prompt": """You are PixelPete — a retro and indie game enthusiast... - You're tired of mainstream AAA hype and live-service games - You appreciate craftsmanship and artistic vision over graphics ...""" }, }

The random persona selection is a deliberate design choice. It simulates the reality that you never know exactly who's going to see your content. A Valorant Champions post might get passionate engagement from TryHard_Tyler but complete indifference from PixelPete. That unpredictability mirrors real content distribution and it's the kind of insight that can emerge from a multi-agent system.

This is essentially synthetic user testing. Companies pay for focus groups and user research. Here, we're simulating it with agent personas, essentially using a lightweight version of the same concept that can run in seconds.

def create_audience_persona_agent(llm_config, persona=None): if persona is None: persona = get_random_persona() agent = Agent( name=persona["name"], instructions=persona["system_prompt"], client=client, ) return agent, persona

Grounding Evaluation with Platform Rubrics

One of the biggest challenges with AI agents is preventing vague, generic feedback. Left unguided, the Algorithm Simulator would default to hollow assessments like "this post is good" or "needs improvement." To prevent this, we give it structured scoring rubrics that mirror how each platform's algorithm actually prioritises content.

PLATFORM_RULES = { "Twitter/X": { "description": "Text-first microblogging platform driven by engagement velocity", "criteria": { "hot_take_factor": { "weight": 0.30, "description": "Does the post have a strong, polarising opinion? " "Twitter/X rewards engagement velocity — hot takes drive replies." }, "quote_retweet_bait": { "weight": 0.25, "description": "Is the post structured to invite quote retweets? QRTs are " "Twitter/X's most powerful distribution mechanic." }, "timing_relevance": { "weight": 0.20, ... }, "thread_potential": { "weight": 0.15, ... }, "hashtag_strategy": { "weight": 0.10, ... }, }, }, "TikTok": { ... }, # Prioritises hook_strength (30%) and trend_alignment (25%) "YouTube": { ... }, # Prioritises thumbnail_clickability (25%) and title_curiosity_gap (25%) "Instagram": { ... }, # Prioritises visual_appeal (30%) and caption_hook (20%) }

Each platform has different criteria with different weights, and those weights are passed directly into the Algorithm Simulator's prompt at evaluation time. TikTok cares most about whether the first three seconds hook the viewer. YouTube cares about click-through rate. Twitter cares about whether your take is spicy enough to drive quote-retweets. The agent's evaluation is always anchored in platform-specific logic, not generic opinions.

How we provide structured evaluation criteria as grounding context here is one of the most transferable patterns in this project. Whenever you need an agent to evaluate something consistently, give it a rubric. It works for content scoring, code review, proposal assessment, or any domain where you want structured, reproducible judgments.

Orchestrating with Microsoft Agent Framework

With the agents designed, let's wire them together. Agent Framework makes this straightforward — each agent is an Agent with instructions and a chat client. We send messages directly using the async agent.run() method, with sessions maintaining conversation context across rounds.

client = OpenAIChatClient( model_id="openai/gpt-4.1-mini", api_key=os.getenv("GITHUB_TOKEN"), base_url="https://models.github.ai/inference", ) creator = create_content_creator_agent(client) algorithm = create_algorithm_simulator_agent(client) audience_agent, persona = create_audience_persona_agent(client) # Sessions maintain conversation context across iteration rounds creator_session = creator.create_session() algorithm_session = algorithm.create_session() audience_session = audience_agent.create_session()

We're using GitHub Models as our LLM backend — free tier, no paid API keys, just a GitHub PAT. This is the same setup used in Microsoft's AI Agents for Beginners course. The OpenAIChatClient connects directly to GitHub's inference endpoint. Each agent gets the same client instance, and create_session() gives each one a persistent memory so they can reference previous rounds during iteration.

Communication between agents flows through agent.run():

async def get_agent_response(agent, message, session=None): result = await agent.run(message, session=session) return result.text or "No response generated."

Each agent.run() call gets a single response. The session parameter maintains conversation history across rounds so agents remember previous feedback. This gives us precise control over the pipeline: Creator generates -> Algorithm evaluates -> Persona reacts -> we decide whether to loop.

This is a common pattern for application-controlled multi-agent orchestration, as opposed to free-flowing agent conversation. Both approaches have their place, but when you need deterministic sequencing (as in any evaluation or pipeline scenario), controlling the loop yourself is more reliable.

Integrating Live Data with Google Trends

What makes this system feel like a real tool is the live Google Trends integration — the agents work with whatever's actually trending in gaming right now, not canned example data.

We use trendspy (a modern replacement for pytrends, which was archived in April 2025) to pull real-time trending searches:

from trendspy import Trends def fetch_gaming_trends(count=10): try: tr = Trends() all_trends = tr.trending_now(geo="US") # Tier 1: Filter by Google's own Games topic tag gaming_trends = [ t.keyword for t in all_trends if GAMES_TOPIC_ID in (t.topics or []) ] if len(gaming_trends) >= 5: return gaming_trends[:count] # Tier 2: Keyword matching as backup gaming_keywords = ["game", "valorant", "fortnite", "nintendo", ...] keyword_matches = [ t.keyword for t in all_trends if any(kw in t.keyword.lower() for kw in gaming_keywords) ] gaming_trends.extend(keyword_matches) # Tier 3: Pad with curated sample data if len(gaming_trends) < 5: sample = _load_sample_trends() gaming_trends.extend([t for t in sample if t not in gaming_trends]) return gaming_trends[:count] except Exception: return _load_sample_trends()[:count]

The three-tier fallback strategy here is worth highlighting because it's a pattern you'll use whenever you integrate external tools into agent workflows. On a day when a major game launches or a big esports tournament is running, Tier 1 will return a full list of gaming-specific trends. On a quiet day, like in this demo scenario, when Google Trends is dominated by the Winter Olympics and NBA All-Star weekend — Tier 2 catches gaming content that wasn't formally tagged, and Tier 3 ensures the system always has enough data to work with.

This is the tool-use pattern from Lesson 4 of the AI Agents for Beginners course in practice. The principle being established here is that external tools should enhance agent capabilities, but they should never be a single point of failure. Build in graceful degradation so the agent workflow completes regardless of what the external service does.

The Refinement Pipeline: Agents Improving Each Other

We want to take the system from just a "neat demo" to "actually useful." The pipeline runs for up to three rounds. Each round, the Content Creator either generates fresh content (round 1) or revises based on aggregated feedback (rounds 2-3). The Algorithm Simulator scores it against the platform rubric. The Audience Persona gives an authentic reaction. Then the user decides: iterate or lock in.

The revision prompt is where the multi-agent magic happens:

revision_prompt = ( f"REVISION REQUEST (Round {iteration}/{MAX_ITERATIONS}):\n\n" f"The Algorithm Simulator and Audience Persona reviewed your " f"{platform} post about '{topic}'. Here's their feedback:\n\n" f"--- ALGORITHM FEEDBACK ---\n{algorithm_response}\n\n" f"--- AUDIENCE FEEDBACK ({persona['name']}) ---\n" f"{audience_response}\n\n" f"Revise your content to address their concerns. Keep what works, " f"fix what doesn't. Show what you changed and why." )

The Creator receives two fundamentally different types of feedback; cold metrics from the Algorithm and subjective human reactions from the Persona. It now has to reconcile them. It might cut hashtags from six to two (addressing the Algorithm's scoring penalty on hashtag overuse) while simultaneously softening its "corporate esports" energy (addressing the Persona's disengagement with mainstream hype).

This negotiation between competing feedback sources is one of the most powerful patterns in multi-agent design. In production systems, you see it everywhere: a coding agent balancing correctness feedback from a test runner with readability feedback from a style checker, or a customer support agent balancing policy compliance with empathy. The agents don't need to agree but only need to provide different perspectives that the system (or a human) can synthesise.

Seeing It in Action

Here's what a real session looks like. We picked "Valorant Champions 2025" on Twitter/X, and PixelPete (the retro/indie enthusiast) was randomly selected as our audience persona.

The Creator generated a bold take:

Valorant Champions 2025 is gonna be a BLOODBATH — here's why no org outside the top 3 will even sniff the finals. Sentinels, Fnatic, and LOUD have cracked the meta code so hard that every other team's strategy looks like a toddler's finger painting...

The Algorithm Simulator broke down the distribution probability:

hot_take_factor (30%): 85/100 — The tweet delivers a strong polarizing opinion, likely to trigger debate and replies. The confident tone aligns with Twitter's engagement velocity mechanics...

hashtag_strategy (10%): 50/100 — Six hashtags is above Twitter's recommended 1-3 per tweet. Overuse reduces organic reach within Twitter's credibility filtering...

Weighted Total: 75/100

 

And PixelPete? He scrolled right past:

Eh, Valorant esports hype isn't really my cup of tea. This whole "bloodbath" and "top 3 orgs owning the meta" spiel feels like the usual corporate esports noise — all flash, little soul. I'll keep scrolling for something with more heart and craftsmanship.

 

Three agents. Three completely different takes on the same content. The Algorithm says it'll perform well. The audience member says he doesn't care. And that mismatch is exactly the kind of insight you'd never get from a single-agent system — and exactly the kind of insight that matters when you're planning a content strategy.

 

Extending the System

The project is designed to be modular. Here are a few directions you can take it:

Add new platforms. The rubric system in platform_rules.py is just a dictionary. Add a LinkedIn or Threads entry with appropriate criteria and weights, and the Algorithm Simulator will evaluate against those rules without any code changes.

Create new audience personas. Add a "Streamer_Sarah" who evaluates content from a Twitch creator's perspective, or a "ParentGamer_Pat" who only engages with family-friendly content. Each persona is a system prompt and a name, nothing else to change.

Swap the niche. Replace the gaming trend fetcher with music, tech, or fitness trends. The agent architecture is niche-agnostic; only the trend tool and sample data need to change.

Register trends as an Agent Framework tool. Right now, the application fetches trends and passes them as context. In a more advanced version, you could use the @tool decorator to register fetch_gaming_trends as a callable tool that agents invoke autonomously — moving from application-controlled to agent-controlled tool use.

What's Next: Evaluating the Evaluator

Here's the question this project intentionally leaves open: the Algorithm Simulator scored the post 75/100 — but how do we know the Simulator itself is any good?

We built an agent that evaluates content, but we never evaluated the evaluator. How consistent are its scores? If you run the same post through it twice, does it give the same result? Do its predictions correlate with real-world engagement metrics? Would a human social media strategist agree with its rubric weights?

This is the problem of agent evaluation — one of the most important and underexplored challenges in building production agentic systems. We all know how to evaluate a model on a benchmark. But how do you evaluate an agent that's making subjective, multi-dimensional judgments within a larger system?

In a follow-up article, we'll tackle exactly this: building evaluation frameworks for AI agents, testing for consistency and calibration, measuring inter-agent agreement, and determining whether your agents are actually doing what you think they're doing. The system we built here will serve as our running example — because when your system contains an agent whose entire job is evaluation, evaluating that agent becomes the most important question you can ask.

Get the Code

The full project is on GitHub: 

https://github.com/HamidOna/viral-or-fail

Clone it, run pip install -r requirements.txt, add your GitHub token to .env, and run python viral_or_fail.py. Everything runs on GitHub Models' free tier — no paid API keys required.

References and Further Reading

Frameworks and Tools

  • Microsoft Agent Framework Documentation — Microsoft's production framework for multi-agent orchestration (successor to AutoGen), used throughout this project
  • AI Agents for Beginners — Microsoft's 12-lesson course on building AI agents, which inspired this project. Particularly relevant: Lesson 4 (Tool Use), Lesson 8 (Multi-Agent Design Pattern), and Lesson 9 (Metacognition)
  • GitHub Models — Free-tier LLM access used in this project, no paid API keys required
  • trendspy — Lightweight Google Trends library replacing the archived pytrends

Concepts

 

Read the whole story
alvinashcraft
5 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Optimising AI Costs with Microsoft Foundry Model Router

1 Share

Microsoft Foundry Model Router analyses each prompt in real-time and forwards it to the most appropriate LLM from a pool of underlying models. Simple requests go to fast, cheap models; complex requests go to premium ones, all automatically.

I built an interactive demo app so you can see the routing decisions, measure latencies, and compare costs yourself. This post walks through how it works, what we measured, and when it makes sense to use.

The Problem: One Model for Everything Is Wasteful

Traditional deployments force a single choice:

StrategyUpsideDownside
Use a small modelFast, cheapStruggles with complex tasks
Use a large modelHandles everythingOverpay for simple tasks
Build your own routerFull controlMaintenance burden; hard to optimise

Most production workloads are mixed-complexity. Classification, FAQ look-ups, and data extraction sit alongside code analysis, multi-constraint planning, and long-document summarisation. Paying premium-model prices for the simple 40% is money left on the table.

The Solution: Model Router

Model Router is a trained language model deployed as a single Azure endpoint. For each incoming request it:

  1. Analyses the prompt — complexity, task type, context length
  2. Selects an underlying model from the routing pool
  3. Forwards the request and returns the response
  4. Exposes the choice via the response.model field

You interact with one deployment. No if/else routing logic in your code.

Routing Modes

ModeGoalTrade-off
Balanced (default)Best cost-quality ratioGeneral-purpose
CostMinimise spendMay use smaller models more aggressively
QualityMaximise accuracyHigher cost for complex tasks

Modes are configured in the Foundry Portal, no code change needed to switch.

Building the Demo

To make routing decisions tangible, we built a React + TypeScript app that sends the same prompt through both Model Router and a fixed standard deployment (e.g. GPT-5-nano), then compares:

  • Which model the router selected
  • Latency (ms)
  • Token usage (prompt + completion)
  • Estimated cost (based on per-model pricing)

Select a prompt, choose a routing mode, and hit Run Both to compare side-by-side

What You Can Do

  • 10 pre-built prompts spanning simple classification to complex multi-constraint planning
  • Custom prompt input enter any text and benchmarks run automatically
  • Three routing modes switch and re-run to see how distribution changes
  • Batch mode run all 10 prompts in one click to gather aggregate stats

API Integration

The integration is a standard Azure OpenAI chat completion call. The only difference is the deployment name (model-router instead of a specific model):

 
const response = await fetch( `${endpoint}/openai/deployments/model-router/chat/completions?api-version=2024-10-21`, { method: 'POST', headers: { 'Content-Type': 'application/json', 'api-key': apiKey, }, body: JSON.stringify({ messages: [{ role: 'user', content: prompt }], max_completion_tokens: 1024, }), } ); const data = await response.json(); // The key insight: response.model reveals the underlying model const selectedModel = data.model; // e.g. "gpt-5-nano-2025-08-07"
 

That data.model field is what makes cost tracking and distribution analysis possible.

Results: What the Data Shows

We ran all 10 prompts through both Model Router (Balanced mode) and a fixed standard deployment.

Note: Results vary by run, region, model versions, and Azure load. These numbers are from a representative sample run.


Side-by-side comparison across all 10 prompts in Balanced mode

Summary

MetricRouter (Balanced)Standard (GPT-5-nano)
Avg Latency~7,800 ms~7,700 ms
Total Cost (10 prompts)~$0.029~$0.030
Cost Savings~4.5%
Models Used41

Model Distribution

The router used 4 different models across 10 prompts:

ModelRequestsShareTypical Use
gpt-5-nano550%Classification, summarisation, planning
gpt-5-mini220%FAQ answers, data extraction
gpt-oss-120b220%Long-context analysis, creative tasks
gpt-4.1-mini110%Complex debugging & reasoning

Routing distribution chart — the router favours efficient models for simple prompts

Across All Three Modes

MetricBalancedCost-OptimisedQuality-Optimised
Cost Savings~4.5%~4.7%~14.2%
Avg Latency (Router)~7,800 ms~7,800 ms~6,800 ms
Avg Latency (Standard)~7,700 ms~7,300 ms~8,300 ms
Primary GoalBalance cost + qualityMinimise spendMaximise accuracy
Model SelectionMixed (4 models)Prefers cheaperPrefers premium
Cost-optimised mode — routes more aggressively to nano/mini models


Quality-optimised mode — routes to larger models for complex tasks

Analysis

What Worked Well

Intelligent distribution  The router didn't just default to one model. It used 4 different models and mapped prompt complexity to model capability: simple classification → nano, FAQ answers → mini, long-context documents → oss-120b, complex debugging → 4.1-mini.

Measurable cost savings across all modes  4.5% in Balanced, 4.7% in Cost, and 14.2% in Quality mode. Quality mode was the surprise winner by choosing faster, cheaper models for simple prompts, it actually saved the most while still routing complex requests to capable models.

Zero routing logic in application code One endpoint, one deployment name. The complexity lives in Azure's infrastructure, not yours.

Operational flexibility Switch between Balanced, Cost, and Quality modes in the Foundry Portal without redeploying your app. Need to cut costs for a high-traffic period? Switch to Cost mode. Need accuracy for a compliance run? Switch to Quality.

Future-proofing  As Azure adds new models to the routing pool, your deployment benefits automatically. No code changes needed.

Trade-offs to Consider

Latency is comparable, not always faster  In Balanced mode, Router averaged ~7,800 ms vs Standard's ~7,700 ms  nearly identical. In Quality mode, the Router was actually faster (~6,800 ms vs ~8,300 ms) because it chose more efficient models for simple prompts. The delta depends on which models the router selects.

Savings scale with workload diversity Our 10-prompt test set showed 4.5–14.2% savings. Production workloads with a wider spread of simple vs complex prompts should see larger savings, since the router has more opportunity to route simple requests to cheaper models.

Opaque routing decisions  You can see which model was picked via response.model, but you can't see why. For most applications this is fine; for debugging edge cases you may want to test specific prompts in the demo first.

Custom Prompt Testing

One of the most practical features of the demo is testing your own prompts before committing to Model Router in production.

Enter any prompt  `the quantum computing example is a medium-complexity educational prompt`
Benchmarks execute automatically, showing the selected model, latency, tokens, and cost

Workflow:

  1. Click ✏️ Custom in the prompt selector
  2. Enter your production-representative prompt
  3. Click ✓ Use This Prompt — Router and Standard run automatically
  4. Compare results — repeat with different routing modes
  5. Use the data to inform your deployment strategy

This lets you predict costs and validate routing behaviour with your actual workload before going to production.

When to Use Model Router

Great Fit

  • Mixed-complexity workloads — chatbots, customer service, content pipelines
  • Cost-sensitive deployments — where even single-digit percentage savings matter at scale
  • Teams wanting simplicity — one endpoint beats managing multi-model routing logic
  • Rapid experimentation — try new models without changing application code

Consider Carefully

  • Ultra-low-latency requirements — if you need sub-second responses, the routing overhead matters
  • Single-task, single-model workloads — if one model is clearly optimal for 100% of your traffic, a router adds complexity without benefit
  • Full control over model selection — if you need deterministic model choice per request

Mode Selection Guide

Is accuracy critical (compliance, legal, medical)? 
Is accuracy critical (compliance, legal, medical)? └─ YES → Quality-Optimised └─ NO → Strict budget constraints? └─ YES → Cost-Optimised └─ NO → Balanced (recommended)

 

Best Practices

  1. Start with Balanced mode — measure actual results, then optimise
  2. Test with your real prompts — use the Custom Prompt feature to validate routing before production
  3. Monitor model distribution — track which models handle your traffic over time
  4. Compare against a baseline — always keep a standard deployment to measure savings
  5. Review regularly — as new models enter the routing pool, distributions shift

Technical Stack

TechnologyPurpose
React 19 + TypeScript 5.9UI and type safety
Vite 7Dev server and build tool
Tailwind CSS 4Styling
Recharts 3Distribution and comparison charts
Azure OpenAI API (2024-10-21)Model Router and standard completions

Security measures include an ErrorBoundary for crash resilience, sanitised API error messages, AbortController request timeouts, input length validation, and restrictive security headers. API keys are loaded from environment variables and gitignored.
Source: leestott/router-demo-app: An interactive web application demonstrating the power of Microsoft Foundry Model Router - an intelligent routing system that automatically selects the optimal language model for each request based on complexity, reasoning requirements, and task type.

⚠️ This demo calls Azure OpenAI directly from the browser. This is fine for local development. For production, proxy through a backend and use Managed Identity.

Try It Yourself
Quick Start

git clone https://github.com/leestott/router-demo-app/

cd router-demo-app

# Option A: Use the setup script (recommended)
# Windows:
.\setup.ps1 -StartDev
# macOS/Linux:
chmod +x setup.sh && ./setup.sh --start-dev

# Option B: Manual
npm install
cp .env.example .env.local
# Edit .env.local with your Azure credentials
npm run dev

Open http://localhost:5173, select a prompt, and click ⚡ Run Both.

Get Your Credentials

  1. Go to ai.azure.com → open your project
  2. Copy the Project connection string (endpoint URL)
  3. Navigate to Deployments → confirm model-router is deployed
  4. Get your API key from Project Settings → Keys

Configuration

Edit .env.local:

VITE_ROUTER_ENDPOINT=https://your-resource.cognitiveservices.azure.com
VITE_ROUTER_API_KEY=your-api-key
VITE_ROUTER_DEPLOYMENT=model-router

VITE_STANDARD_ENDPOINT=https://your-resource.cognitiveservices.azure.com
VITE_STANDARD_API_KEY=your-api-key
VITE_STANDARD_DEPLOYMENT=gpt-5-nano

Ideas for Enhancement

  • Historical analysis — persist results to track routing trends over time
  • Cost projections — estimate monthly spend based on prompt patterns and volume
  • A/B testing framework — compare modes with statistical significance
  • Streaming support — show model selection for streaming responses
  • Export reports — download benchmark data as CSV/JSON for further analysis

Conclusion

Model Router addresses a real problem: most AI workloads have mixed complexity, but most deployments use a single model. By routing each request to the right model automatically, you get:

  • Cost savings (~4.5–14.2% measured across modes, scaling with volume)
  • Intelligent distribution (4 models used, zero routing code)
  • Operational simplicity (one endpoint, mode changes via portal)
  • Future-proofing (new models added to the pool automatically)

The latency trade-off is minimal — in Quality mode, the Router was actually faster than the standard deployment. The real value is flexibility: tune for cost, quality, or balance without touching your code.

Ready to try it? Clone the demo repository, plug in your Azure credentials, and test with your own prompts.

Resources

 

Built to explore Model Router and share findings with the developer community.

Feedback and contributions welcome, open an issue or PR on GitHub.

Read the whole story
alvinashcraft
5 hours ago
reply
Pennsylvania, USA
Share this story
Delete

USB Driver Documentation Updates on UCSI, USB4 Testing

1 Share

Over the past few months, we've been running a freshness pass across the USB driver documentation on Microsoft Learn and the USB4 design guidelines. The goal is straightforward: make sure the docs match the current Windows 11 driver stack, read clearly, and get you to the answer faster. Here's what changed.

USB4 required testing: new Basic Validation Tests

The USB4 ecosystem has grown significantly. As the number of USB4 router implementations grows, so does the need for a clear, repeatable validation baseline. We added a new Basic Validation Tests section to the USB4 Required Testing page.

The existing page already listed the recommended end-to-end test scenarios: driver verifier, domain power-down, wake replay, interdomain connections, and so on. What was missing was a concise set of smoke tests that OEMs and IHVs could run quickly to catch regressions in incremental USB4 firmware and driver updates without needing a full test pass. The new section fills that gap with nine concrete test scenarios:

  1. USB4 HLK Tests: run all System.Fundamentals.SystemUSB.USB4.* and Device.BusController.USB4.* tests.
  2. Basic Enumeration: attach a USB4 dock and a Thunderbolt 3 dock, each with a display, USB3 storage, and USB2 input. Verify clean enumeration in Device Manager, functional input, file copy, and display extension.
  3. Display: verify two 4K displays at 60 Hz concurrently, both tunneled through a USB4 dock and directly via DisplayPort Alt Mode.
  4. Camera (Isochronous) Streaming: stream from a USB3 camera through a USB4 dock for at least one minute with no visible glitches.
  5. System Standby: attach a full dock topology, cycle standby five times with 30-second to 2-minute waits, and verify all devices survive each transition.
  6. System Reboot: same topology, same verification, but with a full reboot instead of standby.
  7. System Hibernate: same again, with hibernate.
  8. Minimal Compatibility and Interoperability: test at least 3 display models and at least 10 USB4 dock or device models spanning Intel Thunderbolt 4, Intel Thunderbolt 5, Via USB4, Asmedia USB4, Realtek USB4, Intel Thunderbolt 3 Titan Ridge, and Intel Thunderbolt 3 Alpine Ridge.
  9. Basic Plug/Unplug with USB4 Switch: configure the USB4 Switch with a USB4 dock on port 1 and a Thunderbolt 3 dock on port 2, run ConnExUtil.exe /cxstress for a minimum of 15 minutes (24+ hours for long-term stability), then verify the port still enumerates and charges after the test.

Each test includes explicit pass criteria: no yellow bangs, no visual glitches, expected resolution and refresh rates confirmed in the Advanced Display settings. The interoperability test (test 8) is particularly important as USB4 matures: it ensures your platform works across the full range of silicon vendors, not just the one on your bench.

If you're validating USB4 firmware or driver updates and need a quick confidence check before a broader test pass, this is the list to start with.

UCSI driver docs: cleaned up and refocused on Windows 11

The UCSI driver article got a thorough refresh: updated architecture diagram, clearer UCSI 2.0 _DSM backward-compatibility guidance, reformatted UCSIControl.exe test instructions with proper inline code for registry paths, and consistent code-style formatting across the DRP role detection and charger mismatch example flows. We also removed outdated Windows 10 Mobile references so the article now focuses exclusively on Windows 10 desktop and Windows 11.

USB generic parent driver (Usbccgp.sys): plain language rewrite

The Usbccgp.sys article, the starting point for anyone building composite USB devices, was rewritten for clarity. We simplified jargon-heavy sentences, expanded abbreviations on first use (e.g., "information (INF) file"), updated cross-references to sentence case per the Microsoft Learn style guide, and added customer-intent metadata for better search discoverability.

Community fix: interrupt endpoint direction

Here's a small one that matters more than it looks. In the How to send a USB interrupt transfer (UWP app) article, the Interrupt IN transfers section incorrectly stated that HID devices like keyboards "support interrupt OUT endpoints." Endpoint direction is fundamental (IN means device-to-host, OUT means host-to-device) and getting that wrong in official documentation can send you down entirely the wrong debugging path.

A community contributor spotted the error and submitted the fix. It now correctly reads "interrupt IN endpoints." If you've ever stared at a USB trace wondering why your interrupt transfer wasn't behaving, this might have been part of the confusion.

Thank you to everyone who submits pull requests. This is exactly the kind of contribution that makes the docs better for all of us.

What's next

These updates are part of a broader freshness initiative across the Windows Hardware driver documentation. 

If you spot something that looks outdated or confusing, our documentation is open source. Submit a PR on the windows-driver-docs GitHub repository. You can also drop a comment below.

Read the whole story
alvinashcraft
5 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Episode 561: Two Guys and Their Tokens

1 Share

This week, we discuss AI-assisted COBOL migrations, the OpenClaw Foundation, and AI killing Office. Plus, is TSA PreCheck Touchless the peak of airport efficiency?

Watch the YouTube Live Recording of Episode 561

Runner-up Titles

  • New’s not good
  • He knows how to be retired
  • Let Matt Cook
  • We don’t have to worry about that Brandon
  • You’re that guy
  • The stock market feels reactionary
  • Siri-Claw
  • Foundation Washing
  • Give me life-changing money and I’ll have a better take
  • Why do I need to pay for power usage?

Rundown

Relevant to your Interests

Nonsense

Listener Feedback

Conferences

  • DevOpsDay LA at SCALE23x, March 6th, Pasadena, CA
    • Use code: DEVOP for 50% off.
  • Devnexus 2026, March 4th to 6th, Atlanta, GA.
    • Use this 30% off discount code from your pals at Tanzu: DN26VMWARE30.
    • Check out the Tanzu and Spring talks and trading cards on THE LANDING PAGE.
  • Austin Meetup, March 10th, Listener Steve Anness speaking on Grafana
  • KubeCon EU, March 23rd to 26th, 2026 - Coté will be there on a media pass.
  • DevOpsdays Atlanta 2026, April 21-22, 2026
  • DevOpsDays Austin, May 5 - 6, 2026
  • WeAreDevelopers, July 8th to 10th, Berlin, Coté speaking.
  • VMware User Groups (VMUGs):
    • Amsterdam (March 17-19, 2026) - Coté speaking.
    • Minneapolis (April 7-9, 2026)
    • Toronto (May 12-14, 2026)
    • Dallas (June 9-11, 2026)
    • Orlando (October 20-22, 2026)

SDT News & Community

Recommendations





Download audio: https://aphid.fireside.fm/d/1437767933/9b74150b-3553-49dc-8332-f89bbbba9f92/e03de1f4-586c-424e-ba92-a98d6bded82c.mp3
Read the whole story
alvinashcraft
5 hours ago
reply
Pennsylvania, USA
Share this story
Delete

The OpenClaw-ification of AI

1 Share
From: AIDailyBrief
Duration: 14:07
Views: 2,856

Exploration of the OpenClaw phenomenon: AI products adopting persistent agents, scheduled tasks, on-device remote control, and deep context integration. Highlights include Claude Code remote control and scheduled tasks, Perplexity Computer as a personal AI workstation, Notion custom agents, and Airtable SuperAgent. Argument for a paradigm shift toward always-on, multimodal orchestration and a recommendation to learn agent primitives through hands-on OpenClaw setups before relying solely on abstracted products.

The AI Daily Brief helps you understand the most important news and discussions in AI.
Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614
Get it ad free at http://patreon.com/aidailybrief
Learn more about the show https://aidailybrief.ai/

Read the whole story
alvinashcraft
5 hours ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories