Take a small design team running a global social campaign. They have the creative vision to produce localized imagery for every market, but not the resources to reshoot, reformat, or outsource that scale. Every asset needs to fit a different platform, a different dimension, a different cultural context, and they all need to ship at the same time. This is where flexible image generation comes in handy.
OpenAI's GPT-image-2 is now generally available and rolling out today to Microsoft Foundry, introducing a step change in image generation. Developers and designers now get more control over image output, so a small team can execute with the reach and flexibility of a much larger one.
What is new in GPT-image-2?
GPT-image-2 brings real world intelligence, multilingual understanding, improved instruction following, increased resolution support, and an intelligent routing layer giving developers the tools to scale image generation for production workflows.
Real world intelligence
GPT-image-2 has a knowledge cut off of December 2025, meaning that it is able to give you more contextually relevant and accurate outputs. The model also comes with enhanced thinking capabilities that allow it to search the web, check its own outputs, and create multiple images from just one prompt. These enhancements shift image generation models away from being simple tools and runs them into creative sidekicks.
Multilingual understanding
GPT-image-2 includes increased language support across Japanese, Korean, Chinese, Hindi, and Bengali, as well as new thinking capabilities. This means the model can create images and render text that feels localized.
Increased resolution support
GPT-image-2 introduces 4K resolution support, giving developers the ability to generate rich, detailed, and photorealistic images at custom dimensions.
Resolution guidelines to keep in mind:
Constraint
Detail
Total pixel budget
Maximum pixels in final image cannot exceed 8,294,400
Minimum pixels in final image cannot be less than 655,360
Requests exceeding this are automatically resized to fit.
Resolutions
4K, 1024x1024, 1536x1024, and 1024x1536
Dimension alignment
Each dimension must be a multiple of 16
Note: If your requested resolution exceeds the pixel budget, the service will automatically resize it down.
Intelligent routing layer
GPT-image-2 also includes an expanded routing layer with two distinct modes, allowing the service to intelligently select the right generation configuration for a request without requiring an explicitly set size value.
Mode 1 — Legacy size selection
In Mode 1, the routing layer selects one of the three legacy size tiers to use for generation:
Size tier
Description
smimage
Small image output
image
Standard image output
xlimage
Large image output
This mode is useful for teams already familiar with the legacy size tiers who want to benefit from automatic selection without making any manual changes.
Mode 2 — Token size bucket selection
In Mode 2, the routing layer selects from six token size buckets — 16, 24, 36, 48, 64, 96 — which map roughly to the legacy size tiers:
Token bucket
Approximate legacy size
16, 24
smimage
36, 48
image
64, 96
xlimage
This approach can allow for more flexibility in the number of tokens generated, which in turn helps to better optimize output quality and efficiency for a given prompt.
See it in action
GPT-image-2 shows improved image fidelity across visual styles, generating more detailed and refined images. But, don’t just take our word for it, let's see the model in action with a few prompts and edits. Here is the example we used:
Prompt: Interior of an empty subway car (no people). Wide-angle view looking down the aisle. Clean, modern subway car with seats, poles, route map strip, and ad frames above the windows. Realistic lighting with a slight cool fluorescent tone, realistic materials (metal poles, vinyl seats, textured floor).
Figure 1. Created with GPT-image-1
Figure 2. Created with GPT-image-1.5
Figure 3. Created with GPT-image-2
As you can see, when using the same base prompt, the image quality and realism improved with each model. Now let’s take a look at adding incremental changes to the same image:
Prompt: Populate the ad frames with a cohesive ad campaign for “Zava Flower Delivery”and use an array of flower types.
Figure 4. Created with GPT-image-2
And our subway is now full of ads for the new ZAVA flower delivery service. Let's ask for another small change:
Prompt: In all Zava Flower Delivery advertisements, change the flowers shown to roses (red and pink roses).
Figure 5. Created with GPT-image-2
And in three simple prompts, we've created a mockup of a flower delivery ad. From marketing material to website creation to UX design, GPT-image-2 now allows developers to deliver production-grade assets for real business use cases.
Image generation across industries
These new capabilities open the door to richer, more production-ready image generation workflows across a range of enterprise scenarios:
Retail & e-commerce: Generate product imagery at exact platform-required dimensions, from square thumbnails to wide banners, without post-processing.
Marketing: Produce crisp, rich in color campaign visuals and social assets localized to different markets.
Media & entertainment: Generate storyboard panels and scene at resolutions suited to production pipelines.
Education & training: Create visual learning aids and course materials formatted to exact display requirements across devices.
UI/UX design: Accelerate mockup and prototype workflows by generating interface assets at the precise dimensions your design system requires.
Trust and safety
At Microsoft, our mission to empower people and organizations remains constant. As part of this commitment, models made available through Foundry undergo internal reviews and are deployed with safeguards designed to support responsible use at scale. Learn more about responsible AI at Microsoft.
For GPT-image-2, Microsoft applied an in-depth safety approach that addresses disallowed content and misuse while maintaining human oversight. The deployment combines OpenAI’s image generation safety mitigations with Azure AI Content Safety, including filters and classifiers for sensitive content.
Pricing
Model
Offer type
Pricing - Image
Pricing - Text
GPT-image-2
Standard Global
Input Tokens: $8
Cached Input Tokens: $2
Output Tokens: $30
Input Tokens: $5
Cached Input Tokens: $1.25
Output Tokens: $10
Getting started
Whether you’re building a personalized retail experience, automating visual content pipelines or accelerating design workflows. GPT-image-2 gives your team the resolution control and intelligent routing to generate images that fit your exact needs. Try the GPT-image-2 in Microsoft Foundry today!
Incoming calls don’t wait for a break in your day. Whether you’re leading a meeting or juggling back-to-back commitments, every new call creates the same dilemma: answer and risk losing momentum, or ignore it and risk missing something important.
We're excited to announce that Microsoft 365 Copilot can now help answer your incoming Teams calls and schedule follow-up appointments on your behalf. This experience helps users focus on engaging with the calls that matter most and is available through the Frontier program.
What is call delegation?
Here's how the experience works:
After turning on call delegation in the Teams Call settings, it answers an incoming call on your behalf and starts a conversation with the caller, helping filter out spam and capturing context so you can understand why the caller is contacting you.
When a call is time-sensitive, call delegation identifies the urgency and attempts a live transfer so you don’t miss out on important conversations.
When a call isn't urgent, call delegation offers the caller the option to leave you a voicemail or book a follow-up appointment on your calendar through an integration with Microsoft Bookings.
After every screened call, Copilot automatically generates a summary with key topics, caller context, and suggested next steps, enabling you to act quickly instead of reading full call transcripts.
The result: reduced interruptions, fewer missed opportunities, and an easier way to connect with the most important callers.
Call delegation in action
Here are a few example use cases for how this experience can help workers stay more focused and responsive throughout the day.
Resolve what's urgent
A supply chain manager in a live meeting receives a call from a supplier about a delayed shipment to the warehouse. Call delegation screens the call, identifies it as urgent, and attempts a live transfer. The supply chain manager glances at the summary context in the transfer notification, steps out of the meeting to accept the call, and successfully resolves the issue.
Let AI handle the noise
A sales director is working on a time-sensitive deliverable and gets three non-urgent calls in an hour. Call delegation handles each one: it screens out a spam call, directs a colleague to leave a voicemail, and schedules a callback appointment with a customer prospect. All without a single interruption.
Catch up on high priority calls in seconds
Call delegation answers several incoming calls on behalf of a consultant while she’s in back-to-back meetings. The consultant later opens the Call app in Teams to review the Copilot recaps for each call handled through call delegation, which include notes about the reasons for the calls and suggested follow-ups. The consultant identifies an issue from an important client and prioritizes an immediate callback.
Get started today
Call delegation is available to users with a Microsoft 365 Copilot license via the Frontier program. Organizations can join Frontier to get early access to Microsoft’s latest AI innovations.
Service limits may apply at the call level and across monthly tenant usage. Licensing details and usage limits are subject to change and additional information will be communicated at general availability.
For the first time, we can read the source code of the layer Microsoft Cowork runs on.
Anthropic has unbundled the agentic AI stack into three licensable layers: the model, the harness, and the application. Microsoft has licensed the middle two. Until three weeks ago, the harness was a black box. Now it isn’t. Here is what 512,000 lines of TypeScript tell us about where Microsoft Cowork is going, and why the architectural pattern language matters more than the model choice.
Most coverage of modern agentic AI still treats the model as the product. That framing is a year out of date. Look at how the serious labs ship agentic AI in 2026 and you see three distinct layers, each licensable on its own terms.
The bottom layer is the model. Claude Opus 4.7, GPT-5.2, Gemini 3 Ultra. This is the part every analyst benchmarks and every procurement team interrogates. It is also the part where the differentiation gap is narrowing fastest.
The middle layer is the harness. This is the agentic runtime that wraps the model. It is the while-loop over tool calls, the context compaction pipeline, the permission gates, the memory system, the sub-agent orchestrator, the MCP integration layer. Anthropic’s version of this middle layer is exposed to customers as the Claude Agent SDK. Until March 31, most people outside of Anthropic had no real sense of how deeply engineered this layer actually is.
The top layer is the application. Claude Code for developers, Claude Cowork for knowledge workers, and now Microsoft’s own Copilot Cowork built on licensed Anthropic primitives. The application layer is where the brand and the workflow context live.
Microsoft has done something interesting with this stack. They are buying the bottom two layers from Anthropic (as a subprocessor, with all the compliance plumbing that implies) and building the top layer themselves inside the Microsoft 365 trust boundary. In Microsoft’s own words, they have integrated “the technology behind Claude Cowork” into Copilot Cowork. That is product marketing language for “we licensed Anthropic’s model and SDK and wrote our own orchestrator on top.”
The Claude Code source, now public whether Anthropic likes it or not, gives us our first real look at what that middle layer actually contains. Not the sanitized developer docs. The production code.
Here is the punchline from the analysis of the leaked codebase: the agentic loop itself is about twenty lines of code. It is a while-loop over tool calls, with message history as the core data structure. That is not where the engineering lives.
The engineering lives in everything wrapped around that loop. Context management. Permission systems. Memory compaction. Tool schemas. Error recovery. Sub-agent orchestration. All told, roughly 512,000 lines of TypeScript across 1,906 files, just to make a language model behave reliably inside a bounded environment for longer than five minutes.
If you are a technical leader evaluating agentic AI for your enterprise, this is the insight that should change how you think about the procurement decision. Model choice is becoming commodity. Harness choice is not. The harness determines whether your agents can run for hours without context rot, whether they can safely execute privileged operations without a human in the loop, whether they can remember what they learned last week, and whether they leave an audit trail your compliance team will accept.
Here is what the source tells us the production-grade harness actually does.
Seven Patterns That Define the Middle Tier
1. Memory as hint, not truth
The source reveals a three-tier memory architecture that deliberately rejects the RAG-everything approach most enterprise agents ship with today. At the core is a file called MEMORY.md, a lightweight index of pointers, roughly 150 characters per line, perpetually loaded into every prompt. This index does not store data. It stores locations.
Actual project knowledge lives in separate topic files fetched on demand. Raw transcripts are never fully reloaded into context; they are grep’d for specific identifiers. Critically, the agent is instructed to treat its own memory as a hint, not as ground truth. It must re-verify any cached fact against the primary source before acting on it.
If you come from clinical informatics, this pattern will feel immediately familiar. It matches how experienced clinicians actually reason: cached knowledge is always provisional until reconfirmed against the patient in front of you. For any compliance-sensitive deployment, the memory-as-hint pattern is the correct starting point. The alternative, which most enterprise agents still ship, is a confident agent with stale assumptions. That is not a posture you want inside a regulated workflow.
2. autoDream, or what happens while the agent is idle
The source revealed a background subsystem called autoDream, modeled explicitly after REM sleep in biological systems. It runs every 24 hours or on demand via a /dream command, and it operates in four phases. Pruning removes outdated or contradictory entries. Merging combines duplicate fragments and unifies different phrasings of the same idea. Refreshing updates stale information and re-weights importance. Synthesis compiles recent learnings into structured memory files with new indexes for faster retrieval.
The subtle and somewhat unsettling detail: autoDream rewrites tentative observations as assertions once enough supporting evidence accumulates. “This function might handle authentication” becomes “this function handles authentication.” Hedging language gets erased from the agent’s own memory. There is no human approval step in this loop.
For regulated industries, this is simultaneously the most exciting and most governance-relevant feature in the entire harness. An agent that can consolidate institutional knowledge between sessions is a step-change in capability. An agent that can silently upgrade guesses to facts is a step-change in risk. Any enterprise deployment will need a policy posture on this one, and I suspect the first wave of enterprise-ready autoDream implementations will include a review queue the human actually has to sign off on before provisional facts get promoted.
3. KAIROS, the daemon that decides when to act
KAIROS is referenced more than 150 times in the source. It is not yet publicly enabled, but it is clearly finished code behind a feature flag. The Greek root is deliberate: kairos means the opportune moment, contrasted with chronos, sequential time. The agent does not run on a schedule. It decides when to engage based on context.
Architecturally, KAIROS is an always-on background daemon. It outlives individual conversations. It receives periodic tick prompts and autonomously decides whether to act. It has a 15-second blocking budget to prevent any single decision from monopolizing system resources. And here is the audit-friendly detail: all of its actions are written to an append-only log that the agent itself cannot erase.
This is the move from reactive chat to autonomous agent. The append-only audit trail is the compliance-safe version of that autonomy. Any CISO evaluating agentic AI should understand that this is the direction the frontier is heading, and that the audit-log primitive already exists in production code at Anthropic. Microsoft’s equivalent will live inside Copilot’s existing auditing and data loss prevention boundary. If you are building governance policy now, the pattern to encode is “autonomous action is fine, silent action is not.”
4. Tool-call orchestration and sub-agent forking
We already covered the headline: the loop is trivial, the harness is not. Where it gets interesting is sub-agent orchestration. Claude Code can spawn sub-agents, but it does not do so through a fancy orchestration framework. Sub-agents are just another tool call in the registry. The AgentTool is a tool like any other.
When the primary agent forks a sub-agent, it creates a byte-identical copy of the parent context so they share the KV cache. Sub-agents process only their unique instructions, not the entire shared context. Parallelism becomes nearly free in token cost. This is the mechanism that makes multi-agent workflows economically viable at scale, and it is the single most important economic insight in the entire leak. Most enterprise agent frameworks today do not share cache across sub-agents, which is why they break the budget the moment anyone tries to run them in parallel.
The broader architectural lesson: keep the orchestration flat. Most agent frameworks in the wild introduce complex state machines, DAG-based planners, or custom runtimes. Claude Code does none of that. It proves that the right answer is a simple loop with sophisticated tooling around it. If your current agent framework requires a diagram to explain its control flow, you are probably over-engineering the wrong layer.
5. The two-mind permission model
This one deserves its own paragraph. Every tool in Claude Code is independently sandboxed. The agent does not have filesystem access. The agent can use the Read tool, and Read has its own permission gate that evaluates deny, ask, and allow rules before anything executes. Deny always wins.
The architectural principle is: the model decides what to attempt. The tool system decides what is permitted. These are two separate minds, and the tool system does not trust the model.
Operationally brilliant detail: permission checks are run by Claude Haiku, the smallest and cheapest model in the Anthropic family, not by the main Opus model handling the reasoning. Permission evaluation is framed as a cheap cascading classifier, not as a reasoning task. This keeps the economics of safety sustainable, which matters enormously once you are running thousands of agent-hours per month.
For HIPAA-regulated deployments, the architectural separation between intent and authorization is not a nice-to-have. It is the pattern the regulators are going to expect. If you are building an agent for a covered entity, your permission system should not live inside the same reasoning context as the agent itself. Put a different mind in charge of the lock.
The clever detail in the source: when MCP servers are connected, Claude Code does not load all their tool schemas into context upfront. It loads only tool names at session start, then uses a search mechanism to discover relevant tools when a task actually needs them. This is the only way to scale tool counts into the hundreds without blowing out the context window.
For enterprise deployments wiring an agent into dozens of line-of-business systems, which is exactly the Microsoft position with M365 and the position of every major healthcare system running Epic plus a dozen niche clinical tools, this lazy-discovery pattern is not optional. It is the primitive that makes the entire integration story work. If your current agentic platform eagerly loads every tool schema at startup, it does not scale to the enterprise integration surface you actually have.
7. The three-stage context compaction pipeline
Long sessions are the unsolved problem of agentic AI. Every engineer who has built an agent has hit the same wall: the longer the session runs, the more confused the model gets. Anthropic internally calls this context entropy.
The harness contains a three-stage compaction pipeline that is arguably the single most valuable pattern in the entire codebase. Stage one truncates cached tool outputs locally, preserving the decisions without the raw data. Stage two generates a structured 20,000-token summary when the conversation approaches the context limit. Stage three compresses the full conversation and adds recently accessed files (up to 5,000 tokens per file), active plans, and relevant skills back into the rebuilt context.
The operational insight for technical leaders: context management is the hardest problem in agentic systems, and it deserves the most engineering investment. Most teams spend their time tuning prompts. The teams that ship working agents spend their time engineering what goes into, and out of, the context window. If you are funding an agentic AI initiative right now, ask your team what their context compaction strategy is. If the answer is “we just use a longer context window,” the initiative will fail at scale.
What This Means for Microsoft Cowork
Walk the seven patterns against Microsoft’s own description of Copilot Cowork and the translation becomes obvious.
Microsoft says Cowork “runs within Microsoft 365’s security and governance boundaries. Identity, permissions, and compliance policies apply by default, and actions and outputs are auditable.” That is the permission and hook model, re-implemented on top of Microsoft Entra and Purview instead of Claude Code’s local sandbox.
Microsoft says Cowork “runs in a protected, sandboxed cloud environment, so tasks can keep progressing safely as you move across devices.” That is KAIROS, re-implemented on Azure instead of your laptop.
Microsoft says Cowork “turns your request into a plan. The plan continues in the background, with clear checkpoints so you can confirm progress, make changes, or pause execution at any time.” That is the coordinator-plus-sub-agent pattern with the append-only audit log, expressed in product language.
Microsoft says Cowork is “powered by Work IQ” and “draws on signals across Outlook, Teams, Excel, and the rest of Microsoft 365.” That is MEMORY.md plus the MCP integration layer, re-implemented on top of the Microsoft Graph.
None of this is coincidence. Microsoft is consuming the Anthropic pattern language. They are not copying the code. They are licensing the architectural primitives via the Claude Agent SDK and wrapping them in Microsoft’s identity, compliance, and data boundaries. The model is Claude Opus 4.7 (now in Copilot Cowork as of last week). The harness is Anthropic’s SDK. The application is Microsoft’s.
And that is precisely why the Anthropic codebase is the most useful document you can read right now if you want to understand where Copilot Cowork is going. The features sitting behind feature flags in the Anthropic source today are the features that will ship in Copilot Cowork in the next two to three quarters.
Why This Matters for Technical Leadership
If you are a healthcare CIO, a hospital informatics lead, or a CTO of any regulated enterprise evaluating where to place your agentic AI bets, here is the read.
Model choice is becoming less important than harness choice. Harness choice determines whether your agent can safely persist across sessions, whether it leaves an audit trail, whether it respects your data boundaries, whether it can scale to the tool counts your actual business requires, and whether it can handle long-running workflows without hallucinating its way into a compliance incident.
The Anthropic harness, now visible in unprecedented detail, represents the current state of the art. Microsoft is consuming it. Other platforms will follow. The pattern language itself is the differentiator for the next eighteen months, not the underlying model.
For healthcare specifically, three of the seven patterns are immediately relevant. Memory-as-hint matches how clinical reasoning works and should be the default for any clinical-adjacent agent. The two-mind permission model is the pattern your compliance team will accept, because it separates intent from authorization at an architectural layer regulators can audit. And the append-only audit log that KAIROS introduces is the pattern that makes autonomous agents defensible under HIPAA and the emerging state AI governance laws.
The leak was framed as a security story. It is actually an industry story. For the first time, we can see the shape of what the middle tier of agentic AI looks like in production, and we can read Microsoft’s product roadmap by looking at the features currently flagged off in the Anthropic codebase. The features that ship next in Claude Code will almost certainly appear in Copilot Cowork a few months later, with a Microsoft wrapper and a different billing mechanism.
Pay attention to the middle tier. It is where the real competition is happening, and it is where your architectural bets for the next three years will either pay off or strand.
Paul J. Swider is the CEO and Chief AI Officer of RealActivity, a Microsoft partner building healthcare AI solutions. He is an analyst-practitioner with Cloud Wars and the Acceleration Economy, a Microsoft MVP and MCT, and the founder of BOSHUG, the Boston Healthcare Cloud & AI Community.
This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.
Onboard/wizard: restyle the setup security disclaimer with a single yellow warning banner, section headings and bulleted checklists, and un-dim the note body so key guidance is easy to scan; add a loading spinner during the initial model catalog load so the wizard no longer goes blank while it runs; add an "API key" placeholder to provider API key prompts. (#69553) Thanks @Patrick-Erichsen.
Agents/prompts: strengthen the default system prompt and OpenAI GPT-5 overlay with clearer completion bias, live-state checks, weak-result recovery, and verification-before-final guidance.
Models/costs: support tiered model pricing from cached catalogs and configured models, and include bundled Moonshot Kimi K2.6/K2.5 cost estimates for token-usage reports. (#67605) Thanks @sliverp.
Sessions/Maintenance: enforce the built-in entry cap and age prune by default, and prune oversized stores at load time so accumulated cron/executor session backlogs cannot OOM the gateway before the write path runs. (#69404) Thanks @bobrenze-bot.
Plugins/tests: reuse plugin loader alias and Jiti config resolution across repeated same-context loads, reducing import-heavy test overhead. (#69316) Thanks @amknight.
Cron: split runtime execution state into jobs-state.json so jobs.json stays stable for git-tracked job definitions. (#63105) Thanks @Feelw00.
Agents/compaction: send opt-in start and completion notices during context compaction. (#67830) Thanks @feniix.
Moonshot/Kimi: default bundled Moonshot setup, web search, and media-understanding surfaces to kimi-k2.6 while keeping kimi-k2.5 available for compatibility. (#69477) Thanks @scoootscooob.
Moonshot/Kimi: allow thinking.keep = "all" on moonshot/kimi-k2.6, and strip it for other Moonshot models or requests where pinned tool_choice disables thinking. (#68816) Thanks @aniaan.
BlueBubbles/groups: forward per-group systemPrompt config into inbound context GroupSystemPrompt so configured group-specific behavioral instructions (for example threaded-reply and tapback conventions) are injected on every turn. Supports "*" wildcard fallback matching the existing requireMention pattern. Closes #60665. (#69198) Thanks @omarshahine.
Plugins/tasks: add a detached runtime registration contract so plugin executors can own detached task lifecycle and cancellation without reaching into core task internals. (#68915) Thanks @mbelinky.
Terminal/logging: optimize sanitizeForLog() by replacing the iterative control-character stripping loop with a single regex pass while preserving the existing ANSI-first sanitization behavior. (#67205) Thanks @bulutmuf.
QA/CI: make openclaw qa suite and openclaw qa telegram fail by default when scenarios fail, add --allow-failures for artifact-only runs, and tighten live-lane defaults for CI automation. (#69122) Thanks @joshavant.
Mattermost: stream thinking, tool activity, and partial reply text into a single draft preview post that finalizes in place when safe. (#47838) thanks @ninjaa.
Fixes
Exec/YOLO: stop rejecting gateway-host exec in security=full plus ask=off mode via the Python/Node script preflight hardening path, so promptless YOLO exec once again runs direct interpreter stdin and heredoc forms such as node <<'NODE' ... NODE.
OpenAI Codex: normalize legacy openai-completions transport overrides on default OpenAI/Codex and GitHub Copilot-compatible hosts back to the native Codex Responses transport while leaving custom proxies untouched. (#45304, #42194) Thanks @dyss1992 and @DeadlySilent.
Anthropic/plugins: scope Anthropic api: "anthropic-messages" defaulting to Anthropic-owned providers, so openai-codex and other providers without an explicit api no longer get rewritten to the wrong transport. Fixes #64534.
fix(qqbot): add SSRF guard to direct-upload URL paths in uploadC2CMedia and uploadGroupMedia [AI-assisted]. (#69595) Thanks @pgondhi987.
Browser/Chrome MCP: surface DevToolsActivePort attach failures as browser-connectivity errors instead of a generic "waiting for tabs" timeout, and point signed-out fallbacks toward the managed openclaw profile.
Webchat/images: treat inline image attachments as media for empty-turn gating while still ignoring metadata-only blank turns. (#69474) Thanks @Jaswir.
Discord/think: only show adaptive in /think autocomplete for provider/model pairs that actually support provider-managed adaptive thinking, so GPT/OpenAI models no longer advertise an Anthropic-only option.
Thinking: only expose max for models that explicitly support provider max reasoning, and remap stored max settings to the largest supported thinking mode when users switch to another model.
Gateway/usage: bound the cost usage cache with FIFO eviction so date/range lookups cannot grow unbounded. (#68842) Thanks @Feelw00.
OpenAI/Responses: resolve /think levels against each GPT model's supported reasoning efforts so /think off no longer becomes high reasoning or sends unsupported reasoning.effort: "none" payloads.
Lobster/TaskFlow: allow managed approval resumes to use approvalId without a resume token, and persist that id in approval wait state. (#69559) Thanks @kirkluokun.
Plugins/startup: install bundled runtime dependencies into each plugin's own runtime directory, reuse source-checkout repair caches after rebuilds, and log only packages that were actually installed so repeated Gateway starts stay quiet once deps are present.
Plugins/startup: ignore pnpm's npm_execpath when repairing bundled plugin runtime dependencies and skip workspace-only package specs so npm-only install flags or local workspace links do not break packaged plugin startup.
MCP: block interpreter-startup env keys such as NODE_OPTIONS for stdio servers while preserving ordinary credential and proxy env vars. (#69540) Thanks @drobison00.
Agents/shell: ignore non-interactive placeholder shells like /usr/bin/false and /sbin/nologin, falling back to sh so service-user exec runs no longer exit immediately. (#69308) Thanks @sk7n4k3d.
Setup/TUI: relaunch the setup hatch TUI in a fresh process while preserving the configured gateway target and auth source, so onboarding recovers terminal state cleanly without exposing gateway secrets on command-line args. (#69524) Thanks @shakkernerd.
Codex: avoid re-exposing the image-generation tool on native vision turns with inbound images, and keep bare image-model overrides on the configured image provider. (#65061) Thanks @zhulijin1991.
Sessions/reset: clear auto-sourced model, provider, and auth-profile overrides on /new and /reset while preserving explicit user selections, so channel sessions stop staying pinned to runtime fallback choices. (#69419) Thanks @sk7n4k3d.
Sessions/costs: snapshot estimatedCostUsd like token counters so repeated persist paths no longer compound the same run cost by up to dozens of times. (#69403) Thanks @MrMiaigi.
OpenAI Codex: route ChatGPT/Codex OAuth Responses requests through the /backend-api/codex endpoint so openai-codex/gpt-5.4 no longer hits the removed /backend-api/responses alias. (#69336) Thanks @mzogithub.
OpenAI/Responses: omit disabled reasoning payloads when /think off is active, so GPT reasoning models no longer receive unsupported reasoning.effort: "none" requests. (#61982) Thanks @a-tokyo.
Gateway/pairing: treat loopback shared-secret node-host, TUI, and gateway clients as local for pairing decisions, so trusted local tools no longer reconnect as remote clients and fail with pairing required. (#69431) Thanks @SARAMALI15792.
Active Memory: degrade gracefully when memory recall fails during prompt building, logging a warning and letting the reply continue without memory context instead of failing the whole turn. (#69485) Thanks @Magicray1217.
Ollama: add provider-policy defaults for baseUrl and models so implicit local discovery can run before config validation rejects a minimal Ollama provider config. (#69370) Thanks @PratikRai0101.
Agents/model selection: clear transient auto-failover session overrides before each turn so recovered primary models are retried immediately without emitting user-override reset warnings. (#69365) Thanks @hitesh-github99.
Auto-reply: apply silent NO_REPLY policy per conversation type, so direct chats get a helpful rewritten reply while groups and internal deliveries can remain quiet. (#68644) Thanks @Takhoffman.
Telegram/status reactions: honor messages.removeAckAfterReply when lifecycle status reactions are enabled, clearing or restoring the reaction after success/error using the configured hold timings. (#68067) Thanks @poiskgit.
Web search/plugins: resolve plugin-scoped SecretRef API keys for bundled Exa, Firecrawl, Gemini, Kimi, Perplexity, Tavily, and Grok web-search providers when they are selected through the shared web-search config. (#68424) Thanks @afurm.
Telegram/polling: raise the default polling watchdog threshold from 90s to 120s and add configurable channels.telegram.pollingStallThresholdMs (also per-account) so long-running Telegram work gets more room before polling is treated as stalled. (#57737) Thanks @Vitalcheffe.
Telegram/polling: bound the persisted-offset confirmation getUpdates probe with a client-side timeout so a zombie socket cannot hang polling recovery before the runner watchdog starts. (#50368) Thanks @boticlaw.
Agents/Pi runner: retry silent stopReason=error turns with no output when no side effects ran, so non-frontier providers that briefly return empty error turns get another chance instead of ending the session early. (#68310) Thanks @Chased1k.
Plugins/memory: preserve the active memory capability when read-only snapshot plugin loads run, so status and provider discovery paths no longer wipe memory public artifacts. (#69219) Thanks @zeroaltitude.
Plugins: keep only the highest-precedence manifest when distinct discovered plugins share an id, so lower-precedence global or workspace duplicates no longer load beside bundled or config-selected plugins. (#41626) Thanks @Tortes.
Cron/delivery: treat explicit delivery.mode: "none" runs as not requested even if the runner reports delivered: false, so no-delivery cron jobs no longer persist false delivery failures or errors. (#69285) Thanks @matsuri1987.
Plugins/install: repair active and default-enabled bundled plugin runtime dependencies before import in packaged installs, so bundled Discord, WhatsApp, Slack, Telegram, and provider plugins work without putting their dependency trees in core.
BlueBubbles: raise the outbound /api/v1/message/text send timeout default from 10s to 30s, and add a configurable channels.bluebubbles.sendTimeoutMs (also per-account) so macOS 26 setups where Private API iMessage sends stall for 60+ seconds no longer silently lose messages at the 10s abort. Probes, chat lookups, and health checks keep the shorter 10s default. Fixes #67486. (#69193) Thanks @omarshahine.
Agents/bootstrap: budget truncation markers against per-file caps, preserve source content instead of silently wasting bootstrap bytes, and avoid marker-only output in tiny-budget truncation cases. (#69114) Thanks @BKF-Gitty.
Context engine/plugins: stop rejecting third-party context engines whose info.id differs from the registered plugin slot id. The strict-match contract added in 2026.4.14 broke lossless-claw and other plugins whose internal engine id does not equal the slot id they are registered under, producing repeated info.id must match registered id lane failures on every turn. Fixes #66601. (#66678) Thanks @GodsBoy.
Agents/compaction: rename embedded Pi compaction lifecycle events to compaction_start / compaction_end so OpenClaw stays aligned with pi-coding-agent 0.66.1 event naming. (#67713) Thanks @mpz4life.
Security/dotenv: block all OPENCLAW_* keys from untrusted workspace .env files so workspace-local env loading fails closed for new runtime-control variables instead of silently inheriting them. (#473)
Gateway/device pairing: restrict non-admin paired-device sessions (device-token auth) to their own pairing list, approve, and reject actions so a paired device cannot enumerate other devices or approve/reject pairing requests authored by another device. Admin and shared-secret operator sessions retain full visibility. (#69375) Thanks @eleqtrizit.
Agents/gateway tool: extend the agent-facing gateway tool's config mutation guard so model-driven config.patch and config.apply cannot rewrite operator-trusted paths (sandbox, plugin trust, gateway auth/TLS, hook routing and tokens, SSRF policy, MCP servers, workspace filesystem hardening) and cannot bypass the guard by editing per-agent sandbox, tools, or embedded-Pi overrides in place under agents.list[]. (#69377) Thanks @eleqtrizit.
Gateway/websocket broadcasts: require operator.read (or higher) for chat, agent, and tool-result event frames so pairing-scoped and node-role sessions no longer passively receive session chat content, and scope-gate unknown broadcast events by default. Plugin-defined plugin.* broadcasts are scoped to operator.write/admin, and status/transport events (heartbeat, presence, tick, etc.) remain unrestricted. Per-client sequence numbers preserve per-connection monotonicity. (#69373) Thanks @eleqtrizit.
Agents/compaction: always reload embedded Pi resources through an explicit loader and reapply reserve-token overrides so runs without extension factories no longer silently lose compaction settings before session start. (#67146) Thanks @ly85206559.
Memory-core/dreaming: normalize sweep timestamps and reuse hashed narrative session keys for fallback cleanup so Dreaming narrative sub-sessions stop leaking. (#67023) Thanks @chiyouYCH.
Gateway/startup: delay HTTP bind until websocket handlers are attached, so immediate post-startup websocket health/connect probes no longer hit the startup race window. (#43392) Thanks @dalefrieswthat.
Codex/app-server: release the session lane when a downstream consumer throws while draining the turn/completed notification, so follow-up messages after a Codex plugin reply stop queueing behind a stale lane lock. Fixes #67996. (#69072) Thanks @ayeshakhalid192007-dev.
Codex/app-server: default approval handling to on-request so Codex harness sessions do not start with overly permissive tool approvals. (#68721) Thanks @Lucenx9.
Cron/delivery: keep isolated cron chat delivery tools available, resolve channel: "last" targets from the gateway, show delivery previews in cron list/show, and avoid duplicate fallback sends after direct message-tool delivery. (#69587) Thanks @obviyus.
Cron/Telegram: key isolated direct-delivery dedupe to each cron execution instead of the reused session id, so recurring Telegram announce runs no longer report delivered while silently skipping later sends. (#69000) Thanks @obviyus.
Models/Kimi: default bundled Kimi thinking to off and normalize Anthropic-compatible thinking payloads so stale session /think state no longer silently re-enables reasoning on Kimi runs. (#68907) Thanks @frankekn.
Control UI/cron: keep the runtime-only last delivery sentinel from being materialized into persisted cron delivery and failure-alert channel configs when jobs are created or edited. (#68829) Thanks @tianhaocui.
OpenAI/Responses: strip orphaned reasoning blocks before outbound Responses API calls so compacted or restored histories no longer fail on standalone reasoning items. (#55787) Thanks @suboss87.
Cron/CLI: parse PowerShell-style --tools allow-lists the same way as comma-separated input, so cron add and cron edit no longer persist exec read write as one combined tool entry on Windows. (#68858) Thanks @chen-zhang-cs-code.
Browser/user-profile: let existing-session profile="user" tool calls auto-route to a connected browser node or use explicit target="node", while still honoring explicit target="host" pinning. (#48677)
Discord/slash commands: tolerate partial Discord channel metadata in slash-command and model-picker flows so partial channel objects no longer crash when channel names, topics, or thread parent metadata are unavailable. (#68953) Thanks @dutifulbob.
BlueBubbles: consolidate outbound HTTP through a typed BlueBubblesClient that resolves the SSRF policy once at construction so image attachments stop getting blocked on localhost and reactions stop getting blocked on private-IP BB deployments. Fixes #34749 and #59722. (#68234) Thanks @omarshahine.
Cron/gateway: reject ambiguous announce delivery config at add/update time so invalid multi-channel or target-id provider settings fail early instead of persisting broken cron jobs. (#69015) Thanks @obviyus.
Cron/main-session delivery: preserve heartbeat.target="last" through deferred wake queuing, gateway wake forwarding, and same-target wake coalescing so queued cron replies still return to the last active chat. (#69021) Thanks @obviyus.
Cron/gateway: ignore disabled channels when announce delivery ambiguity is checked, and validate main-session delivery patches against the live cron service default agent so hot-reloaded agent config does not falsely reject valid updates. (#69040) Thanks @obviyus.
Matrix/allowlists: hot-reload dm.allowFrom and groupAllowFrom entries on inbound messages while keeping config removals authoritative, so Matrix allowlist changes no longer require a channel restart to add or revoke a sender. (#68546) Thanks @johnlanni.
BlueBubbles: always set method explicitly on outbound text sends ("private-api" when available, "apple-script" otherwise), and prefer Private API on macOS 26 even for plain text. Fixes silent delivery failure on macOS setups without Private API where an omitted method let BB Server fall back to version-dependent default behavior that silently drops the message (#64480), and the AppleScript -1700 error on macOS 26 Tahoe plain text sends (#53159). (#69070) Thanks @xqing3.
Matrix/commands: recognize slash commands that are prefixed with the bot's Matrix mention, so room messages like @bot:server /new trigger the command path without requiring custom mention regexes. (#68570) Thanks @nightq and @johnlanni.
Gateway/pairing: return reason-specific PAIRING_REQUIRED details, remediation hints, and request ids so unapproved-device and scope-upgrade failures surface actionable recovery guidance in the CLI and Control UI. (#69227) Thanks @obviyus.
Agents/subagents: include requested role and runtime timing on subagent failure payloads so parent agents can correlate failed or timed-out child work. (#68726) Thanks @BKF-Gitty.
Gateway/sessions: reject stale agent-scoped sessions after an agent is removed from config while preserving legacy default-agent main-session aliases. (#65986) Thanks @bittoby.
Doctor/gateway: surface pending device pairing requests, scope-upgrade approval drift, and stale device-token mismatch repair steps so openclaw doctor --fix no longer leaves pairing/auth setup failures unexplained. (#69210) Thanks @obviyus.
Cron/isolated-agent: preserve explicit delivery.mode: "none" message targets for isolated runs without inheriting implicit last routing, so agent-initiated Telegram sends keep their authored destination while bare mode:none jobs stay targetless. (#69153) Thanks @obviyus.
Cron/isolated-agent: keep delivery.mode: "none" account-only or thread-only configs from inheriting a stale implicit recipient, so isolated runs only resolve message routing when the job authored an explicit to target. (#69163) Thanks @obviyus.
Gateway/TUI: retry session history while the local gateway is still finishing startup, so openclaw tui reconnects no longer fail on transient chat.history unavailable during gateway startup errors. (#69164) Thanks @shakkernerd.
BlueBubbles/reactions: fall back to love when an agent reacts with an emoji outside the iMessage tapback set (love/like/dislike/laugh/emphasize/question), so wider-vocabulary model reactions like 👀 still produce a visible tapback instead of failing the whole reaction request. Configured ack reactions still validate strictly via the new normalizeBlueBubblesReactionInputStrict path. (#64693) Thanks @zqchris.
BlueBubbles: prefer iMessage over SMS when both chats exist for the same handle, honor explicit sms: targets, and never silently downgrade iMessage-available recipients. (#61781) Thanks @rmartin.
Telegram/setup: require numeric allowFrom user IDs during setup instead of offering unsupported @username DM resolution, and point operators to from.id/getUpdates for discovery. (#69191) Thanks @obviyus.
GitHub Copilot/onboarding: default GitHub Copilot setup to claude-opus-4.6 and keep the bundled default model list aligned, so new Copilot setups no longer start on the older gpt-4o default. (#69207) Thanks @obviyus.
Gateway/status: separate reachability, capability, and read-probe reporting so connect-only or scope-limited sessions no longer look fully healthy, and normalize SSH targets entered as ssh user@host. (#69215) Thanks @obviyus.
Slack: fix outbound replies failing with "unresolved SecretRef" for accounts configured via file or exec secret sources; the send path now tolerates the runtime snapshot retaining an unresolved channel SecretRef when a boot-resolved token override is already available. (#68954) Thanks @openperf.
Control UI/device pairing: explain scope and role approval upgrades during reconnects, and show requested versus approved access in the Control UI and openclaw devices so broader reconnects no longer look like lost pairings. (#69221) Thanks @obviyus.
Gateway/Control UI: surface pending scope, role, and device-metadata pairing approvals in auth errors and Control UI hints so broader reconnects no longer look like random auth breakage. (#69226) Thanks @obviyus.
All of the code in this article is available in the Oracle AI Developer Hub. The repository is part of Oracle’s open-source AI collection and serves as the reference implementation for everything covered here.
You can install it with pip install agent-reasoning, browse the 16 agent classes, run the TUI, or integrate it directly into an existing Ollama pipeline as a zero-change replacement client. If you find it useful, a GitHub star goes a long way.
Key Takeaways
Small language models struggle with complex reasoning on their own, but agent-based architectures (like Tree of Thoughts or Self-Consistency) can significantly improve their performance.
The agent-reasoning framework adds 16 research-backed reasoning strategies to any Ollama model using a simple +strategy tag—no code changes required.
Different strategies suit different tasks: CoT works well overall, ReAct excels with external data, and branching methods improve accuracy at the cost of speed.
Much of modern AI progress comes from orchestration (prompting, search, control flow), not just larger models.
Generally, a 270M parameter LLM (as of today, April 2026) struggles with even basic multi-step reasoning. Ask a model like gemma3:270m to solve the classic water jug problem, and it will often return a confidently incorrect answer—much like other small language models (SLMs) of similar size and training.
However, take that same model and wrap it inside a Tree of Thoughts (ToT) agent, running a breadth-first search (BFS) with three levels and weighted branches, and it can reliably solve the puzzle. The improvement comes from the architecture: the agent distributes the reasoning process across structured exploration steps, compensating for the limitations of a single LLM call.
This is where things get interesting. Much of the progress in applied AI isn’t coming from bigger models alone, but from engineers rethinking how to orchestrate them — layering search, memory, and control flow on top of a standard LLM call to unlock new capabilities.
This is the fundamental idea behind agent-reasoning: sixteen cognitive architectures — each backed by peer-reviewed research — can be applied to any Ollama-served model via a simple +Strategy tag appended to the model name. Call gemma3:270m+tot instead of gemma3:270m, and the interceptor handles everything else.
We’ll talk about the different ways to invoke these reasoning strategies through the project.
What You’ll Learn
How the ReasoningInterceptor intercepts model names, removes the +Strategy tag, and directs traffic to one of 16 agent classes
How 16 strategies divide into four families: sequential, branching, reflective, and meta — each representing a different reasoning approach and set of trade-offs
What each major strategy accomplishes in practice, focusing on implementation rather than theory
Which type of problem each strategy is best suited for, based on benchmark results from March 2026
The Interception Layer
Key insight: The ReasoningInterceptor is an interchangeable drop-in client for Ollama that analyzes the model name for a +Strategy tag and directs traffic to one of 16 cognitive agent classes while making no modifications to your pre-existing code.
Everything relies on a single template: add +Strategy to any Ollama model name.
Using ReasoningInterceptor as a drop-in replacement client
The image below illustrates the entire routing process from start to finish. The interceptor acts as a middleman between your code and Ollama, removes the +Strategy tag, and sends traffic to the correct agent class.
Illustrating how the interceptor separates the base model from the Strategy tag
agent_map contains over fifty-five aliases mapped to sixteen agent classes. For example, cot, chain_of_thought, and CoT all map to cotagent. Similarly, mcts and monte_carlo both map to MCTSAgent. Since the interceptor is a drop-in client for Ollama (same .generate() and .chat() apis) existing LangChain pipelines, web UIs, and scripts automatically receive reasoning functionality by replacing a single string in the model name.
Additionally, it can be used as a network proxy. Instead of pointing an Ollama compatible application at http://localhost:11434, point it at http://localhost:8080. Use gemma3:270m+CoT as the model name and the gateway will apply reasoning transparently.
Family 1: Sequential Strategies
Key insight: Sequential Strategies process problems in a linear chain, where each step feeds into the next. In benchmarks, CoT acheived 88.7% average accuracy, compared to 81.3% for standard generation on the same model and weights.
Each of the sixteen strategies fall into one of four families. The diagram below illustrates how they are grouped.
Categorization of the four strategy families
Sequential strategies are designed for high-speed processing with minimal latency. They are ideal for problems with discrete, sequential steps.
Chain of Thought (CoT) is a prompting strategy in which the model generates intermediate reasoning steps before producing a final response. As noted in the original paper: prompting a model to produce these intermediate steps can significantly improve accuracy.
For example, standard prompting on GSM8K achieves 66.7% accuracy. With CoT prompting, this increases to 73.3% — a 10% relative improvement achieved through simple prompt design alone.
The following graphic illustrates how CoT chains appear in practice: a sequence of numbered steps, each building on the previous one.
CoT in operation
In terms of implementation within CotAgent, the query is wrapped in a structured prompt:
Structured prompting enforces step-by-step reasoning in CoTAgent
Benchmark result for qwen3.5:9b (9.7B): CoT achieves 88.7% average accuracy, across GSM8K (math), MMLU (logic), and ARC-Challenge (reasoning), compared to 81.3% for standard generation. This seven-point gain in performance is attributable solely to structural prompts. Identical weights and temperatures were applied to both models.
Recommended usage: Math word problems; logic puzzles; any multi-step reasoning task where the individual steps are sequential and do not have branches.
Decomposed prompting is an architectural module that splits large problems into smaller sub-problems. Each sub-problem is handled independently while carrying forward accumulated context from earlier steps. Once all sub-problems are processed, their outputs are synthesized into a final result. DecomposedAgent follows a three-phase process—decomposition, execution and synthesis—and propagating context throughout so that each step can build on prior results.
Recommended usage: Planning problems; trip itinerary generation; any problem where the ultimate answer consists of multiple distinguishable parts that may be individually addressed.
Note: Decomposed prompting achieved only 38.5% average accuracy in benchmark testing. This result requires context. GSM8K primarily evaluates arithmetic reasoning, where decomposing a problem like “what is 47 × 13 + 9?” introduces overhead without improving the model’s ability to compute the answer.
Decomposition is more effective for problems with genuinely separable components (trip planning, multi-section reports etc.), where each part benefits from focused attention. These strengths are not captured by the benchmark, and the results reflect that mismatch.
Least-to-most prompting is a strategy that orders sub-questions from simplest to most complex, establishing prerequisite knowledge before tackling harder steps. Unlike decomposed prompting which generates arbitrary sub-problems, it enforces a deliberate progression where each step builds on the last. Knowledge is accumulated iteratively until the model reaches the final question.
Recommended usage: Questions with genuine prerequisites — e.g., “what is x?” before determining “how does x relate to y?”; educational style explanation sequences (“concept ladder”); tasks that require establishing foundational concepts before addressing more complex components.
Family 2: Branching Strategies
Key insight: Branching strategies explore multiple reasoning paths simultaneously and choose the best path. ToT scored 76.7% on GSM8K math, compared to 66.7% on GSM8K math with standard generation.
More LLM calls mean higher latency — but often better answers on hard problems. Take this into consideration when running all branching strategies.
ToT is a search-based methodology that evaluates numerous possible reasoning paths concurrently, selecting the best performing path as determined by evaluation metrics such as distance traveled or quality of intermediate solutions etc.
Similar to chess engines, ToT applies BFS through an expanding tree of possible solutions. The core idea is straightforward: generate multiple partial solutions, evaluate them, prune weaker candidates, and continue exploring the most promising branches.
Below is an illustration of how ToT generates and eliminates branches: green nodes represent surviving branches, while red nodes indicate those that have been eliminated. The final answer is derived from the highest scoring leaf node.
A key design decision is how branches are evaluated. Should the same model handle both generation and scoring, or should a stronger model be introduced as a judge? In these benchmarks, the same model was used for both roles, but this is an area worth experimenting with, depending on your accuracy and latency constraints.
Generating candidate branches at each level
ToTAgent implements this as configurable by depth (default=3) and width (default=2 branches). At every level, the agent generates a set of candidate next steps, evaluates them using a scoring function, prunes low-scoring options, and expands the remaining candidates into the next level.
Tot achieved 76.7% accuracy — a 10% percent improvement over standard generation on GSM8K math problems. This performance comes at a cost: additional LLM calls are required at each step to evaluate candidate paths and their intermediate result, making it roughly 5–8x slower than CoT equivalent queries.
Recommended usage: Logic puzzles with multiple solution paths; strategic decision problems; tasks where multiple approaches can be explored and compared.
Self-Consistency is a sampling method that generates multiple independent reasoning traces and selects a final answer through majority voting. Unlike standard prompting, it relies on sampling k diverse traces at a higher temperature to encourage variation. Each trace produces a candidate answer, and the most frequently occurring answer is selected as the final output.
The image below illustrates how both Self-Consistency and Monte Carlo Tree Search (MCTS) sample multiple reasoning paths, but differ fundamentally in how those paths are evaluated — majority voting versus UCB1-based exploration-exploitation balancing.
Self-Consistency vs MCTS comparison
ConsistencyAgent uses k=5 samples at temperature of 0.7 by default. It extracts final answers using regex-based pattern matching and selects the most frequent result via counter.most_common().
Self-Consistency matches CoT on both MMLU (96.7%) and GSM8K (76.7%). Its advantage lies in reliability rather than raw accuracy: majority voting across independent reasoning traces reduces the risk of single-trace errors propagating to the final answer.
Recommended usage: Factual question answering; multiple-choice style questions; problems where arriving at the correct answer via diverse reasoning paths is more important than inspecting a single reasoning trace.
Family 3: Reflective Strategies
Self-Reflection
Paper: Shinn et al. (2023), “Reflexion: Language Agents with Verbal Reinforcement Learning” — arXiv:2303.11366
Self-Reflection is a draft-critique-refine loop in which the model generates an initial answer, critiques it for errors, and then revises it. The Reflexion paper showed that this iterative process can meaningfully improve output quality, even without any gradient updates.
The image below shows all 3 reflective strategies side by side: Self-Reflection, Debate, and Refinement Loop.
Reflective strategies comparison
SelfReflectionAgent runs a draft-critique-refine loop for up to 5 iterations, with early termination when the critique returns “CORRECT” in under 20 characters. If the critique is satisfied on an early pass, subsequent iterations are skipped. This approach helps keeps latency low for queries the model answers correctly on the initial pass.
Recommended usage: Creative writing, high-stakes technical explanations, anything where “good enough on the first try” is insufficient.
Irving proposed debate as a mechanism for improving AI safety. Two agents present opposing arguments, and a judge (either a human or another LLM) evaluates their merits. The underlying premise is that that identifying flaw in weak arguments is often easier than constructing strong ones.
DebateAgent conducts multiple rounds of PRO and CON arguments, with a judge evaluating each exchange. Following all rounds, the strongest arguments from both sides are synthesized into a final answer that balances competing perspectives. Context is carried forward between rounds, enabling incremental refinement rather than redundant arguements.
Recommended usage: Controversial or ambiguous subjects; policy analysis; ethics and any subject matter requiring a balanced perspective.
This paper describes a refinement loop similar to self-reflection, but instead of relying on a human-style critique to guide revisions, it uses a machine-based evaluation system with quantifiable quality metrics. These metrics determine whether further refinement is necessary. The loop terminates when a predefined quality metric is reached (> 0.9 by default) or when the maximum number of iterations is exceeded.
The five-stage complex refinement pipeline consists of sequential stages, each focused on a distinct type of critique: technical accuracy, structure, depth, examples, and polish.
Each stage targets a distinct aspect of quality, ensuring the model focuses exclusively on improving that dimension rather than attempting to optimize everything at once.
Recommended usage: Highly technical writing; documentation; blog posts, a scenario where production-quality output is required rather than simply a first draft.
Family 4: Cross-Domain and Meta Strategies
Key insight: Cross-domain strategies enable sharing knowledge among disciplines, while meta-strategies automatically route queries to the most appropriate reasoning technique without requiring manual selection.
Gentner’s structure-mapping theory proposes that analogical reasoning operates by identifying structural correspondences across domains, rather than relying on surface-level similarity. The AnalogicalAgent builds on this idea through three phases: (1) identify the underlying structure independent of domain specifics, (2) generate analogous solutions from different domains that share that structure, (3) select the most effective analogy and apply its solution approach.
This process reduces reliance on memorized patterns. By focusing on underlying structure, the model learns why a solution works, rather than simply recalling what worked before.
Recommended usage: Solving problems that are structurally similar to prior ones, even if they differ superficially; transferring knowledge across domains; explaining complex concepts through analogy.
The Socratic Method: Do not answer the question directly. Instead, ask follow-up questions that reduce ambiguity in the solution space.
SocraticAgent repeatedly asks questions and receives model responses, continuing until it reaches a limit of five question-response exchanges. It then synthesizes the collected information into a final answer. A deduplication or normalization step helps prevent repeated queries that differ only in wording.
Recommended for: Philosophy; ethics; deep technical knowledge; any field requiring the model to “know” something as opposed to merely answering it.
ReAct is a conceptual framework that interweaves reasoning steps with tool invocations, allowing the model to ground its thinking in external information. In practice, the model decides what action to take, calls a tools such as a web search engine, examines the result, updates its reasoning, and repeats the cycle until it reaches a satisfactory answer. Current tools include web scraping, accessing Wikipedia via an API call, and a calculator interface, with mock-ups available for off-line execution scenarios.
Using ReAct acheived 70.0% accuracy on ARC-Challenge (Science Reasoning). While not the highest on this particular benchmark, it enabled tool use for the LLM and allowed it to search for required information on the Internet.
Recommended usage: Fact-checking; current events queries; mathematical calculations; tasks where access to grounded, external information is important.
Auto Router: MetaReasoningAgent
Key insight: A single LLM invocation allows MetaReasoningAgent to classify each input into one of eleven categories and route it to the most appropriate strategy, without human intervention.
All sixteen strategies depend on selecting the appropriate strategy for a given task. By removing this requirement, MetaReasoningAgent eliminates the need for manual selection.
The diagram below shows how each category maps to its corresponding strategy.
MetaReasoningAgent classification diagram
MetaReasoningAgent instantiates the selected strategy class and passes control to it, along with all event objects for visualization.
To use this capability, specify a model such as gemma3:270m+meta or gemma3:270m+auto.
In practice, routing is generally intuitive: math problems are directed to CoT, logic puzzles to ToT, philosophical questions to Socratic Questioning, and controversial topics to Adversarial Debate.
The trade-off is reduced control over strategy-specific hyperparameters in exchange for automatic routing aligned with the problem type.
What Strategy Should You Pick? Benchmark Results (March 2026)
Key insight: CoT performs best on average (88.7%) across diverse tasks. ReAct excels when tool use is available (70.0% on ARC-Challenge). ToT and Self-Consistency tie on GSM8K math at 76.7%.
These results are based on 4,200 evaluations across 11 strategies using qwen3.5:9b, collected as of March 2026. All 16 strategies are implemented and production-ready. However, the benchmarks shown below focus on the 11 that produce a single extractable answer. The remaining five are generation-focused and not suited to multiple-choice evaluation.
The heat map and bar chart below provide a complete view of the results.
Benchmark results heatmap and bar chart
The short version: CoT wins on average across diverse tasks. Self-Consistency and ToT beat it on specific math benchmarks. ReAct dominates on factual/science tasks. Self-Reflection and Refinement Loop are not well captured by these benchmarks, as they primarily improve generation quality rather than multiple-choice accuracy.
For most queries, start with +cot. If you’re solving logic puzzles or planning problems, try +tot. If you need factually grounded responses, use +react. If you need polished, high-quality output rather than a quick answer, use +refinement. When in doubt, +meta will route they query automatically.
In my experience building agent-reasoning, the most surprising finding is how much prompt structure alone can improve performance. For example, qwen3.5:9b improves from 81.3% to 88.7% average accuracy simply by prompting it to produce numbered reasoning steps.
As of March 2026, all 16 strategies are production-ready and have been evaluated across 4,200 benchmark runs.
You can find the repository here. Install with pip install agent-reasoning or uv add agent-reasoning. The commands to get started:
Getting started commands
The TUI provides a 16-agent sidebar, live streaming, and a step-through debugger. Arena mode runs all 16 agents simultaneously on the same query in a 4×4 grid.
If this is useful, a GitHub star is always appreciated.
Frequently Asked Questions
Do I need to modify my existing code to use agent-reasoning?
No. The interceptor is a drop-in replacement for the Ollama client. Just change the model name string by appending +strategy (e.g., gemma3:270m+cot) and the interceptor handles everything else. Existing LangChain pipelines, web UIs, and scripts work without any other changes.
Which strategy should I start with?
Start with +cot (Chain of Thought). It scored the highest average accuracy (88.7%) across our benchmarks and adds minimal latency. If you are unsure, use +meta and let the auto-router pick the best strategy for you.
Why were only 11 of the 16 strategies benchmarked?
The benchmarks (GSM8K, MMLU, ARC-Challenge) measure multiple-choice accuracy, which works well for strategies that produce a single extractable answer. The remaining five strategies are generation-focused (e.g., Refinement Loop, MCTS) and their strengths in output quality are not captured by multiple-choice evaluations. All 16 strategies are fully implemented and production-ready.
Can I use this with models other than Ollama-served models?
Currently the interceptor targets the Ollama API. Since it exposes the same .generate() and .chat() endpoints, any Ollama-compatible client works out of the box. Support for additional inference backends is on the roadmap.
How much slower are branching strategies compared to CoT?
Tree of Thoughts (ToT) is roughly five to eight times slower than CoT because it generates and evaluates multiple candidate branches at each level. Self-Consistency (k=5 samples) adds similar overhead. For latency-sensitive applications, stick with sequential strategies (CoT, Least-to-Most) and reserve branching strategies for problems where accuracy matters more than speed.
I’m Nacho Martinez, Data Scientist at Oracle. I build open-source AI projects and write about making language models reason better. Find me on GitHub and LinkedIn, or visit the Oracle AI Developer page for more resources.
I’m using the GitHub Copilot CLI on a day to day basis but I have been away for a few weeks and a lot has changed, in order to keep up to date with all of the changes I created a very simple web app that tracks the latest realses to the GitHub Copilot CLI.