Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
155299 stories
·
33 followers

Cordova Plugin InAppBrowser 6.0.1 With Fix For CVE-2026-47430 Released!

1 Share

We are happy to announce that we have just released an update to cordova-plugin-inappbrowser!

Release Highlights

This is a small patch release that addresses the recently published vulnerability CVE-2026-47430: Cordova Plugin InAppBrowser: iOS: Arbitrary Cordova callback IDs can be dispatched without validation from InAppBrowser WebViews. Full details:

Severity: important

Affected versions:

  • Cordova Plugin InAppBrowser (cordova-plugin-inappbrowser) 3.1.0 through 6.0.0

Description:

Summary

The iOS implementation of cordova-plugin-inappbrowser passes the id field from a WKScriptMessage body to commandDelegate sendPluginResult:callbackId: with no format validation (CDVWKInAppBrowser.m:560–574). Any web content loaded inside the InAppBrowser can fire any pending Cordova callback in the host app by posting a message whose id field is a guessable or enumerated callback identifier. An attack abusing this weakness must be tailored to the specific plugins and callback IDs the host app uses. Though an attacker with knowledge of common Cordova plugin configurations could craft reusable payloads targeting widely-adopted plugins.

Impact

An unauthenticated remote attacker who controls content displayed in the InAppBrowser — via a URL the app opens (OAuth redirect, marketing link, deep-link target) or a network interception — can call window.webkit.messageHandlers.cordova_iab.postMessage({id: '<victim-callback-id>', d: '...'}) to fire callbacks belonging to any other installed Cordova plugin (Camera, Contacts, File, Geolocation). Cordova callback IDs follow the predictable format <PluginName><sequential-integer>, making enumeration feasible. Successful exploitation allows the attacker to spoof plugin results across trust boundaries — for example, injecting a forged camera approval, a fabricated contacts list, or a crafted file-read response.

This issue affects Cordova Plugin InAppBrowser: from 3.1.0 through 6.0.0.

Users are recommended to upgrade to version 6.0.1, which fixes the issue.

References:

https://www.cve.org/CVERecord?id=CVE-2026-47430

Please report any issues you find on GitHub!

Changes include:

  • GH-1152 fix(ios): check callbackId with regex
  • GH-1095 chore: gh-action workflow, license header formatting & cleanups
  • Fix npm audit issues
Read the whole story
alvinashcraft
just a second ago
reply
Pennsylvania, USA
Share this story
Delete

Customize Your Claude Code Status Line to Manage Token Burn

1 Share

Customize the Claude Code status line to track context window usage, session limits, and weekly quotas so you stop burning tokens you don’t need.

Read the full article (24 minutes reading time): Customize Your Claude Code Status Line to Manage Token Burn.

Read the whole story
alvinashcraft
14 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

The Cratis Philosophy: Simplicity Through Layers

1 Share
Come for the simplicity, stay for the capabilities That’s the tagline I’ve settled on for Cratis. But to understand why, we need to go back to where this all started. The burden of helping other developers Back in 1994, I started in games development, working mostly on the 2D/3D engine and tooling side of things. That’s where I found my love and lot in life: creating things that help other d...
Read the whole story
alvinashcraft
33 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Creating .NET apps takes forever - then we tried AI

1 Share

We gave Claude a vague prompt to build a .NET app - it struggled. Then we added technical documents. The results were dramatically better. Here's what we learned.

The page Creating .NET apps takes forever - then we tried AI appeared on Round The Code.

Read the whole story
alvinashcraft
47 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Moving Indexes To A New Filegroup: Microsoft Still Hates You

1 Share

Moving Indexes To A New Filegroup, Or: Microsoft Still Hates You


At some point you’re going to want to move some indexes to a new filegroup. Maybe you’re separating data across storage, maybe you’re cleaning up after someone who put everything on PRIMARY and walked away, maybe you’ve got your reasons and they’re none of my business.

Whatever the cause, you’d think this would be a solved problem in a database that’s been around since the Clinton administration.

It is not.

How bad it gets depends on what you’re moving. Let’s go from least painful to most painful, because the pain here is instructive.

Normal Indexes


We’ll define normal as an index that isn’t carrying gobs of LOB data around with it. If that’s what you’ve got, life is easier. Not simple, but easier.

The part that surprises people: you cannot just rebuild the index onto the new filegroup. There is no ALTER INDEX REBUILD WITH (MOVE_THIS_SOMEWHERE_USEFUL = ON). That would be too goddamned easy, and we don’t do easy here (unless it’s a RECOMPILE hint).

What you have to do is fully script the index out. Keys, includes, uniqueness, filters, and any particular settings it was created with.

Then you recreate it on the new filegroup with DROP_EXISTING turned on.

CREATE UNIQUE NONCLUSTERED INDEX 
    whatever
ON dbo.SomeTable
(
    column_one,
    column_two
)
INCLUDE
(
    column_three
)
WHERE column_one > 0
WITH
(
    DROP_EXISTING = ON,
    ONLINE = ON, 
    (and all the other stuff you can or might want do, like PAGE compression)
)
ON [NewFileGroup];

Yes, you can make the new FG the default so that you don’t have to worry much about including it on every script, but who knows?

Maybe you created more than one new FG. You’re weird out there. I know you.

Miss a column in the include list, fumble the filter predicate, forget it was unique, and you’ve now changed the index instead of just moving it.

The work isn’t hard, exactly. Plenty of stored procedures and code examples exist to script out all your indexes.

It’s just tedious and unforgiving, which is its own kind of hard. Like a Cormac McCarthy book.

Heaps


If you’ve got heaps, your life is about to get worse.

What sucks is that I typed that and then realized it sounds like something an LLM would say.

Ah, screw it.

You can’t rebuild a heap onto a new filegroup, because there’s no index to rebuild. The data is just sitting there in a pile.

To move it, you have to put a clustered index on the table, which physically relocates the rows to wherever that clustered index lives.

If you’ve been meaning to fix those heaps anyway, congratulations, you get a small hit of satisfaction here. Build the clustered index, leave it, move on with a slightly better schema than you started with.

But if the table is supposed to be a heap, you’ve now got to script out dropping the clustered index you just created. Which turns it back into a heap on the new filegroup. So the move costs you a create and a drop for something that was never supposed to have an index in the first place.

LOB Data


Now we get to the part where I want someone at Microsoft to do this process.

Once. Just once.

There are products that should be experienced by the people who make them, and this is one of them. I get the sense that it often isn’t.

This applies to clustered tables with LOB columns, and it applies to your heaps with LOB data too, because LOB makes everything worse uniformly. Oh, and if you’ve got nonclustered indexes with LOB data in them, well… you, too.

When you do the create-with-DROP_EXISTING dance to move a table, the in-row data moves. The LOB data does not. It just stays where it was, staring at you, refusing to relocate. You can verify this yourself by checking allocation units before and after and watching the LOB_DATA unit sit exactly where it started.

The fix comes from a Kimberly Tripp post that has saved a lot of people a lot of grief over the years (What about moving LOB data?). The trick relies on a quirk of how SQL Server handles partitioning: LOB data physically moves when the object transitions from non-partitioned to partitioned, or from one partition scheme to another. So you make the table partitioned, which forces the LOB data to move, even if you have no actual interest in partitioning anything.

The sequence goes like this:

  1. Create a partition function and a partition scheme.
  2. Apply the scheme to the table by creating the index on it with DROP_EXISTING. That moves the data onto the scheme.
  3. Then create the index AGAIN, this time onto the plain filegroup, with DROP_EXISTING once more, which makes the table non-partitioned again and moves everything, LOB included, on your target filegroup.

You read that correctly. It takes two index creates with DROP_EXISTING to move LOB data. The table briefly becomes partitioned for no reason other than to trick the engine into picking up the LOB allocation unit and carrying it along.

CREATE PARTITION FUNCTION pf_temp_move (bigint)
    AS RANGE RIGHT
    FOR VALUES (9223372036854775807);

CREATE PARTITION SCHEME ps_temp_move
    AS PARTITION pf_temp_move
    ALL TO ([NewFileGroup]);

/* Move onto the scheme. LOB comes with it. */
CREATE UNIQUE CLUSTERED INDEX 
    whatever
ON dbo.SomeTable 
    (some_bigint_column)
WITH
(
    DROP_EXISTING = ON
)
ON ps_temp_move (some_bigint_column);

/* Move back onto a plain filegroup. Table is no longer partitioned. */
CREATE UNIQUE CLUSTERED INDEX 
    whatever
ON dbo.SomeTable 
    (some_bigint_column)
WITH
(
    DROP_EXISTING = ON
)
ON [NewFileGroup];

And here’s the kicker: If the table has nonclustered indexes on it, both of those moves rebuild every one of them. Onto the scheme, then off the scheme.

You are reading that correctly too. Every nonclustered index gets rebuilt twice.

Say, have you been meaning to clean up some indexes for a while?

Picking A Boundary Value


The partition function needs a boundary point. You want a single boundary that sits higher than any value that exists or will plausibly ever exist in your clustering key, so that everything lands in one partition and nothing actually gets split up. You’re not partitioning for real. You just need the engine to think you are.

If you’re clustered on something with a sane data type, this is easy. Use the maximum value for the type:

  • int: use the int max, 2147483647
  • bigint: use the bigint max, 9223372036854775807
  • uniqueidentifier: use FFFFFFFF-FFFF-FFFF-FFFF-FFFFFFFFFFFF, but read the note below before you trust that
  • date, datetime, datetime2: 99991231 is the standard pick

On the GUID one, be careful. SQL Server does not sort uniqueidentifier values by reading the bytes left to right the way you read them on screen. It sorts on the last group of six bytes first, then works backward through the groups. It’s a genuinely strange ordering and it trips people up constantly. The good news is that the all-F’s GUID still sorts highest no matter how you slice it, because every byte is already maxed out, so FFFFFFFF-FFFF-FFFF-FFFF-FFFFFFFFFFFF is a safe upper boundary. Just don’t go assuming the rest of GUID ordering matches what your eyes tell you.

If you came in from a heap, you’re free to pick whatever column has the most reasonable data type to build your temporary clustered index on. You’re dropping it afterward anyway, so choose the one that makes the boundary value easy.

Your Online Operation Isn’t As Online As You Think


You set ONLINE = ON, you tell yourself you’re being a responsible adult, and you expect the move to glide along without anybody noticing. Then you watch shoving all that LOB data around generate a shitload of tempdb contention, and suddenly your nice online operation is causing blocking anyway, just through a side door.

So your “online” rebuild is online in the narrow sense that it isn’t holding a long schema lock on the table itself, but it’s lighting up tempdb badly enough that everything else fighting for tempdb pages gets to wait in line behind you. The blocking didn’t go away. It just moved somewhere you weren’t looking. Watch your tempdb allocation page contention while this runs, because that’s where the pain shows up, not on the table you’re moving.

ONLINE Is A Suggestion, Not A Promise


It gets better, by which I mean worse. ONLINE = ON only loosely guarantees that your operation won’t block anything. It is not the iron contract people treat it as.

Kendra Little wrote up a great example of an online rebuild that ran offline and took exclusive locks the whole way through, with no warning and no error (Ugly Bug: SQL Server Online Index Rebuild Sometimes Happens Offline Without Warning).

Her repro used ALTER INDEX REBUILD WITH (ONLINE = ON) on a table that had previously had a LOB column dropped, which leaves the table in a state where the engine falls back to an offline operation and holds X locks the whole way through.

WAIT_AT_LOW_PRIORITY, the thing that’s supposed to be your lord and savior from the schema lock at the end, offered her no protection against those locks. This is the same LOB ghost haunting you from a different room. Maybe under your bed. Maybe in your closet. Maybe in your fridge.

The broader point is the one that matters even when you’re not stepping on that specific bug. Your online index build still has to take its locks, minimal as they’re supposed to be. And if something is already in the way when it goes to take them, your online operation gets blocked. Now it’s sitting there waiting, and everything that shows up behind it gets blocked too, because it’s holding its place in the lock queue while it waits for the lock it needs. One stuck online rebuild turns into a blocking chain, and that chain can sit there for a long, long time while you wonder why a “no downtime” operation took your application down. This also makes you look like an asshole for saying that “I can do this fully online and not cause any blocking, boss”.

There’s also some version voodoo waiting for you here. Because moving an index to a new filegroup forces you into CREATE INDEX with DROP_EXISTING rather than ALTER INDEX REBUILD, your access to WAIT_AT_LOW_PRIORITY depends on your version.

For ALTER INDEX, that option has been around since SQL Server 2014. For CREATE INDEX, the WAIT_AT_LOW_PRIORITY syntax only showed up in SQL Server 2022, along with Azure SQL Database and Managed Instance.

If you’re on 2019 or earlier and doing a filegroup move, the one saving grace you’d reach for to manage the Sch-M lock at the switch-in isn’t available to you, even though the people doing plain in-place rebuilds have had it for years.

This is exactly the situation I wrote sp_ProtectSession for. If you’re going to kick off one of these moves on a busy server, set yourself up to watch for and deal with the blocking it can cause, rather than finding out from the people whose queries are piling up behind it.

And Then There Are The String Clusterers


I see you out there. Clustered on a string because at the time it seemed like a simple thing to do and nobody was around to stop you.

Now you’ve got extra thinking ahead of you. You need a boundary value that sorts higher than every string already in the column, which means working out how many bytes the column holds and then building a value out of enough z’s, or whatever sorts highest under your collation, to clear the top of your data. REPLICATE is your friend here, padding a character out to the column’s length so your boundary outranks everything.

/* A boundary higher than any value in a varchar(50) clustering key */
DECLARE @boundary varchar(50) = REPLICATE('z', 50);

And even that depends on your collation deciding that ‘z’ sorts above whatever garbage is actually in there. Mixed case, accented characters, and case sensitivity all get a vote. So you don’t just get to pick a max value off a chart like the rest of us. You get to go think about collation sort order.

Go to hell, all of you.

The Short Version


Normal indexes cost you a careful script and one create. Heaps cost you a create and a drop. LOB data costs you a partition function, a partition scheme, two creates, and a rebuild of every nonclustered index twice. And if you clustered on a string, you also get to do collation homework before you can even write the boundary value down.

None of this needed to be this way. But here we are, and the data isn’t going to move itself.

An additional thing to consider: SQL Server has many different build strategies for indexes. It may choose to build indexes on giant tables single-threaded. It may even choose to build all 23 nonclustered indexes on a huge table single threaded while you’re partitioning on/partitioning off to move LOB data.

The story gets even more tawdry and sordid if you’re using an Availability Group in Synchronous mode. You might see a lot of really nasty pile ups on HADR_SYNC_COMMIT. You do have the option of switching to manual failover and asynchronous commit for a bit, but that’s between you and your RPO goals. If you’re moving a significant amount of data, it may be a long wait.

Get into sports, dummy, as a wise man once wrote on a bathroom wall.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

The post Moving Indexes To A New Filegroup: Microsoft Still Hates You appeared first on Darling Data.

Read the whole story
alvinashcraft
54 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

The AI Agents Stack (2026 Edition)

1 Share

The following article originally appeared on Paolo Perrone’s The AI Engineer Substack and is being reposted here with the author’s permission.

Your team picks LangGraph for a customer support chatbot. Three weeks in, you’ve got 14 nodes in a state graph, a custom checkpointer writing to Redis, and retry logic for tool calls that fail once a week. The agent answers refund questions. It calls one API. A 50-line script on the OpenAI SDK with two MCP servers would have done the same thing. But nobody mapped which layers the problem actually needed.

In November 2024, Letta published an AI agents stack diagram that became the default reference for half the engineering teams I talk to. If you’ve seen a “layers of an agent” visual on LinkedIn or pinned in a Slack channel, it probably traces back to that article.

That diagram is 14 months old now, and a lot has changed since. MCP didn’t exist yet. Memory was still treated as a subset of your vector database. Nobody was shipping provider-native agent SDKs. Eval wasn’t even on the map. The stack has six layers in 2026, and at least three of them didn’t exist as distinct categories when Letta drew the original.

So we drew it from scratch. This is the 2026 version.

The minimum viable agent stack in 2026

TL;DR

That’s the starting stack. Add complexity when something specific breaks, not before.

What are we even mapping?

Before the stack, there was a loop. In “What Is an AI Agent?,” we defined an agent as the think-act-observe cycle: The model reasons about a task, takes an action (calls a tool, writes to memory), observes the result, and loops until the task is done. That loop is the atomic unit. Everything in this issue is infrastructure that makes that loop work reliably, at scale, in production.

The agent stack is not the LLM stack. A chatbot needs inference and maybe RAG. An agent needs state management across multistep execution, tool access governed by protocols, memory that persists across sessions, autonomous reasoning loops, and guardrails that constrain behavior in real time. That’s a fundamentally different set of infrastructure problems.

We’re mapping the six layers between your LLM and a production agent. We’re not covering training infrastructure, data pipelines, or model fine-tuning. Those are adjacent stacks. We covered RAG in depth in Issue #5. Today we’re zooming out to show where RAG fits in the bigger picture.

Three things redrew the map between 2024 and 2026. MCP standardized tool connectivity, and the entire tools layer is new because of it. Reasoning models changed what agents can do autonomously, with single-call agents replacing some multistep chains. And memory became a first-class architectural primitive, not an afterthought bolted onto a vector database.

How to evaluate each layer

When choosing tools at each layer, ask three questions. How much state do you need to manage? A stateless tool caller and a multi-session agent that learns over time are different engineering problems, and the layers where state management is hardest (memory, frameworks) are where most teams get stuck. How much vendor lock-in can you tolerate? MCP is an open standard, provider SDKs are not, and every tool choice either increases or decreases how painful your next migration will be. And how hard is it to go from demo to production? Some layers (model serving) have almost no gap, while others (eval, guardrails) have a massive one. The layer where you feel that gap most is the one to invest in first.

We take each layer from the bottom up, starting with the most stable and ending with the least mature.

Layer 1: Models and inference

How you run the model that powers your agent: call an API, use a managed open weight provider, or self-host.

Models & inference: key players

The inference layer changed more in tone than in substance. Reasoning models like o1, o3, DeepSeek R1, and Claude with extended thinking shifted what agents can plan and execute. Agents that previously needed multistep chains can now solve problems in a single reasoning call. Open weight models like Llama 3.3, DeepSeek V3, and Qwen 2.5 closed the quality gap dramatically, so “always use the biggest closed model” is no longer default advice. The emerging pattern is to prototype on closed source and deploy on open weight.

The honest take: This layer is commoditizing. Model differences matter less each quarter. The real decision is the cost and latency trade-off, not which model is “smartest.”

On the evaluation side, API calls are stateless. Send a request, get a response. Nothing to manage. Lock-in risk runs high for closed APIs because each model reasons differently, so switching providers means retuning prompts, adjusting for different failure modes, and retesting your eval suite. It’s low for open weight, where you can swap the model and keep the infra. The prototype-to-production gap is the smallest of any layer. Your demo API call is the same as your production API call.

Self-host when your agent call volume makes API pricing untenable or when you need sub-100ms latency that API round-trips can’t deliver.

Layer 2: Protocols and tools

How your agent calls external tools and APIs: through MCP servers, browser automation, or agent-to-agent protocols.

Protocols & tools: key players

This layer didn’t exist as a distinct category in 2024. Every framework had its own JSON schema for tool definitions. Now MCP is the standard, with 97M monthly SDK downloads, adoption by OpenAI, Google, and Microsoft, and a donation to the Linux Foundation.

Browser Use exploded in parallel, hitting 78K GitHub stars in under a year. Nobody was shipping browser agents in production in 2024. And agents can now talk to other agents. IBM launched ACP, and Google launched A2A. Neither is standard yet, but the problem they solve (agents coordinating with other agents) is real and growing.

Security is the open problem. Endor Labs analyzed 2,614 MCP servers and found 82% prone to path traversal and 67% to code injection.

The honest take: The protocol debate is over. MCP won. The only question left is how you lock down your MCP servers before someone exploits them.

State management is nonexistent here. Your agent calls a tool, gets a response, done. No session, no memory between calls. Lock-in risk is low because MCP is an open standard, so if you build MCP servers, any MCP-compatible agent can use them. The prototype-to-production gap is medium. Your demo MCP server works until someone sends a malicious tool description. Security and governance are the gap.

MCP standardized how agents use tools. It says nothing about how agents talk to each other. ACP and A2A are trying to solve that, but neither has reached critical mass. If you need multi-agent coordination today, you’re building it yourself at the framework layer. We covered MCP in depth in Issue #4.

Layer 3: Memory and knowledge

How your agent stores and retrieves what it knows: in-context state, vector search, or persistent memory across sessions.

Memory & knowledge: key players

All three tiers feed into the same place: The context window your agent sees on every call.

In 2024, memory meant “pick a vector database and do RAG.” In 2026, memory is a first-class architectural primitive with three distinct tiers. Context windows got massive. Gemini hit 1M+ tokens, Claude 200K. Bigger windows didn’t kill the need for memory. They changed the trade-off: What do you stuff in-context versus what do you retrieve on demand?

“Context engineering” replaced “prompt engineering” as the core discipline. Instead of writing a better prompt, you architect what information the agent sees on every call. Memory blocks appeared as named, structured fields in the context window that the agent can read and overwrite every turn. Instead of dumping everything into the system prompt, the agent manages its own state: what to keep, what to update, what to drop.

On the infrastructure side, pgvector became the default for teams that don’t need a dedicated vector database. It’s just Postgres with an extension. GraphRAG emerged as a second retrieval option: follow relationships between entities instead of matching embeddings, with Neo4j leading this space. Sleep-time compute, where agents process information during idle time, is research stage but signals where tier 3 is heading.

The honest take: Most teams overcomplicate memory. Start with conversation history in Postgres and a structured system prompt. Add vector search when your history exceeds context limits. Add agentic memory management only when your agent needs to learn across sessions.

This IS the state layer. You’re deciding what your agent remembers, how it retrieves it, and when it forgets. Highest complexity in the stack. Lock-in risk is medium. pgvector is portable because it’s just Postgres, while specialized tools like Mem0 or Zep are harder to migrate away from. The prototype-to-production gap is large. Demo memory works because context windows are big enough. Production memory breaks when conversations get long and your agent starts forgetting the important parts.

In-context memory breaks down when agents need to share memory across instances or maintain state across model provider switches. That’s where dedicated memory infrastructure like Letta, Zep, and Mem0 earns its keep.

Layer 4: Frameworks and SDKs

How you wire together the model calls, tool use, and control flow that make your agent work: a provider’s built-in toolkit (SDK), a graph-based framework like LangGraph, or raw code.

Frameworks & SDKs: key players

Every major AI lab now ships its own agent SDK. OpenAI has the Agents SDK (evolved from Swarm). Google released ADK. Microsoft has Semantic Kernel and AutoGen. Hugging Face built smolagents. Two years ago, LangChain was the only game. Now you pick between three camps: provider SDKs that are fast to start but locked to one model, graph-based frameworks like LangGraph that are portable but require more setup, or no framework at all. That choice didn’t exist in 2024.

LangGraph solidified as the graph-based orchestration leader with v1.0 released October 2025 and production deployments at Uber, JPMorgan, LinkedIn, and Klarna. LangChain agents are now built on LangGraph under the hood. Meanwhile, the “build it yourself” camp grew. Teams that tried LangChain in 2024 and fought the abstraction are now writing thin wrappers over provider APIs + MCP. No framework means full control. This works until your agent needs state management or complex branching.

A quick note on naming: “LangChain” and “LangGraph” are not the same thing. LangChain is the integration layer handling model connectors, tool calling, and prompt templates. LangGraph is the orchestration engine managing state, control flow, and graphs. Most production teams use both together, but LangGraph is where the agent logic lives.

The honest take: Most teams pick too much framework. If your agent calls a model and a few tools, you don’t need LangGraph. A provider SDK and a couple of tool calls will get you to production faster than any graph.

Provider SDKs manage state for you. LangGraph makes you define every state transition explicitly. Build-it-yourself means you roll your own. Lock-in risk is the highest in the stack. Your orchestration code doesn’t port. A LangGraph agent rewritten for CrewAI is a new codebase. Provider SDKs are worse because you’re locked to one model too. The prototype-to-production gap is large. Demo works because nothing goes wrong. Production means handling tool failures, retries, timeouts, and humans who need to approve before the agent acts.

The framework you pick determines your migration cost. Provider SDKs are fastest to start but lock you to one model. LangGraph is portable but complex. Building your own gives you full control until your agent outgrows your wrapper. MCP is the one layer that transfers across all three camps.

Layer 5: Eval and observability

How you measure whether your agent is doing its job: tracing runs, scoring outputs, and catching regressions before users do.

Eval & observability: key players

This layer barely existed in 2024. Now it’s the gap. LangChain’s State of Agent Engineering survey found 89% of teams with production agents have implemented observability, but only 52% have evals. That 37-point gap is where production quality dies.

“Evaluation as infrastructure” is converging on three tiers: fast checks on every PR (Did the agent call the right tools?), nightly regression suites that use an LLM to judge output quality, and continuous production monitoring that alerts when agent performance drifts. New agent-specific benchmarks have emerged too, including Context-Bench for memory management, Recovery-Bench for error recovery, and Terminal-Bench for coding agents.

The honest take: Most teams skip eval until something breaks in production. By then they’re debugging blind. The teams that don’t have this problem built evals before they deployed.

State management matters here because your agent runs 12 steps, step 3 picked the wrong tool, and steps 4–12 were doomed from there. If your eval only checks the final output, you’ll never know why. Lock-in risk is moderate. Most tools export OpenTelemetry traces, so switching observability providers is doable, but switching eval frameworks means rebuilding your test suites. The prototype-to-production gap is the biggest of any layer. Most prototypes have zero eval. You don’t feel the pain until production users find the failures for you.

Current eval tools are strongest for single-turn and tool-calling evaluation. Multi-agent evaluation, long-horizon task assessment, and evaluating agents that learn over time are all unsolved problems. If your agent does any of those, you’ll need custom eval infrastructure beyond what the platforms offer today.

Layer 6: Guardrails and safety

How you stop your agent from doing things it shouldn’t: filtering inputs, authorizing tool calls, and validating outputs.

Guardrails & safety: key players

Agent guardrails became a separate discipline from LLM guardrails. In 2024, guardrails meant input/output filters on a model. In 2026, your agent calls tools, spends money, and takes actions. Guardrails now means authorizing tool calls, enforcing rate limits, and validating what the agent actually did.

The “guardrails before action” pattern emerged from teams that learned the hard way. They now enforce authorization at the tool execution layer, not the output layer. By the time you filter the response, the agent already sent the email. OWASP published the MCP Top 10 (beta), which is the first real security checklist for tool-connected agents. Deployment is still DIY. LangGraph Cloud and Bedrock Agents exist, but most production teams are still deploying with FastAPI and their own infra. This layer is where you’ll spend the most unplanned engineering time.

The honest take: This is the least mature layer in the stack. No dominant framework, no established patterns. You’re writing policy code from scratch.

Guardrails need to know what the agent is doing right now to decide what it shouldn’t do next. That means tracking agent state in real time. Lock-in risk is low because most guardrails are custom policy code you write yourself. NeMo Guardrails is the closest thing to a framework, but you’ll still write most rules from scratch. The prototype-to-production gap is effectively infinite. Your demo has no guardrails because nobody’s trying to break it. Production will.

Current guardrails tools focus on single-agent systems. If you’re running multi-agent workflows where agents delegate to each other, guardrail propagation across agent boundaries is an unsolved problem. You’ll need custom authorization logic.

What are you building?

This is the decision that cuts through the framework confusion. The agent type determines which layers you invest in and which tools to pick at each one.

A stateless tool caller answers questions from a knowledge base, looks up an order, or checks inventory. You need a provider SDK, MCP, and Postgres. No framework, no vector database. This is a weekend project.

A multistep workflow processes a refund end to end, reviews a PR across five files, or triages and routes support tickets. Steps depend on each other, things fail in the middle, and humans need to approve before the agent acts. You need LangGraph, MCP, and eval. Build evals before you deploy because these agents break silently.

An agent that learns remembers your preferences across sessions, gets better at your codebase over time, or tracks project context across weeks. You need a memory-first architecture, a vector DB, and eval. Orchestration is the easy part. The hard part is deciding what to remember, what gets dropped, and how you stop old context from polluting new answers.

A multi-agent system has agents that delegate to other agents, split a research task across specialists, or run parallel workstreams. You need the full stack. Two agents passing context to each other is already hard to debug. Five is impossible without trace-level evals on every handoff. Build eval infrastructure before you build the second agent.

Pick your stack

Coding agents: All 6 layers in action

Coding agents like Cursor, Claude Code, Codex, and Windsurf are the most proven application of the AI agents stack. All six layers, working together.

At the inference layer, these tools serve hundreds of millions of daily requests. Cursor routes between Claude, GPT-4, and its own fine-tuned models depending on the task. At the protocols layer, MCP servers connect to editors, terminals, filesystems, and Git, which is how the agent reads your code and runs commands. The memory layer uses codebase-aware retrieval with reranking. The agent doesn’t read your whole repo. It retrieves the files that matter for this specific edit.

At the framework layer, these are custom orchestration systems with RL loops. Not LangGraph, not a provider SDK. Purpose-built control flow for code generation, review, and iteration. At the eval layer, Cursor retrains its acceptance-rate model every 90 minutes based on whether users accept or reject suggestions. That’s eval running in production, continuously. And at the guardrails layer, sandboxed execution prevents runaway agents. The agent can write code and run it, but inside a container that limits what it can touch.

The AI agent stack cheat sheet

Every layer scored on the three questions from the evaluation framework: How much state do you need to manage? How much vendor lock-in can you tolerate? And how hard is it to go from demo to production?

The agent stack cheat sheet

The bigger picture

Most teams are building like it’s still 2024. They pick LangGraph before they know if they need state. They add a vector database before they’ve outgrown Postgres. They design multi-agent architectures before they’ve shipped one agent that works. The decision flowchart above exists because a tool-calling chatbot and a multi-agent research system share almost no infrastructure. Treat them the same and you’ll overbuild the first and underbuild the second.

The teams that got past this run evals on every deploy, not once a quarter. Their guardrails sit at the tool call layer, not the output layer. Their memory architecture was designed, not inherited from whatever the framework defaulted to. Most teams ship the opposite: no evals, output-only filtering, and a system prompt that grows until the context window chokes. The gap isn’t talent or budget. It’s knowing which layers matter for your specific agent instead of half-building all six.

The stack is going to collapse. Provider SDKs are already absorbing memory, tool calling, and basic eval into a single API. By early 2027, most teams won’t build each layer separately. They’ll get an increasingly opinionated stack from their model provider and that will be fine for 80% of use cases. The other 20%, agents at scale where the defaults break, will still build custom at every layer. But even then, when something fails in production, you need to know which layer failed. That’s what this article is for.

Sources

  1. The AI Agents Stack,” Letta, November 2024.
  2. Donating the Model Context Protocol and Establishing the Agentic AI Foundation,” Anthropic, December 2025.
  3. 120+ Agentic AI Tools Mapped Across 11 Categories [2026],” StackOne, February 2026.
  4. Henrik Plate and Darren Meyer, Dependency Management Report, Endor Labs, January 2026.
  5. Jason Liu, Context Engineering Series: Building Better Agentic RAG Systems, August 2025.
  6. LangChain and LangGraph Agent Frameworks Reach v1.0 Milestones,” LangChain, October 2025.
  7. State of Agent Engineering, LangChain, December 2025.
  8. Yunfei Bai, Allie Colin, Kashif Imran, and Winnie Xiong, “Evaluating AI Agents: Real-World Lessons from Building Agentic Systems at Amazon,” Amazon, February 2026.
  9. OWASP MCP Top 10, OWASP.


Read the whole story
alvinashcraft
1 hour ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories