Papers on agentic and multi-agent systems (MAS) skyrocketed from 820 in 2024 to over 2,500 in 2025. This surge suggests that MAS are now a primary focus for the world’s top research labs and universities. Yet there is a disconnect: While research is booming, these systems still frequently fail when they hit production. Most teams instinctively try to fix these failures with better prompts. I use the term prompting fallacy to describe the belief that model and prompt tweaks alone can fix systemic coordination failures. You can’t prompt your way out of a system-level failure. If your agents are consistently underperforming, the issue likely isn’t the wording of the instruction; it’s the architecture of the collaboration.
Some coordination patterns stabilize systems. Others amplify failure. There is no universal best pattern, only patterns that fit the task and the way information needs to flow. The following provides a quick orientation to common collaboration patterns and when they tend to work well.
A linear, supervisor-based architecture is the most common starting point. One central agent plans, delegates work, and decides when the task is done. This setup can be effective for tightly scoped, sequential reasoning problems, such as financial analysis, compliance checks, or step-by-step decision pipelines. The strength of this pattern is control. The weakness is that every decision becomes a bottleneck. As soon as tasks become exploratory or creative, that same supervisor often becomes the point of failure. Latency increases. Context windows fill up. The system starts to overthink simple decisions because everything must pass through a single cognitive bottleneck.
In creative settings, a blackboard-style architecture with shared memory often works better. Instead of routing every thought through a manager, multiple specialists contribute partial solutions into a shared workspace. Other agents critique, refine, or build on those contributions. The system improves through accumulation rather than command. This mirrors how real creative teams work: Ideas are externalized, challenged, and iterated on collectively.
In peer-to-peer collaboration, agents exchange information directly without a central controller. This can work well for dynamic tasks like web navigation, exploration, or multistep discovery, where the goal is to cover ground rather than converge quickly. The risk is drift. Without some form of aggregation or validation, the system can fragment or loop. In practice, this peer-to-peer style often shows up as swarms.
Swarms work well in tasks like web research because the goal is coverage, not immediate convergence. Multiple agents explore sources in parallel, follow different leads, and surface findings independently. Redundancy is not a bug here; it’s a feature. Overlap helps validate signals, while divergence helps avoid blind spots. In creative writing, swarms are also effective. One agent proposes narrative directions, another experiments with tone, a third rewrites structure, and a fourth critiques clarity. Ideas collide, merge, and evolve. The system behaves less like a pipeline and more like a writers’ room.
The key risk with swarms is that they generate volume faster than they generate decisions, which can also lead to token burn in production. Consider strict exit conditions to prevent exploding costs. Also, without a later aggregation step, swarms can drift, loop, or overwhelm downstream components. That’s why they work best when paired with a concrete consolidation phase, not as a standalone pattern.
Considering all of this, many production systems benefit from hybrid patterns. A small number of fast specialists operate in parallel, while a slower, more deliberate agent periodically aggregates results, checks assumptions, and decides whether the system should continue or stop. This balances throughput with stability and keeps errors from compounding unchecked. This is why I teach this agents-as-teams mindset throughout AI Agents: The Definitive Guide, because most production failures are coordination problems long before they are model problems.
If you think more deeply about this team analogy, you quickly realize that creative teams don’t run like research labs. They don’t route every thought through a single manager. They iterate, discuss, critique, and converge. Research labs, on the other hand, don’t operate like creative studios. They prioritize reproducibility, controlled assumptions, and tightly scoped analysis. They benefit from structure, not freeform brainstorming loops. This is why it’s not a surprise if your systems fail; if you apply one default agent topology to every problem, the system can’t perform at its full potential. Most failures attributed to “bad prompts” are actually mismatches between task, coordination pattern, information flow, and model architecture.
Want Radar delivered straight to your inbox? Join us on Substack. Sign up here.
I design AI agents the same way I think about building a team. Each agent has a skill profile, strengths, blind spots, and an appropriate role. The system only works when these skills compound rather than interfere. A strong model placed in the wrong role behaves like a highly skilled hire assigned to the wrong job. It doesn’t merely underperform, it actively introduces friction. In my mental model, I categorize models by their architectural personality. The following is a high-level overview.
Decoder-only (the generators and planners): These are your standard LLMs like GPT or Claude. They are your talkers and coders, strong at drafting and step-by-step planning. Use them for execution: writing, coding, and producing candidate solutions.
Encoder-only (the analysts and investigators): Models like BERT and its modern representations such as ModernBERT and NeoBERT do not talk; they understand. They build contextual embeddings and are excellent at semantic search, filtering, and relevance scoring. Use them to rank, verify, and narrow the search space before your expensive generator even wakes up.
Mixture of experts (the specialists): MoE models behave like a set of internal specialist departments, where a router activates only a subset of experts per token. Use them when you need high capability but want to spend compute selectively.
Reasoning models (the thinkers): These are models optimized to spend more compute at test time. They pause, reflect, and check their own reasoning. They’re slower, but they often prevent expensive downstream mistakes.
So if you find yourself writing a 2,000-word prompt to make a fast generator act like a thinker, you’ve made a bad hire. You don’t need a better prompt; you need a different architecture and better system-level scaling.
Neural scaling1 is continuous and works well for models. As shown by classic scaling laws, increasing parameter count, data, and compute tends to result in predictable improvements in capability. This logic holds for single models. Collaborative scaling,2 as you need in agentic systems, is different. It’s conditional. It grows, plateaus, and sometimes collapses depending on communication costs, memory constraints, and how much context each agent actually sees. Adding agents doesn’t behave like adding parameters.
This is why topology matters. Chains, trees, and other coordination structures behave very differently under load. Some topologies stabilize reasoning as systems grow. Others amplify noise, latency, and error. These observations align with early work on collaborative scaling in multi-agent systems, which shows that performance does not increase monotonically with agent count.
Recent work from Google Research and Google DeepMind3 makes this distinction explicit. The difference between a system that improves with every loop and one that falls apart is not the number of agents or the size of the model. It’s how the system is wired. As the number of agents increases, so does the coordination tax: Communication overhead grows, latency spikes, and context windows blow up. In addition, when too many entities attempt to solve the same problem without clear structure, the system begins to interfere with itself. The coordination structure, the flow of information, and the topology of decision-making determine whether a system amplifies capability or amplifies error.
If your multi-agent system is failing, thinking like a model practitioner is no longer enough. Stop reaching for the prompt. The surge in agentic research has made one truth undeniable: The field is moving from prompt engineering to organizational systems. The next time you design your agentic system, ask yourself:
That said, the winners in the agentic era won’t be those with the smartest instructions but the ones who build the most resilient collaboration structures. Agentic performance is an architectural outcome, not a prompting problem.
There’s a growing body of research around AI coding assistants with a confusing range of conflicting results. This is to be expected when the landscape is constantly shifting from coding suggestions to agent-based workflows to Ralph Wiggum loops and beyond.
The Reichenbach Falls in Switzerland has a drop of 250 metres and a flow rate of 180-300 cubic metres per minute (enough to fill about 1,500 bathtubs). This is comparable to the rate of change in tools and techniques around coding assistants over the past year, so few of us are using it in the same way. You can’t establish best practices under these conditions; only practical point-in-time techniques.
As an industry, we, like Sherlock Holmes and James Moriarty, are battling on the precipice of this torrent, and the survival of high-quality software and sustainable delivery is at stake.
Given the rapid evolution of tools and techniques, I hesitate to cite studies from 2025, let alone 2023. Yet these are the most-cited studies on the effectiveness of coding assistants, and they present conflicting findings. One study reports developers completed tasks 56% faster, while another reports a 19% slowdown.
The studies provide a platform for thinking critically about AI in software development, enabling more constructive discussions, even as we fumble our collective way toward understanding how to use it meaningfully.
The often-cited 56% speedup stems from a 2023 collaboration among Microsoft Research, GitHub, and MIT. The number emerged from a lab test in which developers were given a set of instructions and a test suite to see how quickly and successfully they could create an HTTP server in JavaScript.
In this test, the AI-assisted group completed the task in 71 minutes, compared to 161 minutes for the control group. That makes it 55.8% faster. Much of the difference came from the speed at which novice developers completed the task. Task success was comparable between the two groups.
There are weaknesses in this approach. The tool vendor was involved in defining the task against which the tool would be measured. If I were sitting an exam, it would be to my advantage to set the questions. Despite this, we can generously accept that it made the coding task faster, and that the automated tests sufficiently defined task success.
We might also be generous in stating that tools have improved over the past three years. Benchmarking reports like those from METR indicate that AI has doubled the number of tasks it can handle every 7 months; other improvements are likely.
We’ve also observed the emergence of techniques that introduce work plans and task chunking, thereby improving the agent’s ability to perform larger tasks that would otherwise incur context decay.
And METR is also the source of our cautionary counterfinding regarding task speed.
The METR study in 2025 examined the impact of contemporary tools on task completion times in real-world open-source projects. The research is based on 246 tasks performed by 16 developers who had experience using AI tools. Each task was randomly assigned to an AI-assisted group and a control group. Screen recordings were captured to check and categorize the task completion.
The research found that tasks were slowed by 19%, which appears to contradict the earlier report. In reality, the active coding time was reduced by AI tools, as was the task of searching for answers, testing, and debugging. The difference in the METR report was that it identified tools that introduced new task categories, such as reviewing AI output, prompting, and waiting for responses. These new tasks, along with increased idle/overhead time, consumed the gains and pushed overall task completion times into the red.

Source: METR Measuring the Impact of Early-2025 AI. Task category comparison.
One finding from the METR study worth noting is the perception problem. Developers predicted AI assistants would speed them up. After completing the task, they also estimated they had saved time, even though they were 19% slower. This highlights that our perceptions of productivity are unreliable, as they were when we believed that multitasking made us more productive.
A recently released study from Multitudes, based on data collected over 10 months in 2025, highlights the lack of consensus around the productivity benefits of AI coding tools. They found that the number of code changes increased, but this was countered by an increase in out-of-hours commits.
This appears to be a classic case of increasing throughput at the expense of stability, with out-of-hours commits representing failure demand rather than feature development. It also clouds the picture, as developers who work more hours tend to make more commits, even without an AI assistant.
Some of the blame was attributed to adoption patterns that left little time for learning and increased delivery pressure on teams, even though they now had tools that were supposed to help them.
One finding that repeatedly comes up in the research is that AI coding assistants benefit novice developers more than those with deep experience. This makes it likely that using these tools will exacerbate a wicked talent problem. Novice developers may never shed their reliance on tools, as they become accustomed to working at a higher level of abstraction.
This is excellent news for those selling AI coding tools, as an ever-expanding market of developers who can’t deliver without the tools will be a fruitful source of future income. When investors are ready to recoup, organizations will have little choice but to accept whatever pricing structure is required to make vendors profitable. Given the level of investment, this may be a difficult price to accept.
The problem may deepen as organizations have stopped hiring junior developers, believing that senior developers can delegate junior-level tasks to AI tools. This doesn’t align with the research, which shows junior developers speed up the most when using AI.
The AI Pulse Report compares this to the aftermath of the dot-com bubble, when junior hiring was frozen, resulting in a shortage of skilled developers. When hiring picked up again, increased competition for talent led to higher salaries.

Source: The AI Pulse Report. Hiring plans for junior developers.
While many practitioners recognize the relevance of value stream management and the theory of constraints to AI adoption, a counter-movement is emerging that calls for the complete removal of downstream roadblocks.
“If you can’t complete code reviews at the speed at which they are created with AI, you should stop doing them. Every other quality of a system should be subverted to straight-line speed. Why waste time in discovery when it would starve the code-generating machine? Instead, we should build as much as we can as fast as we can.”
As a continuous delivery practitioner and a long-time follower of the DORA research program, I find this no longer makes sense to me. One of the most powerful findings in the DORA research is that a user-centric approach beats flat-line speed in terms of product performance. You can slow development down to a trickle if you’ve worked out your discovery process, because you don’t need many rounds of chaotic or random experiments when you have a deep understanding of the user and the problem they want solved.
We have high confidence that continuous delivery practices improve the success of AI adoption. You shouldn’t rush to dial up coding speed until you’ve put those practices in place, and you shouldn’t remove practices in the name of speed. That means working in small batches, integrating changes into the main branch every few hours, keeping your code deployable at all times, and automating builds, code analysis, tests, and deployments to smooth the flow of change.
Continuous delivery is about getting all types of changes to users safely, quickly, and sustainably. The calls to remove stages from the deployment pipeline to expedite delivery compromise the safety and sustainability of software delivery, permanently degrading the software’s value for a temporary gain.
There’s so much to unpack in the research, and many studies focus on a single link in a much longer chain. Flowing value from end to end safely, quickly, and sustainably should be the goal, rather than merely maintaining flat-line speed or optimizing individual tasks, especially when those tasks are the constraining factor.
With the knowledge we’ve built over the last seven decades, we should be moving into a new era of professionalism in software engineering. Instead, we’re being distracted by speed above all other factors. When my local coffee shop did this, complete with a clipboard-wielding Taylorist assessor tasked with bringing order-to-delivery times down to 30 seconds, the delivery of fast, bad coffee convinced me to find a new place to get coffee. Is this what we want from our software?
The results across multiple studies show that claims of a revolution are premature, unless it’s an overlord revolution that will depress the salaries of those pesky software engineers and produce a group of builders who can’t deliver software without these new tools. Instead, we should examine the landscape and learn from research and from one another as we work out how to use LLM-based tools effectively in our complex socio-technical environments.
We are at a crossroads: either professionalize our work or adopt a prompt-and-fix model that resembles the earliest attempts to build software. There are infinite futures ahead of us. I don’t dread the AI-assisted future as a developer, but as a software user. I can’t tolerate the quality and usability chasm that will result from removing continuous delivery practices in the name of speed.
The post How AI coding makes developers 56% faster and 19% slower appeared first on The New Stack.
Read the full Show Notes and search through the world's largest audio library on Agile and Scrum directly on the Scrum Master Toolbox Podcast website: http://bit.ly/SMTP_ShowNotes.
"It's about coaching the team, not teaching them." - Prabhleen Kaur
Prabhleen shares a powerful lesson about the dangers of being too directive with a forming team. When she joined a new team, her enthusiasm and experience led her to immediately introduce best practices, believing she was setting the team up for success. Instead, the team felt burdened by rules they didn't understand the purpose of. The process became about following instructions rather than solving problems together.
It wasn't until her one-on-one conversations with team members that Prabhleen realized the disconnect. She discovered that the team viewed the practices as mandates rather than tools for their benefit. The turning point came when she brought this observation to the retrospective, and together they unlearned what had been imposed.
Now, when Prabhleen joins a new team, she takes a different approach. She first seeks to understand how the team has been functioning, then presents situations as problems to be solved collectively. By asking "How do you want to take this up?" instead of prescribing solutions, she invites team ownership. This shift from teaching to coaching means the team creates their own working agreements, their own definitions of ready and done, and their own communication norms. When people voice solutions themselves, they follow through because they own the outcome.
In this episode, we refer to working agreements and their importance in team formation.
Self-reflection Question: When you join a new team, do you first seek to understand their current ways of working, or do you immediately start suggesting improvements based on your past experience?
[The Scrum Master Toolbox Podcast Recommends]
Angela thought she was just there to coach a team. But now, she's caught in the middle of a corporate espionage drama that could make or break the future of digital banking. Can she help the team regain their mojo and outwit their rivals, or will the competition crush their ambitions? As alliances shift and the pressure builds, one thing becomes clear: this isn't just about the product—it's about the people.
🚨 Will Angela's coaching be enough? Find out in Shift: From Product to People—the gripping story of high-stakes innovation and corporate intrigue.
[The Scrum Master Toolbox Podcast Recommends]
About Prabhleen Kaur
Prabhleen is a Certified Scrum Master with 7+ years of experience helping teams succeed with SAFe, Scrum and Kanban. Passionate about clean backlogs, powerful metrics, and dashboards that actually mean something. She is also known for making JIRA behave, driving Agile transformations, and helping teams ship value consistently and confidently.
You can link with Prabhleen Kaur on LinkedIn.