Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
154533 stories
·
33 followers

Get a Good Return on Your AI Investments

1 Share

Last week, we had our first Infrastructure & Ops superstream of 2026, Platform Engineering in the Age of AI. Our speakers explored a range of topics focused on supporting new AI workloads, each with unique infrastructure needs, unpredictable costs, and novel security concerns. Google Cloud’s Abdel Sghiouar took the audience through what a good platform for AI looks like, Cockroach Labs’ Jordan Lewis shared lessons learned rolling out a corporate AI platform, Syntasso’s Daniel Bryant outlined a three-layer model for building a good platform, technology leader Sarah Wells discussed the importance of governance and how to make it more manageable, and Thoughtworks’ Ben O’Mahony explained why evals should be part of your observability story. You can watch the highlights here.

The event concluded with a fireside chat between Sam and Nathen Harvey, who leads the DORA team at Google Cloud. DORA has been tracking software delivery performance for over a decade, which means they’ve watched a lot of technology trends come through. Their center of gravity has always been the same question: How quickly and safely can a team move change into a running production application?

AI hasn’t changed that question, although it has made answering it a bit harder. DORA recently released its ROI of AI-Assisted Software Development report to show how AI is working for teams right now, and how that may or may not be contributing to organizations’ bottom lines. Nathen used the findings as a jumping-off point to dig into how AI is changing platform engineering and software development as a whole.

The productivity gap

Sam started by pointing out one of the biggest headline findings from DORA’S 2025 data: Organizations saw about 10% improvement in terms of actual code shipped to production systems. Even though developers likely felt that they were more productive, that doesn’t automatically carry through to production. DORA’s data shows higher throughput alongside higher instability. In other words, teams are shipping more but they’re also more frequently rolling back changes or implementing fixes. The gains at the individual level are real (and 10% is a pretty good number), but those gains aren’t “the dramatic improvements that you find in the headlines.”

AI amplifies good processes (and bad ones)

Nathen explained that AI is an amplifier and mirror that equally reflects the good and bad. On teams where shipping change is already easy, AI tends to keep things running well. On teams where getting change into production is painful, AI generates more change and makes the existing friction more acute. That said, his read on this outcome is cautiously optimistic: “If the pain is more acute, we maybe will invest in addressing that pain.”

The rub is that the investment has to actually happen. Nathen noted that in lower-performing organizations, AI tools often arrive with a reset of expectations rather than an invitation to fix the process: Here’s your new tool. Now we expect more from you. Addressing this problem means reframing the question “Does AI make people more productive?” What we really should be asking is “Under what conditions will AI boost productivity, and who’s responsible for creating them?” And that falls on the organization, not the technology.

Verification isn’t a checkbox

Trust is a big challenge with generative AI. About 30% of DORA survey respondents trust AI output little or not at all. Around 46% trust it “somewhat” (and Nathen is one of them). Despite all the advances in generative AI, these tools still make mistakes, and if you’ve multiplied your ability to generate code without doing anything to scale your ability to verify it, you’ve made your situation worse, not better.

Nathen called this the verification tax, and it belongs in any honest accounting of AI’s productivity impact. Pipeline adaptation belongs there too: Is your delivery pipeline fit for purpose given the volume of change you’re now trying to push through? These costs don’t show up in the headlines about 10x developer productivity. They show up in your incident reports three months later.

DORA recently published an ROI framework and calculator for AI-assisted software development. Nathen was clear that there’s no universal number to offer, and the calculator doesn’t pretend otherwise. What it does is give teams a way to model the real costs, including the learning investment, the verification overhead, and the pipeline changes required.

Context switching and burnout

With productivity on the upswing, AI-induced burnout is becoming a serious concern. (Steve Yegge calls this the “AI vampire.”) DORA’s data for 2025 showed that AI adoption wasn’t strongly connected with burnout, with the caveat that about 64% of DORA survey respondents said they’d never worked in an agentic workflow. Both of those findings are likely to change significantly in 2026.

Nathen highlighted one source of burnout he expects to escalate as agents become the norm: context switching. As he pointed out, software developers spent years arguing for protected focus time to do the deep work that requires them to maintain flow. Agentic workflows are now incentivizing those same developers to voluntarily run a dozen or more agents at once, forcing them to context-switch multiple times every hour. As he joked, “There’s plenty of research that supports the idea that all of us feel like we’re pretty good multitaskers and none of us are.” The consequences are coming, and we’re doing it to ourselves.

The cognitive debt question

Sam Newman brought up the related notion of “cognitive debt,” and in particular, Margaret-Anne Storey’s discussion of it. (See “How Generative and Agentic AI Shift Concern from Technical Debt to Cognitive Debt” and “From Technical Debt to Cognitive and Intent Debt: Rethinking Software Health in the Age of AI.”) Here’s how Storey explains the problem in her blog post:

Debt compounded from going fast lives in the brains of the developers and affects their lived experiences and abilities to “go fast” or to make changes. Even if AI agents produce code that could be easy to understand, the humans involved may have simply lost the plot and may not understand what the program is supposed to do, how their intentions were implemented, or how to possibly change it.

And as Sam noted, this compounds across teams and organizations. As developers increasingly work in parallel with AI rather than with each other, they lose the shared understanding that comes from people building software together. Kent Beck once said that “software design is an exercise in human relationships.” Agentic workflows are putting pressure on that in ways we’re only beginning to see.

Nathen agreed cognitive debt is where he’s most concerned, and both your workers and your architecture will suffer for it. Understanding the ramifications of an architectural decision you made eight months ago takes years of operation to surface, and AI doesn’t help with that at all.

Invest in your platform now

Considering what makes some AI-assisted teams high performers, Nathen explained, “It’s not that you’re using AI but how you’re using AI.” This observation led DORA to develop seven capabilities that, when combined with AI adoption, lead to better outcomes. Nathen briefly ran through the list, ending on quality internal platforms. And here he made a claim about software engineering investment that was, in his words, “a little bit wild”:

Every product engineer that you have in your organization, every engineer that’s focused on building features right now, should probably stop building features and focus on the platform.

His argument is that platforms matter more, not less, in an environment where AI makes it possible for almost anyone in an organization to build something. The people closest to customers and business problems can now generate working software. What they can’t do is ensure that software is durable, secure, and production-ready.

Nathen suggested that the best leverage for software engineering investment today might be building platforms that provide those guardrails, that shift the complexity of production-readiness down into the infrastructure so that anyone building on top of it gets the safety net for free. He acknowledged that moving every product engineer to platform work might be overkill. But the direction of travel is real. The platform is also, as Newman pointed out, where you bring determinism back into a process that AI has made more nondeterministic.

That’s something we’ve been hearing a lot here at O’Reilly. The expansion of who can build doesn’t reduce the need for deep engineering expertise. It changes where that expertise is most valuable, and platforms are a good answer to where.

What DORA’s research tells us

The teams that are doing well are running experiments, learning from them, and spreading those lessons. The measure Nathen suggested is not how many tokens you’ve consumed but how many experiments you’ve run and how well you’re distributing what you’ve learned.

The tools are moving fast enough that any organization locking in a fixed policy around specific tools will find itself stuck. What you want is the capacity to keep learning, which means building the culture and the processes that make learning visible and transferable.

All of DORA’s research is freely available at dora.dev, including the 2025 annual report and the ROI framework. The DORA Community provides a space for practitioners to work through these questions together. If you’re trying to navigate any of this with your team, you may want to spend some time there.

And if you want to dive deeper into Nathen and Sam’s chat or explore the other sessions, you can watch the entire Infrastructure & Ops Superstream on the O’Reilly learning platform. Our next event, on September 9, will cover agentic observability. Register for free here, and check out all the other free live events on O’Reilly.



Read the whole story
alvinashcraft
44 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Strengthening our approach to tackling non-consensual intimate imagery

1 Share

When intimate images are shared without consent—whether real or AI-generated—the harm is immediate, deeply personal, and often long-lasting. It can affect someone’s sense of safety, dignity, and control, both online and offline. Protecting people from harms like non-consensual intimate imagery (NCII) has long been a priority for Microsoft. And as technology advances, our response continues to evolve to tackle very real challenges like the proliferation of highly realistic synthetic imagery. With the US Take It Down Act coming into force this month, establishing new federal protections against the spread of NCII, it’s important to share how we’re evolving our approach: making it easier to report harm, taking new steps to detect known NCII, and enabling more effective enforcement across our services.

Expanding protections across Microsoft services

Our goal is to make it simpler for individuals, or their representatives, to report violative content to Microsoft. We have strengthened our global reporting processes for NCII with a more intuitive form, with clear options to describe harm, including both real and AI‑generated images. These changes are designed to ease the burden for people in a distressing moment and enable faster, more effective action by our teams. Microsoft’s NCII policy is applied consistently across real and synthetic content, recognizing that the harm to individuals is the same, regardless of how an image was created. To report content on Microsoft services, hit “Report A Concern” or in the product where you encounter the content.  

We also want to proactively detect and prevent the spread of known NCII by working with StopNCII.org, a reporting platform that enables individuals to create a digital “fingerprint,” or hash, of their images. Two years ago, we provided StopNCII.org with a new version of PhotoDNA that enables victims to create a hash without an image ever leaving their device. This can then be used by StopNCII.org partners to detect and remove matching NCII content across platforms, allowing industry to work together to prevent re-sharing and protect individuals’ privacy. We have been piloting the use of these hashes in Bing since September 2024. 

We have now expanded our use of validated StopNCII.org hashes across Microsoft consumer services, including Teams Free, OneDrive, and Xbox. We will implement these changes carefully to advance effectiveness and accuracy—accelerating removals, automating where appropriate, maintaining human review for reported cases, and providing clear, accessible paths for users to appeal decisions.

Enhancing our collective response to this harm

No single company can address NCII alone. It requires coordination across industry, governments, and civil society. Microsoft will continue working with partners to improve shared tools and approaches that help prevent this content from spreading. We will also continue to advocate for clear, effective policies that protect victims, support innovation, and strengthen accountability across the ecosystem.

We will also continue to advocate for policies that support efforts to advance laws that prevent and deter image-based abuse. Microsoft advocated in support of the US Take It Down Act and welcomes the European Union’s work to strengthen protections against “nudification” apps, alongside global efforts to criminalize this misuse of technology. We are closely tracking Ofcom’s recent announcement that new measures will be required under the UK Online Safety Act to address illegal NCII harms. We believe our proactive work in this area will help us maintain trust with survivors, users, and regulators, among others.

Speed, clarity, and trust matter for people affected by intimate image abuse. When someone reaches out for help, we will strive to respond quickly, respectfully, and effectively. Our goal, though, is to invest in technologies and partnerships that reduce the likelihood of harm. We have joined forces with Childnet, a UK NGO that aims to safeguard children online, and created educational materials to prevent the misuse of AI to create intimate imagery among teens. These materials have now been released in the UK, as well as localized with partners in Singapore, South Korea, and Japan.

I am proud to learn from our digital safety team, which is carefully charting our path, and from the many industry and community leaders contributing to this work. This is an evolving challenge. We are committed to the journey, grounded by the voices of experts and survivors.

The post Strengthening our approach to tackling non-consensual intimate imagery appeared first on Microsoft On the Issues.

Read the whole story
alvinashcraft
44 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Extending Human Intelligence Through AI

1 Share
Three icons (speech bubble, handshake, and interconnected circles) on a blue and green gradient background.

At a glance

  • Modern AI systems are powerful not because they replicate human intelligence, but because they presuppose it, by extending structures already present in human cognition and language.
  • This perspective helps explain both AI’s remarkable capabilities and its recurring boundaries, including hallucinations and breakdowns in reasoning.
  • This research argues that AI safety is a system-level challenge, shifting attention from “rogue AI” narratives toward harnessing engineering and governance.
  • Understanding AI as an extension of human intelligence—not a replacement for it—offers a more grounded path for building trustworthy AI systems.

AI systems today can write essays, generate code, summarize complex ideas, and carry on conversations with remarkable fluency. Yet those same systems still struggle with tasks humans find intuitive: reliably tracking objects through change, reasoning compositionally in unfamiliar situations, or distinguishing truth from plausible fiction. These contradictions have fueled polarized debates about AI. Some see current systems as early forms of human-like intelligence; others dismiss them as sophisticated autocomplete. 

In recent interdisciplinary work – including Adam Frank, Marcelo Gleiser, and Evan Thompson’s The Blind Spot (opens in new tab) and DeepMind researcher Alexander Lerchner’s The Abstraction Fallacy (opens in new tab) – a different picture is emerging. Rather than asking whether AI systems are becoming intelligent in the human sense, these approaches ask a more basic question: What if AI systems work because they rely on structures that are rooted in human cognition? This shift in perspective, which draws on the phenomenology of Edmund Husserl, helps make sense of both the capabilities and the limits of modern AI. 

In our recent paper, The Origins of Artificial Intelligence in Natural Intelligence, we argue that modern AI systems are best understood neither as human minds nor as trivial statistical tricks. Instead, they extend structures that originate in human cognition itself. Further drawing on the phenomenology of Husserl, the paper proposes that language already contains sedimented structures of human understanding â€”structures that AI systems learn to model and extend. This perspective helps explain both the capabilities and the boundaries of contemporary AI.  

Human perception is not simply passive reception of sensory data. We experience the world as stable things unfolding through change: a cup remains the same cup as we move around it; a melody remains recognizable even as individual notes pass away. Language emerges by expressing these stable structures in conceptual form. Words like “red,” “round,” or “larger than” articulate relationships that originate in lived experience. 

Large language models learn statistical relationships within this linguistic world. They capture how concepts tend to relate across enormous bodies of human writing. This explains why AI systems can produce coherent responses across many domains. But it also explains why they hallucinate. Humans remain answerable to the world: experience continually corrects our expectations and beliefs. AI systems, by contrast, extend patterns within text itself. They can continue a line of reasoning with remarkable fluency, but they lack the lived engagement with the world that anchors meaning and truth.

How AI extends human cognition | diagram
AI Extends Human Cognition 

This framework helps explain several recurring challenges in AI research. One is the “compositionality gap”—the tendency for language models to perform well on familiar reasoning patterns while failing when asked to combine concepts in genuinely novel ways. Research increasingly shows that larger models improve fluency and factual recall much faster than they improve true compositional reasoning. From our perspective, this is not simply an engineering limitation but a structural boundary: AI systems can extend patterns already sedimented in language, but they do not possess the world-directed understanding that allows humans to generate genuinely new conceptual relations. 

A similar pattern appears in multimodal systems that combine language and vision. These systems can often label images correctly while still failing at robust reasoning about objects and their parts. They learn correlations between visual patterns and language rather than perceiving stable objects unfolding through time in the way humans do. The result is systems that can appear impressively fluent while remaining surprisingly brittle outside familiar patterns. 

This perspective also reframes debates about AI safety. Public discussion often swings between fears of “rogue superintelligence” and claims that AI poses little meaningful risk. Our research suggests that both extremes misunderstand the nature of current systems. The most immediate risks arise not because AI possesses human-like intentions, but because it can extend patterns of reasoning without reflective responsibility to the world. Systems can generate persuasive but ungrounded outputs, automate flawed decisions at scale, or execute harmful actions if embedded in poorly governed environments.

This helps explain why AI safety is increasingly shifting from model safety to system safety. In practice, organizations already rely on layered safeguards—what the industry increasingly calls “harnesses”—to constrain, validate, and monitor AI behavior. Rather than temporary patches, our paper argues that these mechanisms reflect something fundamental about AI architecture itself: trustworthy behavior emerges from the work of builders of AI systems responsible for their behavior, a responsibility that cannot be delegated to or shared with models.

This interpretation aligns closely with how enterprises increasingly approach trustworthy AI deployment. Organizations need systems that can extend human intelligence while remaining governable, auditable, and aligned with human oversight. Understanding AI as a derived form of intelligence clarifies why layered governance, evaluation, and operational controls matter so deeply.

Spotlight: Microsoft research newsletter

Microsoft Research Newsletter

Stay connected to the research community at Microsoft.

Opens in a new tab

Looking ahead, we believe phenomenology offers more than a critique of AI—it offers a framework for understanding its promise. AI systems reveal something profound about human cognition itself: that meaning can be formalized, extended, and scaled in powerful new ways.  The central societal risk of AI thus turns out to be kicking away the ladder of its origins in human experience and cognition – misinterpreting AI as a rival intelligence that diminishes our humanity and thus, in turn, diminishes the true promise of AI itself. 

The question, then, is not whether AI will replace human intelligence. It is how we can responsibly build systems that extend human understanding while remaining grounded in the world from which that understanding arises. If we mistake AI systems for autonomous minds, we risk over-trusting them. If we dismiss them as trivial tricks, we risk overlooking one of the most important technological developments of our time. A more grounded interpretation recognizes both truths at once: AI is a genuine extension of human intelligence—and precisely because of that, humans remain responsible for how it is understood, governed, and used.

Opens in a new tab

The post Extending Human Intelligence Through AI appeared first on Microsoft Research.

Read the whole story
alvinashcraft
44 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Agents on a leash: Agentic AI remains mostly single-agent and monitored at work

1 Share
AI’s impact on software engineering continues, and more and more of that AI is packaged as agents as results from our newest pulse survey show agentic usage has almost doubled (59%) since we last asked about it in our annual Developer Survey
Read the whole story
alvinashcraft
44 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Trisha Gee: AI Won’t Fix Your Broken Pipeline – It Will Break It Faster

1 Share

At Devoxx UK, I spoke with Trisha Gee – author and one of the most recognized voices in the Java space – about what really happens when teams lean heavily on AI. Her take was far darker than the conference hype.

Trisha Gee has spent over two decades in software development, from startups to global enterprises – equally at home discussing DORA metrics and SPACE frameworks as business outcomes and organizational design.

At Devoxx UK, she gave a talk about how software engineering principles stay the same regardless of what tooling era you are in.

I wanted to understand what that means right now when AI is writing a significant portion of the code.

AI exposes the weakest link, not just the fastest path

Trisha frames AI as an amplifier, not a solution. When I asked what that looks like beyond demos, she put it simply: it exposes the problems that were already there, the ones you didn’t know you had.

The most common thing I saw (I was working at Gradle, so we dealt with a lot of build tooling) was more code, more tests, and tests taking longer. The continuous delivery pipeline took a lot of pressure.

The broader pattern she describes is straightforward but easy to miss when you are excited about shipping faster. “Whichever part of your system is the weakest, it’s going to expose that part,” she said.

Reframing it this way, while most conversations about AI adoption focus on what gets faster, Trisha highlights what deteriorates first.

When code gets cheap, everything else gets expensive

When I asked Trisha where teams should focus once code generation becomes cheap, her answer was everywhere.

 Photo: DevoxxUK / Flickr

What she means is that optimizing the writing of code without understanding the surrounding system does not move the needle.

It’s not about one thing which is going to fix one problem, it’s about really understanding the whole system, it’s about understanding even the whole organization, the whole enterprise. Where does IT and technology and software fit into that? What are you really trying to deliver? What is the business benefit?

She described this as working across two ends of the process. On the input side, teams need to get better at questioning requirements before writing anything. On the output side, they need to look at build pipelines, test parallelism, flaky tests, and DORA metrics.

“If you can measure those things (your DORA metrics, build times, whether delivered requirements actually give users value) you can start to see which parts of the process are working and which need attention,” Trisha explained.

Measuring the wrong things optimizes the wrong things

She also makes a sharp point about measurement and optimization.

If you measure lines of code for productivity, you’ll get more lines of code. But really productivity is not just about what we call these activity metrics. It’s not just lines of code. It’s not just pull requests, merges, features delivered.

The thing teams consistently miss is the full arc of delivery.

Developer experience and productivity is the whole piece. Did it get out to the user? Did it meet the user’s needs? Is the user paying for more of our stuff? Is the business getting what they need from what the developers are doing? What you’re measuring there impacts what you’re going to optimize.

That last line is worth sitting with. If your productivity metrics stop at pull requests merged, you are optimizing for pull requests merged.


The SPACE framework and why three metrics beat one

When I asked Trisha what teams should measure, she pointed to the SPACE framework. SPACE stands for satisfaction, performance, activity, communication and collaboration, and efficiency and flow.

DORA metrics, which most teams are more familiar with, are a subset of it. Her recommendation is to pick metrics from three different dimensions rather than relying on a single category. The reasoning is that single-category metrics tend to be easy to game without improving anything real.

So yes, you can write more code, but no, you didn’t do what the business wanted.

Photo: Marin Pavelić

She also brought up Fred Brooks and communication overhead as something the industry consistently underweights. The harder metrics to capture, like satisfaction and flow, are often more revealing than the activity metrics that dashboards make easy to track.

The business outcomes she keeps returning to are specific: “You need to measure, did it do what you wanted it to do? Did it get out to the user in time? Did they start spending more money with us? Did it fix your retention problem?”

Those are the things which matter much more to the business.

What to fix before adopting AI

I wondered what teams need to get right before AI tooling can actually help them. Trisha’s first answer was essentially: stop adopting AI the way you have adopted everything else.

We generally get requirements, write the code, chuck it out there, and then you’re kind of done. That’s not how it should work.

What she advocates for instead is applying the scientific method to engineering decisions, which sounds obvious but rarely happens in real life.

Have a hypothesis, do your investigation, measure the results, have a conclusion. Generally speaking, we have not been great at that in our industry.

Applied to AI adoption specifically, that means being precise about what you are actually trying to achieve. What are we trying to achieve with AI? Do we want to deliver more features more quickly to the customer or do we want to perhaps deliver higher quality features? Because those two things are not necessarily the same thing Trisha concluded.

Therefore the practical instruction she gives is to run short experiments, measure one change at a time, and iterate. But have a hypothesis, figure out how to measure it, measure it, get feedback, and iterate over that.


The post Trisha Gee: AI Won’t Fix Your Broken Pipeline – It Will Break It Faster appeared first on ShiftMag.

Read the whole story
alvinashcraft
44 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Agent Skills

1 Share

The following article originally appeared on Addy Osmani’s blog and is being reposted here with the author’s permission.

The default behavior of any AI coding agent is to take the shortest path to “done.” Ask for a feature and it writes the feature. It doesn’t ask whether you have a spec, write a test before the implementation, consider whether the change crosses a trust boundary, or check what the PR will look like to a reviewer. It produces code, declares victory, and moves on.

This is the same failure mode every senior engineer has spent their career learning to avoid. The senior version of any task includes work that doesn’t show up in the diff: surfacing assumptions, writing the spec, breaking the work into reviewable chunks, choosing the boring design, leaving evidence that the result is correct, sizing the change so a human can actually review it. Those steps are most of what separates engineers who ship reliable software at scale from people who push code that breaks.

Agents skip those steps for the same reason any junior would. They’re invisible. The reward signal points at “task complete” not “task complete and the design doc exists.” So we have to bolt the senior-engineer scaffolding back on.

Agent Skills is my attempt at that scaffolding. It just crossed 27K stars, so apparently I’m not alone in wanting it. This post is the part the README doesn’t quite cover: why each design choice exists, how it maps onto standard SDLC and Google’s published engineering practices, and what you should steal from the project even if you never install a single skill.

What a “skill” actually is

The word “skill” is doing a lot of work in the Claude Code/Anthropic vocabulary, and it helps to be precise. A skill is a Markdown file with front matter that gets injected into the agent’s context when the situation calls for it. Somewhere between a system-prompt fragment and a runbook.

A skill is not reference documentation. It is not “everything you should know about testing.” It is a workflow: a sequence of steps the agent follows, with checkpoints that produce evidence, ending in a defined exit criterion.

That distinction is the whole game. If you put a 2,000-word essay on testing best practices into the agent’s context, the agent reads it, generates plausible-looking text, and skips the actual testing. If you put a workflow there (write the failing test first, run it, watch it fail, write the minimum code to pass, watch it pass, refactor), the agent has something to do, and you have something to verify.

Process over prose. Workflows over reference. Steps with exit criteria over essays without them. That single distinction separates a useful skill from a pretty Markdown file. It also explains why so many “AI rules” repos end up doing nothing in practice. The rules are essays.

The SDLC the skills encode

The 20 skills in the repo organize around six lifecycle phases, with seven slash commands sitting on top. Define (/spec) is where you decide what you’re actually building. Plan (/plan) breaks the work down. Build (/build) implements it in vertical slices. Verify (/test) proves it works. Review (/review) catches what slipped through. Ship (/ship) gets it to users safely. /code-simplify sits across the bottom of the whole thing.

This isn’t a coincidence. It’s the same SDLC every functioning engineering organization runs, just in different vocabulary. Google calls it design doc → review → implementation → readability review → launch checklist. Amazon calls it the working-backward memo and the bar raiser. Every healthy team has some version of this loop.

What’s new with AI coding agents is that most agents skip most of these phases by default. You ask for a feature, you get an implementation, and the spec, plan, tests, review, and launch checklist all just don’t happen. Skills push the agent through the same phases a senior engineer forces themselves through, because shipping the code without them is how you produce incidents.

A complex feature might activate eleven skills in sequence. A small bug fix might use three. The router (using-agent-skills) decides which apply. The point is that the workflow scales to the actual scope, not to the assumed scope.

Five principles that are doing the work

Five design decisions in the project are the loadbearing ones. The rest of the system follows from them.

1. Process over prose

Already covered. Workflows are agent-actionable; essays are not. The same is true for human teams. If your team handbook is 200 pages, no one reads it under time pressure. If it’s a small set of workflows with checkpoints, people actually run them.

2. Anti-rationalization tables

This is the most distinctive design decision in the project, and the one I most want other teams to steal.

Each skill includes a table of common excuses an agent (or a tired engineer) might use to skip the workflow, paired with a written rebuttal. A few examples close to the originals:

  • “This task is too simple to need a spec.” → Acceptance criteria still apply. Five lines is fine. Zero lines is not.
  • “I’ll write tests later.” → Later is the loadbearing word. There is no later. Write the failing test first.
  • “Tests pass, ship it.” → Passing tests are evidence, not proof. Did you check the runtime? Did you verify user-visible behavior? Did a human read the diff?

The reason this works is that LLMs are excellent at rationalization. They will produce a plausible-sounding paragraph explaining why this particular task doesn’t need a spec or why this particular change is fine to merge without review. Anti-rationalization tables are prewritten rebuttals to lies the agent hasn’t yet told.

The pattern is just as good for human teams. Most engineering decay isn’t anyone choosing to do bad work. It’s people accepting plausible-sounding justifications for skipping the parts they don’t feel like doing. A team that writes down its anti-rationalizations is a team that has fewer of them.

3. Verification is nonnegotiable

Every skill terminates in concrete evidence. Tests pass. Build output is clean. The runtime trace shows the expected behavior. A reviewer signs off. “Seems right” is never sufficient.

This is the same principle that makes Anthropic’s harness recover from failures, that makes Cursor’s planner/worker/judge split actually catch bugs, that makes any long-running agent recoverable. The agent is a generator. You need a separate signal that the work is done. Skills bake that signal into every workflow.

4. Progressive disclosure

Do not load all 20 skills into context at session start. Activate them based on the phase. A small meta-skill (using-agent-skills) acts as a router that decides which skill applies to the current task.

This is the harness engineering lesson applied at skill granularity. Every token loaded into context degrades performance somewhere, so you load what’s relevant and leave the rest on disk. Progressive disclosure is how you get a 20-skill library into a 5K-token slot without poisoning the well.

5. Scope discipline

The meta-skill encodes a nonnegotiable I’d staple to every agent if I could: “touch only what you’re asked to touch.” Don’t refactor adjacent systems. Don’t remove code you don’t fully understand. Don’t brush against a TODO and decide to rewrite the file.

This sounds obvious until you watch an agent decide that fixing one bug requires modernizing three unrelated files. Scope discipline is the single biggest determinant of whether an agent’s PR is mergeable or has to be unwound. It’s also the principle that maps most cleanly onto Google’s code review norms, where reviewers will block a PR for doing more than one thing.

The Google DNA

The skills are saturated with practices from Software Engineering at Google and Google’s public engineering culture. This is intentional. Most of what makes Google-scale software work is documented and public, and it is exactly the part agents are most likely to skip.

A partial map of which skill encodes which practice:

  • Hyrum’s law in api-and-interface-design. Every observable behavior of your API will eventually be depended on by someone, so design with that in mind.
  • The test pyramid (~80/15/5) and the BeyoncĂ© rule in test-driven-development. “If you liked it, you should have put a test on it.” Infrastructure changes don’t catch bugs; tests do.
  • DAMP over DRY in tests. Google’s testing philosophy is explicit that test code should read like a specification even at the cost of some duplication. Overabstracted tests are a known antipattern.
  • ~100-line PR sizing, with Critical/Nit/Optional/FYI severity labels in code-review-and-quality. Straight from Google’s code review norms. Big PRs don’t get reviewed; they get rubber-stamped.
  • Chesterton’s Fence in code-simplification. Don’t remove a thing until you understand why it was put there.
  • Trunk-based development and atomic commits in git-workflow-and-versioning.
  • Shift left and feature flags in ci-cd-and-automation. Catch problems as early as possible, decouple deploy from release.
  • Code-as-liability in deprecation-and-migration. Every line you keep is one you have to maintain forever, so prefer the smaller surface.

None of these are new ideas. The point is that none of them are in the agent by default. A frontier model has read the phrase “Hyrum’s law” in its training data, but it does not apply Hyrum’s law when it’s designing your API at 3am. Skills are how you make sure it does.

How to actually use it

Three modes, in roughly increasing commitment.

Mode 1: Install via marketplace. If you’re using Claude Code:

/plugin marketplace add addyosmani/agent-skills 
/plugin install agent-skills@addy-agent-skills

You get the slash commands (/spec, /plan, /build, /test, /review, /ship, /code-simplify) and the agent activates the relevant skills automatically based on context. This is the path I’d recommend most people start on.

Mode 2: Drop the Markdown into your tool of choice. The skills are plain Markdown with front matter. Cursor users put them in .cursor/rules/. Gemini CLI has its own install path. Codex, Aider, Windsurf, OpenCode, anything that accepts a system prompt can read them. The tooling matters less than the workflow underneath.

Mode 3: Read them as a spec. Even if you never install anything, the skills are a documented description of what good engineering with AI agents looks like. Read code-review-and-quality.md and apply the five-axis framework to your team’s review process. Read test-driven-development.md and use it to settle the next “do we need to write the test first” argument with a junior. Read the meta-skill and steal the five nonnegotiables for your own AGENTS.md.

This third mode is where I’d actually start. Pick the four or five skills closest to your current pain. Decide which workflows you want enforced. Then install the runtime, or roll your own, to do the enforcing.

What to steal even if you never install

A few patterns from the project I’d steal regardless of whether you use AI coding agents at all:

Anti-rationalization as a team practice. Write down the lies your team tells itself. “We’ll fix the tests after launch.” “This change is too small for a design doc.” “It’s fine, we have monitoring.” Pair each with the rebuttal. Put it in your AGENTS.md or your engineering wiki. It will save you arguments and it will catch the next tired Friday-afternoon shortcut.

Process over prose for anything you write internally. If you find yourself writing a 2,000-word doc titled “how we approach X” you’ve written reference material. Convert it to a workflow with checkpoints. The doc shrinks to 400 words and people actually run it. This applies as much to onboarding guides and runbooks as it does to agent skills.

Verification as a hard exit criterion. Make “produce evidence” the exit step of every task. For agents, for engineers, for yourself. Evidence is whatever proves the work is done: a green test run, a screenshot, a log, a review approval. Without it, the task is not done. “Seems right” never closes the loop.

Progressive disclosure for any rulebook. Do not write a 50-page handbook. Write a small router that points to the right small chapter for the situation. This is true for AGENTS.md, for runbooks, for incident playbooks, for anything anyone will read under time pressure.

Five nonnegotiables, lifted from the meta-skill, that I’d put in any AGENTS.md tomorrow:

  1. Surface assumptions before building. Wrong assumptions held silently are the most common failure mode.
  2. Stop and ask when requirements conflict. Don’t guess.
  3. Push back when warranted. The agent (or engineer) is not a yes-machine.
  4. Prefer the boring, obvious solution. Cleverness is expensive.
  5. Touch only what you’re asked to touch.

That’s a worthwhile engineering culture in five lines, and you don’t need to install anything to adopt it.

Where this fits in the harness

In the broader picture, skills are one layer of agent harness engineering. The harness is the model plus everything you build around it; skills are the reusable workflow chunks that get progressively disclosed into the system prompt. They sit alongside AGENTS.md (the rolling rulebook), hooks (the deterministic enforcement layer), tools (the actions the agent can take), and the session log (the durable memory). Each layer has a specific job. Skills do the senior-engineer-process job.

Skills matter more for long-running agents than they do for chat-style ones, because long runs amplify every shortcut. An agent that skips the test in a 10-minute session produces one bug. An agent that skips the test in a 30-hour session produces a debugging archaeology project at the end of the run, when no one remembers what the original intent was. The longer the run, the more the senior-engineer scaffolding has to be enforced rather than suggested.

The portability of the skills format matters too. The same SKILL.md file works in Claude Code, Cursor (with rules), Gemini CLI, Codex, and any other harness that accepts system-prompt content. Write the workflow once, the runtime enforces it. That’s the thing the Markdown-with-front matter format buys you that bespoke prompt engineering does not.

Closing

The thing I most want people to take from this project, more than the skills themselves, is the framing.

AI coding agents are extremely capable junior engineers with no instinct for the parts of the job that don’t show up in the diff. The senior-engineering work (surfacing assumptions, sizing changes, writing the spec, leaving evidence, refusing to merge what can’t be reviewed) is exactly what an agent will skip unless you make it impossible to skip. The job, increasingly, is to encode that discipline as something the agent cannot talk itself out of.

Skills are one shape of that. Anti-rationalization tables. Progressive disclosure. Process over prose. Verification as the loadbearing exit criterion. The Google practices that already work, made portable.

You can install my version. You can roll your own. The lesson stands either way: The senior-engineer parts of the job are no longer optional, even when the engineer is a model.

The repo is at github.com/addyosmani/agent-skills (MIT). For the broader scaffolding picture, see “Agent Harness Engineering” and “Long-Running Agents.”



Read the whole story
alvinashcraft
44 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories