Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
151222 stories
·
33 followers

Specification-Driven Development: How to Stop Vibe Coding and Actually Ship Production-Ready AI-Generated Code

1 Share

You've seen the demos. An engineer types "build me a SaaS dashboard" into an AI coding agent, and in three minutes, a fully styled React app appears with auth, a database, and a payment flow. The crowd applauds. The tweet goes viral.

Then you try the same thing on your actual project — the one with 200 files, a custom auth layer, three external API integrations, and a monorepo structure — and the agent confidently rewrites your database schema, deletes a critical middleware, and introduces four security vulnerabilities before you can hit Ctrl+C.

This is the vibe coding gap. The distance between "works in a demo" and "works in production" isn't a tooling problem. It's a methodology problem. And the methodology that closes this gap has a name: Specification-Driven Development (SDD).

SDD isn't about writing more documentation. It's about giving AI agents exactly the constraints they need to operate reliably — through spec files, test-first loops, and a four-phase workflow that replaces "generate and pray" with "specify, design, task, implement." Teams adopting SDD consistently report 60-80% fewer AI-generated regressions and dramatically less time spent debugging code they didn't write.

This guide covers the full methodology: why vibe coding fails at scale, the four-phase SDD loop, how to structure spec files (CLAUDE.md, cursor rules, AGENTS.md), integrating TDD with AI agents, and production-grade patterns for orchestrating agents on real-world codebases.

Why Vibe Coding Breaks Down

Let's be precise about what fails and why. Vibe coding — the practice of prompting an AI with natural language and immediately deploying the output — works surprisingly well for greenfield prototypes. But it systematically fails on production codebases for four specific reasons:

1. The Context Collapse Problem

AI coding agents are stateless. Every new session starts from zero. Your 200-file project has implicit knowledge embedded everywhere: naming conventions, error handling patterns, the fact that userId in your auth layer is a UUID but user_id in your legacy API is a sequential integer. None of this exists in the agent's context unless you explicitly provide it.

Vibe coding provides no mechanism for this. You type a prompt, the agent generates code based on its training data, and the output follows the agent's default patterns — not yours. The result is code that "works" in isolation but creates subtle inconsistencies that compound into architectural drift.

Context Collapse in Practice:

  Your project:          What the agent generates:
  ─────────────          ──────────────────────────
  camelCase everywhere   snake_case in new files
  Zod for validation     Manual if-checks
  Custom error classes   Generic throw new Error()
  Repository pattern     Direct database calls
  UUID primary keys      Auto-increment integers

2. The Happy Path Bias

LLMs are trained predominantly on tutorial code, documentation examples, and Stack Overflow answers. This training data overwhelmingly demonstrates the "happy path" — when everything goes right. The result is AI-generated code that handles the primary flow gracefully but crumbles at boundaries:

  • Network timeouts? Not handled.
  • Concurrent requests to the same resource? Race conditions.
  • Database connection pool exhaustion? Crashes.
  • Malformed user input? Unvalidated.
  • Rate limits from external APIs? Ignored.

A 2025 study from Endor Labs found that 62% of AI-generated code contained security weaknesses or design flaws, with 44-49% of AI-suggested dependencies carrying known vulnerabilities. The Verizon DBIR 2025 report found third-party involvement in breaches doubled to 30%, underscoring the growing risk of unvetted AI-generated code entering software supply chains.

3. The Ownership Void

Vibe coding produces code that its supposed owner doesn't understand. The developer shipped it, but they can't explain the control flow, the error handling strategy, or why a particular library was chosen. When bugs appear in production at 2 AM, they face code written by an alien intelligence with different assumptions about how error states propagate.

This isn't a theoretical concern — it's the single most common complaint from engineering managers about AI-assisted development in 2026. The code works until it doesn't, and when it doesn't, nobody can fix it quickly because nobody understands it deeply.

4. The Compounding Technical Debt

Each vibe-coded session adds code that follows the agent's implicit patterns rather than the project's established ones. After ten sessions, you don't have one codebase — you have an accretion of ten slightly different coding styles, error handling approaches, and architectural assumptions. This technical debt compounds exponentially because each new AI session inherits the confusion of the previous ones.

The Four-Phase SDD Loop

Specification-Driven Development replaces the single "prompt → code" step with a four-phase loop that builds context before generating code. Each phase produces a document artifact that the AI agent can reference in subsequent phases.

The SDD Loop:

  Phase 1: REQUIREMENTS  ──→  requirements.md
       ↓                         "What are we building?"
  Phase 2: DESIGN        ──→  design.md
       ↓                         "How will we build it?"
  Phase 3: TASKS         ──→  tasks.md
       ↓                         "What exact steps in what order?"
  Phase 4: IMPLEMENTATION ──→  code + tests
       ↓                         "Build it, test it, verify it."
       └──── FEEDBACK ────→  Update specs, repeat.

Phase 1: Requirements

Start every feature by making the AI help you clarify what you're building. Don't ask for code — ask for a requirements analysis.

Prompt to your AI agent:

"I need to add a team invitation system to our SaaS app.
Before writing any code, create a requirements.md that covers:
- User stories for the invitation flow
- Edge cases (expired invites, duplicate emails, role conflicts)
- Security requirements (rate limiting, token validation)
- Integration points with our existing auth system
- What we are NOT building in this phase"

The "what we are NOT building" constraint is critical. Without explicit scope boundaries, AI agents tend to over-build — adding features that weren't requested, introducing premature abstractions, and expanding scope until the change touches half the codebase.

The output is a requirements.md file that both the human and the AI agent can reference. This document becomes the source of truth for the feature's scope.

Phase 2: Design

With requirements locked, translate them into technical decisions. Again, no code yet — just architecture.

Prompt to your AI agent:

"Read requirements.md. Now create a design.md that covers:
- Database schema changes (what tables/columns, with migration strategy)
- API endpoints (method, path, request/response shapes)
- Service layer architecture (what functions, what dependencies)
- Error handling strategy (what errors, what HTTP codes, what messages)
- State machine for invitation lifecycle (pending → accepted/expired/revoked)

Reference our existing patterns in src/services/ and src/api/.
Do NOT write implementation code."

This phase catches architectural mistakes before any code exists. The AI might propose a design that conflicts with your existing patterns. It's infinitely cheaper to fix a design document than to debug a half-implemented feature.

Phase 3: Tasks

Break the design into discrete, dependency-ordered implementation steps. Each task should be small enough that the AI agent can complete it in a single focused session.

Prompt to your AI agent:

"Read requirements.md and design.md. Create a tasks.md with:
- Numbered, ordered implementation steps
- Dependencies between steps (what must be done before what)
- Expected test coverage for each step
- Estimated complexity (S/M/L) for each step

Group tasks by: database → service layer → API → integration tests"

Example tasks.md output:

## Team Invitation System — Implementation Tasks

### Database Layer
- [x] Task 1 (S): Create `team_invitations` migration
  - Columns: id, team_id, email, role, token, status, expires_at
  - Tests: migration up/down, constraint validation

### Service Layer
- [ ] Task 2 (M): Implement `InvitationService.create()`
  - Depends on: Task 1
  - Validates: email format, duplicate check, team member limit
  - Tests: happy path, duplicate rejection, limit enforcement

- [ ] Task 3 (M): Implement `InvitationService.accept()`
  - Depends on: Task 2
  - Validates: token exists, not expired, not already accepted
  - Tests: valid acceptance, expired token, already-used token

### API Layer
- [ ] Task 4 (M): POST /api/teams/:id/invitations
  - Depends on: Task 2
  - Auth: team admin role required
  - Tests: 201 success, 403 non-admin, 409 duplicate, 429 rate limit

- [ ] Task 5 (M): POST /api/invitations/:token/accept
  - Depends on: Task 3
  - Auth: must be logged in, email must match invitation
  - Tests: 200 success, 404 invalid token, 410 expired

Phase 4: Implementation

Now — and only now — the AI writes code. But instead of a single massive prompt, you feed it one task at a time, with full context:

Prompt to your AI agent:

"Read requirements.md, design.md, and tasks.md.
Implement Task 2: InvitationService.create()

Requirements:
- Follow our existing service pattern in src/services/TeamService.ts
- Use Zod for input validation
- Use our custom AppError class for error handling
- Write tests first (test file, then implementation)
- Run tests after implementation to verify

Do NOT modify any existing files except to add imports.
Do NOT implement tasks 3-5 yet."

The constraints in this prompt are doing the heavy lifting. "Follow our existing pattern" prevents context collapse. "Write tests first" enforces TDD. "Do NOT modify existing files except imports" prevents the agent from "helpfully" refactoring unrelated code. "Do NOT implement tasks 3-5" prevents scope creep.

Structuring Your Spec Files

The SDD loop depends on persistent context — files that survive across sessions and tell every AI agent how to behave in your project. Here's how to structure them:

CLAUDE.md / Cursor Rules / AGENTS.md

These files serve the same purpose across different tools: they're the project's "briefing document" that the AI reads at the start of every session.

# CLAUDE.md (or .cursor/rules, or AGENTS.md)

## Project Overview
E-commerce SaaS platform built with Next.js 16, TypeScript, 
Drizzle ORM, PostgreSQL. Monorepo managed with Turborepo.

## Tech Stack
- Framework: Next.js 16 (App Router, Server Components)
- Language: TypeScript 6.0 (strict mode)
- Database: PostgreSQL 17 + Drizzle ORM
- Validation: Zod v3
- Styling: Tailwind CSS v4
- Testing: Vitest + Playwright
- Auth: Custom JWT + refresh token rotation

## Architecture Rules
1. All database access goes through repository classes in src/repositories/
2. Business logic lives in service classes in src/services/
3. API routes are thin controllers that call services
4. All inputs are validated with Zod schemas defined in src/schemas/
5. Errors use custom AppError class (see src/lib/errors.ts)
6. All IDs are UUIDs generated with crypto.randomUUID()

## Coding Conventions
- Use named exports, not default exports
- Use explicit return types on all public functions
- Error messages follow the pattern: "[Entity].[action] failed: [reason]"
- File naming: kebab-case for files, PascalCase for classes
- Imports: group by external → internal → types

## Common Commands
- `pnpm test` — Run all tests
- `pnpm test:watch` — Watch mode
- `pnpm db:migrate` — Run pending migrations
- `pnpm lint` — Biome lint + format check

## Critical Warnings
- NEVER use `any` type. Use `unknown` with type narrowing.
- NEVER use default exports. Named exports only.
- NEVER modify migration files. Create new ones for changes.
- NEVER commit .env files. Use .env.example for documentation.

Key Principle: Route, Don't Dump

Keep your root spec file under 200-300 lines. It should tell the agent what to do and where to find more information, not contain every rule in the project.

## Reference Documents
- Architecture decisions: /docs/architecture.md
- API design conventions: /docs/api-conventions.md
- Database schema: /docs/schema.md
- Feature specs: /docs/specs/[feature-name].md

The AI agent reads the root file, then reads referenced documents on-demand when working on relevant areas. This is progressive disclosure — the same UX principle that makes good software, applied to AI context management.

Per-Directory Rules

Many AI tools support directory-scoped rules. Use these for domain-specific constraints:

# src/services/.rules (or src/services/.cursorrules)

## Service Layer Rules
- Every service method must be async
- Services receive dependencies through constructor injection
- Services must not import from src/api/ (no circular deps)
- Every public method must have a corresponding test
- Use transactions for multi-table operations
- Log entry and exit of critical operations using logger.info()

TDD with AI Agents: The Red-Green-Refactor Loop

Test-Driven Development isn't just compatible with AI agents — it's the ideal workflow for them. Tests provide the one thing AI agents desperately need: an objective, verifiable definition of "correct."

Why TDD Works Better With AI Than Without

Without tests, you ask the AI to generate code and then manually review it for correctness. This is cognitively exhausting and error-prone — you're reading code written by an alien intelligence and trying to spot bugs in unfamiliar patterns.

With TDD, you define correctness first, then let the AI generate code until the tests pass. You're not reviewing implementation details — you're reviewing outcomes. The test suite is your automated verifier.

Traditional Flow (fragile):
  Prompt → Code → Manual Review → "Looks right?" → Ship → Bug

TDD Flow (reliable):
  Spec → Test (failing) → Prompt AI → Code → Run Tests → Pass? → Ship
                                          ↓
                                       Fail → AI iterates automatically

The Spec → Test → Implement Pattern

Here's the practical workflow for TDD with an AI agent:

Step 1: Write the test spec (human-driven)

// __tests__/services/invitation-service.test.ts
import { describe, it, expect, beforeEach } from 'vitest';
import { InvitationService } from '@/services/invitation-service';

describe('InvitationService.create', () => {
  it('should create an invitation with a valid email and role', async () => {
    const result = await service.create({
      teamId: 'team-uuid-1',
      email: 'new@example.com',
      role: 'member',
      invitedBy: 'admin-uuid-1',
    });

    expect(result.id).toBeDefined();
    expect(result.status).toBe('pending');
    expect(result.token).toHaveLength(64);
    expect(result.expiresAt).toBeInstanceOf(Date);
  });

  it('should reject duplicate invitations for the same email', async () => {
    await service.create({ teamId: 'team-uuid-1', email: 'dup@example.com', role: 'member', invitedBy: 'admin-uuid-1' });

    await expect(
      service.create({ teamId: 'team-uuid-1', email: 'dup@example.com', role: 'member', invitedBy: 'admin-uuid-1' })
    ).rejects.toThrow('Invitation.create failed: duplicate invitation');
  });

  it('should enforce team member limit', async () => {
    // Assume team already has max members
    await expect(
      service.create({ teamId: 'full-team', email: 'extra@example.com', role: 'member', invitedBy: 'admin-uuid-1' })
    ).rejects.toThrow('Invitation.create failed: team member limit reached');
  });

  it('should set expiration to 7 days from creation', async () => {
    const before = new Date();
    const result = await service.create({
      teamId: 'team-uuid-1',
      email: 'timed@example.com',
      role: 'member',
      invitedBy: 'admin-uuid-1',
    });
    const after = new Date();

    const sevenDaysMs = 7 * 24 * 60 * 60 * 1000;
    expect(result.expiresAt.getTime()).toBeGreaterThanOrEqual(before.getTime() + sevenDaysMs);
    expect(result.expiresAt.getTime()).toBeLessThanOrEqual(after.getTime() + sevenDaysMs);
  });
});

Step 2: Let the AI implement (agent-driven)

Prompt:

"The test file at __tests__/services/invitation-service.test.ts
defines the expected behavior. Implement InvitationService.create()
in src/services/invitation-service.ts to make all tests pass.

Follow the service pattern established in src/services/team-service.ts.
Use Zod for input validation (schema in src/schemas/invitation.ts).
Run `pnpm test __tests__/services/invitation-service.test.ts` after
implementation. Iterate until all tests pass."

The agent generates code, runs the tests, sees failures, and iterates — automatically. This is the Red-Green loop, but the AI is the one cycling through it. You defined the "what" (tests), and the AI figures out the "how" (implementation).

Step 3: Human review (outcome-focused)

Once tests pass, review the implementation for:

  • Does it follow project patterns? (Check against CLAUDE.md)
  • Are there performance concerns? (N+1 queries, missing indexes)
  • Are there security concerns? (Input validation, auth checks)

You're reviewing with purpose, not scanning line-by-line through unfamiliar code hoping to spot a bug.

Failure Patterns and How SDD Prevents Them

Let's map the most common AI coding failures to specific SDD mechanisms that prevent them:

Failure Pattern Root Cause SDD Prevention
Context collapse Agent doesn't know project conventions CLAUDE.md with architecture rules and coding conventions
Happy path bias Agent doesn't generate error handling Tests explicitly define error scenarios (Step 1 of TDD)
Scope creep Agent over-builds or touches unrelated code Task breakdown with explicit "do NOT" constraints
Architectural drift Agent uses its own patterns, not yours Design document + per-directory rules
Regression New code breaks existing functionality Test-first workflow catches regressions immediately
Security holes Agent doesn't consider threat model Security requirements in Phase 1 + security-focused test cases
Ownership void Developer ships code they don't understand Design review in Phase 2 forces understanding before implementation
Technical debt Inconsistent patterns across sessions CLAUDE.md ensures consistency across every session

Production Workflow: Putting It All Together

Here's a complete, real-world SDD workflow from feature request to merge:

1. Feature Brief

## Feature: Team Role Management

Users need the ability to change team member roles (admin → member,
member → admin) and remove members from teams. Only team admins
should be able to perform these actions.

2. AI-Assisted Requirements

Prompt: "Read the feature brief above and our existing auth system 
in src/services/auth-service.ts. Generate requirements.md covering 
user stories, edge cases, security requirements, and what we're NOT 
building. Consider: what happens when the last admin tries to 
change their own role?"

3. AI-Assisted Design

Prompt: "Read requirements.md. Create design.md with database schema 
changes, API endpoints, service methods, and a state diagram for role 
transitions. Follow patterns in our existing codebase. Flag any 
design decisions that need human review."

4. Human Review Checkpoint

This is the critical human-in-the-loop moment. Review the design document for:

  • Does the database schema make sense?
  • Are the API endpoints RESTful and consistent with our existing API?
  • Did the AI catch the "last admin" edge case?
  • Are there security implications the AI missed?

Make corrections to the design document before any code is written.

5. AI-Assisted Task Breakdown

Prompt: "Read requirements.md and design.md. Create tasks.md with 
ordered, dependency-aware implementation steps. Each task should 
include its test coverage requirements."

6. Iterative Implementation

Prompt: "Implement Task 1 from tasks.md. Write tests first, then 
implementation. Run tests to verify. Mark the task as complete in 
tasks.md when done. Do NOT proceed to Task 2."

Repeat for each task. After each task, the agent updates tasks.md to reflect progress. You can review each task independently, making code review manageable rather than facing a single massive PR.

7. Integration Verification

Prompt: "All tasks in tasks.md are complete. Run the full test suite 
with `pnpm test`. If there are failures, fix them. Then run 
`pnpm lint` and fix any issues."

Scaling SDD: Team Patterns

Shared Spec Repository

For teams, maintain spec files in version control alongside code:

project/
├── .claude/
│   └── rules/
│       ├── general.md         # Project-wide rules
│       ├── api-conventions.md # API-specific rules
│       └── testing.md         # Testing standards
├── docs/
│   └── specs/
│       ├── team-invitations/
│       │   ├── requirements.md
│       │   ├── design.md
│       │   └── tasks.md
│       └── role-management/
│           ├── requirements.md
│           ├── design.md
│           └── tasks.md
├── CLAUDE.md                  # Root briefing (routes to .claude/rules/)
├── AGENTS.md                  # Universal rules for all AI tools
└── src/

PR Template for AI-Assisted Work

Require PRs that include AI-generated code to reference their spec:

## PR Checklist (AI-Assisted)

- [ ] Feature spec exists in /docs/specs/[feature-name]/
- [ ] requirements.md reviewed and approved by tech lead
- [ ] design.md reviewed and approved by tech lead
- [ ] All tasks in tasks.md are completed and checked off
- [ ] Test coverage: all paths from requirements are tested
- [ ] No modifications to files outside the feature scope
- [ ] `pnpm test` passes
- [ ] `pnpm lint` passes

Metrics That Matter

Track these metrics to measure the effectiveness of SDD adoption:

Metric Before SDD After SDD Why It Matters
AI-generated regressions per sprint 8-15 1-3 Direct measure of code quality
Time to debug AI code 2-4 hours per bug 15-30 min per bug Ownership void reduction
PR review time 45-60 min (reading unfamiliar code) 15-20 min (checking against spec) Review efficiency
Feature delivery time Same (faster coding, slower debugging) 30-40% faster net Real productivity gain
Security issues caught in review 20-30% catch rate 80-90% catch rate (tests + spec) Security improvement

Common Objections (and Rebuttals)

"SDD is just more documentation overhead"

No. SDD documents are generated by the AI, reviewed by you, and consumed by the AI. The total documentation effort is 10-15 minutes per feature for the human (review time). The alternative is 2-4 hours per bug debugging code you don't understand.

"My project is too small for this"

If your project is small enough that you can hold the entire codebase in your head, vibe coding might work fine. SDD becomes essential when the codebase exceeds your working memory — roughly 10-20 files with interconnected logic.

"Tests slow down development"

Tests slow down the first hour. They accelerate every subsequent hour. With AI agents, this trade-off is even more favorable: the agent writes the implementation to match your tests, often getting it right on the first or second attempt. The test-writing phase takes 5-10 minutes for humans; the implementation phase takes the AI 30-60 seconds.

"I just need to write better prompts"

Prompt quality matters, but prompts are ephemeral. They disappear when the session ends. SDD spec files persist, improve over time, and work across different AI tools. Your CLAUDE.md doesn't just help today's session — it helps every session, every team member, and every AI tool.

The Maturity Model

Teams adopting SDD typically progress through three stages:

Stage 1: Reactive (Most teams today)

  • Use AI for code generation with ad-hoc prompts
  • Debug frequently, review painfully
  • No persistent context files
  • Each session starts from zero

Stage 2: Structured (SDD adoption)

  • CLAUDE.md / AGENTS.md in place
  • Four-phase loop for complex features
  • TDD integrated with AI workflows
  • Per-directory rules for domain-specific constraints
  • Design review before implementation

Stage 3: Systematic (Full SDD maturity)

  • Spec repository maintained alongside code
  • AI agents update spec documents as they work
  • Metrics tracked and optimized
  • Onboarding new team members via spec docs
  • Cross-tool consistency (Cursor, Claude Code, Copilot all follow same specs)

Most teams can reach Stage 2 in a single sprint. Stage 3 develops naturally over 2-3 months as spec files accumulate and patterns stabilize.

The Engineering Reality

AI coding agents are the most powerful tools we've ever had for producing code fast. They're also the most powerful tools we've ever had for producing bugs fast. The difference between the two outcomes isn't the model, the prompt, or the tool — it's the methodology.

Vibe coding treats AI as a replacement for engineering discipline. SDD treats AI as an amplifier for engineering discipline. The four-phase loop — requirements, design, tasks, implementation — isn't bureaucracy. It's the same engineering process that senior developers have always followed, now made explicit so that AI agents can follow it too.

The spec files take 15 minutes to set up. The TDD loop takes zero additional time (the AI writes the implementation). The design review takes 10 minutes per feature. In exchange, you get code that follows your patterns, handles your edge cases, passes your tests, and can be maintained by your team.

Your AI agent is the most productive junior developer you've ever hired. SDD is how you onboard them properly — with clear requirements, documented patterns, well-defined tasks, and verifiable acceptance criteria. Skip the onboarding, and you get the chaos that everyone complains about. Invest in the onboarding, and you get the productivity multiplier that everyone dreams about.

The choice isn't whether to use AI coding agents. The choice is whether to use them with engineering discipline or without it. SDD makes that choice obvious.

💡 Note: This article was originally published on the Pockit Blog.

Check out Pockit.tools for 60+ free developer utilities. For faster access, add it to Chrome and use JSON Formatter & Diff Checker directly from your toolbar.

Read the whole story
alvinashcraft
11 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Proposed Advancement of IMSC Text Profile 1.3 to W3C Recommendation

1 Share

Today, the W3C Team proposed advancing the IMSC Text Profile 1.3 to W3C Recommendation. Published by the Timed Text Working Group as a Candidate Recommendation Draft on 3 April 2026, this specification defines a text-only profile of [ttml2] intended for subtitle and caption delivery applications worldwide. It improves over the Text Profile specified at at [ttml-imsc1.2], with the improvements summarized at L. Summary of substantive changes.

Read the whole story
alvinashcraft
11 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Enterprises struggle with the increased complexity of application change

1 Share
A new report from cloud application lifecycle management (CALM) platform Opkey shows enterprise application leaders are coping with the increased complexity of implementing and managing the cloud technology needed to power today’s business operations. It identifies six major pressure points impacting enterprise IT leaders and QA teams: rising integration management costs, the inability to keep pace with rapid cloud updates, persistent production issues, managing external consultants costs post-implementation, mounting expectations around agentic AI, and how teams hope to reinvest newfound efficiency gains into business innovation. A full 61 percent of respondents say integrations are their single largest cost driver, well… [Continue Reading]
Read the whole story
alvinashcraft
12 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Adobe launches Acrobat Spaces, a free AI-powered study tool for students

1 Share
Adobe Acrobat Spaces is a free tool for students to use AI and create different kind of study material from documents
Read the whole story
alvinashcraft
12 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

The World Needs More Software Engineers

1 Share

I sat down with Aaron Levie at the O’Reilly AI Codecon two weeks ago. Aaron cofounded Box in 2005, and 20 years later, his company manages content for about two-thirds of the Fortune 500. Aaron is one of the few CEOs of an incumbent enterprise software company thinking deeply in public about what AI means for the entire enterprise stack. There are a lot of people who are building companies from the ground up with AI, others who are dragging their feet adapting existing enterprises to it, and then there’s Aaron. He sits in a kind of Goldilocks zone, enthusiastic but not uncritical, engaging in the hard work of adapting AI to the enterprise and the enterprise to AI.

The engineering demand paradox

I started out by asking about something from Lenny’s Newsletter that Aaron had retweeted. Despite all the doom rhetoric, TrueUp data shows software engineering job postings are at a three-year high. Product manager jobs are way up. AI jobs as a whole are way up.

AI jobs are way up

The actual data may be more equivocal than the TrueUp report suggests. The honest read of the literature as of spring 2026 (Brynjolfsson et al., Humlum and Vestergaard, BLS Software Developers, BLS Computer Programmers) is that something real is happening to entry-level software work, that it is happening faster than most previous technology transitions, that it has different effects depending on which job code you look at, and that it is not yet clear whether the net effect on total software employment will be negative, neutral, or eventually positive. Nonetheless, the TrueUp report was a trigger for the discussion that followed.

Aaron noted that engineers have historically been concentrated at tech companies because the cost of a software project was too high to justify anywhere else. But if agents make an engineer two to ten times more productive, all the software projects that were never economically viable suddenly become viable. Demand doesn’t shrink. It diffuses across the entire economy. In his tweet, he called it “Jevons paradox happening in real time.” In our conversation, he said:

“What’s going to happen is the entire world is going to be looking at all the potential software that they build. And they’re going to start to say, Oh, I can finally justify going out and doing this type of project where I couldn’t before.”

Engineers empowered by AI agents won’t just build software for IT teams. The total addressable role of the engineer expands from the technology department to every function in the enterprise. They’ll be wiring up automation for marketing, legal, accounting, and every other corporate function.

He’s totally right. Look around at all the crappy workflows, the crappy processes, the incredible overhead of things that ought to be simple. You think companies should lay off their developers to reduce costs when there’s so much shitty software out there? Really? There’s so much that needs to be improved. He had a great line: “Silicon Valley is spooked by its own technology.”

Over to me: The rhetoric from the labs about job destruction is actively counterproductive. I was talking recently with someone in healthcare who described a hospital system trying to fill a giant hole from reduced Medicare funding. They see AI as a way to gain efficiency in their back office so they can free up more resources for patient care. And of course the union is fighting it because they’ve been told AI is a monster that’s going to take their jobs. If you tell a different story, one about making the system better and serving more people more affordably, that’s something people can get behind. We have to change the narrative.

Context, not connectivity, is the real problem

I also asked Aaron whether protocols like MCP are making context portable enough to erode competitive moats. He agreed that the industry has broadly converged on openness and interoperability (with some toll booths to work through). But getting your systems to talk to each other doesn’t solve the harder problem of getting your data structured so that agents can actually find the right information at the right moment.

“If it’s in 50 different systems and it’s not organized in a way that agents can readily take advantage of, what you’re going to be is at the mercy of how well that agent finds exactly the context that it needs to do its work. And you’re kind of just rolling the dice every time you do a workflow.”

He predicts a decade of infrastructure modernization ahead, which sounds about right. At O’Reilly, I keep running into this myself. I’ll see a task that’s perfect for an agent and soon discover that the data I need is scattered across four systems and I have to jump through hoops to figure out who knows where the data is and how to get access. A friend running a large (but relatively new) enterprise that is turbocharging productivity and service delivery with agents told me recently that a big part of his team’s success was possible because they had spent a lot of time getting their data infrastructure in order from the start.

IMO, a lot of the stories you hear about OpenClaw and other harbingers of the agent future can be misleading in an enterprise context. They are doing greenfield setups, largely running consumer apps with well-defined interfaces, and even then, it takes weeks to set up properly. Now imagine agentic frameworks for companies with thousands of employees, hundreds of legacy apps, and deep wells of proprietary data. A decade of infrastructure modernization is generous. Without help, many enterprises will have difficulty making the transition.

Engineering the trade-offs

I brought up Phillip Carter’s “two computers” framing, that we’re now programming a deterministic computer and a probabilistic computer at the same time. Skills are a bridge, because they have both context for the LLM which can work probabilistically and tools that are built with deterministic code. Both systems coexist and work in parallel.

Aaron called the boundary between the two computers “the trillion-dollar question.” When does a process cross the threshold where it should be locked into repeatable, deterministic code? When should it stay adaptive? Loan processing needs to work the same way every time. Employee HR queries can be probabilistic. And the irony, as Aaron pointed out, is that making these trade-offs correctly requires deep technical understanding. AI makes the field more technical, not less.

I added that sometimes this judgment is a user experience question, sometimes a cost question. You can do something with an LLM, but it might be a lot cheaper with canned code. At other times, even though the LLM costs more, the flexibility of a liquid user interface is far better.

This is also a locus of creativity. What you bring out of AI is what you bring to it. Steve Jobs wasn’t a coder, but he knew how to get the most out of coders. He would have gone nuts with AI agents, because he was the essence of taste and judgment and setting the bar.

https://www.youtube.com/watch?v=1dDKWEeY0aU

Where startups win

I asked Aaron about the risks to existing enterprises from greenfield AI startups that can just move faster, reinventing what the incumbents do with an AI native solution, without all the baggage. He replied:

“If there’s already a substantial amount of the data for that particular workflow in an existing system, and the incumbent is agile enough and responsive enough, then they are in a good position to build either the solutions or to monetize that set of work that’s going to be done….What agents are really good at is automating the unstructured areas of work, the messy, collaborative human-based parts of work, the tax process, the legal review process, the audit and risk analysis process of all of your contracts and unstructured data. And so in those areas, there’s no incumbent. The only incumbent is likely professional services firms. So that’s where I would favor startups.”

Software startups like Harvey are already taking services domains and building agents for them. But it’s not just software startups. Aaron also sees lots of opportunity for AI-native law firms, accounting firms, and ad agencies that can throw away legacy workflow, start from scratch, and deliver two to five times the output at lower cost will have a huge advantage.

I did push back with a point I think is underappreciated: Existing enterprises face a real risk that the organization will try to stuff AI into existing workflows rather than asking what the AI-native workflow would be. People are attached to their jobs, their roles, the org chart. We have to wrestle with that honestly if we’re going to truly reinvent what we do.

https://www.youtube.com/watch?v=wf0nINE8aog

Humans get context for free

One of Aaron’s points about agents is that humans carry an enormous amount of ambient context that agents lack. You know what building you’re in and who else works there and what they do. You know the meeting that just happened where a team changed course on a strategy that hasn’t been written down yet. You have 20 years of accumulated domain knowledge. All of that is free context that we’ve never had to formalize. As he put it, “We’ve never built our business processes in a model where we assume that there’s a new user in that workflow that appeared one second ago and in under five seconds, they need to get all of the information possible to do that task.”

He suggested that one way to think of agents is as new employees who are experts but arrive with zero context and need to be fully briefed. And the context has to be precise, not just comprehensive. Give an agent too much context and it gets confused. Give it too little and it rolls the dice. SKILLS.md and AGENTS.md files are attempts to provide exactly the surgical context an agent needs for a specific process.

But 99% of knowledge work doesn’t have an AGENTS.md file, he noted. The data is everywhere. The context is everywhere. So in an existing enterprise, you have to reengineer workflows from the ground up to deliver the right information to agents at the right moment.

Aaron summed up Box’s strategic pivot in one sentence: swap the word “content” for “context” and the rest of the strategy stays the same. Enterprise context lives in contracts, research materials, financial documents. That’s all enterprise content but it isn’t always easily available as context. The evolution is making agents first-class citizens alongside people as users of that content. This very much maps to what we’re thinking about at O’Reilly too.

https://www.youtube.com/watch?v=RByHwoTIdXM


Read the whole story
alvinashcraft
12 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

1.0.20

1 Share

2026-04-07

  • Add copilot help monitoring topic with OpenTelemetry configuration details and examples
  • Spinner stays active until background agents and shell commands finish, and user input remains available throughout
  • Azure OpenAI BYOK defaults to the GA versionless v1 route when no API version is configured
  • Reduce UI sluggishness during live response streaming
  • /yolo and --yolo now behave identically and /yolo state persists across /restart
Read the whole story
alvinashcraft
12 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories