You've seen the demos. An engineer types "build me a SaaS dashboard" into an AI coding agent, and in three minutes, a fully styled React app appears with auth, a database, and a payment flow. The crowd applauds. The tweet goes viral.
Then you try the same thing on your actual project — the one with 200 files, a custom auth layer, three external API integrations, and a monorepo structure — and the agent confidently rewrites your database schema, deletes a critical middleware, and introduces four security vulnerabilities before you can hit Ctrl+C.
This is the vibe coding gap. The distance between "works in a demo" and "works in production" isn't a tooling problem. It's a methodology problem. And the methodology that closes this gap has a name: Specification-Driven Development (SDD).
SDD isn't about writing more documentation. It's about giving AI agents exactly the constraints they need to operate reliably — through spec files, test-first loops, and a four-phase workflow that replaces "generate and pray" with "specify, design, task, implement." Teams adopting SDD consistently report 60-80% fewer AI-generated regressions and dramatically less time spent debugging code they didn't write.
This guide covers the full methodology: why vibe coding fails at scale, the four-phase SDD loop, how to structure spec files (CLAUDE.md, cursor rules, AGENTS.md), integrating TDD with AI agents, and production-grade patterns for orchestrating agents on real-world codebases.
Why Vibe Coding Breaks Down
Let's be precise about what fails and why. Vibe coding — the practice of prompting an AI with natural language and immediately deploying the output — works surprisingly well for greenfield prototypes. But it systematically fails on production codebases for four specific reasons:
1. The Context Collapse Problem
AI coding agents are stateless. Every new session starts from zero. Your 200-file project has implicit knowledge embedded everywhere: naming conventions, error handling patterns, the fact that userId in your auth layer is a UUID but user_id in your legacy API is a sequential integer. None of this exists in the agent's context unless you explicitly provide it.
Vibe coding provides no mechanism for this. You type a prompt, the agent generates code based on its training data, and the output follows the agent's default patterns — not yours. The result is code that "works" in isolation but creates subtle inconsistencies that compound into architectural drift.
Context Collapse in Practice:
Your project: What the agent generates:
───────────── ──────────────────────────
camelCase everywhere snake_case in new files
Zod for validation Manual if-checks
Custom error classes Generic throw new Error()
Repository pattern Direct database calls
UUID primary keys Auto-increment integers
2. The Happy Path Bias
LLMs are trained predominantly on tutorial code, documentation examples, and Stack Overflow answers. This training data overwhelmingly demonstrates the "happy path" — when everything goes right. The result is AI-generated code that handles the primary flow gracefully but crumbles at boundaries:
- Network timeouts? Not handled.
- Concurrent requests to the same resource? Race conditions.
- Database connection pool exhaustion? Crashes.
- Malformed user input? Unvalidated.
- Rate limits from external APIs? Ignored.
A 2025 study from Endor Labs found that 62% of AI-generated code contained security weaknesses or design flaws, with 44-49% of AI-suggested dependencies carrying known vulnerabilities. The Verizon DBIR 2025 report found third-party involvement in breaches doubled to 30%, underscoring the growing risk of unvetted AI-generated code entering software supply chains.
3. The Ownership Void
Vibe coding produces code that its supposed owner doesn't understand. The developer shipped it, but they can't explain the control flow, the error handling strategy, or why a particular library was chosen. When bugs appear in production at 2 AM, they face code written by an alien intelligence with different assumptions about how error states propagate.
This isn't a theoretical concern — it's the single most common complaint from engineering managers about AI-assisted development in 2026. The code works until it doesn't, and when it doesn't, nobody can fix it quickly because nobody understands it deeply.
4. The Compounding Technical Debt
Each vibe-coded session adds code that follows the agent's implicit patterns rather than the project's established ones. After ten sessions, you don't have one codebase — you have an accretion of ten slightly different coding styles, error handling approaches, and architectural assumptions. This technical debt compounds exponentially because each new AI session inherits the confusion of the previous ones.
The Four-Phase SDD Loop
Specification-Driven Development replaces the single "prompt → code" step with a four-phase loop that builds context before generating code. Each phase produces a document artifact that the AI agent can reference in subsequent phases.
The SDD Loop:
Phase 1: REQUIREMENTS ──→ requirements.md
↓ "What are we building?"
Phase 2: DESIGN ──→ design.md
↓ "How will we build it?"
Phase 3: TASKS ──→ tasks.md
↓ "What exact steps in what order?"
Phase 4: IMPLEMENTATION ──→ code + tests
↓ "Build it, test it, verify it."
└──── FEEDBACK ────→ Update specs, repeat.
Phase 1: Requirements
Start every feature by making the AI help you clarify what you're building. Don't ask for code — ask for a requirements analysis.
Prompt to your AI agent:
"I need to add a team invitation system to our SaaS app.
Before writing any code, create a requirements.md that covers:
- User stories for the invitation flow
- Edge cases (expired invites, duplicate emails, role conflicts)
- Security requirements (rate limiting, token validation)
- Integration points with our existing auth system
- What we are NOT building in this phase"
The "what we are NOT building" constraint is critical. Without explicit scope boundaries, AI agents tend to over-build — adding features that weren't requested, introducing premature abstractions, and expanding scope until the change touches half the codebase.
The output is a requirements.md file that both the human and the AI agent can reference. This document becomes the source of truth for the feature's scope.
Phase 2: Design
With requirements locked, translate them into technical decisions. Again, no code yet — just architecture.
Prompt to your AI agent:
"Read requirements.md. Now create a design.md that covers:
- Database schema changes (what tables/columns, with migration strategy)
- API endpoints (method, path, request/response shapes)
- Service layer architecture (what functions, what dependencies)
- Error handling strategy (what errors, what HTTP codes, what messages)
- State machine for invitation lifecycle (pending → accepted/expired/revoked)
Reference our existing patterns in src/services/ and src/api/.
Do NOT write implementation code."
This phase catches architectural mistakes before any code exists. The AI might propose a design that conflicts with your existing patterns. It's infinitely cheaper to fix a design document than to debug a half-implemented feature.
Phase 3: Tasks
Break the design into discrete, dependency-ordered implementation steps. Each task should be small enough that the AI agent can complete it in a single focused session.
Prompt to your AI agent:
"Read requirements.md and design.md. Create a tasks.md with:
- Numbered, ordered implementation steps
- Dependencies between steps (what must be done before what)
- Expected test coverage for each step
- Estimated complexity (S/M/L) for each step
Group tasks by: database → service layer → API → integration tests"
Example tasks.md output:
## Team Invitation System — Implementation Tasks
### Database Layer
- [x] Task 1 (S): Create `team_invitations` migration
- Columns: id, team_id, email, role, token, status, expires_at
- Tests: migration up/down, constraint validation
### Service Layer
- [ ] Task 2 (M): Implement `InvitationService.create()`
- Depends on: Task 1
- Validates: email format, duplicate check, team member limit
- Tests: happy path, duplicate rejection, limit enforcement
- [ ] Task 3 (M): Implement `InvitationService.accept()`
- Depends on: Task 2
- Validates: token exists, not expired, not already accepted
- Tests: valid acceptance, expired token, already-used token
### API Layer
- [ ] Task 4 (M): POST /api/teams/:id/invitations
- Depends on: Task 2
- Auth: team admin role required
- Tests: 201 success, 403 non-admin, 409 duplicate, 429 rate limit
- [ ] Task 5 (M): POST /api/invitations/:token/accept
- Depends on: Task 3
- Auth: must be logged in, email must match invitation
- Tests: 200 success, 404 invalid token, 410 expired
Phase 4: Implementation
Now — and only now — the AI writes code. But instead of a single massive prompt, you feed it one task at a time, with full context:
Prompt to your AI agent:
"Read requirements.md, design.md, and tasks.md.
Implement Task 2: InvitationService.create()
Requirements:
- Follow our existing service pattern in src/services/TeamService.ts
- Use Zod for input validation
- Use our custom AppError class for error handling
- Write tests first (test file, then implementation)
- Run tests after implementation to verify
Do NOT modify any existing files except to add imports.
Do NOT implement tasks 3-5 yet."
The constraints in this prompt are doing the heavy lifting. "Follow our existing pattern" prevents context collapse. "Write tests first" enforces TDD. "Do NOT modify existing files except imports" prevents the agent from "helpfully" refactoring unrelated code. "Do NOT implement tasks 3-5" prevents scope creep.
Structuring Your Spec Files
The SDD loop depends on persistent context — files that survive across sessions and tell every AI agent how to behave in your project. Here's how to structure them:
CLAUDE.md / Cursor Rules / AGENTS.md
These files serve the same purpose across different tools: they're the project's "briefing document" that the AI reads at the start of every session.
# CLAUDE.md (or .cursor/rules, or AGENTS.md)
## Project Overview
E-commerce SaaS platform built with Next.js 16, TypeScript,
Drizzle ORM, PostgreSQL. Monorepo managed with Turborepo.
## Tech Stack
- Framework: Next.js 16 (App Router, Server Components)
- Language: TypeScript 6.0 (strict mode)
- Database: PostgreSQL 17 + Drizzle ORM
- Validation: Zod v3
- Styling: Tailwind CSS v4
- Testing: Vitest + Playwright
- Auth: Custom JWT + refresh token rotation
## Architecture Rules
1. All database access goes through repository classes in src/repositories/
2. Business logic lives in service classes in src/services/
3. API routes are thin controllers that call services
4. All inputs are validated with Zod schemas defined in src/schemas/
5. Errors use custom AppError class (see src/lib/errors.ts)
6. All IDs are UUIDs generated with crypto.randomUUID()
## Coding Conventions
- Use named exports, not default exports
- Use explicit return types on all public functions
- Error messages follow the pattern: "[Entity].[action] failed: [reason]"
- File naming: kebab-case for files, PascalCase for classes
- Imports: group by external → internal → types
## Common Commands
- `pnpm test` — Run all tests
- `pnpm test:watch` — Watch mode
- `pnpm db:migrate` — Run pending migrations
- `pnpm lint` — Biome lint + format check
## Critical Warnings
- NEVER use `any` type. Use `unknown` with type narrowing.
- NEVER use default exports. Named exports only.
- NEVER modify migration files. Create new ones for changes.
- NEVER commit .env files. Use .env.example for documentation.
Key Principle: Route, Don't Dump
Keep your root spec file under 200-300 lines. It should tell the agent what to do and where to find more information, not contain every rule in the project.
## Reference Documents
- Architecture decisions: /docs/architecture.md
- API design conventions: /docs/api-conventions.md
- Database schema: /docs/schema.md
- Feature specs: /docs/specs/[feature-name].md
The AI agent reads the root file, then reads referenced documents on-demand when working on relevant areas. This is progressive disclosure — the same UX principle that makes good software, applied to AI context management.
Per-Directory Rules
Many AI tools support directory-scoped rules. Use these for domain-specific constraints:
# src/services/.rules (or src/services/.cursorrules)
## Service Layer Rules
- Every service method must be async
- Services receive dependencies through constructor injection
- Services must not import from src/api/ (no circular deps)
- Every public method must have a corresponding test
- Use transactions for multi-table operations
- Log entry and exit of critical operations using logger.info()
TDD with AI Agents: The Red-Green-Refactor Loop
Test-Driven Development isn't just compatible with AI agents — it's the ideal workflow for them. Tests provide the one thing AI agents desperately need: an objective, verifiable definition of "correct."
Why TDD Works Better With AI Than Without
Without tests, you ask the AI to generate code and then manually review it for correctness. This is cognitively exhausting and error-prone — you're reading code written by an alien intelligence and trying to spot bugs in unfamiliar patterns.
With TDD, you define correctness first, then let the AI generate code until the tests pass. You're not reviewing implementation details — you're reviewing outcomes. The test suite is your automated verifier.
Traditional Flow (fragile):
Prompt → Code → Manual Review → "Looks right?" → Ship → Bug
TDD Flow (reliable):
Spec → Test (failing) → Prompt AI → Code → Run Tests → Pass? → Ship
↓
Fail → AI iterates automatically
The Spec → Test → Implement Pattern
Here's the practical workflow for TDD with an AI agent:
Step 1: Write the test spec (human-driven)
// __tests__/services/invitation-service.test.ts
import { describe, it, expect, beforeEach } from 'vitest';
import { InvitationService } from '@/services/invitation-service';
describe('InvitationService.create', () => {
it('should create an invitation with a valid email and role', async () => {
const result = await service.create({
teamId: 'team-uuid-1',
email: 'new@example.com',
role: 'member',
invitedBy: 'admin-uuid-1',
});
expect(result.id).toBeDefined();
expect(result.status).toBe('pending');
expect(result.token).toHaveLength(64);
expect(result.expiresAt).toBeInstanceOf(Date);
});
it('should reject duplicate invitations for the same email', async () => {
await service.create({ teamId: 'team-uuid-1', email: 'dup@example.com', role: 'member', invitedBy: 'admin-uuid-1' });
await expect(
service.create({ teamId: 'team-uuid-1', email: 'dup@example.com', role: 'member', invitedBy: 'admin-uuid-1' })
).rejects.toThrow('Invitation.create failed: duplicate invitation');
});
it('should enforce team member limit', async () => {
// Assume team already has max members
await expect(
service.create({ teamId: 'full-team', email: 'extra@example.com', role: 'member', invitedBy: 'admin-uuid-1' })
).rejects.toThrow('Invitation.create failed: team member limit reached');
});
it('should set expiration to 7 days from creation', async () => {
const before = new Date();
const result = await service.create({
teamId: 'team-uuid-1',
email: 'timed@example.com',
role: 'member',
invitedBy: 'admin-uuid-1',
});
const after = new Date();
const sevenDaysMs = 7 * 24 * 60 * 60 * 1000;
expect(result.expiresAt.getTime()).toBeGreaterThanOrEqual(before.getTime() + sevenDaysMs);
expect(result.expiresAt.getTime()).toBeLessThanOrEqual(after.getTime() + sevenDaysMs);
});
});
Step 2: Let the AI implement (agent-driven)
Prompt:
"The test file at __tests__/services/invitation-service.test.ts
defines the expected behavior. Implement InvitationService.create()
in src/services/invitation-service.ts to make all tests pass.
Follow the service pattern established in src/services/team-service.ts.
Use Zod for input validation (schema in src/schemas/invitation.ts).
Run `pnpm test __tests__/services/invitation-service.test.ts` after
implementation. Iterate until all tests pass."
The agent generates code, runs the tests, sees failures, and iterates — automatically. This is the Red-Green loop, but the AI is the one cycling through it. You defined the "what" (tests), and the AI figures out the "how" (implementation).
Step 3: Human review (outcome-focused)
Once tests pass, review the implementation for:
- Does it follow project patterns? (Check against CLAUDE.md)
- Are there performance concerns? (N+1 queries, missing indexes)
- Are there security concerns? (Input validation, auth checks)
You're reviewing with purpose, not scanning line-by-line through unfamiliar code hoping to spot a bug.
Failure Patterns and How SDD Prevents Them
Let's map the most common AI coding failures to specific SDD mechanisms that prevent them:
| Failure Pattern | Root Cause | SDD Prevention |
|---|---|---|
| Context collapse | Agent doesn't know project conventions | CLAUDE.md with architecture rules and coding conventions |
| Happy path bias | Agent doesn't generate error handling | Tests explicitly define error scenarios (Step 1 of TDD) |
| Scope creep | Agent over-builds or touches unrelated code | Task breakdown with explicit "do NOT" constraints |
| Architectural drift | Agent uses its own patterns, not yours | Design document + per-directory rules |
| Regression | New code breaks existing functionality | Test-first workflow catches regressions immediately |
| Security holes | Agent doesn't consider threat model | Security requirements in Phase 1 + security-focused test cases |
| Ownership void | Developer ships code they don't understand | Design review in Phase 2 forces understanding before implementation |
| Technical debt | Inconsistent patterns across sessions | CLAUDE.md ensures consistency across every session |
Production Workflow: Putting It All Together
Here's a complete, real-world SDD workflow from feature request to merge:
1. Feature Brief
## Feature: Team Role Management
Users need the ability to change team member roles (admin → member,
member → admin) and remove members from teams. Only team admins
should be able to perform these actions.
2. AI-Assisted Requirements
Prompt: "Read the feature brief above and our existing auth system
in src/services/auth-service.ts. Generate requirements.md covering
user stories, edge cases, security requirements, and what we're NOT
building. Consider: what happens when the last admin tries to
change their own role?"
3. AI-Assisted Design
Prompt: "Read requirements.md. Create design.md with database schema
changes, API endpoints, service methods, and a state diagram for role
transitions. Follow patterns in our existing codebase. Flag any
design decisions that need human review."
4. Human Review Checkpoint
This is the critical human-in-the-loop moment. Review the design document for:
- Does the database schema make sense?
- Are the API endpoints RESTful and consistent with our existing API?
- Did the AI catch the "last admin" edge case?
- Are there security implications the AI missed?
Make corrections to the design document before any code is written.
5. AI-Assisted Task Breakdown
Prompt: "Read requirements.md and design.md. Create tasks.md with
ordered, dependency-aware implementation steps. Each task should
include its test coverage requirements."
6. Iterative Implementation
Prompt: "Implement Task 1 from tasks.md. Write tests first, then
implementation. Run tests to verify. Mark the task as complete in
tasks.md when done. Do NOT proceed to Task 2."
Repeat for each task. After each task, the agent updates tasks.md to reflect progress. You can review each task independently, making code review manageable rather than facing a single massive PR.
7. Integration Verification
Prompt: "All tasks in tasks.md are complete. Run the full test suite
with `pnpm test`. If there are failures, fix them. Then run
`pnpm lint` and fix any issues."
Scaling SDD: Team Patterns
Shared Spec Repository
For teams, maintain spec files in version control alongside code:
project/
├── .claude/
│ └── rules/
│ ├── general.md # Project-wide rules
│ ├── api-conventions.md # API-specific rules
│ └── testing.md # Testing standards
├── docs/
│ └── specs/
│ ├── team-invitations/
│ │ ├── requirements.md
│ │ ├── design.md
│ │ └── tasks.md
│ └── role-management/
│ ├── requirements.md
│ ├── design.md
│ └── tasks.md
├── CLAUDE.md # Root briefing (routes to .claude/rules/)
├── AGENTS.md # Universal rules for all AI tools
└── src/
PR Template for AI-Assisted Work
Require PRs that include AI-generated code to reference their spec:
## PR Checklist (AI-Assisted)
- [ ] Feature spec exists in /docs/specs/[feature-name]/
- [ ] requirements.md reviewed and approved by tech lead
- [ ] design.md reviewed and approved by tech lead
- [ ] All tasks in tasks.md are completed and checked off
- [ ] Test coverage: all paths from requirements are tested
- [ ] No modifications to files outside the feature scope
- [ ] `pnpm test` passes
- [ ] `pnpm lint` passes
Metrics That Matter
Track these metrics to measure the effectiveness of SDD adoption:
| Metric | Before SDD | After SDD | Why It Matters |
|---|---|---|---|
| AI-generated regressions per sprint | 8-15 | 1-3 | Direct measure of code quality |
| Time to debug AI code | 2-4 hours per bug | 15-30 min per bug | Ownership void reduction |
| PR review time | 45-60 min (reading unfamiliar code) | 15-20 min (checking against spec) | Review efficiency |
| Feature delivery time | Same (faster coding, slower debugging) | 30-40% faster net | Real productivity gain |
| Security issues caught in review | 20-30% catch rate | 80-90% catch rate (tests + spec) | Security improvement |
Common Objections (and Rebuttals)
"SDD is just more documentation overhead"
No. SDD documents are generated by the AI, reviewed by you, and consumed by the AI. The total documentation effort is 10-15 minutes per feature for the human (review time). The alternative is 2-4 hours per bug debugging code you don't understand.
"My project is too small for this"
If your project is small enough that you can hold the entire codebase in your head, vibe coding might work fine. SDD becomes essential when the codebase exceeds your working memory — roughly 10-20 files with interconnected logic.
"Tests slow down development"
Tests slow down the first hour. They accelerate every subsequent hour. With AI agents, this trade-off is even more favorable: the agent writes the implementation to match your tests, often getting it right on the first or second attempt. The test-writing phase takes 5-10 minutes for humans; the implementation phase takes the AI 30-60 seconds.
"I just need to write better prompts"
Prompt quality matters, but prompts are ephemeral. They disappear when the session ends. SDD spec files persist, improve over time, and work across different AI tools. Your CLAUDE.md doesn't just help today's session — it helps every session, every team member, and every AI tool.
The Maturity Model
Teams adopting SDD typically progress through three stages:
Stage 1: Reactive (Most teams today)
- Use AI for code generation with ad-hoc prompts
- Debug frequently, review painfully
- No persistent context files
- Each session starts from zero
Stage 2: Structured (SDD adoption)
- CLAUDE.md / AGENTS.md in place
- Four-phase loop for complex features
- TDD integrated with AI workflows
- Per-directory rules for domain-specific constraints
- Design review before implementation
Stage 3: Systematic (Full SDD maturity)
- Spec repository maintained alongside code
- AI agents update spec documents as they work
- Metrics tracked and optimized
- Onboarding new team members via spec docs
- Cross-tool consistency (Cursor, Claude Code, Copilot all follow same specs)
Most teams can reach Stage 2 in a single sprint. Stage 3 develops naturally over 2-3 months as spec files accumulate and patterns stabilize.
The Engineering Reality
AI coding agents are the most powerful tools we've ever had for producing code fast. They're also the most powerful tools we've ever had for producing bugs fast. The difference between the two outcomes isn't the model, the prompt, or the tool — it's the methodology.
Vibe coding treats AI as a replacement for engineering discipline. SDD treats AI as an amplifier for engineering discipline. The four-phase loop — requirements, design, tasks, implementation — isn't bureaucracy. It's the same engineering process that senior developers have always followed, now made explicit so that AI agents can follow it too.
The spec files take 15 minutes to set up. The TDD loop takes zero additional time (the AI writes the implementation). The design review takes 10 minutes per feature. In exchange, you get code that follows your patterns, handles your edge cases, passes your tests, and can be maintained by your team.
Your AI agent is the most productive junior developer you've ever hired. SDD is how you onboard them properly — with clear requirements, documented patterns, well-defined tasks, and verifiable acceptance criteria. Skip the onboarding, and you get the chaos that everyone complains about. Invest in the onboarding, and you get the productivity multiplier that everyone dreams about.
The choice isn't whether to use AI coding agents. The choice is whether to use them with engineering discipline or without it. SDD makes that choice obvious.
💡 Note: This article was originally published on the Pockit Blog.
Check out Pockit.tools for 60+ free developer utilities. For faster access, add it to Chrome and use JSON Formatter & Diff Checker directly from your toolbar.
