I’m back at work today and was planning on making a quick trip up to Mountain View for a work meeting. But since I’m solo dad-ing this week and the kid just caught a cold, I’m staying home with him instead. On the plus side, my day is WIDE open tomorrow now!
[blog] Run multiple coding agents safely with git worktrees. Work on a few branches simultaneously. This matters even more now when one person might be coordinating a handful of agents working on the same codebase.
AI is accelerating software delivery. Shared understanding may be the next bottleneck.
I had a conversation last week that I haven’t been able to shake.
I was talking with a developer at a large software company that builds applications for major airlines. They told me their organization had recently committed to going “all in” on AI-driven development. One example stood out. They shared that, even when they understood an issue and believed they could fix it quickly, they were still being encouraged to route the work through AI first. The thinking was that it was better to let the AI attempt the fix and accept its suggestions wherever possible.
Pair that with what they described was an aggressive shipping cadence, every two weeks, no exceptions, and their day-to-day reality starts to sink in.
At one point, I asked them a question.
“How do you maintain a level of ownership or comprehension of what’s being built?”
They paused and said, “You’re not going to like the answer.”
The answer was not that they had stopped caring. The answer was that comprehension had become something they were forced to triage. They reviewed what they could, trusted the agent where they had to, and kept moving because the system around them rewarded movement.
I’ve been thinking about that conversation ever since.
What I’m not saying
I want to be careful here, because it would be easy to misread this post.
I’m not interested in demonizing this developer. I’m also not interested in blaming the leaders around them. Everyone in this story is trying to adapt to a moment where the expectations for software delivery are changing quickly. The developer was being honest about the trade-offs they’re making, and that kind of honesty is rare. I’m grateful for it.
I’m also not interested in demonizing AI-assisted development. I’ve personally experienced AI as a catalyst for getting back into building software. Agents have helped close the gap between the products I study and the act of making them. I can prototype an idea on a Saturday morning that would have taken me weeks of evenings a few years ago. That matters to me.
So, this is not an anti-velocity or anti-AI post. It’s a post about a tension I’ve been noticing, and one I think is worth paying attention to in our research and in our product teams.
The tension
Developers can now generate and modify far more code than they could before AI-assisted development became part of their workflows. Whether all of that translates into better productivity, better products, or better customer outcomes is still being debated. But the volume and speed of software production have changed.
At the same time, something is happening to comprehension.
By comprehension, I don’t only mean whether a developer can explain each line of code. I mean whether the developer can explain the intent of the work, the behavior of the system, the trade-offs being made, and the customer value they believe they are creating. Code comprehension is part of that, but it’s not the whole thing.
When the amount of agent-generated work crossing a developer’s desk outpaces their ability to read, question, and reason about it, review feels more like rationing attention.
This isn’t a new challenge in software development.
Curtis, Krasner, and Iscoe’s classic 1988 field study found that the most common cause of failure in large software projects wasn’t bad code. It was breakdowns in shared understanding, especially around requirements and architecture. We’ve known for almost forty years that software projects struggle when people lose a shared picture of what they are building and why.
I explored similar themes in The Customer-Driven Culture: how much of a team’s effectiveness comes down to a shared language and shared understanding.
What’s new is the acceleration of code generation. AI doesn’t create the need for shared understanding. It puts even more pressure on it.
Cognitive debt, intent debt, and shared understanding
Cognitive debt: is the erosion of shared understanding across a team over time. The team’s mental models of how the system works become increasingly incomplete or fragmented, making the system harder to reason about and change safely.
Intent debt: is the absence or erosion of externalized goals, constraints, specifications, and rationale that guide how the system evolves. The “why” gets lost while the “what” keeps shipping.
I don’t read this as a claim that every unreviewed AI suggestion is automatically a new kind of debt. Sometimes it may just be a bad practice, or a risky shortcut, or an understandable compromise made under pressure.
The debt accumulates when the team’s explanation of the work gets thinner over time. The code changes, but the rationale does not travel with it. The system evolves, but the team’s mental model lags behind. Eventually, people can point to what shipped, but they struggle to explain why it is shaped the way it is.
Code still matters, and it is still the thing that ships. What may be changing is where human attention has to concentrate. If agents are generating more of the implementation, then teams may need to spend more of their attention reviewing intent, not just reviewing diffs.
I think intent review is a useful name for that emerging practice. Before asking only, “does this code work?”, a team might also ask: What was the agent asked to do? What customer problem is this meant to address? What assumptions did we encode in the prompt, the plan, the tests, or the eval? What trade-offs did the agent appear to make? Does the result still line up with what we believe matters?
I’m not suggesting intent review replaces code review. I am suggesting that code review may no longer be enough by itself.
From individual comprehension to organizational coherence
If one person on a team loses comprehension of a piece of code, that’s recoverable. Pair with someone, write a doc, do a walkthrough. We’ve handled that for decades.
But if cognitive debt is accumulating across an entire team, and across the boundaries between teams, what we’re really watching is the gradual erosion of organizational coherence.
By organizational coherence, I mean something specific. It’s the property of a cross-functional team being aligned on the purpose, the goals, and the shape of what they’re building.
That’s a problem you won’t see in the usual delivery metrics dashboards. PR throughput metrics can tell you that work is happening. It can tell you that issues closed, pull requests merged, and changes reached production. It cannot tell you whether the team still understands the shape of the work or whether they know how to drive customer value.
The signs may show up later, and in less tidy ways: more rework, more duplicated effort, longer onboarding, brittle decisions, confusion about ownership, or a growing inability to explain trade-offs to customers and stakeholders.
AI-assisted development may be producing new coordination costs in places our dashboards don’t see.
Tests and evals matter, but they are not the whole answer
When I describe this tension, a reasonable response I sometimes hear is, “isn’t this exactly what automated tests and evals are for?”
I think there is something real in that. Tests and evals do meaningful work. In an AI-assisted workflow, strong evals may become part of the basic infrastructure of responsible development.
But tests and evals do not restore a mental model by themselves. A passing test tells me the system did what the test was designed to check. A passing eval tells me the system satisfied the criteria we gave it. Neither one tells me whether the team can still explain the intent of the work, the assumptions behind the criteria, or the connection to customer value.
This is where I think the customer-driven muscle still matters. If a team optimizes only for what the eval measures, it can make progress against the instrument while drifting away from customer value. Talking with customers helps teams ask the question the eval cannot answer on its own: are we still measuring what matters?
What I am excited to investigate
I’m interested in product experiences that acknowledge where attention may be shifting.
If developers are spending less time authoring every line and more time directing, reviewing, and making sense of agent-generated work, then the tools around them may need to support different moments. Writing code still matters, but so does following the agent’s reasoning, understanding the plan behind a diff, and seeing how the work connects back to intent, customer value, and system behavior.
The same is true for team practices. I don’t think the answer will only come from better software. Some of it may come from practices that help teams preserve shared understanding as the pace of generation increases:
Intent review: where teams examine the purpose, assumptions, and customer value behind agent-generated work.
Team walkthroughs of agent work: especially when the change is large enough that no one person can comfortably hold it all.
Customer-value checks: where the team asks whether the generated work still maps to the problem customers are trying to solve.
Eval review paired with qualitative customer data: where the team can see whether their measurement strategy is still pointed at customer value.
Periodic coherence reviews: where a team creates a lightweight moment to ask whether they can still explain how the pieces fit together.
I don’t see these as heavy process gates. I see them as possible ways to keep human judgment close to the work.
A reason for optimism, and an open question
I want to close on a hopeful note, because I am hopeful.
AI-assisted development is opening the act of software creation to more people. It has helped me get closer to the products I study. It has lowered the cost of trying ideas. It has made building feel more accessible again (and fun!).
That is worth protecting.
The question is how we make the speed of generation more durable. A team can move quickly and still create small moments to check whether the work remains legible to the people responsible for it. In that sense, one of the highest-value questions may be: do we still understand what we are building, why we are building it, and who it is for?
So, I’d love to hear from others. If you’re working on a team that’s leaning hard into AI-assisted development, what are you seeing? Where is shared understanding holding up, and where is it slipping? What practices are helping your team stay in the loop without rejecting the benefits of agents? And how are you balancing evals, tests, customer feedback, and human judgment as more of the work moves through AI?
I don’t have the full shape of this yet. I’m trying to pay attention to the tension while it is still forming, and I’d love to learn from what you’re experiencing.
I’m convinced that in hell, there is a special place dedicated to making engineers fix flaky tests.
Not broken tests. Not tests covering a real bug. Flaky tests. Tests that pass 999 times out of 1000 and fail on the 1,000th run for no reason you can explain with a clean conscience.
If you've ever shipped a reasonably complex distributed system, you know exactly what I'm talking about. RavenDB has, at last count, over 32,000 tests that are run continuously on our CI infrastructure. I just checked, and in the past month, we’ve had hundreds of full test runs.
That is actually a problem for our scenario, because with that many tests and that many runs, the law of large numbers starts to apply. Assuming we have tests that have 99.999% reliability, that means that 1 out of every 100,000 test runs may fail. We run tens of millions of those tests in a month.
In a given week, something between ten and twenty of those tests will fail. Given the number of test runs, that is a good number in percentage terms. But each such failure means that we have to investigate it.
Those test failures are expensive. Every ticket is a developer staring at logs, trying to figure out whether this is a genuine bug in the product, a bug in the test itself, or something broken in the environment. In almost all cases, the problem is with the test itself, but we have to investigate.
A test that consistently fails is easy to fix. A test that occasionally fails is the worst.
With a flaky test, you don't just fix something and move on. You spend two days isolating it. Reproducing it. Building a mental model of a race condition that only manifests under specific timing, load, and cosmic alignment.
The tests that do this are almost always the integration tests. The ones that test complex distributed behavior across many parts of the system simultaneously. By definition, they are also the hardest to reason about.
The fact that, in most cases, those test failures add nothing to the product (i.e., they didn’t actually discover a real bug) is just crushed glass on top of the sewer smoothie. You spend a lot of time trying to find and fix the issue, and there is no real value except that the test now consistently passes.
We have a script that runs weekly, collects all test failures, and dumps them into our issue tracker. This is routine maintenance hygiene, to make sure we stay in good shape.
I was looking at the issue tracker when the script ran, and the entire screen lit up with new issues.
Just looking at that list of new annoyances was enough to ruin my mood.
And then, without much deliberate planning, I did something dumb and impulsive: I copy-pasted all of those fresh issues into Claude and told it to fix them. Then I went and did other things. I had very low expectations about this, but there was not much to lose.
A few hours later, I got a notification about a pull request. To be honest, I expected Claude to mark the flaky tests as skipped, or remove the assertions to make them pass.
I got an actual pull request, with real fixes, to my shock. Some of them were fixes applied to test logic. Some were actually fixes in the underlying code.
And then there was this one that stopped me cold. Claude had identified that in one of our test cases, we were waiting on the wrong resource. Not wrong in an obvious way — wrong in the kind of way that works perfectly 99.9998% of the time and silently fails 0.0002% of the time.
The (test) code looked right. We were waiting for something to happen; we just happened to wait on the wrong thing, and usually the value we asserted on was already set by the time we were done waiting.
Claude found it. In one pass. For the price of a subscription I was already paying. For reference, that single “let me throw Claude at it” decision probably saved enough engineering time to cover the cost of Claude for the entire team for that month.
Let me be precise about what happened and what didn't. Claude did not fix everything. Some of the "fixes" it produced were pretty bad, surface-level patches that didn't address the real cause, or things that were legitimately out of scope.
You still need an engineer reviewing the output. And you still need judgment.
But it got things fixed, quickly, without needing two days to context-switch into the problem space. And the things it did fix well, it fixed really well.
The work it compressed would have realistically taken one developer a week or two to grind through — and that's assuming you could get a developer to focus on it for that long in the first place. Flaky test investigation is the kind of work that quietly kills team morale.
Engineers start dreading CI. They start treating red builds as background noise. That's how quality degrades silently. Leaving aside new features or higher velocity, being able to offload the most annoying parts of the job to a machine to do is… wow.
Based on this, we're building this into our actual workflow as an integral part of how we handle test maintenance. Failures are collected, routed to Claude, and it takes a first pass at triage and repair. Then we create an issue in the bug tracker with either an actual fix or a summary of Claude’s findings.
By the time a human reviews this, significant progress has already been made.
It doesn't replace the engineer. But it means the engineer is doing the interesting part of the work: judgment, review, architectural reasoning. Skipping the part that requires staring at race condition logs until your vision blurs.
This isn’t the most exciting aspect of using a coding agent, I’m aware. But it may be one of the best aspects in terms of quality of life.
The Trump administration is reportedly considering an executive order to create a working group that could review advanced AI models before public release. The shift follows concerns over Anthropic's powerful Mythos model and its cyber capabilities, with officials weighing whether the government should get early access to frontier models without necessarily blocking their release. The New York Times reports: In meetings last week, White House officials told executives from Anthropic, Google and OpenAI about some of those plans, people briefed on the conversations said. The working group is likely to consider a number of oversight approaches, officials said. But a review process could be similar to one being developed in Britain, which has assigned several government bodies to ensure that A.I. models meet certain safety standards, people in the tech industry and the administration said.
The discussions signal a stark reversal in the Trump administration's approach to A.I. Since returning to office last year, Mr. Trump has been a major booster of the technology, which he has said is vital to winning the geopolitical contest against China. Among other moves, he swiftly rolled back a Biden administration regulatory process that asked A.I. developers to perform safety evaluations and report on A.I. models with potential military applications. "We're going to make this industry absolutely the top, because right now it's a beautiful baby that's born," Mr. Trump said of A.I. at an event in July. "We have to grow that baby and let that baby thrive. We can't stop it. We can't stop it with politics. We can't stop it with foolish rules and even stupid rules." Mr. Trump left room for some rules, but he added that "they have to be more brilliant than even the technology itself."
The White House wants to avoid any political repercussions if a devastating A.I.-enabled cyberattack were to occur, people in the tech industry and the administration said. The administration is also evaluating whether new A.I. models could yield cyber-capabilities that could be useful to the Pentagon and U.S. intelligence agencies, they said. To get ahead of models like Mythos, some officials are pushing for a review system that would give the government first access to A.I. models, but that would not block their release, people briefed on the talks said.
Custom VCS is a new Pulumi Cloud integration that connects any Git or Mercurial version control system to Pulumi Deployments using webhooks and centrally managed credentials. Pulumi Cloud already has native integrations with GitHub, GitLab, and Azure DevOps, but if your team uses a self-hosted or third-party VCS, you’ve been limited to manually configuring credentials per stack with no webhook-driven automation. Custom VCS closes that gap.
The problem
Many teams run self-hosted or third-party Git servers that Pulumi Cloud doesn’t have a native integration for, and some teams still use Mercurial. Until now, their only option was the raw git source approach: embedding credentials directly in each stack’s deployment settings, with no way to trigger deployments automatically on push, and no support for Mercurial at all.
This meant:
No push-to-deploy: Every deployment had to be triggered manually or through a separate CI pipeline.
Scattered credentials: Each stack configured its own credentials independently, with no centralized management.
No org-level integration: There was no shared configuration that multiple stacks could reference.
How Custom VCS works
Custom VCS integrations introduce an org-level integration type that works with any Git or Mercurial server. The setup has three parts:
Credentials through ESC: Instead of OAuth flows, you store your VCS credentials (a personal access token, SSH key, or username/password) in a Pulumi ESC environment. The same credential structure works for both Git and Mercurial. The integration references this environment by name and resolves credentials at deployment time. Multiple stacks can share the same credentials without duplicating secrets.
Manual repository registration: You add repositories to the integration by name. Pulumi joins the repository name with the integration’s base URL to form clone URLs. There’s no auto-discovery, so you control exactly which repositories are available.
Webhook-driven deployments: Pulumi provides a webhook endpoint and an HMAC shared secret. You configure your VCS server to POST a JSON payload on push events, and Pulumi automatically triggers deployments for matching stacks. The webhook supports branch filtering and optional path filtering.
What’s supported
Custom VCS focuses on the deployment automation use case. Here’s how it compares to native integrations:
Capability
Native integrations
Custom VCS
Push-to-deploy
Yes
Yes
Path filtering
Yes
Yes
PR/MR previews
Yes
No
Commit status checks
Yes
No
PR comments
Yes
No
Review stacks
Yes
No
Features like PR comments, commit statuses, and review stacks require deep API integration with each VCS platform, so they aren’t available with Custom VCS. If your VCS provider is GitHub, GitLab, or Azure DevOps, we recommend using the native integration for the full feature set.
Neo support
Neo, Pulumi’s AI assistant, works with Custom VCS integrations for repository operations that don’t depend on VCS-specific APIs. Neo can clone and push to Git and Mercurial repositories registered with your Custom VCS integration using the credentials from the integration’s ESC environment. Neo cannot open pull requests or create new repositories on Custom VCS servers at this time. Those operations require APIs unique to each VCS platform and are only available through native integrations.
Get started
To set up a Custom VCS integration:
Navigate to Management > Version control in Pulumi Cloud.
Select Add integration and choose Custom VCS.
Provide a name, base URL, and ESC environment containing your credentials.
Add your repositories.
Configure your VCS server to send webhooks to the provided URL.
For the full setup guide including webhook payload format, HMAC signing, and credential configuration, see the Custom VCS documentation.