Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
153039 stories
·
33 followers

Putting Claude up against our test suite

1 Share

I’m convinced that in hell, there is a special place dedicated to making engineers fix flaky tests.

Not broken tests. Not tests covering a real bug. Flaky tests. Tests that pass 999 times out of 1000 and fail on the 1,000th run for no reason you can explain with a clean conscience.

If you've ever shipped a reasonably complex distributed system, you know exactly what I'm talking about. RavenDB has, at last count, over 32,000 tests that are run continuously on our CI infrastructure. I just checked, and in the past month, we’ve had hundreds of full test runs.

That is actually a problem for our scenario, because with that many tests and that many runs, the law of large numbers starts to apply. Assuming we have tests that have 99.999% reliability, that means that 1 out of every 100,000 test runs may fail. We run tens of millions of those tests in a month.

In a given week, something between ten and twenty of those tests will fail. Given the number of test runs, that is a good number in percentage terms. But each such failure means that we have to investigate it.

Those test failures are expensive. Every ticket is a developer staring at logs, trying to figure out whether this is a genuine bug in the product, a bug in the test itself, or something broken in the environment. In almost all cases, the problem is with the test itself, but we have to investigate.

A test that consistently fails is easy to fix. A test that occasionally fails is the worst.

With a flaky test, you don't just fix something and move on. You spend two days isolating it. Reproducing it. Building a mental model of a race condition that only manifests under specific timing, load, and cosmic alignment.

The tests that do this are almost always the integration tests. The ones that test complex distributed behavior across many parts of the system simultaneously. By definition, they are also the hardest to reason about.

The fact that, in most cases, those test failures add nothing to the product (i.e., they didn’t actually discover a real bug) is just crushed glass on top of the sewer smoothie. You spend a lot of time trying to find and fix the issue, and there is no real value except that the test now consistently passes.

We have a script that runs weekly, collects all test failures, and dumps them into our issue tracker. This is routine maintenance hygiene, to make sure we stay in good shape.

I was looking at the issue tracker when the script ran, and the entire screen lit up with new issues.

Just looking at that list of new annoyances was enough to ruin my mood.

And then, without much deliberate planning, I did something dumb and impulsive: I copy-pasted all of those fresh issues into Claude and told it to fix them. Then I went and did other things. I had very low expectations about this, but there was not much to lose.

A few hours later, I got a notification about a pull request. To be honest, I expected Claude to mark the flaky tests as skipped, or remove the assertions to make them pass.

I got an actual pull request, with real fixes, to my shock. Some of them were fixes applied to test logic. Some were actually fixes in the underlying code.

And then there was this one that stopped me cold. Claude had identified that in one of our test cases, we were waiting on the wrong resource. Not wrong in an obvious way — wrong in the kind of way that works perfectly 99.9998% of the time and silently fails 0.0002% of the time.

The (test) code looked right. We were waiting for something to happen; we just happened to wait on the wrong thing, and usually the value we asserted on was already set by the time we were done waiting.

Claude found it. In one pass. For the price of a subscription I was already paying. For reference, that single “let me throw Claude at it” decision probably saved enough engineering time to cover the cost of Claude for the entire team for that month.

Let me be precise about what happened and what didn't. Claude did not fix everything. Some of the "fixes" it produced were pretty bad, surface-level patches that didn't address the real cause, or things that were legitimately out of scope.

You still need an engineer reviewing the output. And you still need judgment.

But it got things fixed, quickly, without needing two days to context-switch into the problem space. And the things it did fix well, it fixed really well.

The work it compressed would have realistically taken one developer a week or two to grind through — and that's assuming you could get a developer to focus on it for that long in the first place. Flaky test investigation is the kind of work that quietly kills team morale.

Engineers start dreading CI. They start treating red builds as background noise. That's how quality degrades silently. Leaving aside new features or higher velocity, being able to offload the most annoying parts of the job to a machine to do is… wow.

Based on this, we're building this into our actual workflow as an integral part of how we handle test maintenance. Failures are collected, routed to Claude, and it takes a first pass at triage and repair. Then we create an issue in the bug tracker with either an actual fix or a summary of Claude’s findings.

By the time a human reviews this, significant progress has already been made.

It doesn't replace the engineer. But it means the engineer is doing the interesting part of the work: judgment, review, architectural reasoning. Skipping the part that requires staring at race condition logs until your vision blurs.

This isn’t the most exciting aspect of using a coding agent, I’m aware. But it may be one of the best aspects in terms of quality of life.

Read the whole story
alvinashcraft
just a second ago
reply
Pennsylvania, USA
Share this story
Delete

v2026.5.4

1 Share

OpenClaw 2026.5.4

Read the whole story
alvinashcraft
41 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

White House Considers Vetting AI Models Before They Are Released

1 Share
The Trump administration is reportedly considering an executive order to create a working group that could review advanced AI models before public release. The shift follows concerns over Anthropic's powerful Mythos model and its cyber capabilities, with officials weighing whether the government should get early access to frontier models without necessarily blocking their release. The New York Times reports: In meetings last week, White House officials told executives from Anthropic, Google and OpenAI about some of those plans, people briefed on the conversations said. The working group is likely to consider a number of oversight approaches, officials said. But a review process could be similar to one being developed in Britain, which has assigned several government bodies to ensure that A.I. models meet certain safety standards, people in the tech industry and the administration said. The discussions signal a stark reversal in the Trump administration's approach to A.I. Since returning to office last year, Mr. Trump has been a major booster of the technology, which he has said is vital to winning the geopolitical contest against China. Among other moves, he swiftly rolled back a Biden administration regulatory process that asked A.I. developers to perform safety evaluations and report on A.I. models with potential military applications. "We're going to make this industry absolutely the top, because right now it's a beautiful baby that's born," Mr. Trump said of A.I. at an event in July. "We have to grow that baby and let that baby thrive. We can't stop it. We can't stop it with politics. We can't stop it with foolish rules and even stupid rules." Mr. Trump left room for some rules, but he added that "they have to be more brilliant than even the technology itself." The White House wants to avoid any political repercussions if a devastating A.I.-enabled cyberattack were to occur, people in the tech industry and the administration said. The administration is also evaluating whether new A.I. models could yield cyber-capabilities that could be useful to the Pentagon and U.S. intelligence agencies, they said. To get ahead of models like Mythos, some officials are pushing for a review system that would give the government first access to A.I. models, but that would not block their release, people briefed on the talks said.

Read more of this story at Slashdot.

Read the whole story
alvinashcraft
10 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Connect Any Git or Mercurial Repo to Pulumi with Custom VCS

1 Share

Custom VCS is a new Pulumi Cloud integration that connects any Git or Mercurial version control system to Pulumi Deployments using webhooks and centrally managed credentials. Pulumi Cloud already has native integrations with GitHub, GitLab, and Azure DevOps, but if your team uses a self-hosted or third-party VCS, you’ve been limited to manually configuring credentials per stack with no webhook-driven automation. Custom VCS closes that gap.

The problem

Many teams run self-hosted or third-party Git servers that Pulumi Cloud doesn’t have a native integration for, and some teams still use Mercurial. Until now, their only option was the raw git source approach: embedding credentials directly in each stack’s deployment settings, with no way to trigger deployments automatically on push, and no support for Mercurial at all.

This meant:

  • No push-to-deploy: Every deployment had to be triggered manually or through a separate CI pipeline.
  • Scattered credentials: Each stack configured its own credentials independently, with no centralized management.
  • No org-level integration: There was no shared configuration that multiple stacks could reference.

How Custom VCS works

Custom VCS integrations introduce an org-level integration type that works with any Git or Mercurial server. The setup has three parts:

Credentials through ESC: Instead of OAuth flows, you store your VCS credentials (a personal access token, SSH key, or username/password) in a Pulumi ESC environment. The same credential structure works for both Git and Mercurial. The integration references this environment by name and resolves credentials at deployment time. Multiple stacks can share the same credentials without duplicating secrets.

Manual repository registration: You add repositories to the integration by name. Pulumi joins the repository name with the integration’s base URL to form clone URLs. There’s no auto-discovery, so you control exactly which repositories are available.

Webhook-driven deployments: Pulumi provides a webhook endpoint and an HMAC shared secret. You configure your VCS server to POST a JSON payload on push events, and Pulumi automatically triggers deployments for matching stacks. The webhook supports branch filtering and optional path filtering.

What’s supported

Custom VCS focuses on the deployment automation use case. Here’s how it compares to native integrations:

Capability Native integrations Custom VCS
Push-to-deploy Yes Yes
Path filtering Yes Yes
PR/MR previews Yes No
Commit status checks Yes No
PR comments Yes No
Review stacks Yes No

Features like PR comments, commit statuses, and review stacks require deep API integration with each VCS platform, so they aren’t available with Custom VCS. If your VCS provider is GitHub, GitLab, or Azure DevOps, we recommend using the native integration for the full feature set.

Neo support

Neo, Pulumi’s AI assistant, works with Custom VCS integrations for repository operations that don’t depend on VCS-specific APIs. Neo can clone and push to Git and Mercurial repositories registered with your Custom VCS integration using the credentials from the integration’s ESC environment. Neo cannot open pull requests or create new repositories on Custom VCS servers at this time. Those operations require APIs unique to each VCS platform and are only available through native integrations.

Get started

To set up a Custom VCS integration:

  1. Navigate to Management > Version control in Pulumi Cloud.
  2. Select Add integration and choose Custom VCS.
  3. Provide a name, base URL, and ESC environment containing your credentials.
  4. Add your repositories.
  5. Configure your VCS server to send webhooks to the provided URL.

For the full setup guide including webhook payload format, HMAC signing, and credential configuration, see the Custom VCS documentation.

Learn more

Read the whole story
alvinashcraft
10 hours ago
reply
Pennsylvania, USA
Share this story
Delete

OpenAI, Google, and Microsoft Back Bill To Fund 'AI Literacy' In Schools

1 Share
An anonymous reader quotes a report from 404 Media: A new, bipartisan bill introduced (PDF) by Democratic Senator of California Adam Schiff and endorsed by the biggest AI developers in the world -- including OpenAI, Google, and Microsoft -- would change the K-12 curriculum to shoehorn in "AI literacy," something that young people and teachers alike already hate in schools. The Literacy in Future Technologies Artificial Intelligence, or LIFT AI Act, would empower the new director of the National Science Foundation (NSF) to make grant awards "on a merit-reviewed, competitive basis to institutions of higher education or nonprofit organizations (or a consortium thereof) to support research activities to develop educational curricula, instructional material, teacher professional development, and evaluation methods for AI literacy at the K-12 level," the bill says. It defines AI literacy as using AI; specifically, "having the age-appropriate knowledge and ability to use artificial intelligence effectively, to critically interpret outputs, to solve problems in an AI-enabled world, and to mitigate potential risks." The bill is endorsed by the American Federation of Teachers, Google, OpenAI, Information Technology Industry Council, Software & Information Industry Association, Microsoft, and HP Inc. [...] The grant would support "AI literacy evaluation tools and resources for educators assessing proficiency in AI literacy," according to the bill. It would also fund "professional development courses and experiences in AI literacy," and the development of "hands-on learning tools to assist in developing and improving AI literacy." Most importantly for real-world implications, it would fund changing the existing curriculum "to incorporate AI literacy where appropriate, including responsible use of AI in learning."

Read more of this story at Slashdot.

Read the whole story
alvinashcraft
11 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Architecture to Resilience: A Decision Guide

1 Share

Start with the framework, accelerate with the tool

Watch the video walkthrough

The Application Resilience Framework originated from a practical gap we saw in resilience reviews: teams had architecture diagrams, monitoring data, incident history, and runbooks, but no consistent way to connect them into a measurable resilience model.

The framework is intended to close that gap by turning architecture context into a structured lifecycle for risk identification, mitigation validation, health modeling, and governance. It aligns closely with the Reliability pillar of the Azure Well-Architected Framework, especially the guidance around identifying critical flows, performing Failure Mode Analysis, defining reliability targets, and building health models.

Application Resilience Framework flow from artifact import to measurable operational resilience.

The Application Resilience Framework Tool helps teams apply this framework faster by starting with artifacts they already have, such as data flow diagrams or sequence diagrams in Mermaid or image format. The tool extracts workflows, application components, platform components, dependencies, and initial failure modes, then guides the team through the decisions needed to make resilience measurable.

From those artifacts, the tool creates the first version of a resilience model by extracting workflows, application components, platform components, dependencies, and initial failure modes. It then guides the team through one import step followed by four phases:

Import Artifacts -> Phase 1: Failure Mode Analysis -> Phase 2: Mitigation and Validation -> Phase 3: Health Model Mapping -> Phase 4: Operations and Governance

It is not a replacement for WAF guidance or Resilience Hub style assessments. It is a practical way to operationalize those concepts at the workload and workflow level, producing prioritized risks, mitigation plans, validation paths, health signals, dashboards, reports, and governance ownership.

How to use this guide

This guide follows the same flow as the tool. For each step, it covers:

  1. The decision: What needs to be decided?
  2. The options: What paths are available?
  3. The guidance: When each option fits

Use this with the video walkthrough. The video shows the tool in action. This guide explains the choices behind each step.

Question 1: What artifact should you import first?

The import step creates the starting point for the model. Regardless of the input path, the output is the same: workflows that move into Phase 1: Failure Mode Analysis.

Options

Import option

Best for

What happens

Data flow diagram

System, module, data movement, and dependency views

If imported as an image, the tool breaks it into sequence-style flows. Selected flows become workflows.

Sequence diagram

Transaction flow and service interaction views

Converted directly into workflows.

Mermaid input

Diagrams maintained as code in Mermaid format

Converted directly into workflows.

Image input

JPG or PNG diagrams

Azure Foundry Vision models interpret the image and convert it into workflows.

Manual entry

Missing or incomplete diagrams

User creates or corrects workflows manually.

When to pick which

Use data flow for system and dependency views. Use sequence diagrams for transaction or interaction views. Regardless of import path, the output is the same: workflows, components, dependencies, and initial failure modes ready for Phase 1.

Question 2: Which workflows should be analyzed first?

Phase 1 is Failure Mode Analysis. This is where the tool identifies what can fail and how important each failure is.

Options

  • Critical user flows: Login, checkout, payment, onboarding, request processing.
  • High-risk platform flows: Database writes, queue processing, storage access, identity, messaging, external APIs.
  • Known issue areas: Workflows with recent incidents, recurring alerts, or customer impact.

When to pick which

Start where failure creates the highest customer or business impact. The goal is not to model everything at once. The goal is to model the right thing first.

Deliverables

  • Failure Mode Analysis catalog
  • RPV risk scores
  • Criticality classification

Question 3: How should failure modes be prioritized?

After workflows and components are imported, the tool helps score each failure mode using Risk Priority Value or RPV, which uses the four factors of Impact, Likelihood, Detectability and Outage severity.

Options

  • Use generated failure modes and scores: Best for a fast first pass.
  • Tune the RPV scores with engineering input: Best when workload context matters.
  • Add custom failure modes: Best when known risks come from incidents, reviews, or customer experience.

When to pick which

Use the generated model to accelerate the first pass, then adjust it with real system knowledge. The goal is not to create the longest list of risks. The goal is to identify the risks that deserve attention first.

Deliverables

  • Failure Mode Catalog
  • RPV Risk Scores
  • Prioritized criticality list

Question 4: Are mitigations defined or validated?

Phase 2 is Mitigation and Validation. This is where each failure mode gets a response plan.

Options

  • Detection only: The team can detect the failure, but the response is not defined.
  • Defined mitigation: The response is documented, such as retry, fallback, failover, scaling, restore, or rebalance.
  • Validated mitigation: The response has been tested through a controlled validation or chaos test.

When to pick which

For low-risk items, documented mitigation may be enough. For critical and high-risk items, validation is the key. A mitigation that has not been tested is still an assumption.

Deliverables

  • Mitigation playbooks
  • Chaos test plans
  • Support playbooks

Question 5: Which risks need health signals?

Phase 3 is Health Model Mapping. This is where the tool connects risks to observability.

A failure mode should not just sit in a document. It should map to a signal that can show whether the system is healthy, degraded, or unhealthy.

Options

  • Map all failure modes: Best for small systems or highly critical workloads.
  • Map critical and high-risk failure modes first: Best for large systems.
  • Track unmapped risks as gaps: Best when observability coverage is still improving.

When to pick which

Start with the highest RPV items. Every critical failure mode should have at least one signal, such as a metric, log, alert, availability check, or dependency signal.

Deliverables

  • Health model
  • Signal definitions
  • Coverage report
  • Bicep templates

Question 6: Should the health model be exported or deployed?

Once the health model is built, the next decision is how to use it.

Options

  • Export for review: Best when the team needs to validate the model first.
  • Generate monitoring templates: Best when the team wants repeatable implementation.
  • Deploy to Azure: Best when the model is ready to become part of operations.
  • Use outputs in downstream tools: Best when support, SRE, or incident response workflows need structured playbooks.

When to pick which

Export first if the model is still being reviewed. Deploy when component relationships, signals, and coverage are accurate enough for operational use.

Question 7: How will governance keep the model current?

Phase 4 is Operations and Governance. This is where the resilience model becomes an ongoing practice.

Options

  • One-time assessment: Useful for quick discovery but limited long term.
  • Recurring review: Best for production workloads that change regularly.
  • Closed-loop governance: Best when incidents, failed validations, and monitoring gaps feed back into the model.

When to pick which

For production systems, use a recurring governance cadence. Assign owners, track gaps, review dashboards, and update the model as the system changes.

Deliverables

  • Governance model
  • Dashboards
  • Reports and exports
  • Runbooks

Putting it together: three adoption patterns

Once governance is defined, the tool can be used in different ways depending on the team’s maturity and objective. The three common adoption patterns are:

Pattern A: Quick resilience review

  • Import one critical workflow
  • Generate failure modes
  • Review RPV scores
  • Identify top risks
  • Export findings

Best for fast architecture reviews or early customer conversations.

Pattern B: Full workload assessment

  • Import multiple workflows
  • Build a full Failure Mode Catalog
  • Define mitigations and recovery steps
  • Create chaos test plans
  • Map risks to signals
  • Produce coverage reports

Best for structured resilience assessments.

Pattern C: Operational health model

  • Build and tune the health model
  • Export or deploy monitoring artifacts
  • Track risk and signal coverage
  • Review mitigation effectiveness
  • Assign governance ownership
  • Feed findings back into the model

Best when the goal is continuous operational improvement.

A short checklist before using the tool

  1. Which workflow should we import first?
  2. Do we have a data flow diagram, sequence diagram, or Mermaid file?
  3. What components and dependencies should be included?
  4. Which failure modes matter most?
  5. How should RPV be adjusted for this workload?
  6. Do critical failure modes have mitigations?
  7. Have those mitigations been validated?
  8. Are failure modes mapped to health signals?
  9. What coverage gaps remain?
  10. Should the health model be exported or deployed?
  11. Who owns ongoing review?
  12. How often should the model be updated?

Closing thought

The Application Resilience Framework Tool provides a practical way to move from architecture artifacts to measurable, continuously improving resilience.

It starts with data flow or sequence diagrams, builds a structured view of the system, and guides teams through the decisions that matter: what can fail, how severe it is, how it is mitigated, how it is detected, and how it is governed.

Tool repo: Application Resilience Framework Tool 

Read the whole story
alvinashcraft
11 hours ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories