Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
155606 stories
·
33 followers

Making secret scanning more trustworthy: Reducing false positives at scale

1 Share

Secret scanning plays a critical role in protecting developers and organizations. It helps catch exposed credentials early and prevents small mistakes from turning into real incidents.

At GitHub’s scale, even small inefficiencies create real friction. Too many false positives make alerts harder to trust.

When alerts feel noisy, developers spend more time triaging and less time fixing real issues. Over time, this slows down remediation and reduces confidence in the system.

To address this challenge, GitHub collaborated with Microsoft Security & AI’s Agents Offense team to bring more contextual reasoning into GitHub’s secret scanning verification. The collaboration applied the verification approach from Agentic Secret Finder, a broader detection and verification system developed to understand potential secrets in context, not just whether they match a secret-like pattern. This helped GitHub explore ways to reduce low-value alerts while preserving the coverage you expect from secret scanning.

Secret scanning at GitHub today

GitHub secret scanning combines pattern-based detection with AI-based detection to identify potential secrets. Pattern-based detection catches known secret formats, such as partner patterns for tokens and API keys. AI-powered generic secret detection expands coverage to unstructured secrets like passwords that don’t match a known provider pattern.

GitHub already has industry-leading precision for provider-pattern secret detection at massive scale, processing billions of pushes and protecting tens of millions of developers across millions of repositories.

As GitHub expanded into AI-powered secret detection, the next challenge was bringing the precision of AI-detected secrets closer to the same high standard as provider-pattern detections. This collaboration focused on combining GitHub’s large-scale detection pipeline with LLM-based contextual verification to improve alert quality and developer trust.

Our approach: Make secret scanning alerts trustworthy

Secret scanning is most useful when you can quickly tell which alerts need action.

GitHub already has safeguards to reduce noise, but some secret-like values need more context to determine whether they represent a real exposure. To make those alerts easier to trust, we added more reasoning to the verification step.

By looking at how a detected value appears in code, the system can better separate real exposures from values that only look sensitive. This helps you spend less time investigating low-value alerts and more time fixing the issues that matter.

Flow chart showing GitHub's existing verification step is enhanced with context-aware reasoning to improve precision changing detection. The flow is AI based detection > Candidate Secrets > Verification LLM reasoning > High-confidence alerts.

Where this fits in the pipeline

This approach builds directly on the existing system. Detection continues to generate candidates, and the verification step evaluates them. More context-awareness makes this system better at distinguishing real secrets from noise.

The result is higher precision without changing upstream detection logic or reducing coverage.

How it works

A key challenge in verification is deciding what context to provide.

A small snippet of code is often not enough to determine whether something is a real secret. At the same time, passing entire files or repositories introduces too much noise and increases cost and latency.

Instead of giving more context, we’re giving better context.

Rather than send large amounts of code, we extract a small set of high-signal information that helps explain how the value is used. For example, we look for cases where a value is assigned to a variable and later passed into an API request, authentication header, database client, or cloud SDK call. Pattern matching can tell us that a value looks like a secret, but it can’t tell us whether the value is actually being used as one. The surrounding usage context helps the model distinguish real exposures from false alarms, such as random UUIDs or opaque strings, without reviewing the full file or repository.

A table showing 'More context' such as entire file/repository, high noise, is not preferred to 'Better context' of usage signals, execution paths. This provides a focused input.

Focused context, not more data

It’s natural to assume that improving accuracy requires analyzing more of the codebase. But the opposite is true.

Most false positives can be resolved with focused, file-level context. What matters is not how much code the model sees, but whether it has the right signals.

In many cases, you can determine whether a value is a real secret by looking at how it is used within a single file. Values that resemble placeholders, test data, or unused configuration can often be filtered out without deeper analysis.

This keeps the system both effective and practical: high accuracy, low latency, and the ability to scale across large codebases.

Results: reducing false positives in practice

We evaluated this approach on hundreds of customer-confirmed false positive alerts.

Our target was a 65% reduction. The result was 75.76%, exceeding that goal while maintaining strong detection performance.

In practice, this means significantly less noise and a higher proportion of alerts that require action.

False positive reduction based on 1,500 customer-confirmed false positive alerts reached 75.76%.
False positive reduction results based on hundreds of customer-confirmed false positive alerts.

This improvement shows up directly in the developer experience. With fewer irrelevant alerts, it becomes easier to trust what you see. Less time is spent triaging noise, and real issues can be prioritized and fixed faster.

What’s next

We’re continuing to evaluate this approach on larger datasets and live traffic, while improving how context is extracted and used for verification.

Reducing false positives has been a consistent need at scale. This work focuses on improving signal quality where it matters most, making alerts easier to trust and act on.

The goal is simple: fewer distractions, clearer signals, and faster action on real risks.

Get started by running the risk assessment for your organization today, or learn more about secret scanning.

The post Making secret scanning more trustworthy: Reducing false positives at scale appeared first on The GitHub Blog.

Read the whole story
alvinashcraft
4 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Browser testing in Postman Agent Mode

1 Share

UI tests and API tests usually live in separate worlds. The frontend team writes Playwright specs, the API team writes Postman Collections, and the two drift apart over time. Tests stay green, contracts silently break, and the first time anyone notices is when a user files a bug.

Postman just shipped a built-in web browser as a tool inside Postman Agent Mode. Combined with the Postman Playwright integration we announced earlier this year, you can now ask Agent Mode to drive your app in a real browser, record the network traffic, and generate UI tests and API tests against the same observed behavior — then run them together in a single command that fits into CI.

In this post, I’ll walk through that workflow end-to-end: kicking off a browser session in Agent Mode, getting Playwright and Postman Collection tests out the other side, and wiring the whole thing into Application Inventory so the results show up in your team’s dashboard.

Why one browser session beats two test suites

The standard pattern looks fine on paper. Playwright clicks around the frontend. The Postman Collection hits the API. Both pass. Ship it.

The problem shows up between the layers. Your frontend starts calling an undocumented endpoint after a refactor. Your API starts returning a new field that nobody adds to the schema. The UI tests don’t notice — they only assert on rendered text. The API tests don’t notice either — they’re hitting endpoints based on what the contract used to say, not what the frontend actually calls today. That’s the gap contract drift lives in, and it’s the gap Agent Mode’s browser tool is built to close.

When Agent Mode drives the browser, every request the frontend fires gets recorded once. Both test suites are generated from that single recording, which means they reference the same observed behavior instead of two independently-maintained mental models. If the API changes, both suites notice. If the UI starts calling something new, the API tests pick it up the next time you regenerate.

Prerequisites

You’ll need:

Heads up: the browser tool runs inside Agent Mode itself — there’s nothing extra to install. If your Postman app is on v12 and Agent Mode shows a Browser option in the tools list, you’re ready.

Step 1: Ask Agent Mode to drive your app

Open Agent Mode in your workspace and give it a task that involves a flow in your frontend. The trick is to be specific about what you want it to do in the browser and what you want it to produce.

Agent Mode opens the embedded browser, runs the flow, and streams the captured requests into your workspace as it goes. You’ll watch it click through your app in real time.

What you get back is a new Postman Collection containing every request the frontend made, whatever your app actually calls — along with a .spec.ts Playwright file that mirrors the same flow.

Step 2: Review the generated tests

I always review before I commit. Agent Mode is good, but it’s pattern-matching from observed traffic, and there are decisions only you can make — which fields are required, which are stable, which are environment-specific.

A typical generated Postman test script looks like this:

// Generate a unique email each run to avoid duplicate-user conflicts
const ts = Date.now();
const email = `testuser_api_${ts}@example.com`;
pm.collectionVariables.set('reg_email', email);

// Patch the request body with the generated email
const body = JSON.parse(pm.request.body.raw);
body.email = email;
pm.request.body.update({ mode: 'raw', raw: JSON.stringify(body) });

The Playwright side looks like this:

  test('registers a new user and shows the profile view', async ({ page }) => {
    const email = uniqueEmail();

    // Intercept the register API call so we can assert on it
    const [registerResponse] = await Promise.all([
      page.waitForResponse((res) =>
        res.url().includes('/api/auth/register') && res.request().method() === 'POST'
      ),
      (async () => {
        await page.getByRole('button', { name: 'Register' }).click();
        await page.locator('#register-form input[name="email"]').fill(email);
        await page.locator('#register-form input[name="password"]').fill(VALID_PASSWORD);
        await page.locator('#register-form button[type="submit"]').click();
      })(),
    ]);

    // API responded with 201
    expect(registerResponse.status()).toBe(201);
    const body = await registerResponse.json();
    expect(body.success).toBe(true);
    expect(body.data.user.email).toBe(email);
    expect(body.data.accessToken).toBeTruthy();

    // Success message appears
    await expect(page.locator('#message')).toContainText('Account created');

    // Profile view is shown (auth view hidden)
    await expect(page.locator('#profile-view')).toBeVisible();
    await expect(page.locator('#auth-view')).toBeHidden();

    // Profile fields are populated correctly
    await expect(page.locator('#profile-email')).toHaveText(email);
    await expect(page.locator('#profile-id')).not.toHaveText('-');
    await expect(page.locator('#profile-created')).not.toHaveText('-');

    // Token is persisted in localStorage
    const token = await page.evaluate(() => localStorage.getItem('auth.accessToken'));
    expect(token).toBeTruthy();
  });

Step 3: Run both suites with one command

Once the tests are checked in, you run them together with the Postman CLI:

postman app test

That single command runs your Playwright suite, captures the network traffic during the run, and validates the observed API calls against the requests in your Postman Collection. You get one set of results that covers both layers.

The output looks roughly like this:

$ postman app test

→ Running UI tests via Playwright...
  ✓ create a new project (2.3s)
  ✓ delete a project (1.8s)

→ Validating captured API traffic against collection "Demo App API"...
  ✓ POST /api/auth/login        matched ✓ schema ✓ status
  ✓ POST /api/projects          matched ✓ schema ✓ status
  ✓ DELETE /api/projects/:id    matched ✓ schema ✓ status

┌─────────────────────────┬────────────┬────────────┐
│                         │   executed │     failed │
├─────────────────────────┼────────────┼────────────┤
│              UI tests   │          2 │          0 │
├─────────────────────────┼────────────┼────────────┤
│          API requests   │          3 │          0 │
├─────────────────────────┼────────────┼────────────┤
│       contract checks   │          9 │          0 │
└─────────────────────────┴────────────┴────────────┘

If a frontend refactor starts calling an endpoint that isn’t in your Postman Collection, the contract check fails even though the Playwright assertion still passes. That’s the catch you were missing before.

Patterns I’ve found useful

Re-record after every meaningful UI change

The whole point of generating both suites from one recording is that they stay in sync. The moment you start hand-editing one without the other, drift creeps back in. When the UI flow changes, regenerate the recording. It takes a minute.

Keep credentials in environments, never in prompts

Agent Mode prompts are fine for orchestration, but they’re not the place for API keys or test passwords. Reference Postman Environments with secret-typed variables and let Agent Mode pull from there. Easier to rotate, harder to leak.

Start with read-only flows

The first time you turn Agent Mode loose on a real app, point it at a flow that only does reads — search, list views, profile pages. Once you’ve seen it work end-to-end, graduate to flows that create or delete state. The same caution you’d apply to any test automation framework applies here.

Let the contract check fail loud

The temptation is to mark the contract check as a warning instead of a failure, especially early on when there’s a lot of drift to clean up. Resist it. A contract check that doesn’t fail the build doesn’t get fixed.

Try it yourself

Pick a flow in your app that you’ve been meaning to write tests for. Open Agent Mode, point it at the URL, describe the flow in plain English, and ask for Playwright tests plus a Postman Collection. Review the output, run postman app test, and watch both layers validate against the same recording.

If you want a starting point, the Postman Playwright integration blog post walks through postman app init in detail, and the Postman samples on GitHub include working configurations you can clone and adapt.

The piece I keep coming back to is how much friction this removes. Writing API tests by hand from a Playwright recording was always a chore — copy the request, paste it into a new Postman Collection request, write the assertions, remember to update both the next time the UI changes. Agent Mode does that copy-paste-assert loop for you and ties the two ends together. The tests you ship are the tests for what your app actually does, not what someone thought it did six months ago.

Resources

The post Browser testing in Postman Agent Mode appeared first on Postman Blog.

Read the whole story
alvinashcraft
4 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Where’s the holistic AI productivity data?

1 Share

For most of my career I ran a very small company. When you run a tiny company your resources (both time and money) are limited, and you want to use them on the things that will have the most impact. You have to quickly stop doing things that aren’t cost-effective, to avoid “throwing good money after bad”. Ideally, you do a small trial of something new and measure the results before rolling it out more widely, to avoid going all in on something untested. As tech news starts to publish stories about how companies are realising that AI costs more than the humans it was supposed to replace, I’m wondering why it took them so long to figure this out. I’ve spent the last two years watching companies large and small diving headlong into AI. Rarely do I see an attempt to measure the actual costs, financial and otherwise, of that decision.

It’s certainly possible, for a skilled person, to speed up certain processes while also maintaining quality with the use of an agent. I’ve a number of examples that have been successful, and enabled improvements across content sets that would have been hard to justify the work on otherwise.

However, as I document this work I realise how the success of it relies on the things I know. I can spot when the AI tool goes off track, I review its work in the way I’d review the work of a very junior writer. I couldn’t just hand this stuff off to anyone, and them be able to replicate what I can do in terms of the quality of the end result. When you do that, what you get is something that looks on the surface like the same output, but is a pastiche of the result when someone with actual knowledge is behind it.

The same thing seems to be playing out with agentic coding. You can get yourself something that looks like a functioning application. However, without a great deal of knowledge about how to build a functioning application, what you have is often just a reasonably functional mockup. At best you’ve got a handy personal tool that should never escape into production.

Individual productivity enhancements have a ceiling, what you can do with the tools is limited by the need to review the output. As everyone talks about productivity, I’m just not seeing any real research that demonstrates AI is measurably increasing productivity when you take a holistic view.

Individual AI productivity gains

Individually it’s clearly possible to use an LLM to increase productivity. As I’ve already described, a skilled individual can selectively introduce an AI tool to perform specific (usually rote but not quite scriptable) tasks. There are improvements to be had there, but they are similar to the bump you get when you finally figure out how to use a spreadsheet properly, or learn how to automate tasks with some simple coding. If you can already do those things, then AI use can, in some circumstances, automate some additional tasks or make it quicker to create those automations.

This level of improvement is appearing in research data, for example the London School of Economics found in their report Bridging the Generational AI Gap: Unlocking Productivity for All Generations that professionals using AI save an average of 7.5 hours per week. I have a theory that in many cases for non-coders, AI has just solved coding’s image problem, and these gains could have been achieved without AI.

However, another way someone might report increased individual productivity is by shifting the work onto someone else. That might be another person or team—writers end up fixing slop drafts and having to correct obvious errors, code reviewers wade through Pull Requests, and QA teams spend more time dealing with bugs. It also might be your reader or user who now has to wade through paragraphs of slop, is misled by inaccurate documentation, runs into bugs in your app, or finds it inaccessible to them. In this case you might feel more productive, but all you’ve done is move the work around, make someone else’s job or experience measurably worse, and reduce quality.

It’s for this second reason that a holistic approach needs to be taken to truly assess productivity across an organisation. If we look at specific individuals or even teams, we’re likely to miss task reallocation based on AI use.

AI as a forcing function for accessible data

In addition to the issue of task reallocation, there’s another reason why it’s hard to quantify how useful AI actually is. AI tooling has forced a lot of data to become available and easily consumed. This makes it easier to perform non-AI automations.

People who refused to write documentation in the past are now churning out skills, which are documentation. We can use these to easily identify the process needed to achieve tasks. Identifying repetitive processes is the first step of any automation attempt.

Many of my processes are enabled through the easy access to the data required, such as MCP servers, or sites giving me a nice clean markdown export rather than me having to search through messy div soup HTML. This makes more of what I’m doing possible with regular scripting. I’ve found myself moving more things into Python over time, and using the AI tools for more discrete tasks on reliable data returned from a script.

We can’t justify costs we don’t understand

It’s hard to find anything other than anecdata from individuals telling us how AI has made them individually more productive. If AI really was creating measurable improvements in productivity across entire organisations, wouldn’t we be seeing that data? How can we justify the cost (financial, environmental, and human) of AI, if the reality is a relatively small bump in productivity that could have happened by teaching more people to automate tasks using existing tools or simple coding? Why aren’t businesses encouraging people to use non-AI methods where possible, saving the AI only for where it adds value? Given the societal costs, and the benefit to a business of bringing onboard and training people, perhaps on a balance of things even those tasks where AI is needed are better performed by people.

The lack of rigour disquiets me. I’ve been lucky enough to spend the majority of my life working with people who care. The sort of people who like things to make sense, who want to do the right thing, even if it takes longer. We thrived in an industry that prided itself on being data driven. Now so many of us are burning out. It’s exhausting trying to do the work you’ve spent a lifetime building expertise in when people around you are trying to figure out how to replace you with AI, based on vibes that it should be possible. I worry that by the time this all plays out, many of the experienced people the web needs will have left the industry. I see no evidence that AI can come close to replacing the expertise we’ll lose.

Read the whole story
alvinashcraft
4 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Agentic Testing: Where Agents Fit in the E2E Testing Stack

1 Share

Abstract

Agent-driven end-to-end (E2E) tests add a new exploratory layer to testing, but should they replace traditional deterministic tests? We ran more than 200 agentic E2E workflows using the Playwright MCP, Playwright CLI, and agent-generated Playwright tests in test workspaces using non-production data to find out how agentic testing could fit into both our and your testing stacks.

1. From Journeys to Goals

Traditional end-to-end tests validate a specific journey through the UI.

click → click → type → assert

Agent-driven tests instead validate whether a goal can be achieved, often expressed as an instruction (e.g. “send a thread message”):

goal → agent adapts → verify result

This difference can be summarized simply:

Tests enforce journeys. Agents verify goals.

Across our agentic test runs, the overall workflow remained consistent (e.g. login → search → result → clear), but the exact sequence of actions varied. In practice, agents took different paths to reach the same outcome:

  • Different input methods (clicking a search suggestion vs pressing Enter)
  • Different navigation patterns (reopening search vs reusing existing state)
  • Additional or skipped steps (extra clicks, snapshots, or intermediate actions)

Agents can still validate intermediate steps when needed, but this flexibility comes with tradeoffs in reliability, cost, and execution time, which we explore in the next sections.

The Problem

Agent-driven E2E testing looks promising, but it raises a real question: can something that costs $15–30 per run and takes over 10 minutes actually fit into modern testing workflows?

At first glance, the answer seems like no. But in 200+ runs, we found they are fundamentally different from traditional tests. They can be highly reliable and have a clear place in the testing stack.

This is largely due to recent advances in large language models, which enable agents to write code, debug failures, and interact directly with user interfaces. These capabilities introduce a new execution model for testing, but where they fit in existing E2E workflows is not always clear.

2. Our Experiment

To understand how agent-driven tests can fit into E2E workflows, we ran 200+ automated executions across multiple configurations to measure reliability, execution speed, and cost.

Execution models

We evaluated three different approaches:

  • Agent + Playwright MCP
    The agent interacts with the browser through the Playwright MCP, using predefined browser actions (clicking elements, typing input, reading DOM state, etc…) with persistent context (DOM snapshots and logs)
  • Agent + Playwright CLI
    The agent interacts with the browser by running Playwright CLI commands via the shell, executing one step at a time and deciding the next action based on the updated UI state
  • Generated Playwright Tests
    An AI agent generates deterministic Playwright test code from a natural language description, executes it as a standard E2E test, and iteratively refines it until it passes

Experiment Setup

  • Agent model (Playwright MCP / CLI): Claude Sonnet 4.5
  • Model used for generated Playwright tests: Claude Opus 4.6
  • Execution: non-interactive Claude Code (claude -p)
  • Browser tooling:
    • Playwright MCP
    • Playwright CLI
  • Environment setup: 
    • Slack Dev API MCP
    • All experiments were conducted in test workspaces using non-production data

Test flows

We used two flows to cover different levels of complexity. These flows were kept consistent across all experiments to allow for direct comparison.

  • Thread Reply (simple)
    A shorter workflow (~15–20 steps) involving creating a channel, sending a message, replying in a thread, and verifying thread state
  • Search Discovery (medium complexity)
    A longer workflow (~25–30 steps) involving entering search queries, navigating results, moving between views (search, channels, threads), and verifying expected outcomes

Input formats

For agent-driven approaches, we evaluated two input types:

  • Natural language (NL)
    Detailed, human-readable instructions describing the workflow and expected outcomes (e.g. “reply in a thread, and verify it appears in All Threads”), often written as step-by-step lists
  • Structured YAML
    The same workflow expressed in a structured format, with explicit steps, actions, targets, and expected outcomes

The difference is not the level of detail, but how that detail is represented: natural language requires the agent to interpret and map instructions to actions, while YAML defines that mapping more explicitly.

Each configuration was run 20 times. The experiment matrix below shows the full setup:

Experiment Matrix

Exp Execution Model Input Type Tools Thread Reply Search Discovery
1 Agent (Playwright MCP) NL MCP 20 20
2 Agent (Playwright MCP) YAML MCP 20 20
3 Agent (Playwright CLI) NL CLI 20 20
4 Agent (Playwright CLI) YAML CLI 20 20
5 Agent (Generated Tests) NL Code 20 20

3. What We Observed

Summary of Results

Before diving into individual metrics, here’s a quick look at how the different approaches performed overall across both natural language and YAML-based executions.

Approach

Failure rate 

(thread reply)

Failure rate 

(search discovery)

Avg runtime
Agent (Playwright MCP) 0% ~12% ~5–8 min
Agent (Playwright CLI) ~12% ~20% ~9–11 min
Generated Playwright Tests ~8% ~48% ~3 min

The following sections break down these results by individual metrics.

Reliability

One of the clearest patterns we saw was how reliability changed as flows became more complex. 

Across the agentic Playwright flows, the Playwright MCP was the more reliable configuration, consistently achieving near‑zero failure rates on simple scenarios and remaining within 0–12% on more complex flows. In contrast, the Playwright CLI showed higher failure rates (roughly 12–20%), with many failures caused by execution issues such as authentication handling, navigation timing, and session instability rather than model reasoning.

Generated Playwright tests performed reasonably well on simple flows (~8% failure rate), but degraded significantly on more complex workflows (~48%). These tests were not entirely wrong, as they typically progressed through 70-80% of the flow before breaking on a final interaction or assertion. Failures were primarily caused by variability in UI state and abstraction mismatches. These tests were generated from loosely specified natural language flows and reused existing page object abstractions, which sometimes interfered with precise element targeting in more complex scenarios.

Overall, the reliability gap widened with increasing complexity, suggesting that the agent-native execution models like MCP provide more stable behavior as flows get harder. One likely reason is how each model handles state. MCP keeps a live, stable view of the app, while CLI rebuilds state from snapshots at each step. As flows get longer, small inconsistencies in how the UI is interpreted or timed can accumulate and lead to failures. Another likely factor is in-session context. In MCP-based runs, the agent appears to reuse successful interactions from earlier steps in the same flow, while CLI can feel more like starting from scratch at each step. We didn’t explicitly measure this, but it may also contribute to the gap.

Speed

When it came to speed, generated tests were consistently the fastest.

Approach Average Duration
Generated Playwright Tests ~3 minutes
Agent (Playwright MCP) ~5–8 minutes
Agent (Playwright CLI) ~9–11 minutes

For generated tests, the runtime includes both test generation and execution. Each test was generated once and executed five times, and the numbers above reflect the average duration per run. In practice, the raw execution was much faster: ~32 seconds for thread reply and ~45 seconds for search discovery. In CI environments where tests run repeatedly, the one-time generation cost becomes negligible, allowing deterministic tests to scale more efficiently.

Agent-driven workflows pay this cost on every run. Each step typically involves:

  • Observing the UI state
  • Reasoning about the next action
  • Executing the action and validating the result

Adaptability

Another pattern we saw was how differently agents navigate the UI.

Only about 20% of runs followed the exact same sequence of actions. In most runs, the agent discovered different valid UI paths to reach the same goal.

For example, while still reaching the same final state, the agent might:

  • Open menus in a different order
  • Select slightly different UI elements
  • Use alternate navigation flows

To measure this, we compared action signatures across runs. An action signature is the ordered list of tool calls and UI actions performed by the agent (e.g. API calls, browser clicks, form interactions). Action signatures were normalized before comparison: parameters, wait/snapshot actions, and equivalent tool variants (e.g. fill vs type) were collapsed so that only meaningful differences in the action sequence were counted.

Across runs, most action sequences differed even when the final outcome was correct. This highlights a key difference between approaches: traditional E2E tests enforce a single deterministic journey through the UI, while agents explore the interface and verify whether the goal state can still be reached.

Cost and Where It Comes From

Cost stood out in our experiments. Agent-driven runs were typically $15–30 per execution, compared to much cheaper traditional test runs.

To understand where this cost came from, we analyzed token usage across different execution models by running the same search discovery flow.

Approach Tokens
MCP (Opus 4.6) ~3.8M
MCP (Sonnet 4.5) ~3.5M
MCP (Haiku 4.5) ~5.7M
CLI (Opus 4.6) ~6M
Code Gen (Opus 4.6) ~7M

The first thing that stood out was that how the agent was executed mattered more than which model powered it. Haiku did use more tokens than Sonnet or Opus in our runs, but all of the MCP-based approaches still used fewer tokens overall than the CLI and Code Gen approaches for the same flow.

To understand why, we looked at how Claude Code executes agent sessions. The underlying API is stateless and every turn re-sends the full system prompt plus the entire conversation history. This means cost is not driven by model output, which is negligible, but by how quickly context accumulates and how many turns the agent takes to complete the flow.

Approach Turns
MCP (Opus 4.6) ~40
MCP (Sonnet 4.5) ~40
MCP (Haiku 4.5) ~60
CLI (Opus 4.6) ~85
Code Gen (Opus 4.6) ~70

On average, CLI took 85 turns compared to MCP’s ~40-60 because each browser interaction was split across multiple commands, such as actions, waits, snapshots, reads, and element lookups. MCP combined interaction and state return into a single round trip. Each additional turn pays the full system prompt tax plus re-sends all prior conversation context.

What fills that context? For MCP and CLI approaches, browser snapshots are the primary payload. Playwright MCP returns accessibility tree snapshots as part of its browser interaction responses, and these accumulate in the conversation window across all subsequent turns. For Code Gen, the accumulated context comes from test runner output containing full error traces, assertion failures, and DOM state on each retry cycle.

In our analysis, the majority of the cost was retransmission of previously seen content. Only a small fraction of tokens represented new information per turn. The biggest factors affecting cost are turn count and context growth rate rather than model reasoning or output generation.

At this stage, we focused primarily on reliability and behavior, so token usage was not optimized. Opportunities to reduce cost include prompt caching, context compaction, and reducing snapshot frequency. 

Due to the cost, agent-driven testing may currently be better suited for targeted debugging or exploratory testing than for high-frequency CI execution, although cost may improve with future models and tooling.

Infrastructure Matters (MCP vs CLI)

Another important takeaway was how much the execution environment affected reliability, not just the model itself.

Approach Failure rate
Agent (Playwright MCP) 0–12%
Agent (Playwright CLI) 12–20%

Most failures in CLI-based runs came from authentication and navigation issues (sign-in errors, timeouts, and session instability), suggesting that many failures were caused by the execution layer rather than the agent’s reasoning.

The Playwright MCP provides structured browser primitives and tighter integration with the agent’s tool-calling workflow, while CLI-based execution introduces additional layers between the agent and the browser.

Parallelization also differed. MCP runs were easy to execute concurrently, while CLI-based runs were difficult to parallelize in our setup and were mostly executed sequentially.

These results suggest that reliability, speed, and cost depend not just on the model, but also on how stable and well-designed the execution environment is.

Execution Capability Boundaries

Our experiments focused on single-session UI workflows. More complex scenarios, such as cross-workspace flows or workflows that open multiple browser windows, introduce a different set of challenges where the choice of execution model may matter as much as the agent itself.

Both MCP and CLI-based approaches could support these workflows, but with different tradeoffs. MCP may run into cost issues as observation loops grow over longer flows, while CLI-based approaches may introduce additional coordination complexity when managing multiple browser sessions, on top of the higher token usage observed in our experiments. We did not explore these scenarios here, but they are an important consideration for teams evaluating agent-driven testing.

4. Where Agentic Testing Fits in the Testing Pyramid

So where does agent-driven testing actually fit?

Rather than replacing existing approaches, it adds a new capability on top of them.

Deterministic E2E Tests

Best suited for fast, repeatable regression checks in CI.

  • Human-written or AI-generated tests
  • Fast, repeatable, and CI-friendly
  • Low operational cost
  • Enforce a specific journey through the UI 

Agentic Testing

Agent-driven workflows operate differently from deterministic tests. Instead of executing a predefined script, agents operate from a goal: they observe the UI, reason about the current state, and determine how to reach the desired outcome.

  • Exploring complex UI behavior
  • Debugging flaky workflows
  • Reproducing production bugs

Testing Pyramid with Agentic Layer

Testing pyramid with four layers: Unit Tests, Integration Tests, E2E Testing, and Agentic Testing
Testing pyramid with four layers: Unit Tests, Integration Tests, E2E Testing, and Agentic Testing

From a system perspective, agentic testing still operates at the same level as E2E tests, validating real user workflows through the UI. The difference is in how those workflows are executed. 

For this reason, the most effective testing strategies of the future will combine both. Deterministic tests provide a stable foundation for CI, while agentic testing adds a distinct layer at the top of the testing pyramid for exploration, debugging, and validating complex behaviors.

5. Acknowledgements

Huge thanks to the DevXP AI team for building and supporting tools like Claude Code, as well as the metrics infrastructure that made these experiments possible. That foundation made it much easier to run, analyze, and iterate on hundreds of executions.

Special thanks to our managers, Dave Harrington and Vani Anantha, for supporting experiments at a scale that definitely kept the token counters busy, and briefly put us on our internal token usage leaderboard.

We also want to thank the Frontend Test Frameworks team for their help throughout the process, from early ideas to validation and feedback. Special thanks to Lucy Cheng, Natalie Stormann, Roopa Thanisraj, Ilaria Varriale, and Crescencio Zul for their thoughtful input and support along the way.

Interested in solving real problems, making developers’ lives easier, or just building some pretty cool tools? If this kind of work excites you, whether it’s pushing the boundaries of testing or building agent-driven systems and rethinking developer workflows, we’re hiring.

Apply now

 

Read the whole story
alvinashcraft
4 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

WeAreDevelopers is coming to the US to give unsung developers a bigger voice

1 Share
WeAreDevelopers, the Berlin-based developer conference founded in 2015, has grown into a major global event, attracting 15,000 developers from over 70 countries each year. In 2026, it expands beyond Europe with new editions in San Jose, California, and Bengaluru, India. Co-founder and CEO Sead Ahmetovic says the conference was created to give developers a stronger voice in an industry where marketers, salespeople, and entrepreneurs often receive more recognition.

Unless you’ve been living under a mousepad, you know about WeAreDevelopers. The Berlin-based software developers conference and networking event, now in its 11th year, attracts 15,000 devs from more than 70 countries to the German capital each July.

You might be forgiven, though, if you weren’t aware that the conference is expanding in 2026: WeAreDevelopers is coming to San Jose, California, in September and Bengaluru, India, in November.

If you ask Sead Ahmetovic, co-founder and CEO of WeAreDevelopers, about the value of the conference — especially as it expands in two directions this year — he’ll tell you firstly about giving developers an overview for building new software. But the second reason he goes a little deeper: WeAreDevelopers is an all-too-rare opportunity for sometimes unsung developers to get together and actually celebrate themselves a little.

“Developers are introverts. I’m an introvert. We are not that loud when we achieve something or build something. But developers build the products that the entire world uses.”
—Sead Ahmetovic of WeAreDevelopers

“We wanted to give the developer community a bigger voice, because the marketing people, the business people, the sales people, the startup people — they all celebrated themselves,” Ahmetovic says. “And developers are introverts. I’m an introvert. We are not that loud when we achieve something or build something. But developers build the products that the entire world uses.”

Thomas Dohmke, the Seattle-based technology executive (he’s now co-founder and CEO of Entire and the former CEO of GitHub), tells The New Stack that one appeal of WeAreDevelopers is it gives developers an opportunity to tell the story behind their work.

“If you can [tell the story of your work to a crowd of developers], then you can not only convince them of what’s about to come and how their life is going to change, but you can also make them one of your biggest supporters.”

Listen to the full interview in the latest episode of The New Stack podcast. Below is an abbreviated Q&A with Ahmetovic and Dhomke, edited for clarity and brevity.

The New Stack: Sead, WeAreDevelopers began in 2015 as essentially a side project. What was the gap you and co-founder Benjamin Ruschin saw?

Ahmetovic: Ten years ago, you had this rise of cool startup events and great marketing conferences — people going on stage celebrating themselves. For me, as a developer, it was not that interesting, because you do not have the hands-on substance. Developer events existed, but they were mostly based on a niche topic — a programming language, a framework. If you want to build great software, you need to bring all the different stakeholders together. And we wanted to give the developer community a bigger voice. Developers are introverts — we are not that loud when we achieve something. But developers build the products that the entire world uses.

The short version is we did a meetup, 300 people showed up, we did it again, 600 showed up, and everything else just happened. There was no real strategy in the beginning.

How did you choose the name? It almost feels defiant — like, “we deserve our own conference.

Ahmetovic: Could be, but I really don’t know. We brainstormed what domains are available, and everything was taken. But at the end: OK, “we are developers.” That’s what we are. Someone told me, “Do you even know what responsibility you have when you have this name?” But honestly, I think we never really took that too seriously.

Why expand to the US now?

Ahmetovic: People asked for it, partners asked for it. Most of our partners in Europe actually are US-based tech companies. My first reaction was: But you must have these kinds of conferences in North America. We went overseas and looked at a lot of great events, but we saw there is a niche — the vibe, the deep technical content, the hands-on formats — that’s missing. It’s very complementary to the current tech event scene.

Thomas, you left GitHub in August 2025, after leading it past 150 million developers, to become a founder again. What was the itch you couldn’t ignore that pushed you to start Entire?

Dohmke: I was born in the late ’70s in East Berlin, so when the internet boom happened in the mid-’90s, I was too young to participate as a startup founder. But through the journey of Copilot and ChatGPT, I could see we’re in the early stages of another drastic change — like the internet fundamentally changed our lives, I believe AI is such a transformation. All the companies founded now are effectively post-AI — building with AI in mind — and all the companies that already exist have to think about a transformation. I thought that’s an amazing opportunity to go back to that founder lifestyle and build Entire as a new developer platform for these AI-native software developers.

You’ve said GitHub was built for humans collaborating with humans, not for developers running dozens of agents. What does a platform built for that world look like?

Dohmke: GitHub’s landing page is often a repository, which shows you files and folders. Yet most of these files are now written by agents, and almost nobody wants to browse through a file tree — that’s not the artifact you’re interested in. What’s much more interesting is what the developer put as a prompt into the agent. We have to move to the artifacts that are actually relevant: your session logs, your checkpoints — the things describing the idea. They’re the institutional knowledge of your organization, codified together with the code. Those artifacts create the brain of every software project that both humans and agents can leverage.

Five years out — let’s say 2030 — what does a developer’s day look like?

Dohmke: You can check in with your agent the same way you check email and Slack, and feed in three more tasks before you even head to breakfast. But what it won’t mean — we’re seeing this already — is less work. We will feel like magicians with orchestras of agents, but there’s only so much we can process. Those who figure out how to organize will build more and achieve more than ever.

“We will feel like magicians with orchestras of agents, but there’s only so much we can process.”
— Thomas Dohmke of Entire

Ahmetovic: The question I was waiting for is, “Will there be a need for developers?” A developer is someone who builds something. In five years they will still build things — less typing, more thinking, more orchestrating, more talking. The other stuff stays the same: just building stuff.

WeAreDevelopers World Congress North America runs September 23–25 in San Jose, California. Details at wearedevelopers.com.

The post WeAreDevelopers is coming to the US to give unsung developers a bigger voice appeared first on The New Stack.

Read the whole story
alvinashcraft
5 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Dapr v1.18 is now available

1 Share
We’re excited to announce the release of Dapr 1.18! This is primarily a Workflows release, focused on security, durability, and scale. Workflow history can now optionally be cryptographically signed and verified on every state load, so tampering is caught the moment state is read. The same protection extends across application boundaries: child workflow and activity completions are attested between apps. A new WorkflowAccessPolicy resource controls which application IDs may invoke which workflows and activities.
Read the whole story
alvinashcraft
5 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories