Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
153141 stories
·
33 followers

Meta laying off 331 workers in Washington state as part of broader cuts to Reality Labs division

1 Share
Meta’s Dexter Station office in Seattle. (Meta Photo)

New layoffs at Meta will impact 331 workers in the Seattle area and Washington state, according to a filing from the state Employment Security Department.

The company is cutting employees at four facilities located in Seattle and on the Eastside, as well as approximately 97 employees who work remotely in Washington. The layoffs are part of broader reductions in the company’s Reality Labs division, first announced last week, that impacted 1,500 jobs companywide.

The heaviest hit facility is the Reality Labs office in Redmond, followed by the Spring District office in Bellevue, according to the Worker Adjustment and Retraining Notification (WARN) filing.

Meta’s Horizon OS software engineering team, working out of a Meta office on Dexter Avenue North in Seattle, was the hardest hit single group with 20 jobs cut. Horizon OS is the extended reality operating system developed to power Meta Quest virtual reality and mixed reality headsets.

Layoffs are expected to take effect on March 20.

With about 15,000 employees, Reality Labs currently represents about 19% of Meta’s total global workforce of roughly 78,000.

The company employs thousands of people across multiple offices in the Seattle region, one of its largest engineering hubs outside Menlo Park, Calif. Last October, the Facebook parent laid off more than 100 employees in Washington state as part of a broader round of cuts within its artificial intelligence division.

The Reality Labs cuts come at a time when the company is reportedly shifting priorities away from the metaverse to build next-generation artificial intelligence.

Read the whole story
alvinashcraft
2 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

The Microsoft-OpenAI Files: Internal documents reveal the realities of AI’s defining alliance

1 Share
Satya Nadella, Sam Altman
Sam Altman greets Microsoft CEO Satya Nadella at OpenAI DevDay in San Francisco in 2023. (GeekWire File Photo / Todd Bishop)

The launch of the AI lab that would redefine Microsoft caught the tech giant by surprise.

“Did we get called to participate?” Satya Nadella wrote to his team on Dec. 12, 2015, hours after OpenAI announced its founding. “AWS seems to have sneaked in there.”

Nadella had been Microsoft CEO for less than two years. Azure, the company’s cloud platform, was five years old and chasing Amazon Web Services for market share. And now AWS had been listed as a donor in the “Introducing OpenAI” post. Microsoft wasn’t in the mix. 

In the internal message, which hasn’t been previously reported, Nadella wondered how the new AI nonprofit could remain truly “open” if it was tied only to Amazon’s cloud.

Within months, Microsoft was courting OpenAI. Within four years, it would invest $1 billion, adding more than $12 billion in subsequent rounds. Within a decade, the relationship would culminate in a $250 billion spending commitment for Microsoft’s cloud and a 27% equity stake in one of the most valuable startups in history.

New court filings offer an inside look at one of the most consequential relationships in tech. Previously undisclosed emails, messages, slide decks, reports, and deposition transcripts reveal how Microsoft pursued, rebuffed and backed OpenAI at various moments over the past decade, ultimately shaping the course of the lab that launched the generative AI era.

More broadly, they show how Nadella and Microsoft’s senior leadership team rally in a crisis, maneuver against rivals such as Google and Amazon, and talk about deals in private.

For this story, GeekWire dug through more than 200 documents, many of them made public Friday in Elon Musk’s ongoing suit accusing OpenAI and its CEO Sam Altman of abandoning the nonprofit mission. Microsoft is also a defendant. Musk, who was an OpenAI co-founder, is seeking up to $134 billion in damages. A jury trial is scheduled for this spring.

OpenAI has disputed Musk’s account of the company’s origins. In a blog post last week, the company said Musk agreed in 2017 that a for-profit structure was necessary, and that negotiations ended only when OpenAI refused to give him full control. 

The recently disclosed records show that Microsoft’s own leadership anticipated the possibility of such a dispute. In March 2018, after learning of OpenAI’s plans to launch a commercial arm, Microsoft CTO Kevin Scott sent Nadella and others an email offering his thoughts.

“I wonder if the big OpenAI donors are aware of these plans?” Scott wrote. “Ideologically, I can’t imagine that they funded an open effort to concentrate ML [machine learning] talent so that they could then go build a closed, for profit thing on its back.”

The latest round of documents, filed as exhibits in Musk’s lawsuit, represents a partial record selected to support his claims in the case. Microsoft declined to comment. 

Elon helps Microsoft win OpenAI from Amazon

Microsoft’s relationship with OpenAI has been one of its key strategic advantages in the cloud. But the behind-the-scenes emails make it clear that Amazon was actually there first.

According to an internal Microsoft slide deck from August 2016, included in recent filings, OpenAI was running its research on AWS as part of a deal that gave it $50 million in computing for $10 million in committed funds. The contract was up for renewal in September 2016. 

Microsoft wanted in. Nadella reached out to Altman, looking for a way to work together. 

In late August, the filings show, Altman emailed Musk about a new deal with Microsoft: “I have negotiated a $50 million compute donation from them over the next 3 years!” he wrote. “Do you have any reason not to like them, or care about us switching over from Amazon?” 

Musk, co-chair of OpenAI at the time, gave his blessing to the Microsoft deal in his unique way, starting with a swipe at Amazon founder Jeff Bezos: “I think Jeff is a bit of a tool and Satya is not, so I slightly prefer Microsoft, but I hate their marketing dept,” Musk wrote. 

He asked Altman what happened to Amazon.

Altman responded, “Amazon started really dicking us around on the T+C [terms and conditions], especially on marketing commits. … And their offering wasn’t that good technically anyway.”

Microsoft and OpenAI announced their partnership in November 2016 with a blog post highlighting their plans to “democratize artificial intelligence,” and noting that OpenAI would use Azure as its primary cloud platform going forward.

Harry Shum, then the head of Microsoft’s AI initiatives, with Sam Altman of OpenAi in 2026. (Photo by Brian Smale for Microsoft)

Internally, Microsoft saw multiple benefits. The August 2016 slide deck, titled “OpenAI on Azure Big Compute,” described it as a prime opportunity to flip a high-profile customer to Azure. 

The presentation also emphasized bigger goals: “thought leadership” in AI, a “halo effect” for Azure’s GPU launch, and the chance to recruit a “net-new audience” of developers and startups. It noted that OpenAI was a nonprofit “unconstrained by a need to generate financial return” — an organization whose research could burnish Microsoft’s reputation in AI.

But as the ambition grew, so did the bill.

‘Most impressive thing yet in the history of AI’

In June 2017, Musk spoke with Nadella directly to pitch a major expansion. OpenAI wanted to train AI systems to beat the best human players at competitive esports, Valve’s Dota 2. The computing requirements were massive: 10,000 servers equipped with the latest Nvidia GPUs.

“This would obviously be a major opportunity for Microsoft to promote Azure relative to other cloud systems,” Musk wrote in an email to OpenAI colleagues after the call.

Nadella said he’d talk about it internally with his Microsoft cloud team, according to the email. “Sounds like there is a good chance they will do it,” Musk wrote.

Two months later, Altman followed up with a formal pitch. “I think it will be the most impressive thing yet in the history of AI,” he wrote to Nadella that August.

Microsoft’s cloud executives ran the numbers and balked. In an August 2017 email thread, Microsoft executive Jason Zander told Nadella the deal would cost so much it “frankly makes it a non-starter.” The numbers are redacted from the public version of the email. 

“I do believe the pop from someone like Sam and Elon will help build momentum for Azure,” Zander wrote. “The scale is also a good forcing function for the fleet and we can drive scale into the supply chain. But I won’t take a complete bath to do it.”

Ultimately, Microsoft passed. OpenAI contracted with Google for the Dota 2 project instead.

‘A bucket of undifferentiated GPUs’

Microsoft’s broader relationship with OpenAI was starting to fray, as well. By January 2018, according to internal emails, Microsoft executive Brett Tanzer had told Altman that he was having a hard time finding internal sponsors at Microsoft for an expanded OpenAI deal. 

Altman started shopping for alternatives. Around that time, Tanzer noted in an email to Nadella and other senior executives that OpenAI’s people “have been up in the area recently across the lake” — a reference to Amazon’s Seattle headquarters.

The internal debate at Microsoft was blunt. 

OpenAI CEO Sam Altman and Microsoft CTO Kevin Scott at Microsoft Build in 2024. (GeekWire File Photo / Todd Bishop)

Scott wrote that OpenAI was treating Microsoft “like a bucket of undifferentiated GPUs, which isn’t interesting for us at all.” Harry Shum, who led Microsoft’s AI research, said he’d visited OpenAI a year earlier and “was not able to see any immediate breakthrough in AGI.” 

Eric Horvitz, Microsoft’s chief scientist, chimed in to say he had tried a different approach. After a Skype call with OpenAI co-founder Greg Brockman, he pitched the idea of a collaboration focused on “extending human intellect with AI — versus beating humans.” 

The conversation was friendly, Horvitz wrote, but he didn’t sense much interest. He suspected OpenAI’s Dota work was “motivated by a need to show how AI can crush humans, as part of Elon Musk’s interest in demonstrating why we should all be concerned about the power of AI.”

Scott summed up the risk of walking away: OpenAI might “storm off to Amazon in a huff and shit-talk us and Azure on the way out.”

“They are building credibility in the AI community very fast,” the Microsoft CTO and Silicon Valley veteran wrote. “All things equal, I’d love to have them be a Microsoft and Azure net promoter. Not sure that alone is worth what they’re asking.”

But by the following year, Microsoft had found a reason to double down.

The first billion

In 2019, OpenAI restructured. The nonprofit would remain, but a new “capped profit” entity would sit beneath it — a hybrid that could raise capital from investors while limiting their returns. 

Microsoft agreed to invest $1 billion, with an option for a second billion, in exchange for exclusive cloud computing rights and a commercial license to OpenAI’s technology.

The companies announced the deal in July 2019 with a joint press release. “The creation of AGI will be the most important technological development in human history, with the potential to shape the trajectory of humanity,” Altman said. Nadella echoed that sentiment, emphasizing the companies’ ambition to “democratize AI” while keeping safety at the center.

So what changed for Microsoft between 2018 and 2019?

In a June 2019 email to Nadella and Bill Gates, previously disclosed in the Google antitrust case, Scott cited the search giant’s AI progress as one reason for Microsoft to invest in OpenAI. He “got very, very worried,” he explained, when he “dug in to try to understand where all of the capability gaps were between Google and us for model training.”

Microsoft CEO Satya Nadella and OpenAI CEO Sam Altman at the Microsoft campus in Redmond, Wash. on July 15, 2019. (Photography by Scott Eklund/Red Box Pictures)

Nadella forwarded Scott’s email to Amy Hood, Microsoft’s CFO. “Very good email that explains why I want us to do this,” Nadella wrote, referring to the larger OpenAI investment, “and also why we will then ensure our infra folks execute.”

Gates wasn’t so sure. According to Nadella’s deposition testimony, the Microsoft co-founder was clear in “wanting us to just do our own” — arguing that the company should focus on building AI capabilities in-house rather than placing such a large bet on OpenAI.

Nadella explained that the decision to invest was eventually driven by him and Scott, who concluded that OpenAI’s specific research direction into transformers and large language models (the GPT class) was more promising than other approaches at the time.

Hood, meanwhile, offered some blunt commentary on OpenAI’s cap on profits — the centerpiece of its new structure, meant to limit investor returns and preserve the nonprofit’s mission. The caps were so high, she wrote, that they were almost meaningless.

“Given the cap is actually larger than 90% of public companies, I am not sure it is terribly constraining nor terribly altruistic but that is Sam’s call on his cap,” Hood wrote in a July 14, 2019, email to Nadella, Scott, and other executives. 

If OpenAI succeeded, she noted, the real money for Microsoft would come from Azure revenue — far exceeding any capped return on the investment itself.

But the deal gave Microsoft more than cloud revenue.

According to an internal OpenAI memo dated June 2019, Microsoft’s investment came with approval rights over “Major Decisions” — including changes to the company’s structure, distributions to partners, and any merger or dissolution.

Microsoft’s $1 billion made it the dominant investor. Under the partnership agreement, major decisions required approval from a majority of limited partners based on how much they had contributed. At 85% of the total, Microsoft had an effective veto, a position of power that would give the company a pivotal role in defining the future of the company.

‘The opposite of open’

In September 2020, Musk responded to reports that Microsoft had exclusively licensed OpenAI’s GPT-3. “This does seem like the opposite of open,” he tweeted. “OpenAI is essentially captured by Microsoft.”

Nadella seemed to take the criticism seriously. 

In an October 2020 meeting, according to internal notes cited in a recent court order, Microsoft executives discussed the perception that the company was “effectively owning” OpenAI, with Nadella saying they needed to give thought to Musk’s perspective.

In February 2021, as Microsoft and OpenAI negotiated a new investment, Altman emailed Microsoft’s team: “We want to do everything we can to make you all commercially successful and are happy to move significantly from the term sheet.” 

His preference, Altman told the Microsoft execs, was “to make you all a bunch of money as quickly as we can and for you to be enthusiastic about making this additional investment soon.”

They closed the deal in March 2021, for up to $2 billion. This was not disclosed publicly until January 2023, when Microsoft revealed it as part of a larger investment announcement.

By 2022, the pressure to commercialize was explicit. 

Mira Murati, left, and Sam Altman at OpenAi DevDay 2023. (GeekWire File Photo / Todd Bishop)

According to a transcript of her deposition, Mira Murati, then OpenAI’s vice president of applied AI and partnerships, had written in contemporaneous notes that the most-cited goal inside the company that year was a $100 million revenue target. Altman had told employees that Nadella and Scott said this needed to be hit to justify the next investment, as much as $10 billion.

Murati testified that Altman told her “it was important to achieve this goal to receive Microsoft’s continued investments.” OpenAI responded by expanding its go-to-market team and building out its enterprise business.

Then everything changed.

The ChatGPT moment

On Nov. 30, 2022, OpenAI announced ChatGPT. The chatbot became the fastest-growing consumer application in history, reaching 100 million users within two months. It was the moment that turned OpenAI from an AI research lab into a household name.

Microsoft’s bet was suddenly looking very different.

OpenAI’s board learned about the launch on Twitter. According to deposition testimony, board members Helen Toner and Tasha McCauley received no advance notice and discovered ChatGPT by seeing screenshots on social media. 

McCauley described the fact that a “major release” could happen without the board knowing as “extremely concerning.” Toner testified that she wasn’t surprised — she was “used to the board not being very informed” — but believed it demonstrated that the company’s processes for decisions with “material impact on the mission were inadequate.”

Altman, according to one filing, characterized the release as a “research preview” using existing technology. He said the board “had been talking for months” about building a chat product, but acknowledged that he probably did not send the board an email about the specific release.

As its biggest investor, Microsoft pushed OpenAI to monetize the product’s success.

Microsoft CEO Satya Nadella speaks at OpenAI DevDay in 2023, as Sam Altman looks on. (GeekWire File Photo / Todd Bishop)

In mid-January 2023, Nadella texted Altman asking when they planned to activate a paid subscription.

Altman said they were “hoping to be ready by end of jan, but we can be flexible beyond that. the only real reason for rushing it is we are just so out of capacity and delivering a bad user experience.”

He asked Nadella for his input: “any preference on when we do it?”

“Overall getting this in place sooner is best,” the Microsoft CEO responded, in part.

Two weeks later, Nadella checked in again: “Btw …how many subs have you guys added to chatGPT?”

Altman’s answer revealed what they were dealing with. OpenAI had 6 million daily active users — their capacity limit — and had turned away 50 million people who tried to sign up. “Had to delay charging due to legal issues,” he wrote, “but it should go out this coming week.”

ChatGPT Plus launched on Feb. 1, 2023, at $20 a month.

Microsoft invested $10 billion in OpenAI. The companies had begun negotiating the previous summer, when OpenAI was still building ChatGPT. The product’s viral success validated Microsoft’s bet and foreshadowed a new era of demand for its cloud platform.

Ten months later, it nearly collapsed.

‘Run over by a truck’

On Friday afternoon, Nov. 17, 2023, OpenAI’s nonprofit board fired Altman as CEO, issuing a terse statement that he had not been “consistently candid in his communications with the board.” Greg Brockman, the company’s president and cofounder, was removed from the board the same day. He quit hours later.

Microsoft, OpenAI’s largest investor, was not consulted. Murati, then OpenAI’s chief technology officer and the board’s choice for interim CEO, called Nadella and Kevin Scott to warn them just 10 to 15 minutes before Altman himself was told.

“Mira sounded like she had been run over by a truck as she tells me,” Scott wrote in an email to colleagues that weekend.

The board — Ilya Sutskever, Tasha McCauley, Helen Toner, and Adam D’Angelo — had informed Murati the night before. They had given her less than 24 hours to prepare.

At noon Pacific time, the board delivered the news to Altman. The blog post went live immediately. An all-hands meeting followed at 2 p.m. By Friday night, Brockman had resigned. So had Jakub Pachocki, OpenAI’s head of research, along with a handful of other researchers. 

A “whole horde” of employees, Scott wrote, had reached out to Altman and Brockman “expressing loyalty to them, and saying they will resign.”

Microsoft didn’t have a seat on the board. But text messages between Nadella and Altman, revealed in the latest filings, show just how influential it was in the ultimate outcome.

At 7:42 a.m. Pacific on Saturday, Nov. 18, Nadella texted Altman asking if he was free to talk. Altman replied that he was on a board call.

“Good,” Nadella wrote. “Call when done. I have one idea.”

That evening, at 8:25 p.m., Nadella followed up with a detailed message from Brad Smith, Microsoft’s president and top lawyer. In a matter of hours, the trillion-dollar corporation had turned on a dime, establishing a new subsidiary from scratch — legal work done, papers ready to file as soon as the Washington Secretary of State opened Monday morning.

They called it Microsoft RAI Inc., using the acronym for Responsible Artificial Intelligence.

“We can then capitalize the subsidiary and take all the other steps needed to operationalize this and support Sam in whatever way is needed,” Smith wrote. Microsoft was “ready to go if that’s the direction we need to head.”

Altman’s reply: “kk.”

A screenshot of text messages between Microsoft CEO Satya Nadella and OpenAI CEO Sam Altman following Altman’s ouster in 2023.

The company calculated the cost of absorbing the OpenAI team at roughly $25 billion, Nadella later confirmed in a deposition — enough to match the compensation and unvested equity of employees who had been promised stakes in a company that now seemed on the verge of collapse.

By Sunday, Emmett Shear, the Twitch co-founder, had replaced Murati as interim CEO. That night, when the board still hadn’t reinstated Altman, Nadella announced publicly that Microsoft was prepared to hire the OpenAI CEO and key members of his team.

“In a world of bad choices,” Nadella said in his deposition, the move “was definitely not my preferred thing.” But it was preferable to the alternative, he added. “The worst outcome would have been all these people leave and they go to our competition.”

‘Strong strong no’

On Tuesday, Nov. 21, the outcome was still uncertain. Altman messaged Nadella and Scott that morning, “can we talk soon? have a positive update, ish.” Later, he said the situation looked “reasonably positive” for a five-member board. Shear was talking to the remaining directors.

Nadella asked about the composition, according to the newly public transcript of the message thread, which redacts the names of people who ultimately weren’t chosen.

“Is this Larry Summers and [redacted] and you three? Is that still the plan?”

Summers was confirmed, Altman replied. The other slots were “still up in air.”

Altman asked, “would [redacted] be ok with you?”

“No,” Nadella wrote.

Scott was more emphatic, giving one unnamed person a “strong no,” and following up for emphasis: “Strong strong no.”

The vetting continued, as Nadella and Scott offered suggestions, all of them redacted in the public version of the thread. 

A screenshot of text messages from Nov. 21, 2023, included as an exhibit in Elon Musk’s lawsuit, shows Microsoft President Brad Smith and CEO Satya Nadella discussing OpenAI board prospects with Sam Altman following his ouster.

Nadella added Smith to the thread. One candidate, the Microsoft president wrote, was “Solid, thoughtful, calm.” Another was “Incredibly smart, firm, practical, while also a good listener.”

At one point, Scott floated a joke: “I can quit for six months and do it.” He added a grinning emoji and commented, “Ready to be downvoted by Satya on this one, and not really serious.”

Nadella gave that a thumbs down.

The back-and-forth reflected a delicate position. Microsoft had no board seat at OpenAI. Nadella had said publicly that the company didn’t want one. But the texts showed something closer to a shadow veto — a real-time screening of the people who would oversee the nonprofit’s mission.

By evening, a framework emerged. Altman proposed Bret Taylor, Larry Summers, and Adam D’Angelo as the board, with himself restored as CEO. Taylor would handle the investigation into his firing.

Smith raised a concern. “Your future would be decided by Larry [Summers],” he wrote. “He’s smart but so mercurial.” He called it “too risky.” (Summers resigned from the OpenAI board in November 2025, following revelations about his correspondence with Jeffrey Epstein.)

Altman wrote, “id accept it given my conversations with him and where we are right now.” He added, “it’s bullshit but i want to save this … can you guys live with it?”

Nadella asked for Summers’ cell number.

At  2:38 p.m., Altman texted the group: “thank you guys for the partnership and trust. excited to get this all sorted to a long-term configuration you can really depend on.”

Nadella loved the message.

Two minutes later, Smith replied: “Thank you! A tough several days. Let’s build on this and regain momentum.”

Altman loved that one.

Nadella had the last word: “Really looking forward to getting back to building….”

Later that night, OpenAI announced Altman’s return with the newly constituted board.

“We are encouraged by the changes to the OpenAI board,” Nadella posted on X. “We believe this is a first essential step on a path to more stable, well-informed, and effective governance.”

The crisis was resolved, but the underlying tensions remained.

‘Project Watershed’

On December 27, 2024, OpenAI announced it would unwind its capped-profit structure. Internally, this initiative was called “Project Watershed,” the documents reveal.

The mechanics played out through 2025. On September 11, Microsoft and OpenAI executed a memorandum of understanding with a 45-day timeline to finalize terms.

Microsoft’s role was straightforward but powerful. Its approval rights over “Major Decisions” including changes to OpenAI’s structure. Asked in a deposition whether those rights covered a recapitalization of OpenAI’s for‑profit entity into a public benefit corporation, Microsoft corporate development executive Michael Wetter testified that they did.

The company had no board seat. “Zero voting rights,” Wetter testified. “We have no role, to be super clear.” But under the 2019 agreement, the conversion couldn’t happen without them.

The timing mattered. A SoftBank-led financing — internally called Project Sakura — was contingent on the recapitalization closing by year-end. Without the conversion, the funding could not proceed. Without Microsoft’s approval, the conversion could not proceed.

Valuation became a key focus of negotiations. Morgan Stanley, working for Microsoft, estimated OpenAI’s value at $122 billion to $177 billion, according to court filings. Goldman Sachs, advising OpenAI, put it at $353 billion. The MOU set Microsoft’s stake at 32.5 percent. By the time the deal closed after the SoftBank round, dilution brought it to 27 percent. 

OpenAI’s implied valuation was $500 billion — a record at the time (until it was surpassed in December by Musk’s SpaceX). As Altman put it in his deposition, “That was the willing buyer-willing seller market price, so I won’t argue with it.”

For Microsoft, it was a give-and-take deal: the tech giant lost its right of first refusal on new cloud workloads, even as OpenAI committed to the $250 billion in future Azure purchases.

At the same time, the agreement defused the clause that had loomed over the partnership: under prior terms, a declaration of artificial general intelligence by OpenAI’s board would have cut Microsoft off from future models. Now any such declaration needs to be made by an independent panel, and Microsoft’s IP rights run through 2032 regardless. 

The transaction closed on Oct. 28, 2025. The nonprofit remained (renamed the OpenAI Foundation) but as a minority shareholder in the company it had once controlled.

Six days later, OpenAI signed a seven-year, $38 billion infrastructure deal with Amazon Web Services. The company that had “sneaked in there” at the founding, as Nadella put it in 2015, was back — this time as a major cloud provider for Microsoft’s flagship AI partner.

An OpenAI graphic shows its revenue tracking computing consumption.

In a post this weekend, OpenAI CFO Sarah Friar made the shift explicit: “Three years ago, we relied on a single compute provider,” she wrote. “Today, we are working with providers across a diversified ecosystem. That shift gives us resilience and, critically, compute certainty.”

Revenue is up from $2 billion in 2023 to more than $20 billion in 2025. OpenAI is no longer a research lab dependent on Microsoft’s cloud. It’s a platform company with leverage. 

In December 2015, Nadella had to ask whether Microsoft had been called to participate in the OpenAI launch. A decade later, nothing could happen without the Redmond tech giant. 

But OpenAI will no longer be theirs alone.

Read the whole story
alvinashcraft
3 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

AI-supported vulnerability triage with the GitHub Security Lab Taskflow Agent

1 Share

Triaging security alerts is often very repetitive because false positives are caused by patterns that are obvious to a human auditor but difficult to encode as a formal code pattern. But large language models (LLMs) excel at matching the fuzzy patterns that traditional tools struggle with, so we at the GitHub Security Lab have been experimenting with using them to triage alerts. We are using our recently announced GitHub Security Lab Taskflow Agent AI framework to do this and are finding it to be very effective.

💡 Learn more about it and see how to activate the agent in our previous blog post.

In this blog post, we’ll introduce these triage taskflows, showcase results, and  share tips on how you can develop your own—for triage or other security research workflows. 

By using the taskflows described in this post, we quickly triaged a large number of code scanning alerts and discovered many (~30) real-world vulnerabilities since August, many of which have already been fixed and published. When triaging the alerts, the LLMs were only given tools to perform basic file fetching and searching. We have not used any static or dynamic code analysis tools other than to generate alerts from CodeQL.

While this blog post showcases how we used LLM taskflows to triage CodeQL queries, the general process creates automation using LLMs and taskflows. Your process will be a good candidate for this if:

  1. You have a task that involves many repetitive steps, and each one has a clear and well-defined goal.
  2. Some of those steps involve looking for logic or semantics in code that are not easy for conventional programming to identify, but are fairly easy for a human auditor to identify. Trying to identify them often results in many monkey patching heuristics, badly written regexp, etc. (These are potential sweet spots for LLM automation!)

If your project meets those criteria, then you can create taskflows to automate these sweet spots using LLMs, and use MCP servers to perform tasks that are well suited for conventional programming.

Both the seclab-taskflow-agent and seclab-taskflows repos are open source, allowing anyone to develop LLM taskflows to perform similar tasks. At the end of this blog post, we’ll also give some development tips that we’ve found useful.

Introduction to taskflows

Taskflows are YAML files that describe a series of tasks that we want to do with an LLM. In this way, we can write prompts to complete different tasks and have tasks that depend on each other. The seclab-taskflow-agent framework takes care of running the tasks one after another and passing the results from one task to the next.

For example, when auditing CodeQL alert results, we first want to fetch the code scanning results. Then, for each result, we may have a list of tasks that we need to check. For example, we may want to check if an alert can be reached by an untrusted attacker and whether there are authentication checks in place. These become a list of tasks we specify in a taskflow file.

Simplified depiction of taskflow with three tasks in order: fetch code scanning results, audit each result, create issues containing verdict.

We use tasks instead of one big prompt because LLMs have limited context windows, and complex, multi-step tasks often are not completed properly. Some steps are frequently left out, so having a taskflow to organize the task avoids these problems. Even with LLMs that have larger context windows, we find that taskflows are useful to provide a way for us to control and debug the task, as well as to accomplish bigger and more complex tasks.

The seclab-taskflow-agent can also perform a batch “for loop”-style task asynchronously. When we audit alerts, we often want to apply the same prompts and tasks to every alert, but with different alert details. The seclab-taskflow-agent allows us to create templated prompts to iterate through the alerts and replace the details specific to each alert when running the task.

Triaging taskflows from a code scanning alert to a report

The GitHub Security Lab periodically runs a set of CodeQL queries against a selected set of open source repositories. The process of triaging these alerts is usually fairly repetitive, and for some alerts, the causes of false positives are usually fairly similar and can be spotted easily. 

For example, when triaging alerts for GitHub Actions, false positives often result from some checks that have been put in place to make sure that only repo maintainers can trigger a vulnerable workflow, or that the vulnerable workflow is disabled in the configuration. These access control checks come in many different forms without an easily identifiable code pattern to match and are thus very difficult for a static analyzer like CodeQL to detect. However, a human auditor with general knowledge of code semantics can often identify them easily, so we expect an LLM to be able to identify these access control checks and remove false positives.

Over the course of a couple of months, we’ve tested our taskflows with a few CodeQL rules using mostly Claude Sonnet 3.5. We have identified a number of real, exploitable vulnerabilities. The taskflows do not perform an “end-to-end” analysis, but rather produce a bug report with all the details and conclusions so that we can quickly verify the results. We did not instruct the LLM to validate the results by creating an exploit nor provide any runtime environment for it to test its conclusion. The results, however, remain fairly accurate even without an automated validation step and we were able to remove false positives in the CodeQL queries quickly.

The rules are chosen based on our own experience of triaging these types of alerts and whether the list of tasks can be formulated into clearly defined instructions for LLMs to consume. 

General taskflow design

Taskflows generally consist of tasks that are divided into a few different stages. In the first stage, the tasks collect various bits of information relevant to the alert. This information is then passed to an auditing stage, where the LLM looks for common causes of false positives from our own experience of triaging alerts. After the auditing stage, a bug report is generated using the information gathered. In the actual taskflows, the information gathering and audit stage are sometimes combined into a single task, or they may be separate tasks, depending on how complex the task is.

To ensure that the generated report has sufficient information for a human auditor to make a decision, an extra step checks that the report has the correct formatting and contains the correct information. After that, a GitHub Issue is created, ready to be reviewed. 

Creating a GitHub Issue not only makes it easy for us to review the results, but also provides a way to extend the analysis. After reviewing and checking the issues, we often find that there are causes for false positives that we missed during the auditing process. Also, if the agent determines that the alert is valid, but the human reviewer disagrees and finds that it’s a false positive for a reason that was unknown to the agent so far, the human reviewer can document this as an alert dismissal reason or issue comment. When the agent analyzes similar cases in the future, it will be aware of all the past analysis stored in those issues and alert dismissal reasons, incorporate this new intelligence in its knowledge base, and be more effective at detecting false positives.

Information collection

During this stage, we instruct the LLM (examples are provided in the Triage examples section below) to collect relevant information about the alert, which takes into account the threat model and human knowledge of the alert in general. For example, in the case of GitHub Actions alerts, it will look at what permissions are set in the GitHub workflow file, what are the events that trigger the GitHub workflow, whether the workflow is disabled, etc. These generally involve independent tasks that follow simple, well-defined instructions to ensure the information collected is consistent. For example, checking whether a GitHub workflow is disabled involves making a GitHub API call via an MCP server.

To ensure that the information collected is accurate and to reduce hallucination, we instruct the LLM to include precise references to the source code that includes both file and line number to back up the information it collected:

You should include the line number where the untrusted code is invoked, as well as the untrusted code or package manager that is invoked in the notes.

Each task then stores the information it collected in audit notes, which are kind of a running commentary of an alert. Once the task is completed, the notes are serialized to a database which the next task can then append their notes to when it is done. 

Two tasks in order displaying which notes are added to the general notes in each step. With the step trigger analysis the notes added are triggers, permissions and secrets among others. The second task “audit injection point” potentially adds notes such as sanitizers and to the notes.

In general, each of the information gathering tasks is independent of each other and does not need to read each other’s notes. This helps each task to focus on its own scope without being distracted by previously collected information.

The end result is a “bag of information” in the form of notes associated with an alert that is then passed to the auditing tasks.

Audit issue

At this stage, the LLM goes through the information gathered and performs a list of specific checks to reject alert results that turned out to be false positives. For example, when triaging a GitHub Actions alert, we may have collected information about the events that trigger the vulnerable workflow. In the audit stage, we’ll check if these events can be triggered by an attacker or if they run in a privileged context. After this stage, a lot of the false positives that are obvious to a human auditor will be removed.

Decision-making and report generation

For alerts that have made it through the auditing stage, the next step is to create a bug report using the information gathered, as well as the reasoning for the decision at the audit stage. Again, in our prompt, we are being very precise about the format of the report and what information we need. In particular, we want it to be concise but also include information that makes it easy for us to verify the results, with precise code references and code blocks.

The report generated uses the information gathered from the notes in previous stages and only looks at the source code to fetch code snippets that are needed in the report. No further analysis is done at this stage. Again, the very strict and precise nature of the tasks reduces the amount of hallucination.

Report validation and issue creation

After the report is written, we instruct the LLM to check the report to ensure that all the relevant information is contained in the report, as well as the consistency of the information:

Check that the report contains all the necessary information:
- This criteria only applies if the workflow containing the alert is a reusable action AND has no high privileged trigger. 
You should check it with the relevant tools in the gh_actions toolbox.
If that's not the case, ignore this criteria.
In this case, check that the report contains a section that lists the vulnerable action users. 
If there isn't any vulnerable action users and there is no high privileged trigger, then mark the alert as invalid and using the alert_id and repo, then remove the memcache entry with the key {{ RESULT_key }}.

Missing or inconsistent information often indicates hallucinations or other causes of false positives (for example, not being able to track down an attacker controlled input). In either case, we dismiss the report.

If the report contains all the information and is consistent, then we open a GitHub Issue to track the alert.

Issue review and repo-specific knowledge

The GitHub Issue created in the previous step contains all the information needed to verify the issue, with code snippets and references to lines and files. This provides a kind of “checkpoint” and a summary of the information that we have, so that we can easily extend the analysis.

In fact, after creating the issue, we often find that there are repo-specific permission checks or sanitizers that render the issue a false positive. We are able to incorporate these problems by creating taskflows that review these issues with repo-specific knowledge added in the prompts. One approach that we’ve experimented with is to collect dismissal reasons for alerts in a repo and instruct the LLM to take into account these dismissal reasons and review the GitHub issue. This allows us to remove false positives due to reasons specific to a repo. 

Image showing LLM output that dismisses an alert.

In this case, the LLM is able to identify the alert as false positive after taking into account a custom check-run permission check that was recorded in the alert dismissal reasons.

Triage examples and results

In this section we’ll give some examples of what these taskflows look like in practice. In particular, we’ll show taskflows for triaging some GitHub actions and JavaScript alerts.

GitHub Actions alerts

The specific actions alerts that we triaged are checkout of untrusted code in a privileged context and code injection

The triaging of these queries shares a lot of similarities. For example, both involve checking the workflow triggering events, permissions of the vulnerable workflow, and tracking workflow callers. In fact, the main differences involve local analysis of specific details of the vulnerabilities. For code injection, this involves whether the injected code has been sanitized, how the expression is evaluated and whether the input is truly arbitrary (for example, pull request ID is unlikely to cause code injection issue). For untrusted checkout, this involves whether there is a valid code execution point after the checkout.

Since many elements in these taskflows are the same, we’ll use the code injection triage taskflow as an example. Note that because these taskflows have a lot in common, we made heavy use of reusable features in the seclab-taskflow-agent, such as prompts and reusable tasks.

When manually triaging GitHub Actions alerts for these rules, we commonly run into false positives because of:

  1. Vulnerable workflow doesn’t run in a privileged context. This is determined by the events that trigger the vulnerable workflow. For example, a workflow triggered by the pull_request_target runs in a privileged context, while a workflow triggered by the pull_request event does not. This can usually be determined by simply looking at the workflow file.
  2. Vulnerable workflow disabled explicitly in the repo. This can be checked easily by checking the workflow settings in the repo.
  3. Vulnerable workflow explicitly restricts permissions and does not use any secrets. In which case, there is little privilege to gain.
  4. Vulnerability specific issues, such as invalid user input or sanitizer in the case of code injection and the absence of a valid code execution point in the case of untrusted checkout.
  5. Vulnerable workflow is a reusable workflow but not reachable from any workflow that runs in privileged context.

Very often, triaging these alerts involves many simple but tedious checks like the ones listed above, and an alert can be determined to be a false positive very quickly by one of the above criteria. We therefore model our triage taskflows based on these criteria. 

So, our action-triage taskflows consist of the following tasks during information gathering and the auditing stage:

  • Workflow trigger analysis: This stage performs both information gathering and auditing. It first collects events that trigger the vulnerable workflow, as well as permission and secrets that are used in the vulnerable workflow. It also checks whether the vulnerable workflow is disabled in the repo. All information is local to the vulnerable workflow itself. This information is stored in running notes which are then serialized to a database entry. As the task is simple and involves only looking at the vulnerable workflow, preliminary auditing based on the workflow trigger is also performed to remove some obvious false positives. 
  • Code injection point analysis: This is another task that only analyzes the vulnerable workflow and combines information gathering and audit in a single task. This task collects information about the location of the code injection point, and the user input that is injected. It also performs local auditing to check whether a user input is a valid injection risk and whether it has a sanitizer
  • Workflow user analysis: This performs a simple caller analysis that looks for the caller of the vulnerable workflow. As it can potentially retrieve and analyze a large number of files, this step is divided into two main tasks that perform information gathering and auditing separately. In the information gathering task, callers of the vulnerable workflow are retrieved and their trigger events, permissions, use of secrets are recorded in the notes. This information is then used in the auditing task to determine whether the vulnerable workflow is reachable by an attacker.

Each of these tasks is applied to the alert and at each step, false positives are filtered out according to the criteria in the task.

After the information gathering and audit stage, our notes will generally include information such as the events that trigger the vulnerable workflow, permissions and secrets involved, and (in case of a reusable workflow) other workflows that use the vulnerable workflow as well as their trigger events, permissions, and secrets. This information will form the basis for the bug report. As a sanity check to ensure that the information collected so far is complete and consistent, the review_report task is used to check for missing or inconsistent information before a report is created. 

After that, the create_report task is used to create a bug report which will form the basis of a GitHub Issue. Before creating an issue, we double check that the report contains the necessary information and conforms to the format that we required. Missing information or inconsistencies are likely the results of some failed steps or hallucinations and we reject those cases.

The following diagram illustrates the main components of the triage_actions_code_injection taskflow:

Seven tasks of a taskflow connected in order with arrows: fetch alerts, trigger analysis, injection point analysis, workflow user analysis, review notes, create bug report and review bug report. All tasks but fetch alerts symbolize how they either iterate over alerts or alert notes.

We then create GitHub Issues using the create_issue_actions taskflow. As mentioned before, the GitHub Issues created contain sufficient information and code references to verify the vulnerability quickly, as well as serving as a summary for the analysis so far, allowing us to continue further analysis using the issue. The following shows an example of an issue that is created:

Image showing an issue created by the LLM.

In particular, we can use GitHub Issues and alert dismissal reasons as a means to incorporate repo-specific security measures and to further the analysis. To do so, we use the review_actions_injection_issues taskflow to first collect alert dismissal reasons from the repo. These dismissal reasons are then checked against the alert stated in the GitHub Issue. In this case, we simply use the issue as the starting point and instruct the LLM to audit the issue and check whether any of the alert dismissal reasons applies to the current issue. Since the issue contains all the relevant information and code references for the alert, the LLM is able to use the issue and the alert dismissal reasons to further the analysis and discover more false positives. The following shows an alert that is rejected based on the dismissal reasons:

Image showing LLM output of reasons to reject an alert after taking into account of the dismissal reasons.

The following diagram illustrates the main components of the issue creation and review taskflows:

Five tasks separated in two swim lanes: the first swim lane named “create action issues” depicts tasks that are used for the issue creation taskflow starting with dismissing false positives and continuing with the tasks for issue creation for true and false positives. The second swim lane is titled “review action issues” and contains the tasks “collect alert dismissal reasons” and “review issues based on dismissal reasons.

JavaScript alerts

Similarly to triaging action alerts, we also triaged code scanning alerts for the JavaScript/TypeScript languages to a lesser extent. In the JavaScript world, we triaged code scanning alerts for the client-side cross-site-scripting CodeQL rule. (js/xss)

The client-side cross-site scripting alerts have more variety with regards to their sources, sinks, and data flows when compared to the GitHub Actions alerts.

The prompts for analyzing those XSS vulnerabilities are focused on helping the person responsible for triage make an educated decision, not making the decision for them. This is done by highlighting the aspects that seem to make a given alert exploitable by an attacker and, more importantly, what likely prevents the exploitation of a given potential issue. Other than that, the taskflows follow a similar scheme as described in the GitHub Actions alerts section.

While triaging XSS alerts manually, we’ve often identified false positives due to these reasons:

  • Custom or unrecognized sanitization functions (e.g. using regex) that the SAST-tool cannot verify.
  • Reported sources that are likely unreachable in practice (e.g., would require an attacker to send a message directly from the webserver).
  • Untrusted data flowing into potentially dangerous sinks, whose output then is only used in an non-exploitable way.
  • The SAST-tool not knowing the full context where the given untrusted data ends up.

Based on these false positives, the prompts in the relevant taskflow or even in the active personality were extended and adjusted. If you encounter certain false positives in a project, auditing it makes sense to extend the prompt so that false positives are correctly marked (and also if alerts for certain sources/sinks are not considered a vulnerability).

In the end, after executing the taskflows triage_js_ts_client_side_xss and create_issues_js_ts, the alert would result in GitHub issues such as:

A screenshot of a GitHub Issue titled 'Code scanning alert #72 triage report for js/xss,' showing two lists with reasons that make an alert and exploitable vulnerability or not.

While this is a sample for an alert worthy of following up (which turned out to be a true positive, being exploitable by using a javascript: URL), alerts that the taskflow agent decided were false positive get their issue labelled with “FP” (for false positive):

A screenshot of a GitHub Issue titled 'Code scanning alert #1694 triage report for js/xss.' While it would show factors that make an alert exploitable it shows none, because the taskflow identified none. However, the issue shows a list of 7 items describing why the vulnerability is not exploitable.

Taskflows development tips

In this section we share some of our experiences when working on these taskflows, and what we think are useful in the development of taskflows. We hope that these will help others create their own taskflows.

Use of database to store intermediate state

While developing a taskflow with multiple tasks, we sometimes encounter problems in tasks that run at a later stage. These can be simple software problems, such as API call failures, MCP server bugs, prompt-related problems, token problems, or quota problems.

By keeping tasks small and storing results of each task in a database, we avoided rerunning lengthy tasks when failure happens. When a task in a taskflow fails, we simply rerun the taskflow from the failed task and reuse the results from earlier tasks that are stored in the database. Apart from saving us time when a task failed, it also helped us to isolate effects of each task and tweak each task using the database created from the previous task as a starting point.

Breaking down complex tasks into smaller tasks

When we were developing the triage taskflows, the models that we used did not handle large context and complex tasks very well. When trying to perform complex and multiple tasks within the same context, we often ran into problems such as tasks being skipped or instructions not being followed.

To counter that, we divided tasks into smaller, independent tasks. Each started with a fresh new context. This helped reduce the context window size and alleviated many of the problems that we had.

One particular example is the use of templated repeat_prompt tasks, which loop over a list of tasks and start a new context for each of them. By doing this, instead of going through a list in the same prompt, we ensured that every single task was performed, while the context of each task was kept to a minimum.

A task named “audit results” which exemplifies the “repeat prompt” feature. It depicts that by containing three boxes of the same size called 'audit result #1,' 'audit result #2,' and 'audit result n,' while between the #2 and the n box an ellipsis is displayed.

An added benefit is that we are able to tweak and debug the taskflows with more granularity. By having small tasks and storing results of each task in a database, we can easily separate out part of a taskflow and run it separately. 

Delegate to MCP server whenever possible

Initially, when checking and gathering information, such as workflow triggers, from the source code, we simply incorporated instructions in prompts because we thought the LLM should be able to gather the information from the source code. While this worked most of the time, we also noticed some inconsistencies due to the non-deterministic nature of the LLM. For example, the LLM sometimes would only record a subset of the events that trigger the workflow, or it would sometimes make inconsistent conclusions about whether the trigger runs the workflow in a privileged context or not.

Since these information and checks can easily be performed programmatically, we ended up creating tools in the MCP servers to gather the information and perform these checks. This led to a much more consistent outcome.

By moving most of the tasks that can easily be done programmatically to MCP server tools while leaving the more complex logical reasoning tasks, such as finding permission checks for the LLM, we were able to leverage the power of LLM while keeping the results consistent.

Reusable taskflow to apply tweaks across taskflows

As we were developing the triage taskflows, we realized that many tasks can be shared between different triage taskflows. To make sure that tweaks in one taskflow can be applied to the rest and to reduce the amount of copy and paste, we needed to have some ways to refactor the taskflows and extract reusable components.

We added features like reusable tasks and prompts. Using these features allowed us to reuse and apply changes consistently across different taskflows.

Configuring models across taskflows

As LLMs are constantly developing and new versions are released frequently, it soon became apparent that we need a way to update model version numbers across taskflows. So, we added the model configuration feature that allows us to change models across taskflows, which is useful when the model version needs updating or we just want to experiment and rerun the taskflows with a different model.

Closing

In this post we’ve shown how we created taskflows for the seclab-taskflow-agent to triage code scanning alerts. 

By breaking down the triage into precise and specific tasks, we were able to automate many of the more repetitive tasks using LLM. By setting out clear and precise criteria in the prompts and asking for precise answers from the LLM to include code references, the LLM was able to perform the tasks as instructed while keeping the amount of hallucination to a minimum. This allows us to leverage the power of LLM to triage alerts and reduces the amount of false positives greatly without the need to validate the alert dynamically.

As a result, we were able to discover ~30 real world vulnerabilities from CodeQL alerts after running the triaging taskflows.

The discussed taskflows are published in our repo and we’re looking forward to seeing what you’re going to build using them! More recently, we’ve also done some further experiments in the area of AI assisted code auditing and vulnerability hunting, so stay tuned for what’s to come!

Get the guide to setting up the GitHub Security Lab Taskflow Agent >


Disclaimers: 

  1. When we use these taskflows to report vulnerabilities, our researchers review carefully all generated output before sending the report. We strongly recommend you do the same. 
  2. Note that running the taskflows can result in many tool calls, which can easily consume a large amount of quota
  3. The taskflows may create GitHub Issues. Please be considerate and seek the repo owner’s consent before running them on somebody else’s repo.

The post AI-supported vulnerability triage with the GitHub Security Lab Taskflow Agent appeared first on The GitHub Blog.

Read the whole story
alvinashcraft
4 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Four priorities for AI-powered identity and network access security in 2026

1 Share

No doubt, your organization has been hard at work over the past several years implementing industry best practices, including a Zero Trust architecture. But even so, the cybersecurity race only continues to intensify.

AI has quickly become a powerful tool misused by threat actors, who use it to slip into the tiniest crack in your defenses. They use AI to automate and launch password attacks and phishing attempts at scale, craft emails that seem to come from people you know, manufacture voicemails and videos that impersonate people, join calls, request IT support, and reset passwords. They even use AI to rewrite AI agents on the fly as they compromise and traverse your network.

To stay ahead in the coming year, we recommend four priorities for identity security leaders:

  1. Implement fast, adaptive, and relentless AI-powered protection.
  2. Manage, govern, and protect AI and agents.
  3. Extend Zero Trust principles everywhere with an integrated Access Fabric security solution.
  4. Strengthen your identity and access foundation to start secure and stay secure.

Secure Access Webinar

Enhance your security strategy: Deep dive into how to unify identity and network access through practical Zero Trust measures in our comprehensive four-part series.

A man uses multifactor authentication.

1. Implement fast, adaptive, and relentless AI-powered protection

2026 is the year to integrate AI agents into your workflows to reduce risk, accelerate decisions, and strengthen your defenses.

While security systems generate plenty of signals, the work of turning that data into clear next steps is still too manual and error-prone. Investigations, policy tuning, and response actions require stitching together an overwhelming volume of context from multiple tools, often under pressure. When cyberattackers are operating at the speed and scale of AI, human-only workflows constrain defenders.

That’s where generative AI and agentic AI come in. Instead of reacting to incidents after the fact, AI agents help your identity teams proactively design, refine, and govern access. Which policies should you create? How do you keep them current? Agents work alongside you to identify policy gaps, recommend smarter and more consistent controls, and continuously improve coverage without adding friction for your users. You can interact with these agents the same way you’d talk to a colleague. They can help you analyze sign-in patterns, existing policies, and identity posture to understand what policies you need, why they matter, and how to improve them.

In a recent study, identity admins using the Conditional Access Optimization Agent in Microsoft Entra completed Conditional Access tasks 43% faster and 48% more accurately across tested scenarios. These gains directly translate into a stronger identity security posture with fewer gaps for cyberattackers to exploit. Microsoft Entra also includes built-in AI agents for reasoning over users, apps, sign-ins, risks, and configurations in context. They can help you investigate anomalies, summarize risky behavior, review sign-in changes, remediate and investigate risks, and refine access policies.

The real advantage of AI-powered protection is speed, scale, and adaptability. Static, human-only workflows just can’t keep up with constantly evolving cyberattacks. Working side-by-side with AI agents, your teams can continuously assess posture, strengthen access controls, and respond to emerging risks before they turn into compromise.

Where to learn more: Get started with Microsoft Security Copilot agents in Microsoft Entra to help your team with everyday tasks and the complex scenarios that matter most.

2. Manage, govern, and protect AI and agents 

Another critical shift is to make every AI agent a first-class identity and govern it with the same rigor as human identities. This means inventorying agents, assigning clear ownership, governing what they can access, and applying consistent security standards across all identities.

Just as unsanctioned software as a service (SaaS) apps once created shadow IT and data leakage risks, organizations now face agent sprawl—an exploding number of AI systems that can access data, call external services, and act autonomously. While you want your employees to get the most out of these powerful and convenient productivity tools, you also want to protect them from new risks.

Fortunately, the same Zero Trust principles that apply to human employees apply to AI agents, and now you can use the same tools to manage both. You can also add more advanced controls: monitoring agent interaction with external services, enforcing guardrails around internet access, and preventing sensitive data from flowing into unauthorized AI or SaaS applications.

With Microsoft Entra Agent ID, you can register and manage agents using familiar Entra experiences. Each agent receives its own identity, which improves visibility and auditability across your security stack. Requiring a human sponsor to govern an agent’s identity and lifecycle helps prevent orphaned agents and preserves accountability as agents and teams evolve. You can even automate lifecycle actions to onboard and retire agents. With Conditional Access policies, you can block risky agents and set guardrails for least privilege and just in time access to resources.

To govern how employees use agents and to prevent misuse, you can turn to Microsoft Entra Internet Access, included in Microsoft Entra Suite. It’s now a secure web and AI gateway that works with Microsoft Defender to help you discover use of unsanctioned private apps, shadow IT, generative AI, and SaaS apps. It also protects against prompt injection attacks and prevents data exfiltration by integrating network filtering with Microsoft Purview classification policies.

When you have observability into everything that traverses your network, you can embrace AI confidently while ensuring that agents operate safely, responsibly, and in line with organizational policy.

Where to learn more: Get started with Microsoft Entra Agent ID and Microsoft Entra Suite.

3. Extend Zero Trust principles everywhere with an integrated Access Fabric security solution

There’s often a gap between what your identity system can see and what’s happening on the network. That’s why our next recommendation is to unify the identity and network access layers of your Zero Trust architecture, so they can share signals and reinforce each other’s strengths through a unified policy engine. This gives you deeper visibility into and finer control over every user session.

Today, enterprise organizations juggle an average of five different identity solutions and four different network access solutions, usually from multiple vendors.1 Each solution enforces access differently with disconnected policies that limit visibility across identity and network layers. Cyberattackers are weaponizing AI to scale phishing campaigns and automate intrusions to exploit the seams between these siloed solutions, resulting in more breaches.2

An access security platform that integrates context from identity, network, and endpoints creates a dynamic safety net—an Access Fabric—that surrounds every digital interaction and helps keep organizational resources secure. An Access Fabric solution wraps every connection, session, and resource in consistent, intelligent access security, wherever work happens—in the cloud, on-premises, or at the edge. Because it reasons over context from identity, network, devices, agents, and other security tools, it determines access risk more accurately than an identity-only system. It continuously re‑evaluates trust across authentication and network layers, so it can enforce real‑time, risk‑based access decisions beyond first sign‑in.

Microsoft Entra delivers integrated access security across AI and SaaS apps, internet traffic, and private resources by bringing identity and network access controls together under a unified Zero Trust policy engine, Microsoft Entra Conditional Access. It continuously monitors user and network risk levels. If any of those risk levels change, it enforces policies that adapt in real time, so you can block access for users, apps, and even AI agents before they cause damage.

Your security teams can set policies in one central place and trust Entra to enforce them everywhere. The same adaptive controls protect human users, devices, and AI agents wherever they move, closing access security gaps while reducing the burden of managing multiple policies across multiple tools.

Where to learn more: Read our Access Fabric blog and learn more in our new four-part webinar series.

4. Strengthen your identity and access foundation to start secure and stay secure

To address modern cyberthreats, you need to start from a secure baseline—anchored in phishing‑resistant credentials and strong identity proofing—so only the right person can access your environment at every step of authentication and recovery.

A baseline security model sets minimum guardrails for identity, access, hardening, and monitoring. These guardrails include must-have controls, like those in security defaults, Microsoft-managed Conditional Access policies, or Baseline Security Mode in Microsoft 365. This approach includes moving away from easily compromised credentials like passwords and adopting passkeys to balance security with a fast, familiar sign-in experience. Equally important is high‑assurance account recovery and onboarding that combines a government‑issued ID with a biometric match to ensure that no bad actors or AI impersonators gain access.

Microsoft Entra makes it easy to implement these best practices. You can require phishing‑resistant credentials for any account accessing your environment and tailor passkey policies based on risk and regulatory needs. For example, admins or users in highly regulated industries can be required to use device‑bound passkeys such as physical security keys or Microsoft Authenticator, while other worker groups can use synced passkeys for a simpler experience and easier recovery. At a minimum, protect all admin accounts with phishing‑resistant credentials included in Microsoft Entra ID. You can even require new employees to set up a passkey before they can access anything. With Microsoft Entra Verified ID, you can add a live‑person check and validate government‑issued ID for both onboarding and account recovery.

Combining access control policies with device compliance, threat detection, and identity protection will further fortify your foundation. 

Where to learn more: Read our latest blog on passkeys and account recovery with Verified ID and learn how you can enable passkeys for your organization.

Support your identity and network access priorities with Microsoft

The plan for 2026 is straightforward: use AI to automate protection at speed and scale, protect the AI and agents your teams use to boost productivity, extend Zero Trust principles with an Access Fabric solution, and strengthen your identity security baseline. These measures will give your organization the resilience it needs to move fast without compromise. The threats will keep evolving—but you can tip the scales in your favor against increasingly sophisticated cyberattackers.

To learn more about Microsoft Security solutions, visit our website. Bookmark the Security blog to keep up with our expert coverage on security matters. Also, follow us on LinkedIn (Microsoft Security) and X (@MSFTSecurity) for the latest news and updates on cybersecurity.


1Secure employee access in the age of AI report, Microsoft.

2Microsoft Digital Defense Report 2025.

The post Four priorities for AI-powered identity and network access security in 2026 appeared first on Microsoft Security Blog.

Read the whole story
alvinashcraft
5 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Multimodal reinforcement learning with agentic verifier for AI agents

1 Share
Diagram showing visual, audio, and document icons feeding into a central network icon of connected people, which then leads to a checkmark symbol, all on a blue‑to‑purple gradient background.

At a glance

  • Today’s multimodal AI systems can give answers that sound right but may not be grounded in what they actually observe over time, leading to unpredictable errors and safety risks in real-world settings.
  • Argos is a verification framework for multimodal reinforcement learning that trains models by rewarding not just correct answers, but correct answers grounded in visual and temporal evidence, using automated verification rather than human labeling. It selects the appropriate specialized tools for each answer based on what needs to be verified. 
  • Models trained with Argos show stronger spatial reasoning, far fewer visual hallucinations, more stable learning dynamics, and better performance on robotics and real-world tasks while requiring fewer training samples.

Over the past few years, AI systems have become much better at discerning images, generating language, and performing tasks within physical and virtual environments. Yet they still fail in ways that are hard to predict and even harder to fix. A robot might try to grasp a tool when the object is visibly blocked, or a visual assistant integrated into smart glasses might describe objects that aren’t actually present.

These errors often arise because today’s multimodal agents are trained to generate outputs that are plausible rather than grounded in the actual information they receive from their environment. As a result, a model’s output can seem correct while relying on incorrect information. As AI systems are increasingly used to navigate 3D spaces and make decisions in real-world settings, this gap can be a safety and reliability concern.

To tackle this challenge, we posed the question: How can we train AI agents to generate correct answers and take appropriate actions for the right reasons so that their behavior is reliable even as the environment or tasks change?

Argos represents a novel answer to this challenge. It’s an agentic verification framework designed to improve the reliability of reinforcement learning in multimodal models. Reinforcement learning is a training method where AI models learn by receiving rewards for desired behaviors and penalties for undesired ones, gradually improving their performance through trial and error.

Rather than rewarding only correct behaviors, Argos evaluates how those behaviors were produced. It draws on a pool of larger, more capable teacher models and rule-based checks to verify two things: first, that the objects and events a model references actually exist in its input, and second, that the model’s reasoning aligns with what it observes. Argos rewards the model when both conditions are met. In practice, these rewards help curate high-quality training data and guide the model’s further training.

How Argos works

Argos functions as a verification layer on top of an existing multimodal model. Given an image or video, a task or query, and information about the model’s reasoning and output, Argos identifies where the model indicates objects are located in the image, when it indicates events occur in a video, and what action or answer it produces.

Argos then applies specialized tools tailored to the specific content to evaluate and score three aspects of the model’s output. It checks whether the answer is correct, whether referenced objects and events appear at the indicated locations and times, and whether the reasoning is consistent with the visual evidence and the answer (Figure 1).

These scores are combined using a gated aggregation function, a method that dynamically adjusts the importance of different scores. It emphasizes reasoning checks only when the final output is correct. This design prevents unreliable feedback from dominating training and produces a stable reward signal for reinforcement learning.

Figure 1 shows an overview of Argos, an agentic verifier for multimodal reinforcement learning and its downstream applications. The left half of the figure illustrates Argos verifying model responses to visual questions. The left example counts dogs in an image, with red dots marking the referenced dogs and a visual grounding score; another example shows a bathroom scene where the agent reasons whether it can open the door, with an accuracy score. Below these, a blue bar titled “Argos verifier” feeds into icons representing multiple tools, including Grounding DINO, SAM-2, a pointing-hand evaluator, string matching, and a language model score, where their outputs combine into grounding and accuracy scores. The right half of the figure depicts three categories of downstream tasks powered by this supervision: robotic manipulation (a robot arm interacting with objects on a table), high-level task planning and completion (placing toilet paper on the back of a toilet and putting a bowl on a coffee table), and spatial reasoning (answering a viewpoint-based navigation question using room images). The overall message is that dense, grounded verification enables stronger agent performance on complex, real-world tasks.
Figure 1. Argos selects different specialized tools to verify and score the accuracy of referenced points and events in the agent’s reasoning.

Using Argos to curate data for supervised fine-tuning

Argos also helps curate high-quality training data to provide the model with a strong foundation in grounded reasoning. Before the reinforcement learning stage begins, Argos uses a multi-stage process to generate data that is explicitly tied to visual locations and time intervals.

In the first stage, Argos identifies the objects, actions, and events that are relevant to a task and links them to specific locations in images or specific moments in videos. These references are overlaid on images and selected video frames. Next, a reasoning model generates step-by-step explanations that refer to these visual locations and time spans.

Finally, Argos evaluates each generated example for accuracy and visual grounding, filtering out low-quality training data and retaining only data that is both correct and well-grounded in visual input. The resulting dataset is then used in an initial training phase, where the model learns to generate reasoning steps before producing its final output. This process is illustrated in Figure 2.

Figure 2 illustrates the Argos scoring pipeline for both images and videos. On the left, two examples show an image of a living room and a short video clip, each paired with a question and a free-form model response (e.g., estimating the distance between two lamps, or describing why a person failed to pour oil). In the middle, an “Agentic Verifier” column parses each response into structured elements: spatial 2D points indicating the referenced object and pixel coordinates, temporal segments for the relevant video frames, a reasoning-quality panel that combines the image/video, question, and response, and a final-answer panel comparing the predicted answer to ground truth. Below, a row of teacher models and scoring functions, such as Grounding DINO, SAM-2, a pointing-hand metric, string matching, relative accuracy, and a language model score, take these extracted elements as input to produce separate scores. On the right, arrows labeled “Action” and “Score” show how the verifier adaptively selects which teachers to call and then aggregates their outputs via a gated aggregation function into a single reward signal for training.
Figure 2. Argos generates step-by-step reasoning grounded in image locations and video timestamps then filters out low-quality training data.

Evaluation

Building on this foundation in grounded reasoning, we further trained the model using reinforcement learning guided by Argos and evaluated its performance across a range of benchmarks. On spatial reasoning tasks, the Argos-trained model outperformed both the base model Qwen2.5-VL-7B and the stronger Video-R1 baseline across challenging 3D scenarios and multi-view tasks. Models trained with Argos also showed a substantial reduction of hallucinations compared with both standard chain-of-thought prompting and reinforcement learning baselines.

Finally, we evaluated the model in robotics and other real-world task settings, focusing on high-level planning and fine-grained control. Models trained with Argos performed better on complex, multi-step tasks. Notably, these improvements were achieved using fewer training samples than existing approaches, highlighting the importance of reward design in producing more capable and data-efficient agents. Figure 3 illustrates some of these findings.

Figure 3 shows two side-by-side line charts comparing an Agentic model (dashed line) that uses the Argos verifier with a Non-Agentic model (solid line) trained only with an outcome reward. The left plot, “Response Accuary,” tracks response accuracy versus RL step (0, 5, 10, 15). Both models start near 0.54 accuracy, but the Agentic curve slightly rises and then stays roughly flat, while the Non-agentic curve steadily declines to about 0.50. The right plot, “Visual Grounding Acc,” shows visual grounding accuracy over the same steps: the Agentic curve increases monotonically from about 0.39 to just above 0.5, whereas the Non-Agentic curve initially rises slightly and then drops sharply to about 0.1. Together, the plots illustrate that Argos stabilizes answer accuracy and significantly improves visual grounding, while the non-agentic model’s performance and grounding collapse over training.
Figure 3. Performance of Argos compared with baseline models on the task of visual hallucination detection (left) and embodied task planning and completion (right). 

How Argos shapes reinforcement learning

To understand how Argos affects learning, we took the same vision-language model that had been trained on our curated dataset and fine-tuned it using reinforcement learning in two different ways. In one approach, Argos was an agentic verifier, checking the correctness of outputs and the quality of reasoning. In the other, the model received feedback only on whether its answers were correct.

We evaluated both versions on 1,500 samples from a new dataset and tracked their performance throughout the learning process (Figure 4). Although they started at similar levels, the model without Argos quickly got worse. Its accuracy steadily declined, and it increasingly gave answers that ignored what was in the videos. It learned to game the system by producing answers that seemed correct without grounding them in visual evidence.

The model trained with Argos showed the opposite pattern. Accuracy improved steadily, and the model became better at linking its reasoning to what appeared in the videos. This difference highlights the value of verification: when training rewards both correct outputs and sound reasoning based on visual and temporal evidence, models learn to be more reliable rather than simply finding shortcuts to high scores.

Figure 4 shows two side-by-side line charts comparing an Agentic model (dashed line) that uses the Argos verifier with a Non-Agentic model (solid line) trained only with an outcome reward. The left plot, “Response Accuary,” tracks response accuracy versus RL step (0, 5, 10, 15). Both models start near 0.54 accuracy, but the Agentic curve slightly rises and then stays roughly flat, while the Non-agentic curve steadily declines to about 0.50. The right plot, “Visual Grounding Acc,” shows visual grounding accuracy over the same steps: the Agentic curve increases monotonically from about 0.39 to just above 0.5, whereas the Non-Agentic curve initially rises slightly and then drops sharply to about 0.1. Together, the plots illustrate that Argos stabilizes answer accuracy and significantly improves visual grounding, while the non-agentic model’s performance and grounding collapse over training.
Figure 4. Comparison of response accuracy changes with and without Argos across two model versions (left) and differences in visual grounding accuracy over training for both versions (right).

Potential impact and looking forward

This research points toward a different way of building AI agents for real-world applications. Rather than fixing errors after they occur, it focuses on training agents to systematically anchor their reasoning in what they actually receive as input throughout the training process.

The potential applications span many domains. A visual assistant for a self-driving car that verifies what’s actually in an image is less likely to report phantom obstacles. A system that automates digital tasks and checks each action against what’s displayed on the screen is less likely to click the wrong button.

As AI systems move beyond research labs into homes, factories, and offices, reliable reasoning becomes essential for safety and trust. Argos represents an early example of verification systems that evolve alongside the AI models they supervise. Future verifiers could be tailored for specific fields like medical imaging, industrial simulations, and business analytics. As more advanced models and richer data sources become available, researchers can use them to improve these verification systems, providing even better guidance during training and further reducing hallucinations.

We hope that this research helps move the field toward AI systems that are both capable and interpretable: agents that can explain their decisions, point to the evidence behind them, and be trained to adhere to real-world requirements and values.

Opens in a new tab

The post Multimodal reinforcement learning with agentic verifier for AI agents appeared first on Microsoft Research.

Read the whole story
alvinashcraft
5 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Context windows, Plan agent, and TDD: What I learned building a countdown app with GitHub Copilot

1 Share

In our last Rubber Duck Thursdays stream of 2025, I wanted to build something celebratory. Something that captures what Rubber Duck Thursdays is all about: building together, learning from mistakes, and celebrating everyone who tunes in from across the world. 

Along the way, I picked up practical patterns for working with AI that you can apply to your own projects, whether you’re building a countdown app or something entirely different. From managing context windows to avoid cluttered conversations, to using the Plan agent for requirement discovery, to catching edge cases through test-driven development with Copilot. And… why world maps are harder than they look. 👀

See the full stream below. 👇

Starting simple: The basic countdown

Countdown timers are a straightforward concept. Days countdown to hours. Minutes countdown to seconds. But sometimes it’s the simple ideas that allow us to be our most creative. I figured I’d use this as an opportunity to use Copilot in a spec or requirements-driven approach, to build a countdown app that brought anticipation and displayed fireworks as it turned to the new year. 

Fortunately, software development is an iterative process and this livestream embraced that fully. While some requirements were well-defined, others evolved in real time, shaped by suggestions from our livestream audience. Custom agents like the Plan agent helped bridge the gap, turning ambiguous ideas into structured plans I could act on. So let’s start at the very beginning, setting up the project.

I generated a new workspace with GitHub Copilot, using a very specific prompt. The prompt explained that we’re building a countdown app and that I wanted to use Vite, TypeScript, and Tailwind CSS v4. It also explained some of the requirements including the dark theme, centred layout, large bold digits with subtle animation, target midnight on January, 2026 by default, with some room for customizations.

#new 

1. Create a new workspace for a New Year countdown app using Vite, TypeScript, and Tailwind CSS v4.

**Setup requirements:**
- Use the @tailwindcss/vite plugin (Tailwind v4 style)
- Dark theme by default (zinc-900 background)
- Centered layout with the countdown as the hero element

**Countdown functionality:**
Create a `countdown.ts` module with:
- A `CountdownTarget` type that has `{ name: string, date: Date }` so we can later customize what we're counting down to
- A `getTimeRemaining(target: Date)` function returning `{ days, hours, minutes, seconds, total }`
- A `formatTimeUnit(n: number)` helper that zero-pads to 2 digits
- Default target: midnight on January 1st of NEXT year (calculate dynamically from current date)

**Display:**
- Large, bold countdown digits (use tabular-nums for stable width)
- Labels under each unit (Days, Hours, Minutes, Seconds)
- Subtle animation when digits change (CSS transition)
- Below the countdown, show: "until [target.name]" (e.g., "until 2026")

**Architecture:**
- `src/countdown.ts` - pure logic, no DOM
- `src/main.ts` - sets up the interval and updates the DOM
- Use `requestAnimationFrame` or `setInterval` at 1 second intervals
- Export types so they're reusable

Keep it simple and clean—this is the foundation we'll build themes on top of.

What I love about the “generate new workspace” feature is that Copilot generated custom instruction files for me, automatically capturing my requirements, including the countdown app, Vite, TypeScript, and dark theme. It was all documented before writing a single line of code.

Screenshot of a conversation with Copilot Chat in Visual Studio Code. Copilot has identified that the user wants to create a new workspace for a New Year countdown app using Vite, TypeScript and Tailwind CSS v4. It's going to to use the create_new_workspace tool to create the new workspace.

Within minutes, I had a working countdown. Days, hours, minutes, and seconds ticking down to 2026. While it worked, it wasn’t visually exciting. In fairness, I hadn’t specified any design or theme preferences in my initial prompt. So it was time to iterate and make it more interesting.

The community suggestion that steered our course

During the stream, viewers were joining from India, Nigeria, Italy, the United States (the list goes on!); developers from around the world, coming together to learn. One person in the chat made a suggestion that adjusted what we’d do next: What about time zones?

It wasn’t a requirement I’d expected to work on during the stream, so I didn’t have a clear plan of how it would work. Maybe there is a globe that you could spin to select timezones. Maybe there was a world map with a time travel theme. That’s a lot of maybes. My requirements were vague, which was where I turned to the Plan agent.

Plan agent: The questions I hadn’t thought to ask

I’ve been using Plan agent more deliberately lately, especially when I feel that my  requirements aren’t fully defined. The Plan agent doesn’t create a plan based on my initial prompt, it asks clarifying questions that can reveal edge cases you may not have considered.

Screenshot of the input box for Copilot Chat in Visual Studio Code. The input box contains a prompt with some requirements that configure an interactive timezone selector either like a time machine dial, or a mini world map.

I gave it my rough idea: interactive time zone selector, time travel theme, animate between zones, maybe a world map. The Plan agent came back with questions that made me think:

QuestionWhy it mattered
Should the circular dial be primary with the world map as secondary, or vice versa?I hadn’t decided the visual hierarchy
What happens on mobile: dropdown fallback or touch-friendly scroll?I was only thinking of a desktop implementation for this initial version. Mobile could be a future requirement.
When a time zone passes midnight, show “already celebrating” with confetti, or a timer showing how long since midnight?I wanted the celebration, not a reverse countdown. I wasn’t clear on my requirements.
Would there be subtle audio feedback when spinning the dial, or visual only?Bringing audio into the app was scope creep, but it could be a future requirement.

This is the beauty of working with AI in this way. The Plan agent makes you think, potentially asking a clarifying question and offering options A or B. But as you reflect, you realize the answer is somewhere in between. 

For example, in my second iteration of requirements, the plan asked whether fireworks should run continuously, burst once, or loop subtly. I replied that there’s probably a performance consideration, and we should opt for somewhere in the middle. We also asked the livestream viewers to vote on whether we should implement the component as a dial or map. Map won, so we pivoted to a world map as the primary selector with eight featured locations.

Context window management: Just keep what you need

Before implementing, I deliberately started a new chat session.

The context from our previous conversation (workspace creation, basic countdown logic) wasn’t needed anymore. And any context that might have been useful was now included in our custom instructions file. When working with AI tools, that context window is precious. Bringing in irrelevant history clutters the conversation and dilutes focus. So I cleared it, bringing only what mattered: the new requirements, the Plan agent output (which I’d asked Copilot to write to a separate Markdown file), and fresh focus on time zones.

I also reused some custom instruction files, custom agents, and prompt files from another personal project to help steer Copilot in the right direction, and incorporate specialized agents for relevant tasks. This included a UI Performance Specialist agent.

💡 Did you know? GitHub Copilot’s custom agents let you create specialised personas for different development tasks. The UI Performance Specialist agent that I built during the stream is just one example. You can create agents for security reviews, architecture planning, or any role-specific workflow. The awesome-copilot repository has a number of examples.

Implementation: Modular, test-driven, and that map

With the Plan agent’s work complete, I switched to my UI Performance Specialist agent and asked it to review the plan, suggesting deeper implementation details based on its expertise.

: Screenshot of Copilot Chat in Visual Studio Code. The agent selector shows "UI Performance Specialist", with a prompt asking the agent to review the plan and to provide implementation details.

Context is important here, so I didn’t create a new conversation. Instead, I continued the existing one. The agent came back with a detailed set of considerations:

  • Frame time budgets for animations
  • Map SVG size optimisation strategies
  • Celebration particle limits (DOM element concerns) and cleanup considerations
  • Animation property recommendations (transform/opacity only)
  • Reduced motion support

It looked good, but I added a couple of additional requirements. I asked the custom agent to make the implementation modular, to write the tests first based on expected behaviour, and once it had failing tests, to write the implementation.

That’s right: test-driven development with Copilot.

The TDD Cycle

Copilot created test files for time zone utilities, city state management, and the countdown logic. All failing tests in a red state. Good (one of the few times where we want to see failing tests)! 

Screenshot of Copilot Chat in Visual Studio Code showing GitHub Copilot following a TDD cycle; failing tests first, then implementation.

Then it implemented:

  • Time zone utilities using the Intl.DateTimeFormat API
  • City state with featured locations (New York, London, Tokyo, Sydney, etc.)
  • localStorage persistence for selected time zones
  • App state management

With access to tools, the custom agent also executed tests in the terminal. Two test cases failed: the logic that determined whether the celebration was being triggered correctly between year rollovers. The tests were expecting that celebrations were handled at midnight, and the duration since the celebrations began.

Screenshot of Copilot Chat in Visual Studio Code showing GitHub Copilot making implementation changes, running tests and identifying a test failure. Copilot iterates based on the failure, and re-runs the tests to obtain a green set of tests.

Since Copilot had access to the output, the custom agent caught the test failures, adjusted the timezone implementation, and the tests went green.

💡 Thought: This is exactly why TDD and thinking about code quality matters. Just like us developers, AI-assisted development can get things wrong. Tests help us catch bugs before users do. The year rollover edge case would have been embarrassing to discover on December 31, given that it was the core capability of the app!

But some bugs turn into features. I found one bug too funny to fix immediately. Let’s talk about the world map.

The World map, maybe?

When I opened the app, the countdown worked. The time zone selector worked. The calculations were correct, and switching from New York to Tokyo showed the proper time difference.

But the world map? It didn’t quite render as expected. What appeared on screen was more abstract art than geography. But it really made me laugh on stream.

Screenshot from the countdown app. It has several dots placed for locations across the world, but the world map has rendered as a series of abstract shapes to represent islands, instead of a true version of the world map.

💡 Thought: I was ambitious specifying a world map without providing enough context. No SVG asset, no reference to an existing mapping library. Just “add a mini world map.” A reminder that AI can get things wrong.

Could I have fixed it? Absolutely. But we were over an hour into the stream, and had more features to build. So I left it. The map was a perfect example of iterative development where things don’t always go right the first time. (Can you tell that we build things live yet?)

Fireworks: Building anticipation toward midnight

A countdown on its own is functional, but fireworks add celebration and give some visual flare (See what I did there?). 

I switched back to the Plan agent and created a new chat thread (again, context window management, prompting Copilot to build out a plan):

  • Use Fireworks.js for the effects
  • Set the fireworks behaviour based on time remaining
  • If the timer has more than 24 hours left, don’t display fireworks, just show ambient stars
  • If the timer has between 24 to 12 hours remaining, set off fireworks every 30 seconds
  • Between one hour and 10 minutes remaining, the intensity of the fireworks should build
  • And finally, in the last 10 seconds we should have continuous fireworks for maximum celebration

I also asked for a skyline silhouette at the bottom of the screen, a dark night sky gradient, and a theme controller. Plus, one critical testing requirement: “Add a query parameter so I can specify how many minutes away we are from midnight as an override for manual testing.” While I enjoy streaming with our community, I’m not sure that everyone would have enjoyed hanging around until the turn of 2026 to see the results!

The Plan agent asked for further clarification on how to display the stars (either setting them as CSS, or setting them as low-intensity fireworks), as well as some considerations around performance. It also asked about toggle placement, which caught me out. I didn’t remember asking for a toggle button and may have missed that in an iteration of the plan. 

After carefully reviewing the plan, the Plan agent that I originally requested an animation toggle for accessibility. This is why I like the Plan agent. It’s rubber ducking with AI that has the context of your conversation, and can check whether those requirements still make sense.

Screenshot of Copilot Chat in Visual Studio Code showing an interaction where Copilot asks clarifying questions to confirm the requirements.

Once Copilot and I renegotiated the requirements, we used that familiar test-driven development approach. One test failed initially as the JSDOM environment setup was missing. Copilot spotted the failure, identified the misconfigured testing configuration, and made the fix. After that, all tests went green.

We now had an app with fireworks at different intensity levels, an animated starfield using CSS, a city skyline, reduced motion support, and a query parameter override.

Testing the Intensity Levels

I added ?minutesToMidnight=1 to the URL. Fireworks appeared with medium intensity, building excitement with increasing amounts of colors and particles across the sky. At “midnight,” Happy New Year appeared with even more celebration. The intensity curve felt right, the buildup created anticipation and the finale delivered.

Screenshot from the countdown app, showing the timer has reached 0 and fireworks are being shown on screen to celebrate.

Reveal: What I built that morning

But I didn’t stop there. Throughout the stream, I’d been teasing that I’d made another countdown app earlier that morning, something with a very relevant theme. Our viewers guessed another fireworks countdown, a confetti timer, and even an “elicitation-powered tic-tac-toe” (which, to be fair, we have built before). 

But as a GitHub stream, there was only one way that we could finish it off. We had to have a contribution graph themed countdown!

The countdown sat in the centre in front of an animated contribution graph. Each square flickered with green contributions appearing and disappearing across the grid in waves. And just like the fireworks theme, as the countdown ticked closer to zero, more squares lit up and the intensity built.

Screenshot from a GitHub contribution graph themed countdown app. It shows 2026 in green blocks, layered on top of a contribution graph that has several squares active with different shades of green to represent contributions.

This stream was a celebration. A way to bring our community together across time zones, all of us building and counting down to the same moment in our own corners of the world.

During the stream, someone asked about the best programming languages for landing jobs. My answer was the same as my approach to this project: find the thing that brings you joy, and then the right tools and languages just fall into place. I built this GitHub countdown theme because it brought me joy. Because I wanted to make something “GitHubby,” and because I enjoy building visual experiences. 

Since that stream, I’ve worked on bringing these two projects into a unified open source countdown app, Timestamp. It has a centralized theme orchestrator, allowing developers to plug into a common architecture and extend with new themes. Every countdown is a URL so can be easily shared, and there are several countdown modes to choose from (local time, absolute moments and timers).

You can check out the live app and review the codebase. You’re welcome to take a look at the repository, star it, fork it, and even contribute a new theme. 

I hope this inspires you to build that one project that has been on the backlog, and spend some time on the thing that brings you a little bit of joy.

What have we learned?

  • Context window management is a skill. Start new chat sessions when old context isn’t needed. Keep conversations focused. It’s context engineering, not just prompt engineering.
  • The Plan agent asks questions you may have forgotten. Use it when requirements are vague. Let it reveal edge cases through clarifying questions. Sometimes the answer to A or B is “somewhere in the middle.”
  • Custom agents are specialised helpers. My UI Performance Specialist had expertise in frame budgets, animation properties, and accessibility. It gave implementation details while the plan agent helped ask clarifying questions to determine the scope. Specialisation matters.
  • TDD with Copilot works. Write tests first. Let them fail. Implement to pass. Just like us developers, AI-assisted tools produce bugs. We need to use those same practices that we’re used to for checking quality (builds, linters, and tests) to catch issues before users do.
  • Things won’t always work the first time. That’s okay. The world map didn’t render as expected, and I left it that way until my significant refactor and rebuild of the countdown app. Authentic development means showing the messy middle, not just polished outcomes. We learn from unexpected results as much as from successes.
  • Scope ambitiously, implement iteratively. We went from basic countdown, to time zones, to intense fireworks, to a separate contribution graph themed countdown. Rome wasn’t built in a day, and you don’t need everything on day one.

What will you build in 2026? Drop by the next Rubber Duck Thursdays stream at 10:30 a.m. UK time and 2:00 p.m. Eastern time, and let’s build something that brings us joy, which hasn’t quite reached the top of the “some day” list!

The post Context windows, Plan agent, and TDD: What I learned building a countdown app with GitHub Copilot appeared first on The GitHub Blog.

Read the whole story
alvinashcraft
5 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories