Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
153909 stories
·
33 followers

How to Watch Google I/O

1 Share
Google I/O is back with updates to Search, Android, Gemini, and a fresh peek at upcoming Android XR smart glasses. Here's how to watch the announcements live and what to expect.
Read the whole story
alvinashcraft
3 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Agent Skills Work but the Research Shows Most Teams Are Building Them Wrong

1 Share

This post was originally published on The Nuanced Perspective and is being reposted here with the authors’ permission.

Agent skills are everywhere right now. Atlassian built them into Rovo so agents can automatically triage Jira tickets, draft Confluence pages, and route service requests without anyone typing a prompt. Canva and Figma use them so Claude can interact with design files directly. Stripe published skills for payment workflow automation. When Anthropic launched the Agent Skills open standard in December 2025, Microsoft adopted it in VS Code and GitHub within weeks.

The idea is elegantly simple. Instead of building a new specialized agent for every use case, you write a skill once, and any agent that understands the standard can use it. A code reviewer, a PR generator, a deployment checklist, a sprint planner. Each lives in a folder, triggers when relevant, and brings your team’s specific way of doing things into the agent’s context.

But the research on whether skills actually work, and what causes them to fail, is only catching up to adoption now. Four recent papers take the first systematic look at skills in practice: what the benchmarks show, how libraries break down as they grow, and what a more principled approach to orchestration looks like.

Three findings that will change how you think about skills:

  • Curated skills raised the rate at which agents successfully completed tasks by 16.2% on average across 84 tasks. Model-written skills showed no consistent benefit across any configuration tested.
  • As skill libraries grow, the agent’s ability to find the right skill on demand breaks down. When it scans every skill description in one pass, similar-sounding skills start colliding. Organizing skills into a hierarchy rather than a flat list is what the research shows actually fixes this.
  • A large-scale security study of ~31K community skills found that more than one in four contain exploitable vulnerabilities, spanning prompt injection, data exfiltration, and privilege escalation.

This is what those papers found, and what it means for anyone building with skills today.

What a skill is

Your team has a specific way of reviewing PRs. Particular checks, a specific order, standards that go beyond what any generic reviewer would know. You’ve explained it to every new engineer who joined. A skill is how you stop explaining it and let the agent carry it instead. In practice it’s a folder with a SKILL.md file at the center: a description that acts as the trigger condition, a body with step-by-step instructions, and optionally scripts and reference documents that load only when needed. A scoped set of tools and instructions the agent can invoke.

At session startup, the agent reads only the name and description from each installed skill, which is about 100 tokens per skill. The full instructions load only when the skill activates, and scripts run without being read into context at all. A large skill library costs almost nothing at initialization. The context budget only gets spent when a skill is actually running.

That’s progressive disclosure, and it’s what makes skills different from system prompts, which load everything globally every session, or tools, which are API calls that give the agent direct capabilities. The distinction that holds up for MCPs is that MCP gives the agent abilities, say, a shell, an API connection, or access to a database, whereas skills encode the knowledge of how to use those abilities well for a specific workflow. Block’s engineering team put it well that skills are like GitHub Actions YAML, and MCP is the runner. One describes the workflow and the other makes it possible.

Some concrete examples of what this looks like in practice, from teams that have shipped skills in production:

  • A PR review skill that loads your org’s specific style guide, flagging violations and blockers according to your team’s standards rather than generic best practices
  • A deployment checklist skill that runs your team’s exact predeploy sequence, covering environment checks, rollback verification, and the three Slack channels to notify in order
  • A data reporting skill that knows your company’s metric definitions, so when someone asks for “revenue,” it pulls the right number rather than the closest approximation
  • A sprint planning skill that fetches the backlog, applies your team’s capacity rules, and proposes a plan structured the way your team runs standups

The value in each of these isn’t the task itself. Any agent can attempt a PR review or a sprint plan. The value is the organizational knowledge baked into how the skill executes it, your style rules, your deploy sequence, your metric definitions, your team’s way of running things. That specificity is also what makes skills hard to get right, as the benchmarks show.

What the benchmarks show

SkillsBench is the first benchmark built specifically to measure whether agent skills actually improve performance. It tested 84 tasks across 11 domains, running each task under three conditions: no skill, a curated skill, and a self-generated skill. The results are worth sitting with.

Curated skills raised average pass rates by 16.2%. However, the gains were uneven across domains. Software engineering tasks improved by 4.5%, while healthcare tasks saw nearly 52% improvement. The domains where skills helped most were the ones with highly structured workflows and domain-specific conventions the base model doesn’t carry natively.

The less-cited result is that self-generated skills, where the model writes its own skill rather than a human curating one, provided no average benefit across configurations (“SkillsBench,” Table 3). Some model configurations saw small gains; others saw small losses. The paper’s conclusion was that models cannot reliably author the procedural knowledge they benefit from consuming. The trajectory analysis in the benchmark identified two failure modes:

  • Models either generate imprecise procedures lacking specific API patterns, or
  • Fail to recognize what domain knowledge the task actually requires.

The benchmark’s self-generation condition has also drawn pushback from practitioners. One engineer writing on HackerNoon argues the test doesn’t reflect how skilled teams actually build skills. The benchmark prompted a fresh agent to write a skill and immediately use it, which is closer to asking a model to think harder before attempting a task than to building a skill from real execution experience. His own replication, using skills built from actual debugging sessions, showed much stronger results. The distinction matters because a skill captures what a fresh model wouldn’t know. If the model could have reasoned its way there anyway, the skill wasn’t needed.

The practical consequence is that self-generation is the obvious shortcut. You finish a workflow, ask the agent to extract it as a skill, and move on. The benchmark says that without a human review step, you’re not getting the gains you’d expect. The skills look complete. They often cover the main path. What they miss are the edge cases, the exceptions, the three things your team does differently that the model has no way of knowing, and those are exactly the things that make a skill valuable.

One finding worth noting for anyone building with skills: focused skills with two to three modules consistently outperformed comprehensive documentation (“SkillsBench,” Section 4.2). More coverage in a single skill didn’t help; more focused, well-scoped skills did. The benchmark also found that smaller models running with curated skills could match larger models running without them, which is a meaningful cost implication for anyone running skills at scale (“SkillsBench,” Section 4.2.3, Finding 7).

Questions that come up when building with skills

These questions show up every time a team starts building a skill library.

When does something become a skill versus staying in a workflow or system prompt?
The cleaner test is whether this is a recurring task that your team has a specific, repeatable way of doing. If yes, it’s a skill candidate. If it’s a one-time flow or something where general reasoning is sufficient, it probably doesn’t need one. The key difference between a skill and a workflow tool like n8n is flexibility. A workflow executes a fixed sequence and breaks when inputs change, while a skill gives the agent procedural guidance it can apply to variations of the same task. Similarly, agentic workflows can chain multiple agents and tasks together, but each agent still benefits from skills that encode the org-specific knowledge for its part of the chain. When you want the what to be consistent but the agent to handle the how intelligently, that’s a skill.

How narrow or broad should a skill be?
The SkillsBench finding that focused skills with two to three modules outperform comprehensive ones is directly relevant here (“SkillsBench,” Section 4.2). A skill that tries to cover an entire domain tends to underperform one that handles a specific thing well. The more practical question is whether to put a full workflow (data fetch, format, generate PDF) into one skill or split it. Current research supports splitting because, then, each piece becomes reusable, easier to update when something changes, and less likely to create unexpected behavior when one module’s scope drifts.

What about skills for noncoders or nonsoftware workflows?
Skills are format-agnostic. They’re structured instructions plus optional scripts, and the domain can be anything. A customer support team can encode their escalation criteria, tone guidelines, and the specific conditions where a human always takes over. A legal team can encode their document review checklist. A design team can encode component standards so reviews stay consistent across contributors. Atlassian’s Rovo agents are a useful reference outside the coding context. Their skills handle ticket triage, Confluence page creation, and service request routing, none of which is software engineering.

When should you deprecate a skill?
This is the question that gets skipped most often. The “SoK” paper argues for treating skills like any other maintained artifact through discovery, refinement, evaluation, update, and eventually deprecation (see Figure 2 in the paper). A skill that was compensating for a model capability gap six months ago may now be redundant, and worse than redundant if it’s overriding better native behavior. The practical test is to run the task with and without the skill and check if the skill still helps. If the gap has closed, retire it.

What breaks as the library grows

A single well-written skill works well. As libraries grow, flat retrieval breaks down, and the “AgentSkillOS” paper is the first to study this systematically across ecosystem scales from 200 to 200,000 skills.

Flat skill libraries don’t scale. When the agent scans a flat directory of, say, 80+ skills on every request, retrieval becomes unreliable. Two skills with similar descriptions start triggering interchangeably and behavior becomes nondeterministic for the same input. At the extreme, the orchestrator falls into routing collapse, where it consistently invokes the wrong skill because the semantic embeddings of two similar skills are indistinguishable. The output looks reasonable BUT the wrong skill ran.

The fix the paper proposes is capability trees: organize skills into a hierarchy rather than a flat list. Top-level domains like code, data, docs, with more specific skills as branches and leaves. The agent navigates from domain to branch to leaf instead of scanning everything. They also introduce a usage frequency queue, where skills that aren’t being invoked or aren’t improving outcomes get moved to a dormant index so they don’t pollute retrieval for active skills.

Testing this across ecosystems ranging from 200 to over 200,000 skills, the structured approach consistently outperformed flat invocation, and the gap widened as library size grew.

This pattern shows up in how production teams manage their libraries too. Atlassian recommends fewer than five skills per Rovo agent. OpenHands maintains a curated extensions repository with separate skill packages for discrete workflows rather than one monolithic skill set. Across all of them, scoped purposeful skill sets outperform comprehensive ones. More skills isn’t more capable. Past a point, it’s just more noise.

How orchestration can work differently

This section uses a different definition of skill than the rest of the article, so the distinction matters upfront.

In the “SkillOrchestra” paper, a skill isn’t a SKILL.md file. It’s a capability description used to match task requirements to individual agents in a multi-agent system (see Figure 3 in the paper). The concern isn’t procedural knowledge for one agent but figuring out which agent in a pool should handle a given task and why.

The problem it’s solving is that standard reinforcement learning approaches to multi-agent routing don’t hold up as systems grow. Adding a new agent or modifying a workflow means retraining from scratch. RL policies also tend to send everything to the highest-capability agent regardless of cost, which looks fine in evaluation but gets expensive when you’re running it in production.

SkillOrchestra’s alternative has each agent maintain a competence profile derived from its own execution history, specifically estimated success rates across different task types. The orchestrator routes incoming tasks to the agent whose profile best matches what the task actually demands, rather than the one with the highest raw capability. The routing logic stays current without retraining, and you can inspect why a task went where it went.

The same logic applies to SKILL.md-based systems. Tracking which skills actually improve outcomes for specific task types, and what they cost in tokens, gives you the foundation for better selection as your library grows. You don’t need SkillOrchestra’s full framework to benefit from the core idea.

The security problem

A large-scale security analysis of 31,132 community-sourced skills found that 26.1% contain at least one exploitable vulnerability, spanning prompt injection, data exfiltration, privilege escalation, and supply chain risks. More than one in four.

The attack patterns aren’t exotic. Prompt injection hidden in skill descriptions that manipulate agent behavior once the skill loads. Scripts that execute against filesystem permissions broader than the skill needs. Tool authorizations scoped to the entire workspace when the task only requires one directory.

The core issue is that an external skill isn’t a document you’re reading. It’s code running with your agent’s permissions. Importing a skill from a public repository without reviewing it is like doing an npm install from an unknown author. You wouldn’t do that without at least checking what the package does. That framing changes what due diligence looks like. It means checking the scripts folder before installing, verifying that the permissions the skill requests match what the task actually requires, and sandboxing execution where your environment allows.

The tooling for auditing skills at install time doesn’t exist at the level it should yet. Until it does, the due diligence is manual. OpenHands’ extensions repository and Atlassian’s open source skill package are reasonable references for how production-grade community skills scope permissions. Claude Code’s built-in skill creator also helps here, since it structures permission scoping explicitly from the start.

3 things to do differently

Across all four papers, three recommendations are consistent.

Write skills from real execution. Do the workflow manually with an agent, correct it as you go, then extract it as a skill. The agent has full context of what worked. Skills built from real runbooks, incident reports, and accumulated corrections outperform skills written from scratch. The org-specific edge cases are exactly what the base model doesn’t already know. The general workflow it can handle; the three exceptions your team deals with differently are what the skill needs to capture.

Treat the description as routing logic. The description isn’t a label. It’s how the skill gets triggered at all. Specific phrases, explicit activation conditions, context that distinguishes this skill from adjacent ones. If a skill isn’t firing when you expect it to, or fires when it shouldn’t, rewrite the description first. That’s almost always where the problem is.

Plan for the full lifecycle. Creation is the easy part. Skills drift out of relevance as models improve. A skill that compensated for something Claude couldn’t do eight months ago may now be actively overriding better native behavior. They need to be evaluated against actual task outcomes, updated when workflows change, and retired when they stop earning their place. The teams that treat their skill libraries the way good engineering teams treat their codebase, with reviews, with metrics, with a process for deprecation, are the ones whose libraries stay useful as they grow.

Where this is heading

The shift from prompt engineering to tool use to skill engineering has followed a pattern. Each era produces artifacts that persist longer than the last. Prompts lived in conversations. Tools live in configurations. Skills live in libraries, versioned, shared, maintained, and eventually retired. They behave like code.

Most teams aren’t treating them that way yet. Skills get written quickly, without evaluation criteria, without any plan for what happens when they stop being useful. That’s worked so far because most skill libraries are still small enough to hold in your head. It won’t hold as they become infrastructure.

The teams building durable agent systems won’t be the ones with the most skills. They’ll be the ones who figured out earlier that a skill library needs to be maintained, not just populated, and who started building the discipline to do that before it became urgent.


This article grew out of a live “Chai & AI” session conducted by Prahitha Movva where practitioners debated whether agent skills actually deliver on the hype, or just add another layer of complexity.



Read the whole story
alvinashcraft
3 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Angular Is Exciting Again, and v21 Proves It

1 Share
Angular once felt heavy and outdated. Signals, standalone components, zoneless APIs, and Angular 21 may have changed that for good.
Read the whole story
alvinashcraft
3 hours ago
reply
Pennsylvania, USA
Share this story
Delete

When Applying Scrum By The Book Fails, Understanding Context Before Changing The System | Christian Thordal

1 Share

Christian Thordal: When Applying Scrum By The Book Fails, Understanding Context Before Changing The System

Read the full Show Notes and search through the world's largest audio library on Agile and Scrum directly on the Scrum Master Toolbox Podcast website: http://bit.ly/SMTP_ShowNotes.

 

"I treated Scrum like a military SOP — follow the book, execute the steps. But I failed to see that the context was really the tipping point. What looked like a problem was actually their solution." - Christian Thordal

 

Christian shares a hard-won lesson from his time coaching three RPA teams at one of Denmark's largest banks during the pandemic. He inherited teams running six-week sprints with half-hour planning sessions that amounted to little more than putting items on a calendar. As a former Danish Army officer, Christian's instinct was to fix the obvious deviation from the Scrum Guide — the sprint length. He advocated for shorter feedback loops and eventually convinced the Product Owner, who also served as the director, to try two-week sprints. The first planning session was a disaster. There was yelling and scolding, and it became clear that the real problem had nothing to do with sprint length. The teams had no proper backlog. The six-week sprints actually worked because they gave teams enough time to go out to the business, discover work, and deliver it within a single cycle. Christian realized he had been applying Scrum mechanically without understanding how work entered the system. He started attending business analyst and PO meetings, uncovered the backlog gap, and helped the teams build a proper one. His key insight: what looks like a symptom can actually be a pragmatic solution to real constraints. Understand the system before you change it.

 

In this episode, we refer to the book Scrum: The Art of Doing Twice the Work in Half the Time, by Jeff Sutherland.

 

Self-reflection Question: When was the last time you assumed a team's practice was wrong, only to discover it was a reasonable adaptation to their context? How might you investigate the "why" behind existing processes before proposing changes?

 

[The Scrum Master Toolbox Podcast Recommends]

🔥In the ruthless world of fintech, success isn't just about innovation—it's about coaching!🔥

Angela thought she was just there to coach a team. But now, she's caught in the middle of a corporate espionage drama that could make or break the future of digital banking. Can she help the team regain their mojo and outwit their rivals, or will the competition crush their ambitions? As alliances shift and the pressure builds, one thing becomes clear: this isn't just about the product—it's about the people.

 

🚨 Will Angela's coaching be enough? Find out in Shift: From Product to People—the gripping story of high-stakes innovation and corporate intrigue.

 

Buy Now on Amazon

 

[The Scrum Master Toolbox Podcast Recommends]

 

About Christian Thordal

 

Christian Thordal is a former Danish Army officer turned Agile Coach. He works with leaders and teams to create clarity, accountability, and momentum in complex organizations. His approach blends military leadership principles with modern product development, helping organizations move from discussion and strategy to real execution and measurable results.

 

You can link with Christian Thordal on LinkedIn.

 





Download audio: https://traffic.libsyn.com/secure/scrummastertoolbox/20260518_Christian_Thordal_M.mp3?dest-id=246429
Read the whole story
alvinashcraft
3 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Episode 105: The Future of Trader Joe's Depends on These Grads

1 Share

Joining us for this episode of Inside Trader Joe's is a group of Trader Joe's shoppers from the high school Class of 2026. They represent, quite literally, the future, not just for Trader Joe's, but for all of us. We chatted with them about their favorite TJ's products and the way they cook, got their feedback about some new products that haven't quite made their way to the shelves at Trader Joe's just yet. Most importantly, they schooled us on some current slang – let's just say, this episode hits!

Transcript (PDF)





Download audio: https://traffic.libsyn.com/secure/insidetjs/The_Future_of_Trader_Joes_Depends_on_These_Grads.mp3?dest-id=704103
Read the whole story
alvinashcraft
3 hours ago
reply
Pennsylvania, USA
Share this story
Delete

OVCS: Raspberry Pi–powered electric car

1 Share

This Maker Monday, we’ve gone big with this Raspberry Pi–powered electric ‘Frankencar’ made up of parts from different vendors. The OVCS (Open Vehicle Control System) team converted an old VW Polo into an electric vehicle that can be driven remotely.

And kids, adults, all of the above, I cannot stress this enough — DO NOT TRY THIS AT HOME.

The electric vehicle (EV) revolution is just about here. With EV charging points found nearly everywhere, and more established car manufacturers introducing electric variants of existing models — or entirely new ones — a petrol-free future seems closer than ever. Of course, as a reader of this blog, you will be aware of how technology companies can be, and won’t be surprised to know that these manufacturers are using a lot of proprietary tech in their vehicles.

The vehicle is currently not road-legal

“Our project, OVCS [Open Vehicle Control System], aims at breaking the traditional vendor lock-in that you see in cars and other vehicles,” Marc Lainez tells us. “We want to make it possible to interface parts from different brands together as if they were always meant to be working that way. Most car parts have a universal functionality to perform (braking, steering, showing data…), but the language they speak is different. So we thought we could build such a platform that would allow tinkerers like us to extend or swap a vehicle’s functionalities with parts from any brands.”

Marc and his team have developed a prototype of this platform, which uses Raspberry Pi to translate between the different parts.

Cross-compatible

The team had been looking for a larger-scale hobby project they could sink their teeth into, one that would combine all of their various interests. EVs ended up sitting at the centre of the Venn diagram.

A custom steering column was used, so why not attach a very serious racing wheel?

“[We] were growing concerned with the security and reliability of vehicle software platforms,” Marc says. “We thought this was the perfect project to learn a ton about how car parts communicate, how they interact together, and how we could seamlessly integrate parts together using modern languages on off-the-shelf hardware components like Raspberry Pi.”

Raspberry Pi is used in multiple ways in the concept car: first, to power the vehicle management system, which they describe as the brains of the platform.

“It translates messages from the different communication buses (CAN) and routes them to the appropriate ones,” Marc explains. “In total, we have five CAN networks that are being accessed through SPI modules connected to a Raspberry Pi. Without this Raspberry Pi, the car wouldn’t be driveable.”

The prototype was built on wood before any modifications to a real car happened

It’s also used in the infotainment system — something we’ve seen several folks do with Raspberry Pi in cars before. Not only does it show all of the usual info about your vehicle, but it also includes a touchscreen automatic gear shift, as the car is an EV conversion.

Finally, there’s the radio bridge: “[It’s] a component connected to the CAN bus and sends instructions to the VMS to accelerate, brake, and steer,” Marc says. With it, they can control the car remotely.

“From a software perspective, we wanted to have a technology stack that was familiar and at the same time, something ‘batteries-included’ that would allow us to easily build firmware in a high-level language while at the same time making the firmware updates really easy,” Marc continues. “Since we had done quite a lot of Elixir development, we used Nerves. This is an IoT framework built in Elixir and Erlang that relies on Buildroot (Linux build system) and gives you the ability to write your firmware in plain Elixir, a high-level functional language. It made our development cycles much faster/shorter and easier and allowed us to use a language we were already familiar with.”

Put it in reverse

Getting the various parts to communicate — such as a Nissan Leaf electric motor and parts from a VW Polo — was one of the hardest elements, as manufacturers generally do not publish documentation on how their components communicate.

The infotainment system, also powered by a Raspberry Pi, has a touchscreen gear selector

“We had to reverse-engineer quite a lot of messages in order to make the car functional,” Marc reveals. “To give an example, if you want to know what message gives you the handbrake status (pulled or not), you look at all that is passing on the bus, you pull the handbrake a few times to see what frame is perfectly synchronised with your action to isolate its ID, then you check which bytes change when you pull it… For more complex components, this is a combination of multiple messages and, fortunately, there is a community of car tinkerers who publish their findings on forums online. Sometimes the work was done; sometimes partially and we had to complete it.”

Over the course of 18 months, the team did manage to make their ‘Frankencar’ driveable, which was their main goal — they then went beyond that by making it remote-controlled. They also wanted to document their build, which they’re in the process of completing. After that? “The next goal is for the car to be self-driving,” Marc says.

Issue 165 of Raspberry Pi Official Magazine is out now!

If you liked this article, there are many more like it in the latest issue of Raspberry Pi Official Magazine. You can purchase a copy from the Raspberry Pi Store in Cambridge. It’s also available from our online store, which ships around the world. And you can get a digital version via our app on Android or iOS.

You can also subscribe to the print version of our magazine. Not only do we deliver worldwide, but those who sign up for a six- or twelve-month print subscription will receive a FREE Raspberry Pi Pico 2 W!

The post OVCS: Raspberry Pi–powered electric car appeared first on Raspberry Pi.

Read the whole story
alvinashcraft
3 hours ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories