Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
154005 stories
·
33 followers

Xbox fans want exclusives, more backward compatibility, and free online multiplayer

1 Share
Xbox logo illustration

Microsoft launched a new Xbox Player Voice portal yesterday, aiming to collect feedback from fans and "make it more visible." It certainly hasn't taken long for Xbox fans to make their feedback very clear. The most upvoted feedback on Xbox Player Voice demands exclusive games for Xbox consoles, more backward compatible games, and free online multiplayer.

Xbox CEO Asha Sharma has already promised that she's "reevaluating" the approach to Xbox-exclusive games and windowed releases of titles, but there has been no firm commitment to reverse the decision to port games to PlayStation and Nintendo Switch. "Xbox was built off of great game exclusi …

Read the full story at The Verge.

Read the whole story
alvinashcraft
13 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Apple unveils new accessibility features, and updates with Apple Intelligence

1 Share
Apple announced major accessibility updates powered by Apple Intelligence, including new capabilities for VoiceOver, Magnifier, and Voice Control.

Read the whole story
alvinashcraft
13 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

AI Artifact Catalogs: Durable Standards Worth Institutional Investment

1 Share

Companies everywhere are trying to leverage AI to boost internal productivity metrics. Some, like Ramp and Intercom, are succeeding. Many are failing.

To make matters more complicated, the narrative around what tooling enables these gains is constantly shifting. For software engineers, auto-complete via GitHub Copilot was the bleeding-edge tool of choice in 2024. Then it was Cursor for much of 2025. 2026 has been dominated by command-line-based coding agents like Claude Code and Codex.

While the tooling layer winds ebb and flow, many of them have come to share a number of common primitives: open standards that help configure and guide these tools’ capabilities.

Agent Skills. MCP. Plugins. These all present vendor-agnostic mechanisms by which we can configure the tools today. The catch: These mechanisms aren’t one-size-fits-all. How you can connect to an MCP server depends on your organization’s security posture. An Agent Skill crafted specifically for one team’s design system does not copy-paste well into that of another team.

As individuals within organizations begin to configure—and sometimes build from scratch—the skills and MCP servers that unlock real productivity gains, the next unlock is to translate those wins to shareable, reusable institutional knowledge. AI artifact catalogs are the output of this step. They represent the useful bits of internal knowledge and glue that connect much of what employees are doing manually today, over to empowering both:

  • Their peers. By sharing these artifacts within or across teams, productivity gains are shared across the organization, not in individual silos.
  • And their agents. Equipping agent runtimes like Claude Code or Codex with hard-won, domain-specific guidance means employees can spend more time building agentic systems and less time toiling on repeatable labor.

The durability of open standards

There is an ongoing industry-wide rush to buy AI-powered solutions in the hopes that a vendor can unlock these sought-after productivity gains. 95% of those pilot projects are failing.

Of course, there is a spectrum of risk when buying solutions like this from a vendor. If you go all-in on Anthropic’s tooling—like Intercom did with Claude Code—and Anthropic continues to be an industry leader, things will go well. Make the same decision with a startup’s offering that fails to get broad industry adoption, and you’re stuck with a proprietary data model that operates in a dead-end silo you have to rebuild from scratch in a year.

There’s another path: that of committing to open standards. If you invest in Agent Skills, in MCP, in plugins, not only will you be protected against a single vendor going belly-up, but you won’t even miss a beat when the leading coding agent that all your engineers demand next quarter changes, again. Switching costs drop to a fraction of what they’d be with a proprietary stack.

There’s no doubt that AI capabilities are evolving at a breakneck pace. It’s hard to predict what innovations the next cycle will bring. But what’s unique about these vendor-agnostic standardized primitives is that they are concepts upon which innovation can build, not replace. We’re all still building on top of HTTP that forms the fabric of the web. QWERTY keyboards are strictly inferior to Dvorak keyboards, and yet the standard prevails to this day. JavaScript is a much-maligned language, yet it underpins practically the entire frontend of the internet.

As AI rapidly reduces the cost of building, the cost of coordination among people and among entities remains high. Standards remain scarce and valuable.

AI artifacts and their relative maturity

The most important aspect of any standard is its level of adoption. It’s clear that the leading tooling empowering internal AI transformation is coalescing around coding agent tools like Claude Code and Codex, less-technical tooling like Claude Cowork, and rich agent SDKs like those from Anthropic or OpenAI.

Taking the compatibility of leading tools in those categories as indicators of standard adoption, here’s where I think the landscape of AI artifacts currently nets out:

StandardArtifactStatusAdoption
Agent SkillsSkillVendor-agnostic standardHighest
MCP serversmcp.json and Server CardVendor-agnostic standardHighest
PluginsPluginVendor-agnostic standardHigh
Command line interface (CLI) toolsCustomUnstandardizedHigh
HooksHookDerivative standard (Open Plugins)Medium
RootsGit repositoriesDerivative standard (AGENTS.md)Medium
RulesRuleDerivative standard (Open Plugins)Medium
Tool compatibility considered in “adoption” as of April 2026: Claude Code, Cowork, Codex, Cursor, GitHub Copilot, Gemini CLI, Pi, OpenCode, Amp, Claude Agents SDK, OpenAI Agents SDK

A minimalist catalog stored as a Git repository for a team might start off looking something like this:

A minimalist catalog stored as a Git repository

I work with software engineering teams early in their AI adoption journey, where they might have a few individual tinkerers leaning heavily into AI but haven’t yet figured out how to propagate adoption more widely. Out of the gate, my conversations with teams tend to run a gamut of disparate tool preferences, unique workflows, disjoint architectures, and other one-off quirks. A big unlock for moving these organizations forward is to introduce shared language. Shared language grounds conversations. It puts teams working on different AI-related initiatives on a path to smooth integration with each other. People get excited about how puzzle pieces might fit together.

Let’s review these artifacts in more detail.

Skills: The lifeblood of most institutional knowledge

As Tim O’Reilly wrote a few months ago, a skill can be “the integration of expert workflow logic that orchestrates when and how to use each tool, informed by domain knowledge that gives the LLM the judgment to make good decisions in context.”

This is not the only “type” of skill that currently exists out there. They can span a gamut of purposes; to name a few:

  • Encoding of internal, expert orchestration knowledge (as in the above)
  • Guidance on using otherwise deterministic tools (such as MCP servers or CLI tools)
  • Context management tricks that have broad appeal (to make up for LLM capability limitations)

But the first—the encoding of expert knowledge—is very much the most valuable and irreplaceable. Chances are, that variant of skill is not otherwise documented. It lives as tacit knowledge among your employees or is scattered across many systems so as to make any associated work a multistep journey.

The implication: Any skill you can download from the public internet is probably not nearly as valuable as an internal skill crafted by an employee. The latter skill is aware of your business context, the opinionated systems in play, and maybe encodes unique expertise hard-won over years of tenure. And most importantly: That level of insight is not making it into a model training run any time soon. Nor is it likely to be relevant to just about anyone outside of your own company. The same can’t be said for the latest skill repository on GitHub that acquires 10,000 stars. If that public skill is any good, the generic concepts will find their way into natural model and harness capabilities before long, eliminating the need for that class of skill.

Skills are extremely well-adopted; uncontroversially so by every major coding agent.

MCP and CLI tools: The connectivity layer to external systems

Most agents don’t operate in a vacuum: Interaction with external systems is how we compose AI. One agent can talk to another agent, or just some separate deterministic system, by way of MCP or a CLI tool.

The MCP versus CLI debate is well-documented, so we won’t rehash it here. Regardless of which of the two you implement (and perhaps you use both for different use cases), the point is that MCP/CLI is responsible for poking a hole into what is otherwise a local-only sandboxed environment for your agent.

This is the layer that juggles authentication—facilitating OAuth, injecting any relevant secrets—and exposes some well-defined surface area for what your agent could possibly do in communication with that external system (e.g., MCP tool definitions or CLI command options).

For MCP, you have well-established conventions and standards in the form of Server Cards and server.json files—to declare all the possible configurations of an MCP server—and also an upcoming standard called mcp.json to declare specific configurations of an MCP server (inspired by, among others, files like .mcp.json from Claude Code).

For CLI, cataloging it means rolling your own catalog format: probably covering metadata like “how to install this,” “what auth mechanisms does it support,” “where to store secrets,” and related concerns that are explicitly or implicitly captured in analogous mcp.json files.

MCP is very well-adopted and natively compatible with most agent frameworks. CLI works anywhere the agent comes with bash capabilities but can be fairly limited in a sandbox environment and doesn’t share the sort of configurability as MCP does otherwise.

Hooks: Inject capabilities at deterministic trigger points

Hooks are handy to inject sprinkles of determinism in an otherwise nondeterministic agentic session. Some effective uses I’ve seen: injecting a session transcript capture step for future review or capturing analytics on what skills are being invoked within a team.

Hooks don’t have their own standard but are baked into the upcoming Open Plugins standard. The concept is supported by most major coding agents, although implementations have some variance.

Rules: Context appended to rules

Originally popularized by Cursor, rules allow for injecting blurbs of context in largely deterministic, but sometimes nondeterministic, fashion.

Functionally, many rules could be modeled as skills and AGENTS.md files. Given the popularity of the latter, it’s unclear whether they will continue to remain relevant in the long run.

Roots: An agent’s starting point

Most agents “start” inside a particular location in a filesystem: a “root.” For coding agents, this means some folder within a Git repository. In some agents, such as Claude Cowork, this is equivalent to the notion of a “project.”

While not directly standardized, the notion of a root is implicit in the AGENTS.md standard, which assumes the presence of a filesystem that hosts static context for which the agent should operate upon.

Plugins: Bundles to bring it all together

Plugins are somewhat unique in the above list. Conceptually, they are a bundle of several of the other artifacts. A plugin can be thought of as a composition of skills, rules, hooks, MCP servers, and some other components. The up-and-coming Open Plugins initiative spearheaded by Vercel is working to finalize what this specification looks like.

They serve a natural purpose. Any team leaning into building skills and MCP servers will quickly get to a point where several skills and MCP servers will combine to form a practical grouping of guidance and capabilities. Claude Code’s implementation of plugin marketplaces is becoming a de facto distribution mechanism for plugins. It’s very much an option to catalog individual artifacts, and then use mechanisms like that to distribute them all as bundled within the plugin abstraction layer.

Some companies have fully leaned into this abstraction. For example, Intercom, rather than cataloging skills or hooks individually, just catalogs plugins—skills and hooks are fully inlined within them.

Most of the agentic tooling ecosystem is largely aligned on plugins, with Pi and OpenCode being notable holdouts.

Rich, practical catalogs are what can separate AI success stories from repeated false starts

Maybe you choose to go all-in on plugins and bundle your skills and MCP servers inline; maybe you build a granular catalog per artifact type. But whatever shape it takes, what matters is that your company is cataloging—and retaining ownership of—its way of working. And doing so in a way that maximizes potential compatibility with the frontier tooling that is yet to be invented.

It’s very immediately actionable for a company to start on this path. No new vendor relationship is needed, just an internal agreement to start storing artifacts in some company-wide Git repository. Encouraging sharing, moving past individual silos, celebrating wins—and eventually celebrating usage—of these artifacts. Every addition to that catalog is an opportunity for someone else to leverage an artifact someone else constructed, a chance to build on top of it, to collaborate or consolidate efforts.

If you’re part of a company building its first catalog, I’d like to hear from you. I work with a few companies in the early stages of this initiative, and I’ve been capturing early learnings around managing these catalogs in a very lightweight open source framework called AIR. If others are getting value out of leaning into these open standards as catalogs, we likely have an opportunity to collaborate across companies on some of the glue and minutiae that can operationalize the ideas here.

Ramp and Intercom aren’t winning because they picked the right tooling vendor. They’re winning because they’ve turned individual productivity into organizational capability. The tooling will keep rotating. Whether your company compounds alongside it is a choice worth making deliberately.



Read the whole story
alvinashcraft
14 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Introducing the JetBrains Course Creators Program

1 Share

Online programming education still has a major gap: students learn concepts through videos and browser-based exercises, but rarely get to code in the professional tools they’ll use in development jobs.

As AI changes how people learn programming and write code, practical developer skills are becoming even more important. Students need more than generated snippets – they need experience working in real development environments, understanding projects, debugging applications, and building software alongside AI tools. These are the skills they’ll be expected to use in internships and developer roles.

We’re launching the JetBrains Course Creators Program to help close that gap by bringing hands-on coding practice directly into JetBrains IDEs together with course creators and educators.

If you’re a programming course creator on platforms like Udemy, Coursera, LinkedIn Learning, Pluralsight, or your own website, you can now integrate hands-on practice directly into JetBrains IDEs. Your students won’t just watch and learn – they’ll actually code in professional tools used by developers worldwide.

What is the Course Creators Program?

A lot of online programming education still looks like this: 

watch videos → take quizzes → finish the course 

In most cases, hands-on practice happens in browsers or simplified environments. As a result, learners may understand concepts but feel lost when they open a real IDE for the first time. The Course Creators Program aims to close this gap and is designed for independent educators who want to take their courses beyond videos and quizzes. 

You don’t need to rebuild anything. We help you move the practical part of your course into JetBrains IDEs using the JetBrains Academy plugin, so learners can:

  • Write real code.
  • Run and debug programs.
  • Build skills in a professional development environment.

Already teaching on Coursera?

We already support direct integration between Coursera courses and JetBrains IDEs. Students can open projects in the IDE with a single click – no additional setup required.

To help creators get started, we’ve also prepared a step-by-step integration guide. This guide explains how to connect your Coursera course with JetBrains IDEs using the Apps (LTI) feature – from creating the App to publishing coding exercises that sync progress automatically.

Who can participate?

You can join the program if you:

  • Publish programming courses on Udemy, Coursera, LinkedIn Learning, Pluralsight, edX, or a similar platform.
  • Run your own educational platform.
  • Teach programming, software development, or related technical topics.

What are the benefits of joining?

This program is more than just a technical integration. Our goal is to help course creators build a stronger hands-on learning experience while growing their audience and professional presence alongside JetBrains.

As part of the program, creators receive product access, technical guidance, promotional support, and collaboration opportunities designed to help bring professional coding workflows into online education.

How to apply

Getting started is simple:

  1. Apply to the program
    Tell us about your course, audience, and the technologies you teach.
  2. Integrate your course into JetBrains IDEs
    Our team will help you bring the practical part of your course into the IDE.
    No need to rebuild your entire course – just enhance the hands-on experience.
  3. Start teaching with real coding exercises
    Your students continue learning on your platform while practicing in professional development tools.

Most creators complete integration within 2–4 weeks.

Not ready for full integration? Let’s talk anyway

We know that integrating a course into JetBrains IDEs is a meaningful commitment. If you’re not there yet but still want to bring JetBrains tools to your students, we’re open to other forms of collaboration, too.

For example:

  • Point your students to free JetBrains IDEs

Several JetBrains IDEs are available for free for non-commercial use – no student verification or application required. You can recommend them to your learners right away so they can follow your course in a professional IDE instead of a browser-based editor.

  • Get educational license coupons for your students

If your course uses IDEs not covered by the free tier, we can provide educational license coupons for IntelliJ IDEA, PyCharm, and other JetBrains tools.

  • Feature JetBrains tools in your content

If you already teach using JetBrains IDEs in your videos or course materials, we’d love to support that with tools, visibility, and resources.

For more information, reach out to us at education@jetbrains.com.

We believe programming education should feel closer to real development from day one.

Join the JetBrains Course Creators Program and help your students practice with the tools they’ll use in future jobs.

Your JetBrains Academy team

Read the whole story
alvinashcraft
14 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

LLM Evaluation and AI Observability for Agent Monitoring

1 Share

This is a guest post from Naa Ashiorkor, a data scientist and tech community builder.

Artificial intelligence keeps evolving at a rapid pace. The latest major application of AI, specifically of LLMs, is AI agents. These are systems that use their perception of their environment, processes, and input to take action to achieve specific goals, and they are built on LLMs. 

Increasingly, complex AI agents are being used in real-world applications. While simpler agentic applications that use only one agent to achieve a goal still exist, organizations are now shifting towards multi-agent systems that use multiple subagents coordinated by a main agent. These are more adaptable and can mimic human teams when it comes to performing specialized tasks such as data analysis, compliance, customer support, and more. The reasoning and autonomy of AI agents have improved; consequently, they can gather data, conduct cross-references, and generate analysis.

As we move towards these complex, real-world applications of agents, an ever-stronger spotlight is being shone both on how we observe AI agents and how we evaluate the LLMs they’re built upon. The complexity, interactions, and autonomous processes under the surface of AI agents make rigorous monitoring and assessment an essential part of building and maintaining these applications. LLM evaluation determines if the AI agent can work, while AI agent observability determines if it is working. LLM evaluation tests an agent’s basic capabilities before and during deployment, while agent observability provides deep, real-time visibility into an agent’s internal reasoning and operational health once it is live. It is pretty obvious that having just one of these is a loss and a formula for failure. 

In this blog post, we’ll explore how to evaluate agents using advanced metrics and observability tools. It’s designed as a practical, end-to-end reference for teams that want to move beyond demos and actually run AI agents in live, real-world environments, avoiding the common pitfalls that cause failure in production.

Core LLM evaluation metrics for modern AI systems

As LLMs are now applied to a wide range of use cases, it is important that their evaluation covers both the tasks they may perform and their potential risks. Evaluation metrics give a better understanding of the strengths and weaknesses of LLMs, influence the guidance of human-LLM interactions, and highlight the importance of ensuring LLM safety and reliability. Hence, LLM evaluation metrics for assessing the performance of an LLM are indispensable in modern AI systems. Without well-defined evaluation metrics, assessing model quality becomes subjective. 

There are several key evaluation metrics, each with a different purpose, and the table below provides a summary of some of them.

Evaluation MetricWhat the metric evaluates
Hallucination rateFactual accuracy and truthfulness of generated content
Toxicity scoresHarmful, offensive, or inappropriate content
RAGAS (Retrieval Augmented Generation Assessment)Measures whether the RAG system retrieves the right documents and generates answers that are faithful to those sources
DeepEvalTests everything from basic accuracy and safety to complex agent behaviors and security vulnerabilities across the entire LLM application

Hallucination rate

Hallucinations in LLMs produce outputs that seem convincing yet are factually unsupported and can be categorized as either intrinsic, where the output contradicts the source content, or extrinsic, where it simply cannot be verified. They can stem from a range of factors across data, training, and inference, from quality issues in the large datasets used for initial training and the data used to fine-tune model behavior to post-training techniques that make models overly eager to provide responses to imperfect decoding strategies at inference. Because hallucination is an unsolved challenge cutting across every stage of model development, measuring and assessing it remains a vital part of LLM evaluation.

There is a wide variety of techniques for detecting hallucinations. These include: 

  • Fact-checking: Extracting independent factual statements from the model’s outputs (fact extraction) and then verifying these against trusted knowledge sources (fact verification).
  • Uncertainty estimation: Using the certainty provided in the model’s internal state to estimate how likely a piece of factual content is to be a hallucination.
  • Faithfulness hallucination detection: Ensures the faithfulness of LLMs to provide context or user instructions. 

There are several metrics for hallucination detection. Some of the most commonly used metrics include:

  • Fact-based metrics: Assessing faithfulness by measuring the overlap of facts between the generated content and the source content. 
  • Classifier-based metrics: Utilizing trained classifiers to distinguish between the level of entailment between the generated content and the source content. 
  • QA-based metrics: Using question-answering systems to validate the consistency of information between the source content and the generated content. 
  • Uncertainty-based metrics: Assessing faithfulness by measuring the model’s confidence in its generated outputs. 
  • LLM-based metrics: Using LLMs as evaluators to assess the faithfulness of generated content through specific prompting strategies. 

PyCharm’s Hugging Face integration lets you discover evaluation models and datasets without leaving the IDE. Use the Insert HF Model feature to search for hallucination or toxicity classifiers, and hover over any model or dataset name in your code to instantly preview its model card, including training data, intended use, and limitations. This means you can import a dataset, evaluate your LLM, and verify the tools you’re using, all from one place.

PyCharm's Hugging Face integration
Opening the Hugging Face model browser in PyCharm from the Code menu, then selecting Insert HF Model.
PyCharm's "Insert HF Model" feature
Searching for a specific hallucination model and selecting one. Use Model inserts a ready-to-use code snippet into the editor.
PyCharm's "Use Model" feature
A ready-to-use code snippet of the Vectara hallucination evaluation model is inserted into the editor.
Vectara hallucination evaluation model
Hovering over the Vectara hallucination evaluation model in the code to preview its model card within PyCharm.

Trust is imperative in the acceptance and adoption of technology. Trust in AI is especially important in areas such as healthcare, finance, personal assistance, autonomous vehicles, and others. Hallucinations have a huge impact on users’ trust in LLMs.

In 2023, a story went viral about a Manhattan lawyer who submitted a legal brief largely generated by ChatGPT. The judge quickly noticed how different it was from a human-written submission, revealing clear signs of hallucination. Incidents like this highlight the real-world risks of LLM errors and their impact on user trust. As people encounter more examples of hallucination, skepticism around LLM reliability continues to grow.

Toxicity scores

LLMs that have been pretrained on large datasets from the web have the tendency to generate harmful, offensive, and disrespectful content as well as toxic language, such as hate speech, harassment, threats, and biased language, which have a negative impact on their safe deployment. Toxicity detection is the process of identifying and flagging toxic content by integrating open-source tools or APIs into the LLM workflow to analyze both the user input and the LLM output. Some of the available toxicity tools include the OpenAI Moderation API, which is free, works with any text, and has a quick implementation. Perspective API by Google is also widely used with a transparent methodology, but will no longer be in service after 2026. Detoxify, which is open source, has no API costs, and is Python-friendly, and Azure AI Content Safety by Microsoft, which is customizable and best for enterprise deployments and existing Azure users. Hugging Face Toxicity Models have many model options and easy integration with Transformers.

Toxicity detection has become a guardrail; hence, it is important in public-facing applications. They prevent toxic content from reaching users, which protects both individuals and organizations. In public-facing applications, toxicity detection operates by input filtering, output monitoring, and real-time scoring. This prevents attacks where users intentionally train AI to produce toxic content through coordinated toxic inputs; toxic content will never reach the user, even if produced by the underlying AI, so systems can adjust their behavior dynamically based on conversation content and escalating risks. Unguarded AI can be exploited, which leads to reputational damage. 

For toxicity evaluation, PyCharm’s Hugging Face Insert HF Model feature helps you discover classifiers like s-nlp/roberta_toxicity_classifier directly in the IDE. Hovering over the model name reveals its model card, where you can see it was trained on the Jigsaw toxic comment datasets, helping you understand what the model can and can’t detect before you write a single line of evaluation code. 

PyCharm's Hugging Face Insert HF Model feature
Opening the Hugging Face model browser in PyCharm from the Code menu, then selecting the Insert HF Model.
PyCharm's "Use Hugging Face Model"
Searching for a specific toxicity model and selecting one. Use Model inserts a ready-to-use code snippet into the editor.
A ready-to-use code snippet of the roberta_toxicity_classifier is inserted into the editor.
Hovering over the roberta_toxicity_classifier in the code to preview its model card within PyCharm.

Frameworks for LLM evaluation

Frameworks for LLM evaluation have changed the game; teams don’t have to rely on manual reviews, gut instinct, and subjective judgment to assess model quality. These frameworks automate the measurement of model quality using standardized, quantifiable metrics. They assign numerical scores to outputs that measure faithfulness, relevancy, toxicity, and other important dimensions. This automation results in reproducibility, speed, and objectivity. 

Consequently, the same input always produces the same score; evaluation runs 10–100 times faster, so in minutes instead of days; and there are no more debates on the quality of the output. Some of these frameworks include DeepEval and Retrieval Augmented Generation Assessment (Ragas). DeepEval is an open-source evaluation framework built with seven principles in mind, such as the ability to easily “unit test” LLM outputs in a similar way to Pytest and plug in and use over 50 LLM-evaluated metrics, most of which are backed by research and all of which are multimodal. 

It is extremely easy to build and iterate on LLM applications with two modes of evaluation, namely, end-to-end LLM evals and component-level LLM evals. It is used for comprehensive testing across RAG, agents, and chatbots. Ragas is a framework for reference-free evaluation of RAG pipelines. There are several dimensions to consider, such as the ability of the retrieval system to identify relevant and focused context passages, as well as the capability of the LLM to exploit such passages in a faithful way; hence, it is challenging to evaluate RAG systems. Ragas provides a suite of metrics for evaluating these dimensions without relying on ground-truth human annotations. 

The limits of static prompt evaluation

Traditional LLM evaluation methods are useful for single prompt-response pairs, measuring output quality, RAG systems with straightforward retrieval, and static evaluation with fixed inputs. But they are limited for multi-step agents because LLM evaluation focuses on the final output quality, not the decision-making process that produced it. Multi-step agents exhibit a different kind of complexity, as they chain multiple decisions.

Why traditional LLM evaluation isn’t enough for agents 

Agents operate independently within complex workflows, and this independence can introduce challenges such as deviation from expected behavior, errors in production, and more failure points than in traditional software applications. Hence, an agent can perform well in testing but fail in production. Traditional LLM evaluations don’t have the capacity to test such use cases. Testing is usually done in a controlled environment with limited scenarios, but production involves real users, edge cases, unpredictable inputs, and scale. This means that agents can make decisions that are not seen in testing, and in production, tasks could be completed, though incorrectly, without generating an error signal. This is where advanced evaluation and monitoring practices come to the rescue! They provide the visibility and systematic measurement needed to deploy agents confidently, rather than relying on trial and error.

The complexity of agent behavior

Traditional LLM evaluation measures single prompt-response pairs: provide an input prompt, receive an output response, and measure quality through metrics such as accuracy, relevance, and faithfulness. Due to the complexity and non-deterministic, multi-step reasoning of AI agents, they cannot be reliably evaluated using traditional evaluation metrics.

Agent behavior is complex, and this complexity introduces challenges. Agents operate in dynamic environments where APIs might be down, databases change between queries, and the “right” answer depends on current conditions. They can use external tools and APIs to complete tasks, and may either use the wrong tool or use the right tool with the wrong parameters or input type. Their internal reasoning traces remain hidden unless they are logged explicitly, so it might be challenging to determine whether an agent was successful through logic or chance. An agent’s output could be perfectly correct despite poor internal decisions, or the entire task could fail despite correct step execution.

This is where observability tooling becomes essential. PyCharm’s AI Agents Debugger breaks open the black box of agentic systems, letting you trace LangGraph workflows and inspect each agent node’s inputs, outputs, and reasoning directly in the IDE, with zero extra code. Just install the plugin, run your agent, and the debugger automatically captures execution traces. Click the Graph button to visualize the full workflow, making it easy to spot where an agent chose the wrong tool, passed bad parameters, or succeeded by luck rather than logic.

To see this in action, I built a simple travel-planning agent using LangGraph in two steps: a research node that suggests summer destinations based on my preferences, and a plan node that picks the best option and builds a three-day itinerary. With the AI Agents Debugger, you can trace exactly what information flowed between these two steps – what the research node suggested and how the planner used those suggestions to build the final itinerary.

The AI Agents Debugger shows how the agent moves from initialization to the research stage, displaying the data passed in and out, and the LLM call used to generate the research results.
The AI Agents Debugger shows how the planning step processes inputs and produces outputs, using an LLM call to construct the final travel itinerary.
The Graph viewprovides a high-level overview of the agent’s workflow, mapping how it progresses from the initial step through research and planning to the final result.

Advanced agent evaluation metrics

The complexity of AI agents demands evaluation that goes beyond considering the final output quality, that is, measuring whether it is accurate, relevant, and grounded. Specialized agent evaluation assesses the complete decision-making process, including the planning logic, tool selection, parameter construction, reasoning coherence, and resource efficiency that led to the final output. Hence, the advanced agent evaluation metrics are designed to make such a process visible and measurable. Some of them are task completion rate, tool usage, reasoning quality, efficiency, and error handling.

Task completion rate

Task completion rate measures the percentage of tasks where an agent successfully achieves the end goal. This is calculated as the number of completed tasks divided by the total number of tasks attempted. The context of “completed” differs by use case. There are real-world use cases for task completion rate. Let’s start with a basic use case. Consider a customer service agent handling a specific food delivery order: “Where is my order #0001? It has not been delivered to me.” Completion rate means successfully looking up the order ID, retrieving the tracking information, and providing an accurate delivery estimate, so all three steps must succeed. If the agent retrieves the wrong order or fails to assess the tracking system, that is a failed task, even if it produces the same output. 

Next, let us look at a medium-complexity use case, sequential API calls. Consider an agent tasked with creating a Jira support ticket and notifying the relevant team in Slack. The agent calls the Jira API to create a ticket, parses the response to get the ticket ID, calls the Slack API with the ticket link, and finally verifies the success of both. If the agent successfully creates the Jira ticket, but the Slack notification fails, that is considered a failed task even if the ticket exists in Jira, since the team wasn’t notified. 

Finally, let’s examine a high-complexity use case: An agent is given the task of completing an online purchase, which means it must handle everything from checkout to order confirmation. Six steps are involved: Verify the item is still in stock, process the payment with a credit or debit card, reserve or decrement inventory, create an order record, generate an order confirmation number, and send a confirmation email to the customer. If the agent successfully charges the customer’s card but the confirmation email fails to send, that’s a failed task, even if the payment was processed and the order was created. In such a situation, the customer has no proof of purchase, so they will likely contact support or attempt to purchase again.

Tool usage correctness

Tool usage correctness assesses whether an agent correctly identifies and invokes the relevant tools and APIs. It is a deterministic measure that is assessed using techniques such as LLM as a judge, like most LLM evaluation metrics. It has three dimensions: 

  • Did the agent choose the right tool for the task (tool selection)? 
  • Were the parameters constructed correctly (input parameters)? 
  • Did the agent properly use the tool results (output handling)? 

Hence, it is important for reliability and functional correctness. 

Step-by-step reasoning accuracy

In real-world use cases, an LLM agent’s reasoning is shaped by much more than just the model itself. Modern frameworks such as LangChain expose the agent’s internal “thoughts” through structured logging of intermediate reasoning steps. This is done using the ReAct (Reasoning and Acting) pattern, which involves the agent thinking about what to do, using a tool, observing the tool result, and then repeating until the task is complete. Each “thought” is logged as text, which creates a complete trace of the reasoning process from initial query to final answer. These traces can be extracted programmatically and evaluated to assess whether the agent’s logic is sound even when the final output appears correct. Evaluating planning steps involves assessing aspects such as the overall approach’s logic, the ordering of steps, and whether any steps are unnecessary or redundant. Evaluating execution assesses whether the implementation worked, such as whether tools were called with correct parameters, whether each step was completed successfully, whether errors were handled appropriately, and whether the output was interpreted correctly. This can be done seamlessly in PyCharm using the AI Agents Debugger.

Groundedness (faithfulness)

Groundedness, also known as faithfulness, is the most critical metric for retrieval-augmented generation (RAG), which is a common component of agentic applications. It assesses whether the agent’s response is actually supported by the retrieved source documents or whether, instead, the model hallucinated information. Different evaluation techniques include:

  • Atomic claim verification: Breaks up the response into atomic claims and checks each claim against the retrieved context. It is slow but best for production RAG and thorough evaluation. 
  • Semantic similarity: Compares the embeddings of the response and source documents. It is fast, so it is best for quick checks and first-pass filtering. 
  • LLM-as-Judge: works by prompting the LLM to score groundedness by extracting factual statements from the response and then checking each statement against the retrieved context. It offers medium speed and is best for flexible, custom criteria. 

AI observability and why it matters

AI observability is about visibility into what the agent is doing. This covers recording everything that happens when a task is executed, including the agent’s reasoning at each step, which tools were called with what parameters, what data was retrieved, and how decisions were made from start to finish. With such a transparent system where every decision can be logged and traced, teams are able to understand why an agent fails, behaves unexpectedly, or becomes expensive to run because issues can be debugged and behavior can be audited. Consequently, system design improves, and guesswork is eliminated.

Definition of AI observability

AI observability is the real-time monitoring of agent actions, thoughts, and environmental interactions: what went in, what came out, how the agent thought through the problem, and which tools, APIs, and data were used. AI observability builds on the three pillars of DevOps observability – that is, metrics, logs, and traces – but extends each one for AI’s unique needs. DevOps metrics track CPU and latency, while AI metrics track token usage and cost per interaction. DevOps logs capture system errors, while AI logs capture reasoning traces and decision points. DevOps traces follow requests through services, while AI traces follow reasoning through agent steps, tool calls, and observations.

Benefits for agent monitoring

Agent monitoring has immense benefits – here are some of the most important:

  • It debugs reasoning errors: When an agent fails or gives an unexpected output, monitoring provides a complete trace of its decision-making process, which shows exactly where the logic broke down. Hence, there is no need to spend hours guessing the causes.
  • It measures performance and latency over time: Since metrics such as average latency, token usage, cost per interaction, and completion rates across all queries are tracked, degradation patterns can be identified before they affect users. As a result, performance issues can be identified and resolved before users file any complaints. 
  • It identifies regressions after model or prompt updates: Baseline metrics such as completion rate, faithfulness scores, latency, and cost are established and then monitored for deviations after deployments. If a new prompt drops the compilation rate or a model update increases the hallucination rate, automated alerts catch it immediately. Hence, issues are caught before users are affected.

Popular tools for agent monitoring

Several frameworks and platforms have emerged to provide built-in observability for AI agents, with each having different strengths and integration approaches and matching different features and requirements. The choice of the right tool depends on the framework, deployment preferences, and primary needs. The table below shows some popular tools and whether they match different features and requirements.

ToolTraces agent steps?Tracks costs?Detects regressions? Self-hostable?Open source?Easy integration?
HeliconeYesYesYesYesYesYes
LangSmithYesYesYesLimitedNoYes
LangFuseYesYesYesYesYesModerate
OpenLLMetryYesLimitedLimitedYesYesModerate
PhoenixYesLimitedYesYesYesModerate
TruLensYesLimitedYesYesYesModerate
DataDogLimitedYesYesNoNoModerate

Best practices for evaluating agents in production

Evaluation does not end after deployment; rather, it is intensified. This continuous evaluation tracks how much the system costs to run, how quickly it responds under various loads, and how it handles errors or unusual inputs. Without such evaluation, problems can only be identified after the users are affected. An agent can pass all the quality checks with excellent faithfulness scores, high completion rates, and strong reasoning but fail in production if costs spiral, latency increases, or edge cases cause instability. Hence, there is a critical need for ongoing evaluation and monitoring, which will lead to systems that are reliable, scalable, and financially sustainable.

Monitor cost and latency

Monitoring cost and latency is critical for production sustainability. Token usage and response time must be tracked continuously because small inefficiencies compound dramatically over time, and the cost per token of the powerful reasoning models used for agents can be high. Production workloads require cost and latency monitoring to identify problems before user experience and budget are impacted. Cost monitoring tracks token usage at different levels, such as per request, per query type, and over time. Without visibility into patterns generated by these, teams end up discovering cost problems through surprise bills. With monitoring, they can proactively cache common queries and optimize prompts to reduce token use. Latency monitoring reveals track response time and component breakdowns to identify bottlenecks.

Cost control in production workloads is important because production costs can spiral quickly, unmonitored systems can exceed budgets, and latency impacts user experience and retention.

Combine offline and online evaluation

Effective agent evaluation requires combining offline and online evaluation, where each addresses gaps the other leaves. Offline evaluation uses fixed test databases for reproducible benchmarking, which enables fast iteration on prompts and models in controlled environments without production risk. Online evaluation monitors real user interactions in production, which reveals edge cases in testing that were never expected, so it is useful for real-time feedback, user data, and observability tools. A combination of both results in an optimal strategy where offline evaluation validates changes before deployment, then online evaluation monitors production reality. 

Use human-in-the-loop when necessary

LLM agents are appreciated for how they have played a positive role in the different ecosystems, but not every agent should run autonomously since they can misinterpret prompts, cross boundaries, or make dreadful errors that can’t be caught by automation alone. Hence, the need for human-in-the-loop failsafes. Human-in-the-loop is also essential during initial setup: Unless teams already have domain-specific evaluation datasets for monitoring the agent, these will need to be created manually by assessing the agent’s performance. A hybrid approach is required when critical decisions require human validation, such as approving transactions, modifying sensitive data, or triggering irreversible workflows. In this approach, it is important that decisions are routed through a human checkpoint before proceeding. The intention is not to slow automation but rather to ensure that the right decisions involve the right oversight. A well-designed human-in-the-loop system delivers compound returns over time. Every human correction becomes feedback, which improves the agent’s accuracy and gradually reduces the need for manual review. Human oversight isn’t treated as a failure but rather as a safety net that makes the system better with use.

Final thoughts

Fundamentally, AI agents are different from single-prompt LLMs. They navigate multi-step workflows, make autonomous decisions, and use external tools, which introduces complexities that demand continuous evaluation, not just static testing. Evaluation must evolve from pre-deployment checkpoints to ongoing monitoring. Production-ready agents aren’t just well-tested; they’re continuously observed and improved based on real behavior. LLM evaluation and AI observability enable faster, safer iteration by catching issues early and feeding production insights back into development.

PyCharm streamlines agent development with integrated debugging, profiling, and testing. Step through reasoning with breakpoints, find cost bottlenecks, and iterate on evaluation tests rapidly. These workflows transform hours of debugging into minutes of systematic investigation. Explore PyCharm for AI development to see how integrated tools can help you build, evaluate, and deploy reliable AI agents.

About the author

Naa Ashiorkor

Naa Ashiorkor is a data scientist and tech community builder. She is deeply involved in the Python community and serves as an organizer for various conferences, including EuroPython. She is currently building PyLadies Tampere.

Read the whole story
alvinashcraft
14 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Rust Tooling Made My JavaScript Site 53x Faster

1 Share
A real-world Next.js 16 migration shows how Bun, Oxlint, Biome, tsgo, and Turbopack changed build speed, linting, installs, and disk usage.
Read the whole story
alvinashcraft
14 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories