Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
145783 stories
·
32 followers

When efficiency isn’t wasteful

1 Share

Efficiency means doing the same thing with fewer resources. In software, efficient code is faster, because that uses less compute. If some other software is waiting on yours to return, then your speed is saving upstream compute. Efficient data uses less space, which means less network bandwidth to move it around.

All this is very good, but getting that speed or compactness took effort. Someone got that process to work, and then someone wrote telemetry or tests to measure it, and then someone optimized code to get it there. Efficiency is expensive.

Therefore, efficient software is not the same as an efficient company–unless the software efficiency is important to the company’s business model.

Cat Swetel clarified this in her talk at Agile 2025: understand why your company cares about efficiency.

  • Is it part of the business model?
  • Trying to ride out hard times?
  • or something else?

“Efficiency can be about hoarding or efficiency can be about access.” – Cat Swetel

Cat works at Nubank. Nubank serves millions of people who aren’t profitable for older banks, so older banks won’t (can’t) serve them. These are people with high transaction volumes and low balances. Nubank can only do this with super efficient transaction processing. It’s part of their business model.

Nubank is entirely on AWS, so they’re really good at optimizing cloud costs. For instance, they use a lot of Spot instances: cheaper than regular instances, but that computer can be taken away with 2 minutes’ notice.

With Spot, you pay less, but your infrastructure is less stable. The software has to compensate: services are quick to start up and shut down, and every part of the system handles ephemerality. Optimize for fault tolerance, not the happy path.

For Spot instances, Nubank does the upfront and ongoing work to cope with the instability and saved money on infrastructure. In another case, paying more for infrastructure saved money.

Their database, Datomic, has a local cache and an external cache. When transaction volumes shot up, the local cache ran out of space. Fetches from the external cache increased latency, slowing down all the services depending on this. Some of their spot instances happened to have an SSD, and they noticed that using that SSD to expand the local cache saved time (and therefore money) across the entire transaction flow. Numbers like: Spending $1 for SSD saved $3500 across the whole flow!

If each team minimized cost, then the database nodes wouldn’t add that SSD. Efficiency in the system is bigger than efficiency in the components. Nubank measures the cost of a flow, not a service. (They use honeycomb.io for this!)

Nubank learned: the Cost <-> Stability tradeoff isn’t. Instability is expensive! When fluctuations lead to scale-up, scale-down thrashing, efficiency is defeated. Building stable software on unstable infrastructure is expensive in development work, but the benefits of stability scale with their expanding demand.

Investing in software efficiency lets Nubank profit from customers that cost other banks money. They continually drive down cost per customer served, so they can serve more people. This is efficiency as business model.

If compute efficiency is not part of your business model, then it’s likely the investment in stability that it takes to work with spot instances won’t pay off. Instead, it could be a distraction from your core business. Does your product excel in user experience? In the complex legal knowledge built in? In traceability? in charm? Focus on these, until cost becomes a limiting factor. When you focus on efficiency, know why.

Read the whole story
alvinashcraft
5 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Hundreds of Google AI Workers Were Fired Amid Fight Over Working Conditions

1 Share
Last week the Guardian reported on "thousands of AI workers contracted for Google through Japanese conglomerate Hitachi's GlobalLogic to rate and moderate the output of Google's AI products, including its flagship chatbot Gemini... and its summaries of search results, AI Overviews." "AI isn't magic; it's a pyramid scheme of human labor," said Adio Dinika, a researcher at the Distributed AI Research Institute based in Bremen, Germany. "These raters are the middle rung: invisible, essential and expendable...." Ten of Google's AI trainers the Guardian spoke to said they have grown disillusioned with their jobs because they work in siloes, face tighter and tighter deadlines, and feel they are putting out a product that's not safe for users... In May 2023, a contract worker for Appen submitted a letter to the US Congress that the pace imposed on him and others would make Google Bard, Gemini's predecessor, a "faulty" and "dangerous" product This week Google laid off 200 of those moderating contractors, reports Wired. "These workers, who often are hired because of their specialist knowledge, had to have either a master's or a PhD to join the super rater program, and typically include writers, teachers, and people from creative fields." Workers still at the company claim they are increasingly concerned that they are being set up to replace themselves. According to internal documents viewed by WIRED, GlobalLogic seems to be using these human raters to train the Google AI system that could automatically rate the responses, with the aim of replacing them with AI. At the same time, the company is also finding ways to get rid of current employees as it continues to hire new workers. In July, GlobalLogic made it mandatory for its workers in Austin, Texas, to return to office, according to a notice seen by WIRED... Some contractors attempted to unionize earlier this year but claim those efforts were quashed. Now they allege that the company has retaliated against them. Two workers have filed a complaint with the National Labor Relations Board, alleging they were unfairly fired, one due to bringing up wage transparency issues, and the other for advocating for himself and his coworkers. "These individuals are employees of GlobalLogic or their subcontractors, not Alphabet," Courtenay Mencini, a Google spokesperson, said in a statement... "Globally, other AI contract workers are fighting back and organizing for better treatment and pay," the article points out, noting that content moderators from around the world facing similar issues formed the Global Trade Union Alliance of Content Moderators which includes workers from Kenya, Turkey, and Colombia. Thanks to long-time Slashdot reader mspohr for sharing the news.

Read more of this story at Slashdot.

Read the whole story
alvinashcraft
1 hour ago
reply
Pennsylvania, USA
Share this story
Delete

Building Resilient Email Delivery Systems: SendGrid vs Azure Communication Services with Polly in .NET

1 Share
1 The High Stakes of Email Delivery Email remains the backbone of digital communication for critical workflows: password resets, payment confirmations, fraud alerts, onboarding sequences, and service notifications. When you click "Forgot Password," you don’t think about queues, retries, or SMTP relays—you expect an email in seconds. But for the engineers behind the curtain, ensuring that message reliably reaches the inbox is anything but trivial. In this section, we’ll zoom out from code and APIs to understand why building resilient email delivery systems is not just a technical exercise but a business-critical mandate. Then, we’ll frame the architectural blueprint that guides every design choice in the sections that follow. 1.1 Introduction: Beyond "Fire and Forget" Too many teams still treat email as a “fire and forget” action: make an API call to SendGrid or Azure Communication Services (ACS), and assume the job is done. This mindset works for hobby projects and proof-of...
Read the whole story
alvinashcraft
2 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Trump’s H-1B visa fee isn’t just about immigration, it’s about fealty

1 Share

Donald Trump has never made his distaste for immigrants a secret. It's been a cornerstone of his political movement since he descended that escalator on June 16th, 2015 and started hurling racist vitriol in the general direction of Mexico and Mexican Americans. On the surface, his assault on the H-1B visa program seems like part of the White House's ongoing campaign to reduce the number of immigrants in the country. It might have that effect, but the biggest goal for Trump may not be forcing companies to hire more Americans or cutting down on the number of workers from India moving to the US. It's giving the government more leverage over his …

Read the full story at The Verge.

Read the whole story
alvinashcraft
6 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Random.Code() - Resurrecting EmitDebugging Using Interceptors, Part 3

1 Share
From: Jason Bock
Duration: 1:24:25
Views: 43

I keep working on an interceptor for System.Reflection.Emit. Let's see if I can make any new progress.

https://github.com/JasonBock/EmitDebugging/issues/6

Read the whole story
alvinashcraft
6 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Agentic AI research-methodology - part 2

1 Share

This post continues our series (previous post) on agentic AI research methodology. Building on our previous discussion on AI system design, we now shift focus to an evaluation-first perspective.

Tamara Gaidar, Data Scientist, Defender for Cloud Research
Fady Copty, Principal Researcher, Defender for Cloud Research

 

 

TL;DR:

Evaluation-First Approach to Agentic AI Systems

This blog post advocates for evaluation as the core value of any AI product. As generic models grow more capable, their alignment with specific business goals remains a challenge - making robust evaluation essential for trust, safety, and impact.

This post post presents a comprehensive framework for evaluating agentic AI systems, starting from business goals and responsible AI principles to detailed performance assessments. It emphasizes using synthetic and real-world data, diverse evaluation methods, and coverage metrics to build a repeatable, risk-aware process that highlights system-specific value.

 

Why evaluate at all?

While issues like hallucinations in AI systems are widely recognized, we propose a broader and more strategic perspective:

Evaluation is not just a safeguard - it is the core value proposition of any AI product.

As foundation models grow increasingly capable, their ability to self-assess against domain-specific business objectives remains limited. This gap places the burden of evaluation on system designers. Robust evaluation ensures alignment with customer needs, mitigates operational and reputational risks, and supports informed decision-making. In high-stakes domains, the absence of rigorous output validation has already led to notable failures - underscoring the importance of treating evaluation as a first-class concern in agentic AI development.

Evaluating an AI system involves two foundational steps:

  1. Developing an evaluation plan that translates business objectives into measurable criteria for decision-making.
  2. Executing the plan using appropriate evaluation methods and metrics tailored to the system’s architecture and use cases.

The following sections detail each step, offering practical guidance for building robust, risk-aware evaluation frameworks in agentic AI systems.

Evaluation plan development

The purpose of an evaluation plan is to systematically translate business objectives into measurable criteria that guide decision-making throughout the AI system’s lifecycle. Begin by clearly defining the system’s intended business value, identifying its core functionalities, and specifying evaluation targets aligned with those goals. A well-constructed plan should enable stakeholders to make informed decisions based on empirical evidence. It must encompass end-to-end system evaluation, expected and unexpected usage patterns, quality benchmarks, and considerations for security, privacy, and responsible AI. Additionally, the plan should extend to individual sub-components, incorporating evaluation of their performance and the dependencies between them to ensure coherent and reliable system behavior.

 

Example - Evaluation of a Financial Report Summarization Agent

To illustrate the evaluation-first approach, consider the example from the previous post of an AI system designed to generate a two-page executive summary from a financial annual report. The system was composed of three agents: split report into chapters, extract information from chapters and tables, and summarize the findings. The evaluation plan for this system should operate at two levels: end-to-end system evaluation and agent-level evaluation.

End-to-End Evaluation

At the system level, the goal is to assess the agent’s ability to accurately and efficiently transform a full financial report into a concise, readable summary. The business purpose is to accelerate financial analysis and decision-making by enabling stakeholders - such as executives, analysts, and investors - to extract key insights without reading the entire document. Key objectives include improving analyst productivity, enhancing accessibility for non-experts, and reducing time-to-decision.

To fulfill this purpose, the system must support several core functionalities:

  • Natural Language Understanding: Extracting financial metrics, trends, and qualitative insights.
  • Summarization Engine: Producing a structured summary that includes an executive overview, key financial metrics (e.g., revenue, EBITDA), notable risks, and forward-looking statements.

The evaluation targets should include:

  • Accuracy: Fidelity of financial figures and strategic insights.
  • Readability: Clarity and structure of the summary for the intended audience.
  • Coverage: Inclusion of all critical report elements.
  • Efficiency: Time saved compared to manual summarization.
  • User Satisfaction: Perceived usefulness and quality by end users.
  • Robustness: Performance across diverse report formats and styles.

These targets inform a set of evaluation items that directly support business decision-making. For example, high accuracy and readability in risk sections are essential for reducing time-to-decision and must meet stringent thresholds to be considered acceptable. The plan should also account for edge cases, unexpected inputs, and responsible AI concerns such as fairness, privacy, and security.

Agent-Level Evaluation

Suppose the system is composed of three specialized agents:

  • Chapter Analysis
  • Tables Analysis
  • Summarization

Each agent requires its own evaluation plan. For instance, the chapter analysis agent should be tested across various chapter types, unexpected input formats, and content quality scenarios. Similarly, the tables analysis agent must be evaluated for its ability to extract structured data accurately, and the summarization agent for coherence and factual consistency.

Evaluating Agent Dependencies

Finally, the evaluation must address inter-agent dependencies. In this example, the summarization agent relies on outputs from the chapter and tables analysis agents. The plan should include dependency checks such as local fact verification - ensuring that the summarization agent correctly cites and integrates information from upstream agents. This ensures that the system functions cohesively and maintains integrity across its modular components.

Executing the Evaluation Plan

Once the evaluation plan is defined, the next step is to operate it using appropriate methods and metrics. While traditional techniques such as code reviews and manual testing remain valuable, we focus here on simulation-based evaluation - a practical and scalable approach that compares system outputs against expected results. For each item in the evaluation plan, this process involves:

  1. Defining representative input samples and corresponding expected outputs
  2. Selecting simulation methods tailored to each agent under evaluation
  3. Measuring and analyzing results using quantitative and qualitative metrics

This structured approach enables consistent, repeatable evaluation across both individual agents and the full system workflow.

Defining Samples and Expected Outputs

A robust evaluation begins with a representative set of input samples and corresponding expected outputs. Ideally, these should reflect real-world business scenarios to ensure relevance and reliability. While a comprehensive evaluation may require hundreds or even thousands of real-life examples, early-stage testing can begin with a smaller, curated set - such as 30 synthetic input-output pairs generated via GenAI and validated by domain experts.

Simulation-Based Evaluation Methods

Early-stage evaluations can leverage lightweight tools such as Python scripts, LLM frameworks (e.g., LangChain), or platform-specific playgrounds (e.g., Azure OpenAI). As the system matures, more robust infrastructure is required to support production-grade testing. It is essential to design tests with reusability in mind - avoiding hardcoded samples and outputs - to ensure continuity across development stages and deployment environments.

Measuring Evaluation Outcomes

Evaluation results should be assessed in two primary dimensions:

  1. Output Quality: Comparing actual system outputs against expected results.
  2. Coverage: Ensuring all items in the evaluation plan are adequately tested.

Comparing Outputs

Agentic AI systems often generate unstructured text, making direct comparisons challenging. To address this, we recommend a combination of:

  • LLM-as-a-Judge: Using large language models to evaluate outputs based on predefined criteria.
  • Domain Expert Review: Leveraging human expertise for nuanced assessments.
  • Similarity Metrics: Applying lexical and semantic similarity techniques to quantify alignment with reference outputs.

Using LLMs as Evaluation Judges

Large Language Models (LLMs) are emerging as a powerful tool for evaluating AI system outputs, offering a scalable alternative to manual review. Their ability to emulate domain-expert reasoning enables fast, cost-effective assessments across a wide range of criteria - including correctness, coherence, groundedness, relevance, fluency, hallucination detection, sensitivity, and even code readability. When properly prompted and anchored to reliable ground truth, LLMs can deliver high-quality classification and scoring performance.

For example, consider the following prompt used to evaluate the alignment between a security recommendation and its remediation steps:

“Below you will find a description of a security recommendation and relevant remediation steps. Evaluate whether the remediation steps adequately address the recommendation. Use a score from 1 to 5:

  • 1: Does not solve at all
  • 2: Poor solution
  • 3: Fair solution
  • 4: Good solution
  • 5: Exact solution
    Security recommendation: {recommendation}
    Remediation steps: {steps}”

While LLM-based evaluation offers significant advantages, it is not without limitations. Performance is highly sensitive to prompt design and the specificity of evaluation criteria. Subjective metrics - such as “usefulness” or “helpfulness” - can lead to inconsistent judgments depending on domain context or user expertise. Additionally, LLMs may exhibit biases, such as favoring their own generated responses, preferring longer answers, or being influenced by the order of presented options.

Although LLMs can be used independently to assess outputs, we strongly recommend using them in comparison mode - evaluating actual outputs against expected ones - to improve reliability and reduce ambiguity. Regardless of method, all LLM-based evaluations should be validated against real-world data and expert feedback to ensure robustness and trustworthiness.

Domain expert evaluation

Engaging domain experts to assess AI output remains one of the most reliable methods for evaluating quality, especially in high-stakes or specialized contexts. Experts can provide nuanced judgments on correctness, relevance, and usefulness that automated methods may miss. However, this approach is inherently limited in scalability, repeatability, and cost-efficiency. It is also susceptible to human biases - such as cultural context, subjective interpretation, and inconsistency across reviewers—which must be accounted for when interpreting results.

Similarity techniques

Similarity techniques offer a scalable alternative by comparing AI-generated outputs against reference data using quantitative metrics. These methods assess how closely the system’s output aligns with expected results, using measures such as exact match, lexical overlap, and semantic similarity. While less nuanced than expert review, similarity metrics are useful for benchmarking performance across large datasets and can be automated for continuous evaluation. They are particularly effective when ground truth data is well-defined and structured.

Coverage evaluation in Agentic AI Systems

A foundational metric in any evaluation framework is coverage - ensuring that all items defined in the evaluation plan are adequately tested. However, simple checklist-style coverage is insufficient, as each item may require nuanced assessment across different dimensions of functionality, safety, and robustness.

To formalize this, we introduce two metrics inspired by software engineering practices:

  • Prompt-Coverage: Assesses how well a single prompt invocation addresses both its functional objectives (e.g., “summarize financial risk”) and non-functional constraints (e.g., “avoid speculative language” or “ensure privacy compliance”). This metric should reflect the complexity embedded in the prompt and its alignment with business-critical expectations.
  • Agentic-Workflow Coverage: Measures the completeness of evaluation across the logical and operational dependencies within an agentic workflow. This includes interactions between agents, tools, and tasks - analogous to branch coverage in software testing. It ensures that all integration points and edge cases are exercised.

We recommend aiming for the highest possible coverage across evaluation dimensions. Coverage gaps should be prioritized based on their potential risk and business impact, and revisited regularly as prompts and workflows evolve to ensure continued alignment and robustness.

Closing Thoughts

As agentic AI systems become increasingly central to business-critical workflows, evaluation must evolve from a post-hoc activity to a foundational design principle. By adopting an evaluation-first mindset - grounded in structured planning, simulation-based testing, and comprehensive coverage - you not only mitigate risk but also unlock the full strategic value of your AI solution. Whether you're building internal tools or customer-facing products, a robust evaluation framework ensures your system is trustworthy, aligned with business goals, and differentiated from generic models.

Read the whole story
alvinashcraft
6 hours ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories