Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
149930 stories
·
33 followers

Why observable AI is the missing SRE layer enterprises need for reliable LLMs

1 Share

As AI systems enter production, reliability and governance can’t depend on wishful thinking. Here’s how observability turns large language models (LLMs) into auditable, trustworthy enterprise systems.

Why observability secures the future of enterprise AI

The enterprise race to deploy LLM systems mirrors the early days of cloud adoption. Executives love the promise; compliance demands accountability; engineers just want a paved road.

Yet, beneath the excitement, most leaders admit they can’t trace how AI decisions are made, whether they helped the business, or if they broke any rule.

Take one Fortune 100 bank that deployed an LLM to classify loan applications. Benchmark accuracy looked stellar. Yet, 6 months later, auditors found that 18% of critical cases were misrouted, without a single alert or trace. The root cause wasn’t bias or bad data. It was invisible. No observability, no accountability.

If you can’t observe it, you can’t trust it. And unobserved AI will fail in silence.

Visibility isn’t a luxury; it’s the foundation of trust. Without it, AI becomes ungovernable.

Start with outcomes, not models

Most corporate AI projects begin with tech leaders choosing a model and, later, defining success metrics. That’s backward.

Flip the order:

  • Define the outcome first. What’s the measurable business goal?

    • Deflect 15 % of billing calls

    • Reduce document review time by 60 %

    • Cut case-handling time by two minutes

  • Design telemetry around that outcome, not around “accuracy” or “BLEU score.”

  • Select prompts, retrieval methods and models that demonstrably move those KPIs.

At one global insurer, for instance, reframing success as “minutes saved per claim” instead of “model precision” turned an isolated pilot into a company-wide roadmap.

A 3-layer telemetry model for LLM observability

Just like microservices rely on logs, metrics and traces, AI systems need a structured observability stack:

a) Prompts and context: What went in

  • Log every prompt template, variable and retrieved document.

  • Record model ID, version, latency and token counts (your leading cost indicators).

  • Maintain an auditable redaction log showing what data was masked, when and by which rule.

b) Policies and controls: The guardrails

  • Capture safety-filter outcomes (toxicity, PII), citation presence and rule triggers.

  • Store policy reasons and risk tier for each deployment.

  • Link outputs back to the governing model card for transparency.

c) Outcomes and feedback: Did it work?

  • Gather human ratings and edit distances from accepted answers.

  • Track downstream business events, case closed, document approved, issue resolved.

  • Measure the KPI deltas, call time, backlog, reopen rate.

All three layers connect through a common trace ID, enabling any decision to be replayed, audited or improved.

Diagram © SaiKrishna Koorapati (2025). Created specifically for this article; licensed to VentureBeat for publication.

Apply SRE discipline: SLOs and error budgets for AI

Service reliability engineering (SRE) transformed software operations; now it’s AI’s turn.

Define three “golden signals” for every critical workflow:

Signal

Target SLO

When breached

Factuality

≥ 95 % verified against source of record

Fallback to verified template

Safety

≥ 99.9 % pass toxicity/PII filters

Quarantine and human review

Usefulness

≥ 80 % accepted on first pass

Retrain or rollback prompt/model

If hallucinations or refusals exceed budget, the system auto-routes to safer prompts or human review just like rerouting traffic during a service outage.

This isn’t bureaucracy; it’s reliability applied to reasoning.

Build the thin observability layer in two agile sprints

You don’t need a six-month roadmap, just focus and two short sprints.

Sprint 1 (weeks 1-3): Foundations

  • Version-controlled prompt registry

  • Redaction middleware tied to policy

  • Request/response logging with trace IDs

  • Basic evaluations (PII checks, citation presence)

  • Simple human-in-the-loop (HITL) UI

Sprint 2 (weeks 4-6): Guardrails and KPIs

  • Offline test sets (100–300 real examples)

  • Policy gates for factuality and safety

  • Lightweight dashboard tracking SLOs and cost

  • Automated token and latency tracker

In 6 weeks, you’ll have the thin layer that answers 90% of governance and product questions.

Make evaluations continuous (and boring)

Evaluations shouldn’t be heroic one-offs; they should be routine.

  • Curate test sets from real cases; refresh 10–20 % monthly.

  • Define clear acceptance criteria shared by product and risk teams.

  • Run the suite on every prompt/model/policy change and weekly for drift checks.

  • Publish one unified scorecard each week covering factuality, safety, usefulness and cost.

When evals are part of CI/CD, they stop being compliance theater and become operational pulse checks.

Apply human oversight where it matters

Full automation is neither realistic nor responsible. High-risk or ambiguous cases should escalate to human review.

  • Route low-confidence or policy-flagged responses to experts.

  • Capture every edit and reason as training data and audit evidence.

  • Feed reviewer feedback back into prompts and policies for continuous improvement.

At one health-tech firm, this approach cut false positives by 22 % and produced a retrainable, compliance-ready dataset in weeks.

Cost control through design, not hope

LLM costs grow non-linearly. Budgets won’t save you architecture will.

  • Structure prompts so deterministic sections run before generative ones.

  • Compress and rerank context instead of dumping entire documents.

  • Cache frequent queries and memoize tool outputs with TTL.

  • Track latency, throughput and token use per feature.

When observability covers tokens and latency, cost becomes a controlled variable, not a surprise.

The 90-day playbook

Within 3 months of adopting observable AI principles, enterprises should see:

  • 1–2 production AI assists with HITL for edge cases

  • Automated evaluation suite for pre-deploy and nightly runs

  • Weekly scorecard shared across SRE, product and risk

  • Audit-ready traces linking prompts, policies and outcomes

At a Fortune 100 client, this structure reduced incident time by 40 % and aligned product and compliance roadmaps.

Scaling trust through observability

Observable AI is how you turn AI from experiment to infrastructure.

With clear telemetry, SLOs and human feedback loops:

  • Executives gain evidence-backed confidence.

  • Compliance teams get replayable audit chains.

  • Engineers iterate faster and ship safely.

  • Customers experience reliable, explainable AI.

Observability isn’t an add-on layer, it’s the foundation for trust at scale.

SaiKrishna Koorapati is a software engineering leader.

Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.



Read the whole story
alvinashcraft
40 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Project Spotlight: Steeltoe

1 Share

Steeltoe provides a collection of libraries that helps users build production-grade cloud-native applications using externalized configuration, service discovery, distributed tracing, application management, and more. It is proven—trusted by developers all over the world, delivering delightful experiences to millions of end-users every day, with contributions from VMware, Microsoft, and more.

Steeltoe is flexible, offering comprehensive extensions and third-party libraries that let developers build almost any web application imaginable, whether cloud-scale microservices or heavyweight enterprise applications. It is productive, building on .NET runtime libraries, providing the necessary glue code, and supporting many of Spring Cloud’s libraries, patterns, and templates.

Steeltoe is fast, with developer productivity as one of its superpowers—developers can start a new project in seconds with the Steeltoe Initializr at start.steeltoe.io. It is secure, remediating security issues quickly and responsibly, monitoring dependencies, and providing industry-standard security integrations.

Finally, Steeltoe is supportive, backed by a global, diverse community that offers guides, tutorials, videos, support, and access to the development team on Slack.

What Steeltoe Can Do

  • Microservices: Production-grade features with independently evolvable services.

  • Cloud: Your code, any cloud—connect and scale your services.

  • Web Apps: Fast, secure, responsive applications connected to any data store.

Link: https://steeltoe.io/

Read the whole story
alvinashcraft
4 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Member Spotlight: Tomas Herceg

1 Share

Tomas Herceg lives in Prague, Czech Republic, and runs a software consulting company called RIGANTI.

He has been doing .NET development since the time of .NET Framework 1.0. He got his first Microsoft MVP award in 2009, and for a couple of years, he was also a Microsoft Regional Director. He is also running Update Conference, a company that organizes developer events focused on .NET, cloud, and security. Many people might also know him because of his recent book about modernizing .NET web applications.

Tomas’s open-source journey started in 2014 when he had the idea of creating a framework that lets you build web apps with just C# and HTML. He made a simple prototype, published it on GitHub, and demoed it in one of his conference sessions. Surprisingly, the next day someone submitted a pull request. He contacted that person, and they decided to continue working on the idea and see what happens.

That is how the DotVVM project started. Tomas and his team have been contributing to it for more than 10 years, adding hundreds of features, tests, and documentation pages. Over the years, more people have been helping with the development. They use the framework intensively at RIGANTI and are committed to its long-term sustainability. Therefore, they built a bunch of commercial extensions and components for DotVVM, which helps them secure funding for future improvements to the open-source framework.

DotVVM is an opinionated framework that enables building web apps using the Model-View-ViewModel (MVVM) approach with just C# and HTML. It requires only about 56kB of JavaScript on the client, and it can be used to build feature-rich UI interfaces. The framework supports both ASP.NET Core and classic ASP.NET, providing an easy way to incrementally modernize ASP.NET Web Forms applications. DotVVM comes with 30+ built-in components, and there is also an extension for Visual Studio and Visual Studio Code.

Links:
https://github.com/riganti/dotvvm
https://tomasherceg.com
https://modernizationbook.com

Read the whole story
alvinashcraft
4 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Python Is Quickly Evolving To Meet Modern Enterprise AI Needs

1 Share

Python is ubiquitous. Millions of professionals, from scientists to software developers, rely on it. Organizations like Google and Meta have built critical infrastructure using it. Python even helped NASA explore Mars, thanks to its image processing abilities.

And its growth isn’t slowing anytime soon.

In 2024, Python surpassed JavaScript as the most popular language on GitHub, and today, it has become the backbone of modern AI systems. Python’s versatility and passionate community have made it what it is today. However, as more enterprises rely on Python for everything from web services to AI models, there are unique needs that enterprises must address around visibility, performance, governance and security to ensure business continuity, fast time to market and true differentiation.

How Python Became the Universal AI Language

Most popular languages have benefited from corporate sponsorship. Oracle supports Java. Microsoft backs C#. And Apple champions Swift. But Python has almost always been a community project, supported by several companies, and has been developed and improved over decades by a committed group of mainly volunteers, directed by Guido van Rossum as Benevolent Dictator for Life until 2018.

In the 1980s, van Rossum sought to create a language that was both simple and beautiful. Since the early ’90s, as an open source project, Python was available for anyone to inspect, modify or improve.

The Zen of Python, by Tim Peters, image originally posted by Pycon India on X.

Python quickly differentiated itself from its peers. It was easy to learn, write and understand. Developers could easily tell what was happening in their and others’ code just by looking at it, an anomaly in the days of Perl, C++ and complex shell scripts. This low barrier to entry made it highly approachable to new users.

Then there was Python’s extensibility, meaning it could easily integrate with other languages and systems. With the rise of the internet in the early 2000s, this extensibility took Python from a scripting solution to a production language for web servers, services and applications.

In the 2010s, Python became the de facto language for numerical computing and data science. Today, the world’s leading AI and machine learning (ML) packages, such as PyTorch, TensorFlow, scikit-learn, SciPy, Pandas and more, are Python-based. Still, the high-performance data and AI algorithms they use rely on highly optimized code written in compiled languages like C or C++. It is Python’s ability to easily integrate with these and other languages that has been critical in its ability to provide the best of both worlds: an easy interface to these packages for the millions of users who want to use them, but flexible interfaces for the experts that can optimize them in the language of their choice. These factors have made Python indispensable for both data science and AI workflows.

Today, if you’re working with any kind of AI or ML application, you’re likely using Python. However, as Python has become both the glue and the engine powering modern AI systems, enterprises need to be aware of critical needs specific to corporations around compliance, security and performance, and the community must strive to address them.

Helping Python Meet Enterprise Needs

Longtime Python core contributor Brett Cannon famously said, “I came for the language, but I stayed for the community.”

The community has made Python the incredible language it is today, serving users above all else. However, the community’s mission has always been to build a language that works for everyone, from programmers to scientists to data engineers. This has proven to be the right approach. This also means Python wasn’t engineered for the specific needs of enterprises running their business with Python.

And that’s OK, as long as those needs are addressed.

Anaconda’s “2025 State of Data Science and AI Report” found that enterprises face many of the same recurring challenges as they move data and AI applications to production. Over 57% reported that it takes more than a month to move AI projects from development to production. To demonstrate ROI, respondents were mostly interested in business concerns, such as:

  • Productivity Improvements (58%)
  • Cost Savings (48%)
  • Revenue Impact (46%)
  • Customer Experience / Loyalty (45%)

Think about it like cloud computing fifteen years ago. Organizations could immediately see the massive cost and operational advantages of moving workloads to the cloud. However, they realized that the security, compliance and cost model had changed entirely. They needed to continuously monitor, govern and optimize this new tool in altogether new ways. Python has reached that same point for enterprises.

I’ve spoken with dozens of leaders at organizations using Python, and here are the common challenges and themes I see.

Security

While 82% of organizations validate open source Python packages for security, nearly 40% of respondents still frequently encounter security vulnerabilities in their projects. These security issues create deployment delays for over two-thirds of organizations.

One of the strengths of Python, and all open source software, is that they’re free to download and use. You get the latest and greatest technology, and you can experiment, develop and push applications to production without paying a dime on the software.

However, history has shown that this openness and collaborative community can be abused by bad actors or even allow simple mistakes to proliferate, leading to the spread of vulnerable and malicious software. A piece of software or a package that looks fine could actually be dangerous. That problem is now compounding, with AI systems now generating and executing Python code without a human in the loop. Enterprises must protect their people, systems and data, and in turn, ensure safe AI deployment without missing deadlines.

Performance Optimization

Though Python is straightforward to use, it can also be prolonged, which is fine for many use cases. But as we saw in the “State of Data Science and AI Report,” the modern enterprise’s primary concern is to do more with less — continually improve and increase efficiency, productivity improvements, cost savings, increase revenue, etc. The economics of producing AI applications is only exacerbating performance and efficiency concerns.

With limited time, expertise or tools, most enterprises struggle to fine-tune the Python runtime, leading to far more compute than needed and higher costs, or to running AI systems that aren’t performant enough to provide a usable experience.

Auditability

Every CIO and CISO I know is staring down a wave of regulations, from the EU AI Act to internal SOC 2 and ISO 27001 compliance audits. Enterprises must be able to prove what code is running, where it’s running and how it’s interacting with sensitive data and systems.

Free and open source software makes that challenging because when anyone can download and run software freely, everyone will. New Python applications are popping up outside of IT control, packages are constantly updating, unknown or new dependencies are pulled in and there’s limited runtime visibility. Especially for organizations in highly regulated industries, this lack of runtime visibility creates present and future risk.

Managing Deployments

According to a recent survey of Anaconda’s users, over 80% of practitioners spend more than 10% of their AI development time troubleshooting dependency conflicts or security issues. Over 40% spend greater than a quarter of their time on these tasks, and time is money.

Once applications are in production, continuous maintenance, upgrades and security hardening can compound those issues. For an individual running and maintaining a small number of scripts and applications, this is not so hard. Still, for a large enterprise managing thousands of production applications, this becomes a considerable challenge.

Enterprises need a way to easily adopt new versions of Python and new technologies, while also minimizing version sprawl, security exposure and management overhead.

How To Help Enterprise AI Meet the Needs of Modern Enterprises

The good news is you can start addressing many of these challenges today. It all comes down to being intentional about your governance strategy.

More than half of organizations today have no or very limited open source and AI governance policies or frameworks in place. Creating an official policy around governance and investing in visibility and auditability already puts you ahead of most enterprises.

When building your governance strategy, start by building internal processes that track Python usage across teams and systems. Ensure you know what packages are running, where, and under what configurations.

Next, you’ll want to ensure you’re managing Shadow IT/AI and reviewing any and all AI-generated code. Agentic tools can’t replace a solid software development life cycle (SDLC) process. Ensure you have the right visibility, standards and processes in place to prevent unverified scripts from entering production.

It’s also critical to invest in workforce upskilling, increasing AI literacy among your employees so they better understand the risks of open source and AI solutions and why governance is so important. Some of the best education is in using these tools directly and gaining experience.

Finally, give your teams safe, reliable solutions across AI and data science workflows so that doing the right thing becomes the path of least resistance.

Make Python Your Competitive Edge

Python’s openness is its greatest strength and its most significant challenge. While it’s democratized AI development, it’s also created new risk vectors and blind spots that enterprises must address. IT teams need the same visibility and governance for open source solutions as they would for any other part of their tech stack. Time has shown that this is a primary source of innovation in the enterprise, so the investment in securing that innovation is worth it. And while specific upgrades to the language itself can help, intentional governance can make a difference today.

At Anaconda, we’ve seen enterprises tackle these challenges by building strong SDLC, governance, and observability layers around their Python environments. It adds a little more work upfront, but it’s a critical shift that will protect your organization in the long run and ensure the success and longevity of your AI initiatives.

The post Python Is Quickly Evolving To Meet Modern Enterprise AI Needs appeared first on The New Stack.

Read the whole story
alvinashcraft
4 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Most Developers Call AI Data With APIs and A2A

1 Share
Dev News logo

A recent Theory Ventures survey found that 91% out of 413 senior technical builders surveyed are either directly building AI, or managing teams that build AI. That said, only 17% are building MCP servers. This contrasts with the 46% that are using direct tool calling, including via APIs and A2A, to access data.

The survey found the most common use case for agent tools is accessing databases, with 72% of respondents using AI agents to do this.

Nearly everyone (98%) is evaluating the quality of their AI, usually using a combination of automated and manual methods, the survey found. Overall, 27% are using LLMs to judge the quality of their AI’s output and 63% are using synthetic data to evaluate their AI’s output.

46% of builders are using direct tool calling to access data, including via APIs and A2A.

Respondents were almost three times as likely to use AI evaluations to assess the quality of their AI products as compared to those using telemetry.

When it comes to LLM observability, 57% are storing interaction with their AI system as traces in a product designed for LLM observability. However, 57% say they also use spreadsheets to review the data they are collecting.

Perhaps that is why 21% said data review of interactions are the area of tooling they most believe will add value for their projects.

When it comes to context engineering, 47% are using a prompt optimization method like GEPA and Prompt Evolution. Fifty-two percent are managing prompts as code that can be checked into their code repository and handled like other code reviews. That’s in contrast to the 41% who are updating prompts with text files that people from different teams are collaborating on.

Another 41% said they are iterating on prompts outside of the code, according to the survey.

Shipaton Mobile Hack-A-Thon Won With Vibe-Coded App

Shipaton is a mobile app hackathon hosted annually by RevenueCat, a platform that powers in-app purchases for apps. This year it showed that vibe coding is truly catching on.

Only 1,700 individuals participated last year, but this year the event attracted 54,000 participants. RevenueCat said the increase reflects how vibe coding is redefining — and expanding — who is an app developer.

“Vibecoders played a prominent role in the competition, with the Grand Prize going to Payout, an app completely produced through AI-assisted development using Claude Code and Cursor,” the company said in a prepared statement. Payout enables people to find class action lawsuits for which they qualify.

“AI tools are lowering the barrier to build and ship apps, but the fundamentals haven’t changed,” said Jacob Eiting, CEO of RevenueCat. “You still need to build something people love, use, and pay for — and that’s exactly what this year’s Shipaton winners did.”

Shipaton awards were granted by category and included:

  • Best Vibes: Vibe coding platforms and tools are increasingly being used in app development. The winner, OtterDay, used Perplexity Pro for dialogue and visuals, KlingAI for otter animations, and ElevenLabs for voiceovers.
  • Design: Visuals are key for engagement and the winners of this category specialize in aesthetic quality. DayLoop, an app that turns everyday moments into cinematic time-lapse videos, won for its precision, privacy and delight.
  • Buzziest Launch: Whether or not an app makes a splash on day one can make or break its long-term success. ReadHim, geared toward decoding men’s texts, won first place in this category for an Instagram meme account that amassed over 5.2 million views, partnering with a TikTok influencer with over 2.3M followers, and executing a creative stunt complete with supercars and a robot dog.
  • Apps That Make Money: Vibe coding makes launching apps easier than ever, but sustainable monetization still sets the best apart. VectorGuard was awarded for their top-tier app monetization strategy. With thoughtful design and fair monetization, they turn public data into public good.

Shipaton celebrates the freshly built projects, new launches and prototypes that are bringing unique products to users. RevenueCat also awards six Shippies to apps that have gone to the next level, not just launching but nailing every part of the subscription journey, from onboarding and monetization to retention and creativity. The 2025 Shippies awards went to Hank Green’s Focus Friend, Ladder, Resubs, Recime, Wink and WeWard.

Warp CLI Tool Expands AI Agent Capabilities

Warp is a modern, high-performance command-line terminal application designed for macOS, Linux, and Windows. This month, Warp expanded its AI agent capabilities with the launch of Agents 3.0.

The company’s goal with this release is to offer reliable, collaborative and fully autonomous development workflows within the terminal environment.

Among the new features is the Full Terminal Use, which allows the agent to interact with live processes and full-screen terminal apps like debuggers, which solves a major bottleneck for real-world development tasks.

The update also introduces structured, versioned development blueprints and Interactive Code Review for human oversight directly in the terminal.

The post Most Developers Call AI Data With APIs and A2A appeared first on The New Stack.

Read the whole story
alvinashcraft
4 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Prompt Droid version 2

1 Share

Latest build and GitHub!

After I pushed out my first version of Prompt Droid (nee Prompt Drawer) a couple of weeks ago, I had a couple of realizations:

  1. Someone already published an extension named “Prompt Drawer” that did something similar to mine (though I like the way mine works better), so I needed a new name.
    • EVERYTHING related to the word “prompt” is used already: the ones you’d expect, like Prompty or PromptBox and even more esoteric ones like El Prompto. Prompt Droid was not, however, so one find-and-replace in my code later and here we are.
  2. It was really bothering me that the extension wouldn’t work on Copilot. I did a lot of digging and learned that I’d built the extension in the entirely wrong way, that I needed to implement it as a popup instead of having it embedded in the page itself (which some sites, including Copilot, forcibly block).
  3. It took a village of AI tools to help get it where I wanted it. I used Claude as my main toolset, Gemini 3 to help with some code iteration, and Copilot was actually able to tell me why it wasn’t working on… Copilot. Physician, know thyself.

So here we are, Prompt Droid has been rewritten and fixed up so that it now works anywhere. As always, please let me know if you find any bugs.



Read the whole story
alvinashcraft
4 hours ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories