Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
155175 stories
·
33 followers

Microsoft Build 2026: Building agentic apps with Microsoft Fabric and Microsoft Databases

1 Share

AI is driving a fundamental shift in how work gets done and how applications are built. As the 2026 Microsoft Work Trend Index Report highlights, a growing share of workers are moving beyond asking questions to handing off entire tasks and orchestrating multi-agent systems.

This shift introduces a new constraint. The challenge is no longer model capability, but consistent, shared data context across the business.

Developers and business users know what they want to automate, and today’s models can deliver. The bottleneck is context. Every new agent starts from zero, relearning how the business works, where data lives, and what rules to follow. Without a consistent foundation, agents can’t coordinate or scale.

That’s the challenge we are solving with Microsoft Fabric. It provides a unified data and AI platform that empowers you to bring together data and move from isolated AI experiments to production-ready agent systems, in which each new agent builds on shared organizational context. This vision is already driving strong momentum among the millions of developers building on Fabric and Microsoft Databases.

At Microsoft Build, we are extending this foundation with new capabilities that help developers move from prototype to production faster. These include Rayfin, a new software development kit (SDK) and command-line interface (CLI) designed to make Fabric a production-ready application backend, and Azure HorizonDB, a new PostgreSQL database designed for AI‑powered applications, now in public preview.

Introducing Rayfin: From prompt to production backend

Coding agents are accelerating app development. Moving those applications from prototype to production, however, remains a challenge. Agent-created or not, every production-ready application still relies on a backend to manage data, enforce identity and permissions, coordinate state, and operate reliably over time. Existing software-service platforms were either not designed for agents or do not fully meet enterprise requirements for deployment, security, and governance.

Rayfin, a new open-source SDK and CLI, is designed to close that gap. It lets developers and coding agents describe what to build and get an enterprise-grade application backend directly into the application code, including a database, authentication, and more. Rayfin then deploys directly to Microsoft Fabric, giving every application enterprise-grade security and scale from day one. Developers and AI agents can now move from prompt to production without managing infrastructure.

With Rayfin, developers work through familiar GitHub‑based workflows to define data models, backend logic, and access policies entirely in code, giving teams and agents a consistent, programmable interface for building and managing applications. Watch Rayfin in action:

Because Rayfin can be deployed directly on Fabric, application data lands directly in OneLake, where it is immediately available to the full Fabric data stack, unified with analytics, operational and real-time data, and AI engines by default. This enables developers to build their enterprise apps on trusted business logic, integrate with semantic models, and embed rich data visuals.

We’re excited to partner with Replit, a leading AI coding platform, to help customers build enterprise-grade apps in the interface they know and love while keeping app, data, and services managed in their own Fabric tenant.

Rayfin unlocks a new development model for our users. Agents write the code. Fabric ships it quickly and safely. Together, we’re giving developers something they’ve never had before: a path from idea to enterprise-grade production that’s measured in hours, not months.

—Amjad Masad, Chief Executive Officer (CEO) of Replit

Learn more about Rayfin by watching the Microsoft Build session “BRK225 – Data, apps, and agents: the future of app dev with Microsoft Fabric” on Wednesday, June 3, 2026, at 1:30 PM PT.

Microsoft Databases, designed for AI applications

For decades, databases have been the backbone of enterprise applications. As applications become more intelligent and agent‑powered, we are evolving Microsoft Databases into the foundation optimized for real‑time, AI‑ready, and operationally rich experiences.

Azure HorizonDB: Enterprise-ready PostgreSQL built for the demands of AI applications

As a leading PostgreSQL committer, Microsoft has long invested in the PostgreSQL community. But as AI‑powered applications place new strains on scale, latency, and resilience, the demand for a new class of PostgreSQL database is clear.

Azure HorizonDB is that next step. Now available in public preview, HorizonDB is a fully managed, PostgreSQL‑compatible database that combines PostgreSQL familiarity with cloud‑scale architecture. It’s zone resilient by default and delivers elastic storage that scales to 128 TB, massive scale‑out compute up to 3,072 vCores, and can sustain sub‑millisecond, multi‑zone commit latency for demanding transactional scenarios.

As our data demands have expanded exponentially because of our use of Azure AI to chat with our data, HorizonDB has come at the perfect time to meet the performance, scale, and security we need to shift into this new world of AI-enabled data.

—Rand Morimoto, President of Convergent Computing (CCO)

Beyond scalability, we’ve also infused Azure HorizonDB with experiences designed specifically for AI applications like vector search, integrated AI model management, and direct connectivity to Microsoft Foundry and Fabric. These features provide a modern foundation for building, modernizing, and scaling AI‑powered applications with confidence. Mohsin Shafqat, Director of Software Engineering at NASDAQ, mentioned, “What stood out with HorizonDB is that it aligns closely with how we already think about the problem. Instead of stitching together multiple components, it brings transactional data, vector search, and AI capabilities into a single platform, which simplifies the architecture without forcing a complete rethink.”

Learn more by watching the Microsoft Build session “BRK223 – From rows to reasoning: Designing databases for AI apps and agents” on Tuesday, June 2, 2026, at 2:30 PM PT.

New security and migration tooling for Azure Database for PostgreSQL

As Azure HorizonDB powers a new class of high-scale, mission‑critical applications built for AI, Azure Database for PostgreSQL remains a trusted, open‑source foundation for modernizing and operating existing PostgreSQL workloads.

Today’s updates introduce two meaningful enhancements. First is the Microsoft Defender for Cloud integration, now in preview, which delivers continuous security and compliance assessments to help teams identify misconfigurations and reduce risk. Second, we’ve released new discovery and assessment tooling to help you more confidently plan migrations to Azure Database for PostgreSQL. These tools evaluate Oracle and PostgreSQL environments and provide readiness insights, sizing guidance, and cost estimates. Learn more about this migration tooling in the Microsoft Build on-demand session “OD822 – Smarter PostgreSQL migrations to power modern, intelligent apps.”

Powering intelligent, multi‑agent systems at global scale with Azure Cosmos DB

Azure Cosmos DB is Microsoft’s AI-ready NoSQL and vector database for building responsive applications and intelligent AI experiences at any scale. OpenAI, for example, chose Azure Cosmos DB as its primary operational database “because of its automatic scaling and schema-less flexibility, allowing us to iterate quickly,” said Nick Cooper, senior technical staff member at OpenAI.

At Microsoft Build, we are focused on improving developer productivity and AI quality. The Azure Cosmos DB Linux Emulator is now generally available, enabling developers to build, test, and validate applications locally across Linux, macOS, and Windows without a cloud dependency. New AI capabilities are also now in preview, including semantic reranking, which improves search relevance using built‑in contextual understanding. In addition, a new agent memory toolkit helps developers standardize persistent memory for AI agents using Azure Cosmos DB, Azure Durable Functions, and Microsoft Foundry models. Learn more in the Microsoft Build on‑demand session “OD820 – Designing reliable multi-agent apps with Azure Cosmos DB.”

Unifying databases and Fabric on a single platform

Microsoft Databases can be centrally managed through the new Database Hub in Fabric, currently in private preview, and mirrored into OneLake, bringing operational and analytical data onto a single foundation. From there, you can use Fabric to make it trusted, contextual, and ready for AI.

Building an AI‑ready data foundation with Microsoft Fabric

In the era of AI, data is the fuel, but data alone is not enough. Equally important is how that data is understood: the definitions of customers, orders, products, revenue, and the relationships between them. Today, that understanding is fragmented across customer relationship management (CRM) and enterprise resource planning (ERP) systems, productivity tools, and spreadsheets, and too often it does not travel with the data. Organizations have long relied on people to recreate this context. But as agents take on more responsibility, this gap becomes critical. Without a shared understanding of the business, agents cannot reliably reason, coordinate, or act.

Microsoft IQ addresses this missing layer by unifying enterprise intelligence into a shared foundation built to activate AI agents. It enables consistent reasoning and enterprise-scale impact, rather than isolated interactions and and brings together four interconnected capabilities: Work IQ captures how work happens, Fabric IQ models how the business operates, Foundry IQ enables agents to discover and reuse knowledge, and the new Web IQ, announced today at Microsoft Build, adds real-time global context from the web.

Fabric IQ is central to this system. It powers a continuous operational loop where people and agents observe live signals, reason over shared context, and take governed action in the moment across analytics, operations, and the productivity tools where work happens. It provides three integrated layers of business context:

  1. Unified data: OneLake unifies the organization’s data estate, spanning analytical and operational data into a single, accessible layer.
  2. Business intelligence: Semantic models provide structured, governed representations of that data which organizations already rely on for trusted business metrics and analyze their business.
  3. Operational intelligence: Ontologies capture operational context by defining business entities and their relationships so agents can reason in the language of the business. This context can include live signals from Fabric Real-Time Intelligence, enabling organizations and their agents to understand what is happening right now and act in time to change outcomes.
Fabric IQ layered architecture showing Unified Data as the foundation, Business Intelligence (semantic models) above it, and Operational Intelligence (ontologies) on top.

This three-tiered foundation helps ensure that every agent starts with the same understanding of the business and can apply it correctly across workflows. But Frontier organizations cannot start at the IQ layer. Building this capability requires a unified data foundation. Microsoft Fabric delivers this through four core capabilities:

  1. Unifying your data estate
  2. Processing and harmonizing data
  3. Curating semantic meaning  
  4. Empowering AI agents to act

Learn more about how Fabric can help you create an AI-ready data foundation in the Microsoft Build on-demand session “OD811 – Powering the next AI frontier with a unified data platform.”

1. Unifying your data estate with Microsoft OneLake

Most organizations struggle to see their entire data estate. It is spread across systems, duplicated in multiple places, and owned by different teams, making it difficult to know what exists, where it lives, and how it connects. Microsoft OneLake brings it together into a single, AI‑ready data lake that unifies your multi‑cloud estate and enables organization‑wide access for analytics and AI.

We are making it easier to connect existing data to OneLake without moving or duplicating it with the release of shortcuts to SharePoint and OneDrive into general availability and the ability to create shortcuts directly from Fabric Data Warehouses, now in preview. We are also adding the preview of workspace-level Azure Private Link support for mirrored data sources.

The general availability of the OneLake catalog in Microsoft Foundry

At the same time, we are making it easier to connect that data to AI. With the recent general availability of the OneLake catalog in Microsoft Foundry, you can discover trusted data, explore rich metadata, and connect it directly to AI solutions. The catalog is embedded within Foundry’s Knowledge experience, making it simple to move from data discovery to AI development in a single workflow.

Learn more about all of these OneLake announcements by watching the Microsoft Build on-demand session “OD815 – Unify your entire data estate on a single, AI-ready data lake.”

2. Process data faster with a new class of GPU-accelerated analytics

Once data is unified, the next challenge is turning it into insights quickly, reliably, and at scale. We are introducing GPU-acceleration built directly into Fabric Data Warehouse to unlock a new level of performance without adding complexity.

The research behind this innovation was recently recognized by ACM SIGMOD as the “Best Industry Paper of 2026.” This breakthrough establishes Fabric Data Warehouse as the first fully managed data warehouse to offer GPU acceleration.

By integrating NVIDIA accelerated computing, query acceleration in Fabric Data Warehouse fundamentally changes how fast queries can run. In internal benchmarking conducted in May 2026, the GPU-accelerated Fabric Data Warehouse delivered up to 7x faster performance relative to three comparable external vendors for reporting and application workloads at 64-user concurrency.

This shift reflects a broader change in how modern data systems need to operate. As Ian Buck, Vice President of Hyperscale and HPC at NVIDIA, explains, “AI applications are redefining how a data warehouse needs to perform. As AI agents reason over enterprise data, analytics systems need low-latency performance for many simultaneous users. With NVIDIA accelerated computing and custom CUDA kernels built directly into Microsoft Fabric Data Warehouse, Microsoft is bringing the SQL workflows customers already use into the production AI era.”

Customers are already seeing that impact. At UNC Health, “We’re seeing up to 5x improvement in our query speeds, which allows our teams to spend less time managing performance and more time delivering meaningful insights,” said Shaun McDonald, IT Manager.

Query acceleration works automatically within Fabric, speeding up queries without requiring any query rewrites. The result is consistently low‑latency analytics that power responsive applications, interactive reporting, and agent‑driven analysis, delivering fresher insights and greater confidence in data.

Query acceleration will be available for an early access preview in the next few weeks. You can learn more by watching the Microsoft Build on-demand session “OD813 – Powering modern data analytics in Fabric Data Warehouse.”

3. Curating semantic meaning with Fabric IQ, now generally available

Once data is prepared, the next challenge is not access but understanding. Most organizations lack a shared layer of business context, forcing every agent to relearn how the business works from fragmented data. Fabric IQ, now generally available, addresses this gap.

With data unified in OneLake, Power BI’s industry-leading semantic models then provide structured representations of your data for trusted business intelligence, serving as an ideal foundation for training agents.

Ontologies in Fabric IQ, expected to be generally available in the coming months, extend semantic models by adding operational context. They define business entities, relationships, properties, rules, and actions, and connect to live signals from Fabric Real-Time Intelligence. Operations agents, now generally available, then reason over shared live context, make decisions based on policy, and take action in the moment. Running on the governed foundation of Fabric and integrated with Microsoft Foundry, operations agents move beyond answering questions to driving action and outcomes.

GIF of entities, relationships, properties, actions and business objectives all being created based on products, inventory, stores, shipments, buyers, and suppliers

Announcing the general availability of graph and planning in Fabric

We’re announcing the general availability of graph in Fabric, with general availability of the planning in Fabric coming later this month. Available today, graph introduces a highly scalable, relationship‑first model that connects business entities, systems, and signals so teams and agents can understand how changes propagate across the enterprise and act with full context. Planning extends this foundation by enabling organizations to create plans, budgets, forecasts, and scenario models on top of Fabric’s semantic models. Notably, planning in Fabric is not just static outputs. They can be written back into Fabric to drive execution, enabling closed-loop alignment with the same system of data and context.

Extending Fabric IQ to Microsoft Foundry and Agent 365

Today, we are extending Fabric IQ across the agent ecosystem so this shared understanding can be used consistently across every agent and application.

Microsoft Foundry and Agent 365 

Now in preview, Ontologies are accessible directly from Microsoft Foundry as knowledge sources, bringing trusted business context into both custom and built-in agent experiences. Also in preview, Fabric IQ is now integrated with Microsoft Agent 365 as a first-party model context protocol (MCP) tool, enabling organizations to ground agents in shared meaning and ensure consistent behavior across their agent estate.

Microsoft 365 Copilot: Cowork and Copilot Chat 

Fabric IQ is also extending into Microsoft 365 Copilot, including Cowork and Copilot Chat. This enables agents to access governed Fabric data, starting with Power BI reports and semantic models, to turn insights into action. Instead of static dashboards, agents can detect changes in key metrics, generate updates, trigger follow-ups, and schedule next steps, all grounded in trusted, governed data. The result is faster, more consistent execution across the organization. These experiences are currently available customers in Frontier with a Microsoft 365 Copilot license. Join the program today.

GitHub Copilot CLI 

Using Agent Skills for Fabric, Fabric IQ tools and skills for data insights are accessible through GitHub Copilot CLI, bringing governed semantic context directly into the terminal. Now you can also query Power BI reports and semantic models from the command line, grounded in governed semantic context in Fabric IQ. Teams can ask natural language questions about usage, metrics, or customer behavior and get answers grounded in Fabric data directly within their workflow, reducing back-and-forth and accelerating data-driven decisions. 

Learn more about these enhancements in the Build on‑demand session “OD812 – Bringing Enterprise Ontology Directly into the Developer Workflow.”

4. Empowering agents to act with operations agents and Copilot in Fabric

Enterprises are increasingly moving to multi-agent systems made up of specialized agents grounded in specific data or domain expertise. These agents can be reused across multiple systems, making it easier to scale and deliver more consistent outcomes.

Announcing the general availability of operations agents

Microsoft Fabric supports this shift with native agent capabilities, including Fabric data agents and operations agents. I’m excited to share operations agents are now generally available. These agents are designed to continuously monitor real-time data, detect patterns or anomalies, and act based on predefined business logic.

Apply agentic analytics in Power BI

In addition, we recently released open-source Agent Skills for Fabric, designed to make it easier for AI tools and agents to interact directly with Fabric. These capabilities now extend to Power BI, enabling developers and analysts to create reports using natural language or even a screenshot of the desired outcome, significantly accelerating how insights are built and shared.

Copilot in Power BI can now also modify semantic models, with built-in recommendations that improve performance and make models more AI-ready. This helps teams iterate faster and deliver insights more quickly without leaving the tools they already use.

Learn more by reading the Power BI blog or by watching the Microsoft Build on-demand session “OD817 – Agentic analytics with Power BI and Microsoft Fabric.”

Watch these announcements from Microsoft Build

If you’re interested in learning more and seeing live demos on these announcements, I encourage you to watch the Microsoft Database and Microsoft Fabric sessions at Microsoft Build, available in the Microsoft Build catalog.

Tuesday, June 2, 2026

Wednesday, June 3, 2026

On-demand sessions (available now)

Click the links below to watch immediately.

Microsoft Databases

Microsoft Fabric

Explore additional resources for Microsoft Fabric

The post Microsoft Build 2026: Building agentic apps with Microsoft Fabric and Microsoft Databases appeared first on Microsoft Azure Blog.







Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete

Introducing Reader Chat, Powered by Jetpack Search

1 Share

Site search is great when readers know what to type. A lot of the time, they don’t.

They land on a post with a question. They skim an archive looking for a thread. They want to know whether you’ve written about a topic before, but they don’t know the exact keyword that will get them there.

Reader Chat gives them a more natural way in.

It adds a small chat experience to your public site, so readers can ask questions and get answers based on your site content. On a post, it can use that post as context. On the home page or an archive, it can help readers explore the site overall.

See Reader Chat in action

Reader Chat suggests questions based on the page, answers from the site’s content, and offers follow-up questions so readers can keep exploring.

A better way into your archive

The more useful your archive gets, the harder it can be for readers to find the right entry point.

Reader Chat turns that archive into something readers can ask. Instead of guessing the perfect search term, a visitor can ask things like:

  • What’s the main takeaway from this post?
  • What else has this site published about this topic?
  • Can you recommend a good post to read next?
  • How does this idea connect to earlier posts?

Search is still the right tool when someone wants a results page, filters, sorting, and a list of matching posts. Reader Chat is for the next kind of question: the one a reader has while they’re already reading.

Suggestions that fit the page

Reader Chat starts with suggested questions, so visitors don’t have to begin from a blank box.

On a post, suggestions can focus on the article in front of them: the main point, key details, or related context. On the home page or an archive, suggestions can help them explore recent posts, learn what the site covers, or find something worth reading next.

After a reader gets an answer, follow-up suggestions help keep the conversation moving without sending them away from your site.

Give Reader Chat a voice

Reader Chat can use your site’s Guidelines to shape how it responds.

That means site owners can set the tone, boundaries, and style they want Reader Chat to follow. A travel site might want warm, practical answers. A technical site might want short, precise ones. A personal site might want responses that feel closer to the author’s own voice.

Reader Chat still answers from your content, but Guidelines help it respond in a way that feels like it belongs on your site.

How to enable Reader Chat

Reader Chat is opt-in. To turn it on:

  1. Go to your site’s WP Admin.
  2. Open Jetpack → Search.
  3. Go to the Settings tab.
  4. Turn on Enable Reader Chat.
  5. Save your changes, then visit a public post on your site to try it.

Reader Chat appears only on eligible public sites with Jetpack Search plan. It won’t load on Coming Soon or unlaunched sites.

How to set Guidelines

Guidelines are optional, but they’re useful if you want Reader Chat to match your site’s voice and expectations.

  1. Go to your site’s WP Admin.
  2. Open Gutenberg → Experiments.
  3. Enable Guidelines.
  4. Open Settings → Guidelines.
  5. Use Site to describe what your site is about, who it’s for, and what readers usually come there to find.
  6. Use Additional for Reader Chat-specific guidance, like tone, answer length, topics to avoid, or when to point readers back to posts.
  7. Click Save guidelines.

Example:

Keep answers warm, practical, and concise. Use the site’s existing posts as the source of truth. When possible, point readers to related posts they can keep reading. Avoid sounding formal or promotional.

Availability

Reader Chat is available in Preview on eligible public sites with Jetpack Search. It’s live now on WordPress.com sites with a Jetpack Search plan, and on self-hosted WordPress sites running Jetpack 15.9 or later.

Search that talks back

Search helps readers find the right post. Reader Chat helps them ask the next question once they’re there.

If your site has a deep archive, detailed guides, or posts that connect across topics, Reader Chat gives visitors a faster way to follow those threads and keep reading.





Read the whole story
alvinashcraft
2 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

A Developer’s Guide to Managing Models, Cost and Quality in Microsoft Foundry

1 Share

The hardest part of building AI systems today is no longer getting access to a capable model. It is knowing how to choose, validate, optimize, and operate the right model across the full lifecycle of a real application.

Take a retrieval-augmented generation (RAG)-based customer support copilot or a tool-calling agent that helps employees complete business workflows. In a prototype, it may be enough to pick a strong model, connect a few data sources, and get a useful response. In production, the system needs to retrieve the right context, call the right tools, meet quality and safety thresholds, stay within latency targets, and run at a cost the business can sustain.

Models evolve, costs shift, and production requirements often arrive after the first version is already working. Success depends less on choosing the most powerful model and more on building a disciplined operating approach around the application.

That is where Microsoft Foundry comes in: a unified platform to select, evaluate, optimize, operate, and continuously improve AI applications at production scale.

What’s new

Microsoft Foundry continues to expand the model ecosystem and operating surface for developers building production AI systems.

Fireworks AI on Microsoft Foundry is now generally available, giving developers access to production-grade open model inference through a single Azure endpoint, with enterprise service-level agreements (SLAs) and zero-setup onboarding.

Foundry is also adding new model families and capabilities across modalities, including Microsoft AI models, partner models, open-source models, custom models, and post-trained variants. Together, these updates give developers more choice while keeping selection, evaluation, deployment, and operations in one consistent workflow.

The challenge is no longer access. It is operations.

In a prototype, the questions are simple: Can the model answer the prompt? Can it connect to my data? Can it complete the happy path?

In production, the questions change. Which model fits each task? How do I validate it on my own data? What latency budget does this experience require? How much throughput do I need at peak? What happens when quota is constrained, costs spike, or a newer model becomes available? How do I monitor quality, detect eval drift, roll back safely, and prove the system is governed?

Agentic systems often fail when the model is mismatched, evaluation is incomplete, costs run unchecked, or governance arrives too late. Teams that rely on a single provider face another risk: lock-in, with no escape hatch when a model degrades, pricing changes, or capacity becomes constrained.

Foundry is built on the opposite philosophy. It is a model-agnostic platform spanning Microsoft, open-source, and independent software vendor (ISV) partner models, all on the same operating surface.

The answer is to treat model selection and optimization as a continuous operating discipline: 

Model optimization loop showing how teams select, evaluate, optimize, operate, and improve models over time

1. Select the right model for the task

Model selection is about workload fit, not leaderboard rank. Before choosing a model, define the task contract: what the model needs to do, what good looks like, what constraints it must operate within, and which failure modes are unacceptable.

A routing step may need low latency. A policy question may need grounded reasoning with citations. A coding agent may need deeper reasoning and tool use. A customer-facing copilot may need strong safety boundaries, predictable latency, and cost efficiency at scale.

A simple model selection framework:

Workload needFavor this approachWhy
Classification, routing, extraction, or high-volume chatSmaller, lower-latency modelKeeps cost and latency low
Complex reasoning, coding, or planningStronger reasoning modelImproves quality for harder tasks
Image, speech, voice, or physical AIModality-specific modelMatches the model to the input and output type
Mixed workloads with different complexityModel RouterRoutes each request based on quality, cost, and latency
Domain-specific behavior, tone, or formatFine-tuned or custom modelImproves consistency for your scenario

Effective model choice depends on four dimensions: capability, safety, latency, and cost.

Foundry helps developers make these tradeoffs through a broad model ecosystem and a consistent operating surface. Developers can access Microsoft models, leading base models, partner models like Fireworks AI, open-source models, custom models, and post-trained variants through one selection, evaluation, and deployment workflow.

Developer tip: For developers who want to bypass manual selection, Foundry provides Model Router in Foundry Models. Model Router automatically routes each request to the most appropriate model based on workload characteristics, cost targets, and latency requirements.

2. Validate with your own evals and data

Benchmarks are not enough. A model that leads a public leaderboard may still underperform on your prompts, your data, your users, and your business rules. Production confidence comes from evaluating against the workloads your application will actually run.

With Foundry, developers can bring their own evaluation inputs, including CSV or JSONL datasets with prompts, expected outputs, labels, or ground-truth answers. They can run side-by-side comparisons across models and prompts, evaluate agents and multi-step workflows, and inspect results across datasets, traces, and production-like scenarios.

Built-in quality and safety evaluators help measure signals such as relevance, groundedness, coherence, fluency, safety, and policy adherence. Custom evaluators can capture application-specific rules, formats, and business logic.

A strong evaluation covers:

Quality: Did the model complete the task correctly? Accuracy and groundedness: Did it produce reliable answers based on the right context? Safety: Did it follow policies and avoid unacceptable responses? Performance: Did it meet latency, throughput, and reliability requirements? Cost: Did it deliver the right outcome at the right price?

Evaluation should run continuously as new model versions, fine-tuned variants, agent changes, or new model families become available.

Developer tip: Define success criteria before opening the model catalog. Criteria-first evaluation prevents anchoring on model reputation instead of workload fit.

3. Optimize cost and performance

Cost is a first-class architectural concern, not an afterthought. In prototypes, it may be acceptable to send every task to the most capable model. In production, that approach breaks down quickly.

A simple classification task, a RAG response, a long-context reasoning workflow, and a multi-step agentic process should not always use the same model or deployment strategy.

Foundry gives developers levers to optimize across quality, cost, and latency at the system level:

Intelligent routing: Send each task to the right model based on complexity and budget. Batching: Use asynchronous processing for workloads that do not require real-time responses. Caching: Avoid paying repeatedly for identical or near-identical requests. Provisioned throughput: Use dedicated capacity for predictable performance at scale. Quota management: Scale more predictably with quota tiering, global customer quota, and data zone customer quota. Model optimization: Use model compression, fine-tuning, or distillation where appropriate.

Fireworks AI on Foundry is now generally available, giving developers access to a high-performance open model catalog through a single Azure endpoint, with enterprise SLAs, no separate infrastructure, and no separate contracts.

Developer tip: Profile cost by task type before optimizing globally. Routing decisions are workload-specific, not one-size-fits-all.

4. Operate at scale with enterprise confidence

Deploying an endpoint is not the same as operating a production AI system. Teams need to understand how the system behaves, enforce policies, monitor usage and cost, test model changes safely, and roll back when quality or performance regresses.

Foundry brings these operating capabilities into one surface: versioning, SLA-backed reliability, security, governance, access controls, audit logging, usage monitoring, and controlled upgrades.

Teams can monitor token usage and throughput, inspect logs and traces, evaluate model and agent behavior, enforce policies, and compare changes before rolling them out broadly. As new model versions become available, they can test against evaluation datasets and traces, validate quality, latency, and cost impact, and reduce risk with versioning and rollback strategies.

The Fireworks AI on Foundry generally available (GA) release is a concrete example of this operating model, with enterprise SLAs, provisioned throughput unit (PTU) Data Zone support, SOC2 readiness, and the same access controls and audit logging that govern Foundry.

Production adopters span AI-native and traditional enterprise workloads, including Perplexity, Motif, UiPath, and StackBlitz. During preview, the platform processed more than 176 billion tokens across 17 S&P 500 enterprises.

Developer tip: Treat model upgrades like dependency upgrades: test against baselines, stage rollouts, monitor regressions, and keep a rollback plan.

5. Continuously improve as models and workloads evolve

AI systems are dynamic. Models improve, workloads shift, user behavior changes, pricing evolves, and new model families arrive. The best system today may not be the best system six months from now.

That is why the lifecycle loop matters:

Select the right model for the task. Evaluate it against your own data and production baselines. Optimize for quality, cost, latency, and throughput. Operate with governance, observability, and reliability. Improve as new models, tools, and customization options emerge.

For engineering teams, every model, prompt, tool, agent, or workflow change should be treated like a production change. New model versions should be tested automatically against regression datasets, production traces, and known edge cases before rollout.

A model may improve quality but increase latency, reduce cost but weaken groundedness, or perform better on common cases while regressing on high-risk scenarios. Automated evaluations help teams detect those tradeoffs early.

Developer tip: Automate your evaluation pipeline so every new model version is compared against production baselines for quality, safety, latency, throughput, and cost before deployment.

What this means for developers

The next phase of AI development will not be won by teams that simply have access to the biggest models. It will be won by teams that know how to operate models well.

That means choosing by workload fit, validating with real data, optimizing cost and performance, deploying with governance, and improving as the landscape shifts.

Microsoft Foundry is designed for exactly this reality: a model-agnostic platform spanning Microsoft, open-source, and ISV models, all on one operating surface. No lock-in. No re-architecture. No guesswork.

The future of AI development is not about guessing which model might work. It is about building an operating discipline that lets you know.

Get started

The post A Developer’s Guide to Managing Models, Cost and Quality in Microsoft Foundry appeared first on Microsoft Azure Blog.

Read the whole story
alvinashcraft
2 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Foundry IQ: Build smarter agents faster with unified knowledge and serverless retrieval

1 Share

Developers building agent fleets keep hitting the same pattern: the agent logic is ready, but the knowledge infrastructure underneath is complex to do well. Getting to production means solving for stability, scale, data access, answer quality, security, and content ingestion all at once. Today, we are enabling developers to have faster impact by simplifying the enterprise knowledge platform.

Your company’s IQ, powered by Microsoft IQ, is the collective intelligence locked in documents, emails, meetings, operational data, and the live web. This is where your true competitive edge lives. Foundry IQ grounds agents with the knowledge from these sources and continuously improves based on your business goals.

The announcements today are designed to help customers provision knowledge bases faster, unify enterprise and external sources, and expose that knowledge through the Foundry IQ Model Context Protocol (MCP) server for any agent framework or MCP-compatible hosts.

What’s new

  • Foundry IQ Serverless in preview: Provision instant, no-friction context retrieval with scale to zero pricing. Developer tier now available in public preview with more coming soon. Docs | Create a Foundry IQ resource
  • New knowledge sources in preview: Ground agents across Work IQ, Fabric IQ (including Data agents and Ontology), File Search, Azure SQL, and MCP through a multi-source knowledge base, with no custom integrations required. Docs | Cookbook
  • Web IQ in Foundry IQ is now available: Extend agent context to the web, honoring publisher preferences, and marketplace data with sub-165 ms latency and zero data retention. Blog | Website
  • Foundry IQ knowledge bases are generally available: Ship production agents on a fully SLA-backed knowledge layer with stable APIs, compliance certifications, and the Foundry IQ MCP server for any MCP-compatible host. Docs | Quickstart
  • Agentic retrieval quality improvements: The latest updates to the agentic retrieval engine improve answer performance across datasets, effort tiers, and model sizes while spending fewer tokens. Blog | Quickstart
  • Data pipeline updates in preview: Automatic layout-aware ingestion of documents, image enrichment, and broader SharePoint indexing ground agents in complete documents, not just raw text. Blog | Quickstart
  • Security updates in preview: New controls for encryption, permissions sync, and sensitivity-label governance keep enterprise policy intact as content flows into agents. Blog | Quickstart

Foundry IQ Serverless in public preview

We know agent workloads are bursty and event-driven: an agent might execute hundreds of steps in seconds, then go idle for hours. Serverless eliminates infrastructure friction: no clusters to manage, no capacity to reserve, no idle costs. Go from zero to production fast, with instant retrieval-augmented generation (RAG) and state-of-the-art retrieval quality built in.

Foundry IQ Serverless (Developer tier) is available in public preview. You are billed for compute resources and storage used, and the service scales to zero when idle.

Serverless tiers use Compute Units (CU) to measure resource consumption, including CPU utilization, memory and storage I/O. Usage is calculated each minute in increments of 0.25 CUs.

For large-scale serverless deployments, contact us for additional options.

CapabilityDeveloper tier
Compute usage$0.24 CU / hour
Indexed storageUp to $0.29 GB / month; GB cost is region dependent
Indexed storage per index1 GB / index
Indexes per service30 indexes / service
Services per subscription per region5 services / subscription / region

Billing is expected to begin in late 2026 with details provided at least 30 days in advance. Customers using Serverless Developer won’t be charged before billing is enabled. Current Compute Unit measurements are estimates only and subject to change before billing is enabled.

Next steps: create a Foundry IQ Serverless resource in the Foundry portal.

New knowledge sources in preview

How do I give an agent access to organizational knowledge and structured business data without building a custom connector for every system?

Bringing enterprise knowledge into agent workflows often means stitching together custom integrations across each data source. Developers must account for different data formats, permission models, retrieval patterns, and source-specific logic before an agent could reliably use that knowledge.

Foundry IQ simplifies this by bringing enterprise content and structured systems into a single knowledge base for multi-source, agentic retrieval. Developers can give agents access to that knowledge without building and maintaining separate connectors or source-specific retrieval strategies.

New knowledge sources in preview:

  • Work IQ brings organizational signals like emails, meetings, files, and Teams messages into one enriched, AI-ready source, all while respecting user permissions. Agents can answer questions about how the organization operates, what decisions were made, and what is top of mind for teams.
  • Fabric IQ lets agents query data agents and company ontologies: formal models of business entities, relationships, and rules linked to live data in OneLake and a specialized semantic layer. This returns structured answers alongside unstructured document context for a query.
  • File Search allows you to directly upload files to a knowledge base.
  • Azure SQL brings structured relational data into a knowledge base.
  • MCP Server connects knowledge served over the Model Context Protocol.
A Microsoft Foundry interface showing knowledge source selection for Foundry IQ, with options for connecting enterprise data sources into a knowledge base.

Next steps: use the Foundry IQ Forgebook to try out additional knowledge sources.

Microsoft Web IQ in Foundry IQ now available

When an answer needs fresh, real-world context, how do I reach the open web without paying a latency or compliance penalty?

Microsoft Web IQ is available in limited access through the Foundry IQ MCP knowledge source. It gives agents access to external retrieval across web, news, images, video, and shopping sources while honoring publisher preferences. It is designed for large language model (LLM) workflows rather than traditional search pages, with industry-leading low-latency ranking.

Combined with Foundry IQ, agents can plan, search, reason, and synthesize answers that draw on both internal knowledge and real-world external context in one retrieval engine.

Next steps: read the blog announcement for Microsoft Web IQ.

Knowledge bases in Foundry IQ are generally available

What does it actually take to move a prototype into production?

Production means guarantees: stable contracts, predictable performance, and security that holds under audit. Foundry IQ knowledge bases and select knowledge sources, and security capabilities are generally available: with full SLA coverage, compliance certifications, stable APIs, and enterprise-grade network isolation with identity and policy enforced by default.

What is included in GA:

  • Knowledge bases: agentic retrieval references, output and activity logs, the Foundry IQ MCP server, and minimal retrieval reasoning effort.
  • Foundry IQ MCP server: exposes Foundry IQ knowledge bases as a remote MCP server, making them accessible from any MCP-compatible host or client, including Claude, ChatGPT, LangChain, and the Microsoft Agent Framework. Network isolation, document-level security, cross-source ranking, and agentic retrieval all work over the open MCP standard, making it available for the broader agent ecosystem.
  • Knowledge sources: Azure Blob Storage (with a status API to check indexing progress), search indexes, Web, and OneLake.
  • Security: network isolation and managed identity support.

“We’ve been using Foundry IQ in our research and prototyping work, and the reusable knowledge base approach has cut a lot of the setup overhead we’d normally expect. Being able to ground agents in trusted enterprise content from day one, without rebuilding retrieval logic each time, has made early-stage experimentation noticeably faster and higher quality.” — Jane Chen, Lead AI Developer, Baringa Partners

Next steps: use the Mastering Foundry IQ cookbook to get started building with the Foundry IQ MCP server.

Agentic retrieval quality improvements

The latest retrieval enhancements improved our answer quality benchmarks by up to 20%, across our evaluated datasets, effort tiers, and model sizes. Compared to single-shot RAG, knowledge bases improved recall by up to 54%.

Foundry IQ improved its iterative agentic retrieval loop to batch queries more effectively, surface more relevant passages via semantic ranker, and apply server-side token caching to reduce redundant consumption across multi-turn conversations. This results in meaningfully fewer tokens spent without sacrificing answer quality, while beating previous benchmarks on answer quality.

Next steps: read our blog for more on the latest evaluations and Foundry IQ benchmarks.

Security updates in preview

How do I keep enterprise data permissions intact as content flows into agents?

Security belongs at the data layer, not approximated in application code. Several security capabilities are now in preview, including cross-tenant customer-managed keys (CMK) using federated identity credentials — eliminating shared secrets — Purview sensitivity-label auditing, incremental SharePoint permissions sync, APIM support for Foundry model integrations, and surfacing Purview sensitivity labels inside knowledge sources so label-based access controls are honored end to end.

Private connectivity between Foundry IQ and Foundry products, via Shared Private Link and Network Security Perimeter, is generally available.

“By integrating Foundry IQ, we provide a managed, permission-aware business context layer that connects marketing and brand knowledge into every agent so they can access the right information, at the right time, with the right governance.” — Andrei Pop, Director of PM, Innovation, Sitecore

Next steps: read more about the latest Foundry IQ security announcements.

Data pipeline updates in preview

How do I make sure agents are grounded in the whole document (tables, diagrams, and images) not just the raw text?

Ingestion quality sets the ceiling on retrieval quality. New data pipeline capabilities in preview include first-class SharePoint indexing for ASPX pages and Lists alongside document libraries and document enrichment to process images plus serve them at query time in knowledge bases, so agents and users can reference original visuals and ask follow-up questions about them. We are also introducing Azure Content Understanding chunking with image verbalization — a layout-aware ingestion pipeline that converts diagrams, charts, and scanned images into meaningful text so agents are grounded in complete, semantically accurate representations of source documents.

Next steps: read Foundry IQ’s data pipeline deep dive blog post.

Get started today

Build once, reuse everywhere: Foundry IQ enables you to ground multiple agents with the same knowledge base, connecting and unifying data from anywhere. Foundry IQ is designed for agent workloads to deliver better results from your company’s IQ. With Foundry IQ, accelerate agent delivery, deliver context without blind spots, and ensure every answer respects your organization’s security by default.

The easiest way to explore Foundry IQ is through the Microsoft Foundry portal. From there you can create a knowledge base, access the documentation, and follow the Microsoft Foundry Learn courses, all in a few clicks.

Be sure to check out the latest news from Foundry IQ at Microsoft Build 2026:

  • BRK246: Foundry IQ: Fuel agents with enterprise knowledge and agentic retrieval
  • LAB532: From data to context: agent-ready knowledge with Foundry IQ
  • BRK240: Build context-aware agents: From data to decisions with Microsoft IQ
  • DEM331: Turn APIs, tools, and data into real agent velocity

The post Foundry IQ: Build smarter agents faster with unified knowledge and serverless retrieval appeared first on Microsoft Azure Blog.

Read the whole story
alvinashcraft
2 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

How to stop AI hallucinations in enterprise RAG systems (a complete guide)

1 Share

Retrieval-Augmented Generation (RAG) does not solve AI hallucinations. Instead, it just moves the failure point from the language model to the retrieval pipeline — where poor chunking, weak embeddings, outdated documents, and low-confidence search results quietly produce confident but incorrect answers.

From the Air Canada chatbot lawsuit to the infamous Chevy dealership pricing fiasco, this article breaks down the six real reasons RAG systems fail in production – and the five architectural patterns high-performing AI teams use to make them trustworthy, grounded, and production-ready.

The journey of enterprise AI often begins with a celebration – a Retrieval-Augmented Generation (RAG) chatbot passes internal testing and ships to production. That’s all good, but then the messy reality of user interaction hits. A customer asks about a discount, the bot confidently promises a refund that doesn’t exist, and you have yourself a problem.

Case in point: Air Canada, which was embroiled in the industry’s first known legal reckoning. A customer asked Air Canada’s chatbot about bereavement fares and was told he could apply for a refund, retroactively, within 90 days. In reality, though, Air Canada’s policy required the discount to be applied at the time of booking, so his refund was denied.

He sued and, in the resulting lawsuit, Air Canada argued an extraordinary defense: that the chatbot was a “separate legal entity” responsible for its own actions. The court disagreed, finding that a chatbot is merely a dynamic extension of a company’s digital presence. They ruled it as a “negligent misrepresentation” – essentially, the company is liable – not the bot.

If RAG doesn’t solve hallucination – just what exactly does it do?

All of this begs the question: if RAG doesn’t solve hallucination, what does it do? Well, it changes the source of hallucination. In a naive system, the model hallucinates directly, whereas in a RAG system, hallucination usually comes from the retrieval layer. It provided the wrong context (or the model failed to reconcile contradictory snippets), but it’s still primarily a retrieval (and grounding, combined) problem. It’s not an inherent flaw of the generator.

An image showing a graph of the two parallel pipeline paths: ingestion pipeline, and query pipeline
The two parallel pipeline paths: the ingestion pipeline (top) and the query pipeline (bottom)

Six reasons why RAG systems still hallucinate

When a RAG system fails, it usually fails silently. There’s no 404 error – the failure typically manifests as a perfectly formatted, confident lie. These “confidently wrong” answers have plagued early-stage deployments of RAG systems, but the root cause of failure can almost always be traced to one of six failure points in the pipeline architecture. Let’s take a look at what they are.

Poor chunking and semantic fragmentation

Most teams start out by chunking text into context windows of a fixed size, i.e. every 500 characters (or tokens). That approach is often semantically ‘naïve’, splitting a rule from its key exception. “Refunds are allowed” might appear in one chunk, for example, while “unless the ticket was purchased during a promotional sale,” appears in another.

This happens because the retriever grabs the first chunk, and then the large language model (LLM) hallucinates a general refund policy. If a document states “Refunds are allowed unless the ticket was purchased during a promotional sale,” a fixed-size split might put the permission in one chunk and the exception in another.

The retriever pulls the first chunk, and the LLM confidently hallucinates a universal refund policy – because it’s physically missing the constraint.

“Everyone wants to move faster with AI, but few are truly ready for it.”

What does the AI landscape look like in 2026? Get the full overview in Redgate’s 2026 State of the Database Landscape report >>
Download the report

Weak (or mismatched) embedding models

Here’s one, very common, silent failure. You might be ingesting highly specialized legal or technical documentation, for example, and using a generic embedding model. And this model may not have the nuance to differentiate between similar looking, but functionally distinct, terms.

If you then update your embedding model version halfway through an ingestion cycle without re-indexing legacy data, you’ve created an ’embedding space mismatch’. Simply put: the query and stored documents are speaking two different mathematical dialects. 

The limits of cosine similarity

Cosine similarity measures the geometric angle between two vectors – a proxy for topical similarity, not factual relevance.

A query about “Product Version 3.2” might return a high-scoring chunk for “Version 3.1” because the textual overlap is 95% identical. The search engine sees the high score and places the wrong version at the top. This “specifically wrong” retrieval is a primary driver of hallucinations in technical support bots.

Context overload – and the ‘lost in the middle’ effect

It’s tempting to increase the number of retrieved chunks (the ‘k’ value) to ensure the answer is “somewhere” in the prompt. Models prioritize information at the very beginning and end of a prompt while ignoring the middle. This is known as the ‘lost-in-the-middle’ effect.

If the critical detail is in the fifth of ten chunks, for example, the model may conclude that the information is missing. It’ll then default to its training data – a hallucination born from good intentions.

Contradictory documents and version drift

Most enterprise knowledge bases are rarely ‘truth-shaped’ — their primary focus is on being ‘document-shaped.’ For example, a knowledge base might contain the 2022, 2023, and 2024 versions of a travel policy. Without deduplication or metadata filtering, a RAG system will retrieve chunks from all three.

Example: if the LLM receives “Refunds take 5 days” and “Refunds take 14 days” simultaneously, it has no mechanism to determine which is current. It either synthesizes a hallucinated average or picks one at random.

Low-confidence generation without fallback (the ‘Chevy dealership’ trap)

In early 2024, a user manipulated a dealership’s chatbot into “legally” selling a Chevy Tahoe for $1. Naive RAG assumes that if a document is retrieved, it must be relevant. If the vector search returns garbage because no good match exists, the LLM is still forced to generate an answer from that garbage.

Without a confidence threshold – a mathematical gate that checks whether the retrieval score warrants an answer – the model improvises. This is not a model failure; it is an architecture failure.

What actually works? 5 lessons from real deployments

To deliver a both useful and reliable system, we must move from a one-stage to a multi-stage architecture. We’re then correctly treating retrieval as a first-class engineering problem. The following five lessons represent hard-won wisdom from teams who have navigated the transition from hallucination-prone to production-grade.

Lesson 1: Why you should implement semantic chunking

Instead of splitting text into fixed-size chunks, consider semantic chunking. With semantic chunking, every chunk of text represents one complete thought.

The system slices a document into individual sentences and computes embeddings for each of those sentences. It then computes the cosine distance between each pair of consecutive sentences, and places a chunk boundary whenever the distance exceeds a given percentile (e.g., 95th percentile) of all of the distances for that document.

The next most successful generative pattern uses what is mathematically called a “percentile-based split.” Despite the technical term, the gist of what it does is simple – it identifies ‘topic shifts’.

In a real-world deployment for a medical equipment manufacturer, this strategy improved retrieval recall by 9%. All because it kept complex multi-step instructions for individual parts within a single, unbreakable context.

An image showing a comparison graph between fixed-size chunking and semantic chunking.
Fixed-size chunking vs semantic chunking

Lesson 2: Why you should deploy hybrid retrieval

Relying exclusively on vector embeddings can lead to precision errors, particularly when it comes to technical data. In our experience, dense embeddings fail to capture alphanumeric strings such as a model number (‘Model X-451’) or an error code (‘0x8004’). This is why a production system must combine BM25 (keyword-based) search together with dense vector search.

The documentation was organized by hardware codes, but users described their problems in natural language (such as “my screen is flickering”), and the hybrid retriever helped bridge the gap. The industry standard for merging these results is called Reciprocal Rank Fusion (RRF), which calculates a new score for each document based on its rank in both the keyword and semantic result sets.

Put simply: RRF says “give me a document that is relevant to both of my systems”. The hybrid retriever bridged the gap between user intent and document terminology.

Lesson 3: Why reranking is your lifeline

Vector databases are good at fast but shallow recall – with the top results often in the wrong order. This is why most RAG pipelines perform a second filter on the top 20 or 50 retrieved chunks, known as a ‘reranker.’

Rerankers (or cross-encoder models) take over from the bi-encoders after the initial retrieval pass, evaluating the query – and each chunk – one at a time. Because the cross-encoder ingests the query and document tokens together, it’s able to identify when a chunk is topically similar but factually irrelevant.

Tools like Cohere Rerank and BGE-Reranker have become production standards, reducing hallucination rates by up to 20% simply by ensuring the best chunk appears first in the prompt. Skipping reranking is the single most common cause of “good retrieval, bad answer” failures.

Lesson 4: The confidence gate pattern, explained

An honest RAG system must be empowered to say “I don’t know.” Implementing a confidence gate – a numerical threshold for the reranker or similarity score below which the fallback is triggered instead of an answer – is a critical safety feature.

# The Confidence Gate Pattern
async def search_with_threshold(query: str, threshold: float = 0.6):
    results = await vector_db.similarity_search(query)
    # Filter chunks that don't meet the similarity threshold
    confident_context = [res for res in results if res.score >= threshold]

    if not confident_context:
        return "I'm sorry, I don't have enough verified information to answer that."

    return generate_answer(query, confident_context)

In a financial services deployment, we found that the threshold didn’t need to be a single value. Exploratory “tell me about” queries could use lower thresholds of 0.5, while specific technical questions required high thresholds of 0.8.

Lesson 5: The importance of mandatory citation-grounded outputs

The single most powerful hallucination reducer we’ve found in production is forcing the model to cite its sources. And I don’t mean just tacking on some links at the end. I mean intrinsic source citation, in which the model anchors every factual claim it makes to a specific chunk ID.

This kind of prompt effectively creates a self-correcting feedback loop. If it can’t find a source for a claim it was about to make, it must either leave it out or acknowledge the gap. In a pilot we did for a legal research company, we saw that this sharply reduced hallucination rates, since the “hallucination cost” (making up a reasonable-sounding source ID) was now higher than the cost of following the context.

Production-grade example: customer support assistant

Take the example of a high-reliability support assistant for an enterprise software company. It must be able to field complex questions about installation, A.P.I. configurations, and troubleshooting – using documentation that may span thousands of pages across multiple versions of the software. Here’s how all five lessons come together.

The ingestion lifecycle

The first step involves Recursive Character Text Splitting, with an option to keep Markdown headers. The “Prerequisites” section and the “Step-by-Step” guide are an example of two such closely related chunks. Each chunk is stored with a robust metadata schema:

source_url — link to the live documentation page

version — software version tag, essential for filtering

chunk_id — a unique, stable identifier for citation mapping

The retrieval workflow

When a user asks “How do I configure OAuth for version 3.2?”, the system executes four steps:

  • Metadata pre-filtering: The search space is immediately narrowed to chunks tagged version 3.2 or “global,” preventing retrieval of obsolete v2.0 instructions.

  • Hybrid retrieval: A BM25 search runs for “OAuth” while a vector search targets “Single Sign-On authentication configuration.”

  • Reranking: The top 40 hybrid candidates are passed to a reranker, which identifies the top 5 chunks specifically addressing “configuration” rather than just “OAuth.”

  • Confidence gate: If the reranker scores the top chunks below 0.7, the system escalates to a human agent with the query pre-filled (rather than guessing.)

The generation and verification gate

It then assembles the top five into a prompt for the generation engine. It also provides a strict-mode order:
“Answer based only on the provided context. If the information is missing, respond with NO_CONTEXT_FOUND.”

If the LLM returns NO_CONTEXT_FOUND, the system does not surface an error to the user. Instead, it silently escalates to a human agent. If an answer is generated, the user interface (UI) renders citations as clickable deep-links that highlight the exact source paragraph, giving the user immediate verification of the system’s honesty.

How to measure whether your RAG pipeline is honest

You can’t fix a system if you don’t know where it broke. Think of it as the Shopper vs. Chef problem: if the Shopper (Retriever) brings home rotten eggs, the Chef (LLM) produces a bad meal regardless of their skill. Evaluation must be diagnostic – pinpointing which layer failed, not just whether the final answer was wrong.

The standard evaluation tool is RAGAS, which provides an automated mathematical heartbeat for the system’s accuracy.

After every code change, run these four metrics to confirm you haven’t regressed:

Context precision

To choose an embedding model, we look at mean Precision@K over the retrieved chunks. Poor precision denotes that a retriever didn’t rank the relevant chunks at the top of the list. It can also denote that a reranker is missing.

Context recall

Recall measures whether the retriever found all the information necessary to answer the question (as compared to the reference answer). Low recall generally indicates that chunk sizes are too small. However, it can also indicate that the initial K value is too conservative.

Faithfulness (‘groundedness’ score)

Every claim in the generated answer should be relatable to the retrieved chunks. Simply put: we should be able to mathematically infer the generated answer from the retrieved chunks. To measure faithfulness, the answer is broken down into individual claims, and an LLM judge compares each of the claims to the context. This means that high faithfulness is the ultimate signal of an honest system.

Answer relevance

The LLM then reverse-engineers a series of possible questions that could have resulted in the generated answer – determining the similarity between those questions and the input query. In other words, it checks whether the response actually answers the user’s question, even if it isn’t factually true. This process also flags “evasive” bots, which are technically telling the truth but are otherwise thoroughly useless.

When RAG is the wrong tool

Part of building solid AI infrastructure involves knowing when RAG is not the solution. In the rush to adopt generative AI, many teams add RAG complexity to problems that older, more deterministic tools are better at solving.

The long-context advantage

If your data fits into a single long-context window (200k to 1M tokens), then RAG is likely an unnecessary complexity, as models such as Gemini 1.5 Pro or Claude 3.5 are capable of reading an entire technical manual in one go. This eliminates the problem of retrieval failure and chunking fragmentation entirely.

While RAG is dramatically cheaper for massive datasets, long-context is often more accurate for complex reasoning across a small, static document set.

Structured data and SQL

If the user’s question is about structured information — for example, “Which customers spent more than $5,000 in December?” — then RAG will fail.

Vector search cannot perform mathematical aggregations or precise joins. A SQL database is the only correct tool for structured, numerical, or relationship-heavy queries.

Fast, reliable and consistent SQL Server development…

…with SQL Toolbelt Essentials. 10 ingeniously simple tools for accelerating development, reducing risk, and standardizing workflows.
Learn more & try for free

Rules engines and determinism

Sometimes you need an answer that must be 100 percent deterministic and auditable, like for a medical dosage calculation or tax compliance logic. In these instances, a generative model is a liability. A rules engine, on the other hand, is less of a risk because it can’t hallucinate.

Decision heuristic:

  • Use RAG when your knowledge base is too large for a context window, too dynamic to constantly fine-tune on, or too unstructured for a SQL database.

  • Use Long Context for deep reasoning over a small, static set of documents. (e.g., “Analyze these three research papers for contradictions”).

  • Use SQL for structured, numerical, or relationship-heavy queries.

  • Use Rules Engines for safety-critical, deterministic logic.

Conclusion: Hallucination is a retrieval problem

The road from naive RAG to production is humbling. It starts with an understanding that the generator is only as truthful as the input you feed it, and that most RAG hallucinations happen before the LLM ever sees the query. RAG hallucinations have multiple causes, often originating in the ingestion, chunking, and retrieval layers of the pipeline.

Developing a system that doesn’t hallucinate is more about building rigorous architecture than choosing the ‘smartest’ model. The three pillars of a trustworthy system are semantic chunking, hybrid retrieval, and mandatory citations. The goal is to move your pipeline from a probabilistic “guesser” to a deterministic “researcher”. One that checks its facts before it speaks.

That work is already happening behind the scenes – narrowing the input to what’s relevant, and also in agentic retrieval. This is where AI agents dynamically plan and iterate on their own searches. For this technology to work, though, the quality of the grounding is still paramount.

FAQs: RAG and AI hallucinations

1. Does RAG eliminate AI hallucinations?

No. RAG reduces hallucinations by grounding responses in external data, but most failures simply shift to the retrieval layer — including poor chunking, weak embeddings, outdated documents, or irrelevant search results.

2. Why do RAG systems still give incorrect answers?

RAG systems fail when they retrieve incomplete, contradictory, or low-quality context. The LLM then generates an answer from flawed retrieval data, often sounding confident even when incorrect.

3. What causes hallucinations in enterprise AI chatbots?

The most common causes include semantic fragmentation from bad chunking, embedding mismatches, weak reranking, context overload, version drift, and missing confidence thresholds.

4. What is the best way to reduce hallucinations in RAG pipelines?

Production-grade systems typically combine semantic chunking, hybrid retrieval (BM25 + vector search), reranking models, confidence gates, and citation-grounded outputs.

5. What is hybrid retrieval in RAG?

Hybrid retrieval combines keyword search with vector similarity search, improving accuracy for technical terms, product codes, and structured documentation that embeddings alone often miss.

6. When should you avoid using RAG?

RAG is usually the wrong tool for structured SQL-style queries, deterministic business rules, or small document sets that fit entirely within a long-context model window.

7. Why are citations important in AI systems?

Mandatory citations force the model to ground claims in retrieved evidence, making answers more verifiable and significantly reducing hallucination rates.

The post How to stop AI hallucinations in enterprise RAG systems (a complete guide) appeared first on Simple Talk.

Read the whole story
alvinashcraft
3 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Making the OWASP top ten in the vibe code era​​​​‌‍​‍​‍‌‍‌​‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍​‍​‍​‍‍​‍​‍‌​‌‍​‌‌‍‍‌‍‍‌‌‌​‌‍‌​‍‍‌‍‍‌‌‍​‍​‍​‍​​‍​‍‌‍‍​‌​‍‌‍‌‌‌‍‌‍​‍​‍​‍‍​‍​‍‌‍‍​‌‌​‌‌​‌​​‌​​‍‍​‍​‍‌‍​‌‍‌‌​​‍‍‌​‌‌​‌‍​‌‌‍​‌‍‍‌‍‌‌‍‌‍‌‌‌​‍‌‍‌‍‌‍​‌‍‌‌​‍‍‌‍​‌‍​‍‌‍‍‌‌‍‍‌‌​‌‍‌‌‌‍‍‌‌​​‍‌‍‌‌‌‍‌​‌‍‍‌‌‌​​‍‌‍‌‌‍‌‍‌​‌‍‌‌​‌‌​​‌​‍‌‍‌‌‌​‌‍‌‌‌‍‍‌‌​‌‍​‌‌‌​‌‍‍‌‌‍‌‍‍​‍‌‍‍‌‌‍‌​​‌​‍​‌‍​‌​​​‌‍​‍‌​‌‍‌‍​‍​‌​​‍‌​‍​​​‌‌‍‌‍​​​‍‌​‌​‌‍‌‌​‌‌‌‍‌‌​‍‌‌‍​‌​​​‌‍​‌‍​‍​‍‌‌‍​​‌‌‌‍‌‌‌‍​‌​‍​​​‌​‌‌‍‌​​‍​​‍​​​​​‌​‍‌‌​‌‍‌‌​​‌‍‌‌​‌‌‍​‍‌‍​‌‍‌‍‌‌‌​​‌‍‌​‌‌​​‍‌​​‌‍​‌‌‌​‌‍‍​​‌‌‌​‌‍‍‌‌‌​‌‍​‌‍‌‌​‌‍​‍‌‍​‌‌​‌‍‌‌‌‌‌‌‌​‍‌‍​​‌‌‍‍​‌‌​‌‌​‌​​‌​​‍‌‌​​‌​​‌​‍‌‌​​‍‌​‌‍​‍‌‌​​‍‌​‌‍‌‍​‌‍‌‌​​‍‍‌​‌‌​‌‍​‌‌‍​‌‍‍‌‍‌‌‍‌‍‌‌‌​‍‌‍‌‍‌‍​‌‍‌‌​‍‍‌‍​‌‍​‍‌‍‌‍‍‌‌‍‌​​‌​‍​‌‍​‌​​​‌‍​‍‌​‌‍‌‍​‍​‌​​‍‌​‍​​​‌‌‍‌‍​​​‍‌​‌​‌‍‌‌​‌‌‌‍‌‌​‍‌‌‍​‌​​​‌‍​‌‍​‍​‍‌‌‍​​‌‌‌‍‌‌‌‍​‌​‍​​​‌​‌‌‍‌​​‍​​‍​​​​​‌​‍‌‍‌‌​‌‍‌‌​​‌‍‌‌​‌‌‍​‍‌‍

1 Share
Ryan welcomes back Tanya Janca, now part of the OWASP Top 10 team, to discuss what changed in the latest OWASP Top 10 release, how the list shifted from “outdated components” to a broader software supply chain focus, and why they added memory safety and vibe-coding as awareness items.​​​​‌‍​‍​‍‌‍‌​‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍​‍​‍​‍‍​‍​‍‌​‌‍​‌‌‍‍‌‍‍‌‌‌​‌‍‌​‍‍‌‍‍‌‌‍​‍​‍​‍​​‍​‍‌‍‍​‌​‍‌‍‌‌‌‍‌‍​‍​‍​‍‍​‍​‍‌‍‍​‌‌​‌‌​‌​​‌​​‍‍​‍​‍‌‍​‌‍‌‌​​‍‍‌​‌‌​‌‍​‌‌‍​‌‍‍‌‍‌‌‍‌‍‌‌‌​‍‌‍‌‍‌‍​‌‍‌‌​‍‍‌‍​‌‍​‍‌‍‍‌‌‍‍‌‌​‌‍‌‌‌‍‍‌‌​​‍‌‍‌‌‌‍‌​‌‍‍‌‌‌​​‍‌‍‌‌‍‌‍‌​‌‍‌‌​‌‌​​‌​‍‌‍‌‌‌​‌‍‌‌‌‍‍‌‌​‌‍​‌‌‌​‌‍‍‌‌‍‌‍‍​‍‌‍‍‌‌‍‌​​‌​‍​‌‍​‌​​​‌‍​‍‌​‌‍‌‍​‍​‌​​‍‌​‍​​​‌‌‍‌‍​​​‍‌​‌​‌‍‌‌​‌‌‌‍‌‌​‍‌‌‍​‌​​​‌‍​‌‍​‍​‍‌‌‍​​‌‌‌‍‌‌‌‍​‌​‍​​​‌​‌‌‍‌​​‍​​‍​​​​​‌​‍‌‌​‌‍‌‌​​‌‍‌‌​‌‌‍​‍‌‍​‌‍‌‍‌‌‌​​‌‍‌​‌‌​​‍‌​​‌‍​‌‌‌​‌‍‍​​‌‌‍‌‌‌‍​‌‍​‌‍‌‌‌​‍‌​​‌‌​​‌‍​‍‌‍​‌‌​‌‍‌‌‌‌‌‌‌​‍‌‍​​‌‌‍‍​‌‌​‌‌​‌​​‌​​‍‌‌​​‌​​‌​‍‌‌​​‍‌​‌‍​‍‌‌​​‍‌​‌‍‌‍​‌‍‌‌​​‍‍‌​‌‌​‌‍​‌‌‍​‌‍‍‌‍‌‌‍‌‍‌‌‌​‍‌‍‌‍‌‍​‌‍‌‌​‍‍‌‍​‌‍​‍‌‍‌‍‍‌‌‍‌​​‌​‍​‌‍​‌​​​‌‍​‍‌​‌‍‌‍​‍​‌​​‍‌​‍​​​‌‌‍‌‍​​​‍‌​‌​‌‍‌‌​‌‌‌‍‌‌​‍‌‌‍​‌​​​‌‍​‌‍​‍​‍‌‌‍​​‌‌‌‍‌‌‌‍​‌​‍​​​‌​‌‌‍‌​​‍​​‍​​​​​‌​‍‌‍‌‌​‌‍‌‌​​‌‍‌‌​‌‌‍​‍‌‍​‌‍‌‍‌‌‌​​‌‍‌​‌‌​​‍‌‍‌​​‌‍​‌‌‌​‌‍‍​​‌‌‍‌‌‌‍​‌‍​‌‍‌‌‌​‍‌​​‌‌​​‍‌‍‌​​‌‍‌‌‌​‍‌​‌​​‌‍‌‌‌‍​‌‌​‌‍‍‌‌‌‍‌‍‌‌​‌‌​​‌‌‌‌‍​‍‌‍​‌‍‍‌‌​‌‍‍​‌‍‌‌‌‍‌​​‍​‍‌‌
Read the whole story
alvinashcraft
3 hours ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories