Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
153963 stories
·
33 followers

MCP on Code Mode (Interview)

1 Share

This week I’m talking with Matt Carey about Code Mode and how most of us have been thinking about MCP all wrong. Matt works on the Agents SDK and MCP at Cloudflare — we discuss how server-side Code Mode lets one MCP server expose all ~2,500 Cloudflare API endpoints in about 1,000 tokens of context, the dynamic Worker loader that runs model-written code safely in a V8 isolate, Matt’s own workflow with Claude, where memory fits into the future of agents, and his Zaggy git wrapper that keeps agents from force-pushing his repos.

Join the discussion

Changelog++ members save 9 minutes on this episode because they made the ads disappear. Join today!

Sponsors:

  • Coder.com – Secure environments where devs and agents work in parallel. Open by design. Secure by default.
  • TailscaleAdam loves Tailscale! Easy, secure, identity-based access to anything. Tailscale deploys quickly and enables Zero Trust access to any resource on your network. From CI/CD runners across multi-cloud environments, to SaaS tools and infrastructure, Tailscale connects it all, seamlessly.
  • RWX – CI/CD platform for high velocity teams. When agents help developers write code in minutes, validation becomes your bottleneck. RWX gives agents programmatic control, sub-second cached builds, and semantic outputs they can act on. No commit required. Just iterate until CI passes, then push.
  • Fly.ioThe home of Changelog.com — Deploy your apps close to your users — global Anycast load-balancing, zero-configuration private networking, hardware isolation, and instant WireGuard VPN connections. Push-button deployments that scale to thousands of instances. Check out the speedrun to get started in minutes.

Featuring:

Show Notes:

Featured

Cloudflare platform

MCP and code-mode references

Coding agents and tools

Agent memory and personal AI

Homelab and infrastructure

Something missing or broken? PRs welcome!





Download audio: https://op3.dev/e/https://pscrb.fm/rss/p/https://cdn.changelog.com/uploads/podcast/681/the-changelog-681.mp3
Read the whole story
alvinashcraft
5 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Three Tiers, One Platform: Building Agents Together with the Build-Along Series

1 Share

The Agentic Opportunity

 

Agentic Platforms in Microsoft 365, Power Platform and Azure

 

Every partner has agent ambition, but a single workshop format cannot serve every builder. Business makers want a fast win without writing code. Citizen developers want to orchestrate real processes across enterprise systems. Pro developers want a code-first surface with evaluations, tracing, and managed identity. Run the wrong session for the wrong audience, and engagement collapses - too abstract for one group, too constrained for another. The Agent Build-Along Series solves that by meeting builders exactly where they are, across three tiers of the Microsoft platform stack. 

Behind the series sits a self-serve GitHub repository - a curated library of build-along sessions with easy-to-follow, step-by-step instructions, scoped to specific industry scenarios and business functions. Partners pick the scenario that fits the room, clone the assets, and run the session. No bespoke content build, no facilitator guesswork. 

A Self-Serve Repository for Build-Along Sessions 

 

Diagram illustrating three tiers of agent building

 

The repo at Agent Build-Along Github is a self-serve catalog of ready-to-run workshops. Every session is curated for a specific industry and business function, with easy-to-follow instructions a facilitator can pick up and deliver. Pick the combination that matches the room and run the session - the content, scenario, and assets are already there. 

Sessions are organised across three axes so partners can find the right fit fast: 

  • Industry - Financial Services, Healthcare, Retail, Manufacturing, Public Sector, Energy, Professional Services 
  • Business function - Sales, Marketing, Finance, HR, Operations, IT, Customer Service 
  • Workload - Agent Builder, Copilot Studio and Azure AI Foundry 

Each session in the repo includes a real-world scenario, step-by-step build instructions, sample data, prompts and configuration, and a facilitator narrative. Clone, brief the room, build the agent - that is the loop. 

 

Three Tiers - Meet Builders Where They Are 

The pyramid is intentional. Builders start where their skills are today and move up as confidence grows. The tier defines platform, audience, duration, and approach - content stays consistent. 

Tier 

Platform 

Audience 

Duration 

Approach 

1 - Foundation 

Microsoft 365 Copilot Agent Builder 

Business makers, end users, departmental teams 

60 min 

No-code 

2 - Extend 

Copilot Studio 

Citizen developers, power users, ops teams 

90 min 

Low-code 

3 - Pro-Code 

Azure AI Foundry 

Pro developers, architects, AI engineers 

180+ min 

Code-first 

 

Tier 1 - M365 Copilot Agent Builder 

Author your first agent - no code required. 

 

Microsoft 365 Copilot Agent Builder virtual workshop overview,

 

 

What your customers will build 

  • A custom agent scoped to a real workplace scenario 
  • Grounded with knowledge sources (SharePoint sites, files, web URLs) 
  • Configured with custom instructions and conversation starters 
  • Tested in the preview pane and shared with a team 

Session flow 

Define → Ground → Refine → Test & Share 

Prerequisites 

  • Microsoft 365 Copilot licence (or Copilot Chat free) 
  • Modern browser (Edge / Chrome) 
  • A workplace scenario in mind, or use our template 

Outcome 

Attendees leave with a working agent they can use immediately, plus a repeatable pattern for future use cases. This is the right entry point for departmental teams who want to operationalise Copilot without engineering involvement. 

Tier 2 - Copilot Studio 

Orchestrate multi-step agents across enterprise systems. 

 

Microsoft Copilot Studio virtual workshop overview

 

What your customers will build 

  • A copilot with generative answers and authored topics 
  • Custom actions that call APIs and Power Platform connectors 
  • Knowledge sources spanning SharePoint, Dataverse, public web, and enterprise data 
  • Channel deployment to Microsoft Teams and a public web embed 

Session flow 

Design → Connect → Author → Publish 

Prerequisites 

  • M365 Copilot license with access to Copilot Studio 
  • Familiarity with Power Platform helpful, not required 
  • A multi-step workflow or process in mind 

Outcome 

Attendees leave with a published copilot running in Teams or on the web, plus a pattern for connecting copilots to enterprise systems. This is where citizen developers and ops teams turn manual processes into orchestrated, agent-driven workflows. 

Tier 3 - Azure AI Foundry 

Ship a code-first agent with evaluations and observability. 

 

Microsoft Azure Foundry virtual workshop overview

 

 

What your customers will build 

  • A code-first agent with custom tools and function calling 
  • Grounding via Azure AI Search and the customer's own data 
  • Model selection from the Foundry catalog, with optional fine-tuning 
  • Evaluations, tracing, and managed-identity deployment 

Session flow 

Provision → Build → Evaluate → Deploy 

Prerequisites 

  • Azure subscription with Azure AI Foundry access 
  • VS Code, plus Python or .NET familiarity 
  • Sample data or a use case ready to ground on 

Outcome 

Attendees leave with a deployed agent - code, evaluations, and tracing in place - plus reusable SDK templates. This is the right tier for architects and AI engineers shipping agents into production with the governance bar enterprise customers expect. 

How the Repository Works 

The repo removes the heaviest lift in running these workshops: content creation. Every session is curated against the same three axes - industry, business function, tier - and shipped with step-by-step instructions a facilitator can follow with minimum prep. 

Industry/Business Function  ×  Tier  →  Ready-to-run Build-Along 

Each session in the repo ships with: 

  • A real-world scenario grounded in the chosen industry and business function 
  • Step-by-step build instructions tuned to the selected tier 
  • Sample data the agent can ground against 
  • Prompts, conversation starters, and configuration snippets 
  • Facilitator narrative so anyone can run the session 

Need an enterprise agent that can handle integrate with CRM data? There is a curated Copilot Studio session with a deal-desk scenario and sample CRM data. Need a pro-code health care agent? Choose the Foundry session with a patient-flow scenario and a grounded dataset. Same repo, different curated paths - at the right depth for the room. 

What Attendees Walk Away With 

  • Agent-Builder - A working agent in Microsoft 365 Copilot, ready to share with colleagues. 
  • Copilot Studio - A published copilot running in Teams or on the web, connected to enterprise systems. 
  • Foundry - A deployed Foundry agent with evaluations, tracing, and SDK templates for the next build. 

In all cases attendees leave with assets - not slides - and the confidence to build the next one without facilitation. 

 

Next Steps - Host a Build-Along 

 

Host a virtual workshop

 

 

  • Pick your tier - Start where your skills are today. Move up the pyramid as confidence grows. 
  • Drive Registration – Socialise your Build-Along session to maximise attendance. 
  • Show up and build. Bring a real scenario. Customers leave with a working agent to share. 

Where to Next 

Browse the self-serve repository of Build-Along sessions - curated by industry and business function, with step-by-step instructions ready to run: Agent Build Along. 

 

Read the whole story
alvinashcraft
6 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Introducing Code Interpreters for Logic Apps

1 Share

We have recently introduced support for Code interpreters inside of Azure Logic Apps. When partnered with a LLM, code interpreters allow builders to express their goals or intents via natural language and obtain executable results. These capabilities become powerful in the areas of data analysis, visualizations, validations and transformations. Our first language supported for code interpreter is Javascript, with other languages following later.

Historically, customers have had concerns about an LLM performing data analysis, calculations and transformations due to context window exhaustion which can lead to hallucinations. Code interpreters help in this regard as they can perform this analysis without filling up context windows and providing more reliable results.

You can see the code interpreter with Javascript in action in this video from Kent Weare. After watching the video, you can deep dive in the details.

Use case: Expense Validations

To help illustrate, this capability, let’s take an accounts payable example built in Logic Apps Standard. Zava uses a 3rd party expense application to capture employee expenses. The 3rd party expense application will export transactions in CSV format.

Zava has some very specific business validations that need to execute before the expenses can be processed by the ERP. To solve this problem, we will build an agentic business process in Logic Apps that includes our new JavaScript code interpreter.

Our code interpreter will be able to ingest and parse our CSV file and then apply our business validations for us. The outcome is a report that identifies both valid and invalid transactions.

Prior to uploading to the ERP (Dataverse), we will route our request to a human in the loop process for their oversight. This allows for additional control as unwinding in an ERP is always a tedious task.

Below, is a picture of our solution. Within it we can see both deterministic steps before and after our Agent action. Within our agent action, we have tools that will help our agent address our company objectives. These tools include calling a batch API to upload valid expense records to Dataverse. Another tool that will take care of uploading invalid records to a different table, our human in the loop action to seek approval from our human stakeholder and a tool that will help us obtain business knowledge from SharePoint.

You might be asking, ok where does the code interpreter come in? Within our Agent action, we will discover a toggle that allows you to enable it. The code interpreter gets invoked based upon instruction in the model.  Here is a subset of the prompt from this workflow that describes how to invoke the code interpreter.

For example:

### Step 2 -- Parse and Validate
The expense CSV data is available from the Get_file_content action. Use code interpreter to parse ALL rows from the CSV. For each row, normalize:

Category: title case
- Amount: decimal number
- SubmittedDate: ISO 8601 format (e.g. "2026-01-05T00:00:00Z")
- ReceiptAttached: convert "Yes"/"No" to true/false

Then apply the business rules from Step 1 to classify every record as VALID or INVALID.

 

You won’t see the code interpreter modelled as a tool within our agent action, but we see the execution outcome within our run history. In the following screenshot we can see this illustrated. Within our agent action, we can see that we are on our 4th turn and we have executed the code interpreter action. In the code window, we can see the code that was generated for our us. This is the result of the LLM working together with the code interpreter to generate and execute this code.

 

Note: In this scenario, we are dynamically generating this code at runtime. This allows for ultimate flexibility if we have different source inputs and we are relying upon the LLM and code interpreter to adapt to these fluid inputs. If we were interested in a more deterministic approach we can also pass pre-written code into this action where it can also execute. This will result in less flexibility, but more deterministic behavior.

Using  Code Interpreters in Logic Apps Consumption

Logic Apps Consumption has a slightly different architecture to Logic Apps Standard. In Logic Apps Standard, we offer dedicated compute and storage for customers which provides workload isolation across customers.

When it comes to Logic Apps Consumption, we provide a multi-tenant offering allowing customers to take advantage of a lower price point due to shared resource utilization. In order to allow customer isolation, customers need to have an integration account attached to their consumption workflow. This will allow the code interpreter to run in isolated compute thus avoiding any potential disruptions to other customers.

You can provision an Integration Account by searching for Integration Accounts at the top of the Azure portal. You can select any of the SKUs available, including the Free SKU for non-production/non-SLA scenarios.

 

 

 

With an Integration Account created, we can associate this Integration Account with our consumption logic app by clicking on Settings – Integration Account.

 

 

Read the whole story
alvinashcraft
6 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

AWS Weekly Roundup: AWS Transform at 1 year, Claude Platform on AWS, EC2 M3 Ultra Mac instances, and more (May 18, 2026)

1 Share

Just a year ago, we launched AWS Transform for .NET, Mainframe and VMware workloads, the first agentic AI service purpose-built for modernizing enterprise applications at scale. At re:Invent 2025, we introduced AWS Transform custom, which enables organizations to modernize and transform code at scale using AWS-managed and custom transformations. You can upgrade language versions, migrate frameworks, optimize performance, and analyze code bases using transformations that are ready to use or can be customized to meet your organization’s specific requirements. We also introduced full-stack Windows modernization capabilities and Reimagine capabilities and automated testing functionality for mainframe.

In 12 months, thousands of customers migrated hundreds of thousands of servers, saved 1.6+ million hours, and processed 4.5+ billion lines of code with AWS Transform. Celebrating its 1-year anniversary, AWS Transform agents now available in Kiro, Claude, Cursor, and Codex, including the agent builder toolkit Kiro power for building customized transformation agents.

To learn what happened in 12 months, the four things we learned, and how that evolved our roadmap, visit the one-year anniversary blog post.

Last week’s launches
Here are last week’s launches that caught my attention:

  • The general availability of Claude Platform on AWS – You can get direct access to Anthropic’s native Claude Platform experience, including APIs, console, and early-access beta features, directly through your existing AWS account, without managing separate accounts, billing, or tracking. Claude Platform on AWS is operated by Anthropic, and customer data is processed outside the AWS security boundary. To learn more, visit the deep dive blog post.
  • Amazon EC2 M3 Ultra Mac instances – These instances are built on Apple M3 Ultra Mac Studio computers featuring a 28-core CPU, 60-core GPU, 32-core Neural Engine, and 256GB of unified memory. Compared to EC2 M4 Max Mac instances, M3 Ultra Mac instances provide 2x the unified memory, 1.75x the CPU cores, 1.5x the GPU cores, and 2x the Neural Engine cores, giving Apple developers the headroom to run significantly more Xcode simulators in parallel and accelerate on-device ML workflows to improve product time to market.
  • Amazon Redshift RG instances powered by AWS Graviton – These instances deliver better performance, running data warehouse and data lake workloads up to 2.4x as fast as previous generation RA3 instances, at 30% lower price per vCPU. RG instances include Redshift’s custom-built vectorized data lake query engine that processes Apache Iceberg and Parquet data on your cluster nodes.
  • Amazon Bedrock Advanced Prompt Optimization – You can optimize your prompts for any model on Bedrock, while comparing your original prompts to your optimized prompts across up to 5 models simultaneously. You can also use this if you are migrating to a new model or just want to get better performance on your current model.
  • AWS Security Agent full repository code scanning (preview) – You can use a new capability in AWS Security Agent that performs deep, context-aware security analysis of your entire codebase. When vulnerabilities are found, the scanner generates code remediation—specific fixes tied to the exact file and line—enabling teams to remediate security vulnerabilities faster than ever before. This capability is available at no additional charge for existing AWS Security Agent customers during the preview.
  • AWS Interconnect – multicloud connectivity with Oracle Cloud Infrastructure (preview) – You can quickly provision resilient, scalable private connections to other cloud providers using AWS Interconnect – multicloud connectivity. OCI is the latest CSP to adopt the open specification that powers AWS Interconnect. This allows AWS to provide a consistent, simple experience to our customers on OCI (preview), Google Cloud (generally available), and Microsoft Azure (coming later in 2026).

Additional updates
Here are some additional news items that you might find interesting:

  • Accelerate AI research and education with Build on Trainium program – Read how the next generation of AI researchers is using Amazon chips to accelerate discovery. AWS invested $110 million to give university researchers access to purpose-built AI chips. AWS Trainium is speeding up AI research at UC Berkeley, MIT, Carnegie Mellon, and more. All research is open source, meaning improvements flow back to the broader developer community.
  • A full list of AWS Community Days 2026 – There’s something different about an event where the speakers are your peers, the organizers are volunteers who do this out of passion, and the agenda was shaped by the community itself. That’s exactly what AWS Community Days are, and they’re happening in cities across every continent, every year.
  • The Kiro Startups Credit program is back – Thousands of founders applied in the first round, and now applications are open again. Apply to receive up to one year of Kiro Pro+ credits automatically applied to your organization’s AWS account.

For a full list of AWS blog posts, be sure to keep an eye on the AWS Blogs page.

Learn more about AWS, browse and join upcoming AWS-led in-person and virtual events, startup events, and developer-focused events including AWS Summits. Join the AWS Builder Center to connect with builders, share solutions, and access content that supports your development.

That’s all for this week. Check back next Monday for another Weekly Roundup!

Channy

Read the whole story
alvinashcraft
6 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Can Text-to-SQL Benchmarks Work on Document Databases? A Couchbase Architecture Case Study

1 Share

Executive Summary

Industry-standard text-to-SQL benchmarks are designed for relational, structured databases. However, not all real-world data workloads are confined to relational systems. Modern AI-driven query platforms increasingly operate on document-oriented data stores such as Couchbase, where schemas are flexible and data is represented as nested JSON rather than normalized tables. This divergence introduces a fundamental evaluation challenge: how can we rigorously measure the accuracy of an AI query system on a non-relational platform without rewriting the benchmark itself? Evaluating AI query systems built on non-relational platforms –  such as Couchbase – against these benchmarks therefore requires non-trivial architectural adaptation. This document presents the approach taken to re-architect the Spider2-Lite benchmark pipeline to run on Couchbase, a document-oriented, non-relational database, while fully preserving the integrity of the evaluation methodology.

Why Spider2 and Spider2-Lite?

Evaluating an AI-powered natural language query system requires a benchmark that is both realistic in query complexity and widely accepted by the research community. For this work, the Spider benchmark family was selected because it is one of the most rigorous and commonly used datasets for evaluating text-to-SQL systems.

Spider introduced a challenging evaluation paradigm in which models must generalize to previously unseen database schemas. Instead of memorizing query templates, systems must interpret a natural language question, understand the provided schema, and generate a correct query dynamically. This property makes Spider particularly well-suited for evaluating production systems such as Couchbase Capella iQ, where queries must operate over arbitrary customer schemas rather than a fixed training dataset.

More recently, Spider2 was introduced to reflect modern data environments and higher query complexity. Spider2 expands beyond simple relational tasks and introduces queries that more closely resemble real analytical workloads. However, the full Spider2 benchmark spans multiple database backends – including BigQuery, Snowflake, and Google Analytics – which require external infrastructure and large-scale data environments.

For this architectural study, the focus was placed on Spider2-Lite, a curated subset of Spider2 designed to preserve the benchmark’s complexity while remaining runnable in a controlled local environment. Spider2-Lite includes a set of SQLite-backed instances that can be executed locally without cloud dependencies, making it feasible to migrate the underlying data and reproduce the evaluation pipeline.

This made Spider2-Lite an ideal candidate for this case study: it maintains the rigor and schema generalization challenges of modern text-to-SQL benchmarks, while allowing the underlying relational datasets to be systematically migrated to Couchbase for evaluation.

The Architectural Problem

Spider2-Lite Assumptions

Spider2-Lite, like virtually all text-to-SQL benchmarks, is built on a set of foundational assumptions rooted in the relational model:

  • Data is stored in structured tables with fixed schemas
  • Queries are expressed in standard SQL (ANSI-compatible)
  • Relationships between entities are expressed through foreign keys and joins
  • Results are deterministic, row-ordered tabular outputs

These assumptions are well-suited for databases like SQLite, PostgreSQL, and MySQL. They do not hold, without adaptation, for document-oriented databases such as Couchbase.

Couchbase

Couchbase is a multi-model, non-relational database that organizes data as JSON documents within a hierarchy of buckets → scopes → collections, rather than databases and tables. Its query language, SQL++, is a superset of SQL capable of querying JSON structures – but it operates over keyspaces, not tables, and must contend with schema flexibility, mixed types, and nested document structures not present in the relational world.

This mismatch between the benchmark’s assumptions and Couchbase’s actual data model represents the central architectural challenge this work addresses.

Scope and Benchmark Filtering

Before any architectural work could begin, the benchmark was scoped appropriately. The full Spider2-Lite dataset contains 548 instances spanning BigQuery, Snowflake, Google Analytics, and SQLite backends. Only the 135 SQLite-based (“local”) instances were retained – these are the cases for which source data can be migrated to Couchbase, enabling faithful evaluation.

📂 View on GitHub

# Filtering to local instances only
filtered = [l for l in lines if json.loads(l).get('instance_id', '').startswith('local')]
# Result: 135 instances retained

Architectural Adaptations

Three categories of architectural change were required to make the benchmark viable on Couchbase.

Relational-to-Document Data Model Transformation

The first challenge was translating the relational schema into Couchbase’s document model without losing the structural information that SQL++ queries depend on.

Design decision: Preserve the relational hierarchy using Couchbase’s native organizational primitives:

We deliberately avoided this, because doing so would invalidate the original benchmark queries. It ensures that SQL++ queries generated by Capella iQ can reference the same logical entities as the original SQL queries –  just via a different keyspace syntax (E_commerce.spider2.orders instead of orders).

The migration pipeline is intentionally two-staged:

📂 View on GitHub

SQLite (.sqlite)
    ↓  export_sqlite_to_json.py
JSON Intermediate (inspectable, auditable)
    ↓  batch_import_to_couchbase.py
Couchbase (bucket.scope.collection)

The intermediate JSON layer is not merely a technical artifact –  it is a critical quality gate. It allows human inspection and programmatic cleaning of the data before it enters Couchbase, which is not possible if migrating directly.

Type System Reconciliation

Relational databases typically enforce column-level type constraints. However, SQLite is a notable exception: its *type affinity* model allows columns declared with types such as NUMERIC to store values of different kinds, including strings and integers, within the same column. While this permissive behavior is valid within SQLite, it can introduce ambiguity and inconsistencies when queries are executed in systems that assume stronger typing. As a result, queries derived from SQLite-based benchmarks may surface type-related failures when evaluated on Couchbase SQL++ engine, which expect clearer type semantics at runtime.

When exported naively, these type inconsistencies carry forward into JSON. For example, an era column in a sports statistics table might contain [2.84, “”, 3.12, “”] – a mix of floats and empty strings.

Solution: The find_mixed_type_columns.py utility was built to detect and remediate this class of issue at the JSON layer, before import:

  • Scans all JSON exports and classifies each column’s value types
  • Identifies columns with incompatible mixing (e.g., empty_string + int)
  • Replaces empty strings with null in numeric columns, making the data SQL++-compatible
  • Generates automatic backups and supports a –dry-run mode for safe preview

Analysis across 30 JSON exports identified 77 columns requiring remediation across seven files. After cleaning, all files imported into Couchbase without type-related errors.

Query Language Adaptation via Capella iQ

The final and most significant adaptation is at the query generation layer. Standard text-to-SQL systems produce ANSI SQL. Capella iQ produces SQL++, which differs in keyspace syntax, some function names, and its ability to navigate nested JSON.

Rather than attempting to transpile existing SQL reference queries into SQL++, the evaluation framework uses Capella iQ itself as the query generation layer – feeding each natural language question directly to Capella iQ and evaluating the result of executing the generated SQL++ against Couchbase. The evaluation thus measures real-world system performance, not the quality of a transpilation layer.

For each benchmark instance, the pipeline:

  1. Enumerates all available keyspaces (bucket.scope.collection) from Couchbase.
  2. Provides this keyspace context to Capella iQ.
  3. Submits the natural language question.
  4. Receives and executes the generated SQL++.
  5. Persists results for evaluation.

Preserving Evaluation Integrity

Adapting the data and query layers would be insufficient if the evaluation methodology itself were compromised. Several measures were taken to ensure the evaluation remains faithful to the Spider2-Lite standard.

Result-Level Comparison

Rather than comparing the generated SQL++ text directly against the reference SQL — which would be invalid because the languages themselves differ — evaluation is performed at the **result level**. The output produced by executing the generated SQL++ against Couchbase is compared directly with the pre-computed output obtained by executing the reference SQL against the original SQLite database.

This design choice is critical for several reasons.

First, SQL and SQL++ are not syntactically or semantically identical languages. SQL++ is designed for semi-structured, JSON-based data and introduces constructs for navigating nested objects and arrays that do not exist in traditional SQL. Conversely, SQL assumes a flat relational schema. Because of these structural differences, a valid SQL query cannot simply be translated into an identical SQL++ string representation. Any text-level comparison would therefore penalize correct queries purely because they are written in a different language.

Second, query equivalence in databases is fundamentally semantic, not textual. Two queries can differ substantially in syntax yet produce identical results. For example, the same answer can be derived using joins versus subqueries, different aggregation strategies, or alternative filtering structures. Evaluating queries based on their textual similarity would incorrectly mark many correct solutions as wrong.

By evaluating the result sets produced by execution, the benchmark measures what actually matters: whether the system returns the correct answer. This makes the evaluation both language-agnostic and semantically faithful – a query is judged by what it produces, not by how closely its text resembles a reference query.

Multi-Variant Gold Standards

Some benchmark questions admit multiple equally valid result sets (e.g., different but correct orderings or aggregation groupings). The evaluation framework handles this by comparing the generated result against all available gold variants and taking the maximum score – a query passes if it matches any acceptable answer.

Per-Instance Evaluation Metadata

Each benchmark instance carries evaluation metadata specifying:

  • condition_cols – columns to use for sorting and matching
  • ignore_order – whether row ordering should be considered
  • toks – token complexity reference

This per-instance configuration ensures that numeric tolerance, column alignment, and ordering are applied consistently and correctly across all 135 test cases.

Architectural Summary

📂 View on GitHub

ORIGINAL SPIDER2-LITE ARCHITECTURE          ADAPTED ARCHITECTURE (COUCHBASE)
─────────────────────────────────           ─────────────────────────────────────
SQLite .sqlite files                        SQLite → JSON → Couchbase Buckets
Standard SQL reference queries              SQL++ generated by Couchbase IQ
Table/column schema                         Keyspace (bucket.scope.collection)
Type-enforced columns                       Type-reconciled JSON documents
Result CSVs from SQLite                     Result CSVs from Couchbase N1QL
Evaluation: SQL vs reference SQL            Evaluation: Results vs gold results

The adapted architecture preserves every layer of the evaluation pipeline except the query language itself – which is exactly what is being evaluated.

Conclusion

This work demonstrates that industry-standard text-to-SQL benchmarks can be successfully and rigorously adapted to evaluate AI query systems operating on document-oriented databases. Although benchmarks such as Spider2-Lite were originally designed for relational systems, their underlying goal – measuring the ability of an AI system to translate natural language into correct database queries – remains equally relevant in modern semi-structured data environments.

Through a principled architectural adaptation – including relational-to-document data modeling, type-system reconciliation, and result-level evaluation across different query languages – the benchmark was executed on Couchbase while preserving the methodological integrity of the original evaluation framework. Rather than forcing document databases into a relational mold, this approach respects the native architecture of Couchbase and leverages SQL++ to operate directly on JSON documents and flexible schemas.

The results highlight a broader insight: modern AI query systems benefit significantly from operating on platforms designed for semi-structured data. Couchbase’s document model allows data to be stored in a form that more naturally reflects real-world application structures, while SQL++ provides the expressive power needed to query nested and heterogeneous data without complex relational transformations. When paired with Couchbase Capella iQ, this architecture enables natural language queries to be translated directly into executable SQL++ over native JSON datasets, reducing the impedance mismatch between how data is stored and how it is queried.

Taken together, this case study shows that Couchbase – combined with Capella iQ – provides a powerful foundation for AI-driven data access. By supporting flexible schemas, JSON-native querying, and intelligent query generation, the platform enables natural language interfaces to operate effectively over modern application data. The ability to run established benchmarks like Spider2-Lite on Couchbase further demonstrates that document databases can participate in rigorous evaluation frameworks while preserving the advantages of their native architecture.

References:

  1. Text-to-SQL Research Overview. Surveys and benchmarks evaluating natural language interfaces for databases.
  2. T. Yu et al., “Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task,” in Proc. EMNLP, 2018.
  3. “Spider2: A Benchmark for Complex and Realistic Text-to-SQL Tasks,” 2024.
  4. “Spider2-Lite: Lightweight Subset for Local Evaluation,” 2024. Curated subset designed for local execution and controlled evaluation environments.
  5. Couchbase Documentation. Couchbase Server Architecture and Data Model. Covers Buckets, Scopes, Collections, and JSON document storage.
  6. SQL++ (formerly N1QL): Couchbase Query Language Reference. Describes extensions over SQL for querying semi-structured JSON data.
  7. SQLite Documentation. Datatypes In SQLite Version 3. Explains type affinity and flexible typing behavior.
  8. Google BigQuery Documentation. Referenced as part of Spider2 backend environments.
  9. “Semantic Parsing: Concepts and Applications,” 2023.
  10. “Schema Generalization in Text-to-SQL Systems,” 2023. 

The post Can Text-to-SQL Benchmarks Work on Document Databases? A Couchbase Architecture Case Study appeared first on The Couchbase Blog.

Read the whole story
alvinashcraft
6 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Bravo! Philly’s Must-See Arts & Culture Events for 2026

1 Share

The Semiquincentennial celebration offers an opportunity for Philly’s vibrant arts and culture community to celebrate the milestone through thought-provoking events, festivals and performing arts programming.

Learn some hidden history at storytelling benches across the historic district through Once Upon a Nation or take a walking tour, er jawnt, through every pocket of the city on the Neighborhood Jawnts Tour Series.

And you know Philly does Independence Day better than anywhere else in the country. Wawa Welcome America kicks off on Juneteenth and keeps going non-stop until the fireworks pop over the Benjamin Franklin Parkway.

Philly’s festival scene is on fire in 2026 with highlights including: five-week arts festival ArtPhilly: What Now, the Philadelphia Chinese Lantern Festival in Franklin Square, the ongoing Ring it On! series bringing a pop-up party to your neighborhood and Rockyfest, a celebration of everyone’s favorite fictional Philadelphian.

Check out this guide to the biggest cultural events, festivals and performing arts across the region in 2026.

Read the whole story
alvinashcraft
7 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories