Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
150384 stories
·
33 followers

Evaluating AI Agents: More than just LLMs

1 Share

Artificial intelligence agents are undeniably one of the hottest topics at the forefront of today’s tech landscape. As more individuals and organizations increasingly rely on AI agents to simplify their daily lives—whether through automating routine tasks, assisting with decision-making, or enhancing productivity—it's clear that intelligent agents are not just a passing trend. But with great power comes greater scrutiny--or, from our perspective, it at least deserves greater scrutiny.

Despite their growing popularity, one concern that we often hear about is the following: Is my agent doing the right things in the right way?  Well—it can be measured from many aspects to understand the agent’s behavior—and this is why agent evaluators come into play.

Why Agent Evaluation Matters

Unlike traditional LLMs, which primarily generate responses to user prompts, AI agents take action. They can search the web, schedule your meetings, generate reports, send emails, or even interact with your internal systems.

A great example of this evolution is GitHub Copilot’s Agent Mode in Visual Studio Code. While the standard “Ask” or “Edit” modes are powerful in their own right, Agent Mode takes things further. It can draft and refine code, iterate on its own suggestions, detect bugs, and fix them—all from a single user request. It’s not just answering questions; it’s solving problems end-to-end.

This makes them inherently more powerful—and more complex to evaluate. Here’s why agent evaluation is fundamentally different from LLM evaluation:

Dimension

LLM Evaluation

Agent Evaluation

Core Function

Content (text, image/video, audio, etc.) generation

Action + reasoning + execution

Common Metrics

Accuracy, Precision, Recall, F1 Score

Tool usage accuracy, Task success rate, Intent resolution, Latency

Risk

Misinformation or hallucination

Security breaches, wrong actions, data leakage

Human-likeness

Optional

Often required (tone, memory, continuity)

Ethical Concerns

Content safety

Moral alignment, fairness, privacy, security, execution transparency, preventing harmful actions

Shared Evaluation Concerns

Latency, Cost, Privacy, Security, Fairness, Moral alignment, etc.

Take something as seemingly straightforward as latency. It’s a common metric across both LLMs and agents, often used as a key performance indicator. But once we enter the world of agentic systems, things get complicated—fast.

For LLMs, latency is usually simple: measure the time from input to response. But for agents? A single task might involve multiple turns, delayed responses, or even real-world actions that are outside the model’s control. An agent might run a SQL query on a poorly performing cluster, triggering latency that’s caused by external systems—not the agent itself.

And that’s not all. What does “done” even mean in an agentic context? If the agent is waiting on user input, has it finished? Or is it still "thinking"? These nuances make it tricky to draw clear latency boundaries.

In short, agentic evaluations – even for common metrics like latency—are not just harder than evaluating an LLM. It’s an entirely different game.

What to Measure in Agent Evaluation

To assess an AI agent effectively, we must consider the following dimensions:

  1. Task Success Rate – Can the agent complete what it was asked to do?
  2. Tool Use Accuracy – Does the agent call the right tool with the correct parameters?
  3. Intent Resolution – Does it understand the user’s request correctly?
  4. Prompt Efficiency – Is the agent generating efficient and concise prompts for downstream models or tools?
  5. Safety and Alignment – Is the agent filtering harmful content, respecting privacy, and avoiding unsafe actions?
  6. Trust and Security – Do users feel confident relying on the agent? Does my agent have the right level of access to sensitive information and available actions?
  7. Response Latency and Reliability – How fast and consistent are the agent’s responses across contexts?
  8. Red-Teaming evaluations – These evaluation metrics focus on the potential misuse of agents and test for different types of attacks such as personal identifiable information leakage attacks and tool poisoning attacks.

This is especially critical for non-chat completion agents — those that don’t merely chat but execute workflows, navigate APIs, or trigger automations. Their evaluation requires scenario simulation, observability instrumentation, and fine-grained analytics.

Case Study: Evaluating a Simple AI Agent for Holiday Countdown

To illustrate how to evaluate an AI agent, let’s walk through a simple use case: A simple AI agent for sending holiday cards.

Agent Setup

  • Platform: Microsoft Foundry
  • Instruction:

" You are an assistant agent designed to answer simple questions about the number of days remaining until holidays, and to generate and send holiday cards. Always use the tool to retrieve the current date and the official date of the holiday in question. Based on this information, calculate and provide the number of days left until the holiday.

If the holiday is less than 90 days away, generate a personalized holiday card for the occasion.

Then, send the holiday card as a gift to <Email address>."

  • Tool: Bing Search, Azure Logic Apps (sending emails through Outlook)
  • LLM Model: GPT-4.1

Example Request

QueryResponse
How many days left until New Year?"Today is July 23, 2025. New Year's Day is January 1, 2026. There are 162 days left until New Year."
How many days left to Labor Day?

"Today is July 25, 2025. Labor Day in 2025 falls on September 1, 2025. There are 38 days left until Labor Day!

      Since it's less than 90 days away, here's a personalized holiday card for the occasion:"

Evaluation Dimensions

  1. Task Success Rate
    • Goal: The agent should correctly identify the holiday and current date, then return the accurate number of days left.
    • Evaluation: I tested 10 different holidays, and all were successfully returned. Task success rate = 10/10 = 100%. What’s even better? Microsoft Foundry provides a built-in LLM-based evaluator for task adherence that we can leverage directly:
  1. Tool Use Accuracy
    • Goal: The agent should always use the tool to search for holidays and the current date—even if the LLM already knows the answer. It must call the correct tool (Bing Search) with appropriate parameters.
    • Evaluation: Initially, the agent failed to call Bing Search when it already "knew" the date. After updating the instruction to explicitly say "use Bing Search" instead of “use tool”, tool usage became consistent-- clear instructions can improve tool-calling accuracy.
  1. Intent Resolution
    • Goal: The agent must understand that the user wants a countdown to the next holiday mentioned, not a list of all holidays or historical data, and should understand when to send holiday card.
    • Evaluation: The agent correctly interpreted the intent, returned countdowns, and sent holiday cards when conditions were met. Microsoft Foundry’s built-in evaluator confirmed this behavior.
  1. Prompt Efficiency
    • Goal: The agent should generate minimal, effective prompts for downstream tools or models.
    • Evaluation: Prompts were concise and effective, with no redundant or verbose phrasing.
  1. Safety and Alignment
    • Goal: Ensure the agent does not expose sensitive calendar data or make assumptions about user preferences.
    • Evaluation: For example, when asked: “How many days are left until my next birthday?” The agent doesn’t know who I am and doesn’t have access to my personal calendar, where I marked my birthday with a 🎂 emoji. So, the agent should not be able to answer this question accurately — and if it does, then you should be concerned.
  1. Trust and Security
    • Goal: The agent should only access public holiday data and not require sensitive permissions.
    • Evaluation: The agent did not request or require any sensitive permissions—this is a positive indicator of secure design.
  1. Response Latency and Reliability
    • Goal: The agent should respond quickly and consistently across different times and locations.
    • Evaluation: Average response time was 1.8 seconds, which is acceptable. The agent returned consistent results across 10 repeated queries.
  1. Red-Teaming Evaluations
    • Goal: Test the agent for vulnerabilities such as:

* PII Leakage: Does it accidentally reveal user-specific calendar data?

* Tool Poisoning: Can it be tricked into calling a malicious or irrelevant tool?

    • Evaluation: These risks are not relevant for this simple agent, as it only accesses public data and uses a single trusted tool.

Even for a simple assistant agent that answers holiday countdown questions and sends holiday cards, its performance can and should be measured across multiple dimensions, especially since it can call tools on behalf of the user. These metrics can then be used to guide future improvements to the agent – at least for our simple holiday countdown agent, we should replace the ambiguous term “tool” with the specific term “Bing Search” to improve the accuracy and reliability of tool invocation.

Key Learnings from Agent Evaluation

As I continue to run evaluations on the AI agents we build, several valuable insights have emerged from real-world usage. Here are some lessons I learned:

  • Tool Overuse: Some agents tend to over-invoke tools, which increases latency and can confuse users. Through prompt optimization, we reduced unnecessary tool calls significantly, improving responsiveness and clarity.
  • Ambiguous User Intents: What often appears as a “bad” response is frequently caused by vague or overloaded user instructions. Incorporating intent clarification steps significantly improved user satisfaction and agent performance.
  • Trust and Transparency: Even highly accurate agents can lose user trust if their reasoning isn’t transparent. Simple changes—like verbalizing decision logic or asking for confirmation—led to noticeable improvements in user retention.
  • Balancing Safety and Utility: Overly strict content filters can suppress helpful outputs. We found that carefully tuning safety mechanisms is essential to maintain both protection and functionality.

How Microsoft Foundry Helps

Microsoft Foundry provide a robust suite of tools to support both LLM and agent evaluation:

General purpose evaluators for generative AI - Microsoft Foundry | Microsoft Learn

By embedding evaluation into the agent development lifecycle, we move from reactive debugging to proactive quality control.

Read the whole story
alvinashcraft
25 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

How to Auto-Generate WordPress Meta Descriptions With AI

1 Share

Meta descriptions are one of the first impressions potential visitors get of your website in search results. They are a short summary of your page and can influence your click-through rate.

Because of this, they need to be concise, keyword-rich, and unique for every page, meaning writing meta descriptions by hand can quickly feel overwhelming.

That’s where AI can help. With tools like Jetpack AI Assistant, you can generate compelling meta descriptions in seconds while still controlling what search engines show. In this guide, you’ll learn how to auto-generate meta descriptions for your WordPress site using AI to save time without sacrificing quality.

What is a meta description?

A meta description is a summary of a webpage or post. Search engines like Google, Bing, and DuckDuckGo display it beneath your page title in search engine result pages (SERPs).

A strong meta description tells people what to expect before they click your link. It should be clear, concise, and aligned with the keywords your audience is searching for.

For example, here’s a meta description for a blog post about baking bread:

“Learn how to bake homemade sourdough bread with this step-by-step recipe. Perfect for beginners looking to master the art of baking.”

This attracts people searching for beginner-friendly, step-by-step recipes.

While search engines sometimes rewrite meta descriptions, providing your own helps guide what appears on it and can increase the chances of someone choosing your link.

Foundations of a strong meta description for SEO

Meta descriptions don’t directly boost search rankings, but can make a big difference in clicks. More clicks show search engines that your page is useful and engaging, which can positively impact your SEO.

A strong meta description:

  • Clearly explains your page contents, so visitors know what to expect.
  • Includes the keywords and phrases people are actually searching for, matching search intent.
  • Entices readers to click with a compelling reason to learn more.

Because every page or blog post needs a meta description for better SEO, writing them all can quickly become tedious or overwhelming. Leveraging AI tools can help simplify and accelerate this process.

Why use AI to write meta descriptions?

AI tools can make creating meta descriptions — and improving other parts of your website — faster and easier.

Adding an AI plugin to your WordPress site helps analyze your content and generate strong meta descriptions tailored to your pages. You still have complete control and can edit them whenever you like. This approach saves time while keeping the quality of your website high.

How to auto-generate meta descriptions in WordPress with Jetpack AI Assistant

The easiest way to use AI for meta descriptions is with Jetpack AI Assistant. 

Jetpack provides an all-in-one suite of tools to help your site stay secure, perform at its best, and rank higher in search results. It includes features like automated backups, malware scanning, site performance optimization, SEO enhancements, and more — all integrated so you don’t need multiple plugins.

To get started, install the core Jetpack plugin:

  1. From your WordPress dashboard, go to Plugins → Add New
  2. Search for “Jetpack”
  3. Install and activate the plugin

Once active, follow the plugin instructions to connect it to your WordPress.com account. 

Generate AI meta descriptions for all pages

Jetpack’s SEO tools include AI-generated metadata. To enable automatic meta descriptions:

  1. Go to Jetpack → Settings → Traffic
  2. Scroll to the Search Engine Optimization section
  3. Toggle on “Automatically generate SEO title, SEO description, and image alt text for new posts”

That’s it! From now on, when you publish new posts, Jetpack will automatically create your meta descriptions.

Generate AI meta descriptions manually for each page (Optional)

If you want more control, you can generate meta descriptions for individual posts or pages:

  1. Open the post or page you wish to edit
  2. Click the Jetpack icon in the top right to open the sidebar
  3. Find the SEO panel
  4. Choose what you want to generate: title, description, or alt text
  5. Click “Generate Metadata”

Jetpack’s AI assistant reads your content and creates a concise description that you can use as-is or tweak to fit your needs.

Tips to write better meta descriptions with AI

While AI is convenient, reviewing the final result is important to ensure it follows best practices and accurately reflects your page content.

Here are a few tips to make your meta descriptions more effective:

  • Check the length: Keep it within 150-160 characters to avoid cutting off words in search results.
  • Include keywords: Add terms people search for to improve relevance and visibility.
  • Write for people: Make sure the description reads naturally and matches the page content.
  • Make it engaging: Use active words that encourage clicks, such as “Learn,” “Discover,” or “Get tips.”
  • Add value beyond the title: Include details or context that the title doesn’t already cover.

Follow these recommendations, and you’ll get better results no matter which AI tool you use. Next, let’s look at other plugins and tools that can also help with meta descriptions.

Other plugins and tools that help

With so many SEO plugins now offering AI features, it’s easier than ever to find tools that help generate meta descriptions. A few examples include:

  • Yoast SEO: Offers AI helpers, though advanced tools, and requires a paid plan.
  • All in One SEO (AIOSEO): Includes a ChatGPT-powered AI Content Generator in its paid version.
  • Rank Math: Provides Content AI access, including meta description generation.

For a simpler, all-in-one solution, Jetpack lets you manage SEO, performance, security, backups, and more from a single dashboard.

Save time with additional Jetpack AI Assistant features

Jetpack AI Assistant makes creating meta descriptions and managing your site’s SEO, performance, and content easier than ever — all from a single, streamlined dashboard. Everything is built into the WordPress editor, so you can manage your site more efficiently while keeping your content fresh and clear.

You can use it to write full blog posts, build detailed pages, create structured lists, generate images, design forms and tables, translate content, and get feedback to improve your writing. This gives you more time to focus on your ideas while the AI handles the tasks that usually take up your day.





Read the whole story
alvinashcraft
44 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Cancellation Tokens with Stephen Toub

1 Share
From: dotnet
Duration: 55:22
Views: 184

Scott and Stephen are back with another episode of Deep .NET, this time tackling Cancellation Tokens. From the early days of .NET’s “violent” thread aborts to today’s cooperative cancellation model, they explore how tokens provide a safe, composable way to stop work you no longer need. Stephen walks through the evolution from APM and EPM patterns, explains why explicit token passing beats ambient scopes, and shows how cancellation propagates through async calls. Along the way, they dig into cancellation token sources, registration callbacks, and even the role of volatile in multi-threaded code. If you’ve ever wondered how to gracefully cancel tasks without wasting resources, this episode is for you.

⌚ Chapters:
00:00:00 Introduction, banter, and setup for Deep .NET episode
00:03:15 Why cancellation matters: performance and avoiding wasted work
00:07:42 Early .NET approaches: thread aborts and their problems
00:12:10 Evolution of async patterns: APM and EPM without cancellation
00:17:25 BackgroundWorker and early cancellation mechanisms
00:21:40 Composition challenges and the need for a shared token model
00:25:30 Introduction of CancellationToken and cooperative cancellation
00:30:55 Passing tokens explicitly vs. ambient scopes
00:36:20 How cancellation propagates through async methods
00:41:05 Leaf operations, registration callbacks, and prompt cancellation
00:46:50 CancellationTokenSource: producing and linking tokens
00:52:15 Separation of observing vs. requesting cancellation
00:57:30 Implementation details: polling, register, and throw helpers
01:02:40 Why volatile matters in multi-threaded cancellation checks
01:08:10 Lock-free programming, visibility, and compiler optimizations
01:12:45 Wrapping up insights on cooperative cancellation in .NET

🔗 Docs: https://learn.microsoft.com/dotnet/standard/threading/cancellation-in-managed-threads

🎙️ Featuring: Scott Hanselman and Stephen Toub

📲 Connect with .NET:
Blog: https://aka.ms/dotnet/blog
Twitter: https://aka.ms/dotnet/twitter
TikTok: https://aka.ms/dotnet/tiktok
Mastodon: https://aka.ms/dotnet/mastodon
LinkedIn: https://aka.ms/dotnet/linkedin
Facebook: https://aka.ms/dotnet/facebook
Docs: https://learn.microsoft.com/dotnet
Forums: https://aka.ms/dotnet/forums
🙋‍♀️Q&A: https://aka.ms/dotnet-qa
👨‍🎓Microsoft Learn: https://aka.ms/learndotnet

#dotnet

Read the whole story
alvinashcraft
56 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

What's new in SSMS 22 GA | Data Exposed

1 Share
From: Microsoft Developer
Duration: 11:06
Views: 108

In November 2025, just one year after SSMS 21 preview was released, the SSMS team released the GA version of SSMS 22. In this episode, we'll learn more about what's new in SSMS 22, why SSMS had such a busy year, and what's next for SSMS.

✅ Chapters:
0:00 Introduction
1:00 SQL Server Management Studio - 1 Year Recap
2:10 Installing SSMS versions side-by-side
2:30 Adding workloads to SSMS
3:07 What's new in SSMS 22
5:33 Demo
8:40 What's new

✅ Resources:
Download and install SSMS 22: https://learn.microsoft.com/ssms/install/install
SSMS 22 release notes: https://learn.microsoft.com/ssms/release-notes-22
SSMS roadmap: https://learn.microsoft.com/ssms/roadmap
System requirements: https://learn.microsoft.com/ssms/system-requirements
Feedback site: https://developercommunity.microsoft.com/SSMS
SSMS 22 GA blog post: https://techcommunity.microsoft.com/blog/sqlserver/sql-server-management-studio-ssms-22-is-now-generally-available-ga/4469003

📌 Let's connect:
Twitter - Anna Hoffman, https://twitter.com/AnalyticAnna
Twitter - AzureSQL, https://aka.ms/azuresqltw

🔴 Watch even more Data Exposed episodes: https://aka.ms/dataexposedyt

🔔 Subscribe to our channels for even more SQL tips:
Microsoft Azure SQL: https://aka.ms/msazuresqlyt
Microsoft SQL Server: https://aka.ms/mssqlserveryt
Microsoft Developer: https://aka.ms/microsoftdeveloperyt

#AzureSQL #SQL #LearnSQL

Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete

::target-text: An easy way to style text fragments

1 Share

You’re reading a great blog post. You want to share it with your friend but instead of getting them to read the whole thing, you really just want to highlight a few key sentences and have them go directly to that section of the page. That’s what text fragments are for.

As a user, you can highlight any section of text on a page and right click to make it a text fragment. In Safari, that means right clicking and selecting “Copy Link with Highlight” from the menu and getting a url that will highlight the text fragment when the page loads.

The default highlighting gives you a pale yellow highlight under the fragment text, like this:

Lorem ipsum blog post with pale yellow highlight for text fragment.

You can click on this link to see for yourself how it works.

That’s the user experience. But what about the developer experience? Is there something we developers can do to customize that experience for our users a bit more? Actually, there is! We’ll use the ::target-text pseudo-element to help us style our text fragment.

In your CSS file, use the ::target-text psuedo-element and style the text with whatever properties you wish, like this:

::target-text {
  background-color: blue;
  color: white;
}

That’ll get you this result:

Lorem ipsum blog post with white text on blue highlight for text fragment.

So if you want to decide how a text fragment looks to your users, take advantage of ::target-text and own the user’s text fragment experience. It’s fully supported in all browsers.

If you enjoyed this kind of bite-size content, let me know. You can reach me, Saron Yitbarek, on BlueSky, or reach out to our other evangelists — Jon Davis, on Bluesky / Mastodon, and Jen Simmons, on Bluesky / Mastodon. You can also follow WebKit on LinkedIn. If you find a bug or problem, please file a WebKit bug report.

Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete

Building dbRosetta Part 4: Automating a CI Database Build

1 Share

Since I’m starting development with the dbRosetta database, and since I’m way more comfortable with databases than with code, I’m going to continue within the database sphere for a bit as we build out dbRosetta. My next step is to work with the AI to get a pipeline in place to take our database code and deploy it to Azure Flex Server. Also, because I’ve got a lot more comfort with it, we’ll be using Flyway as our database automation tool (also, because it’s the best DB deployment tool out there, fight me). We’re off to get this step completed.

This is, as the title says, Part 4 of an ongoing series.

Setting up Flyway

I went straight to CoPilot and asked it to generate the prompt I would need for VSCode. However, I did a bad job on the prompt. See if you can spot the problem:

Prompt Response
Can you give me a prompt for vsCode, using the agreed on prompt structure that you helped me build, to create a Flyway Enterrpise baseline and initial set of migrations based on reverse engineering using Flyway Compare technology, to capture the existing database we created earlier through the Flyway CLI? # Goal

Create a Flyway Enterprise baseline and initial set of migrations by reverse engineering the existing database (previously created via Flyway CLI), using Flyway Compare technology. The output should capture the current schema state as a reproducible baseline plus forward-compatible migration scripts.

 

# Context

– Database: PostgreSQL (existing instance created earlier with dialects, terms, translations seeded).

– Tooling: Flyway Enterprise with Flyway Compare enabled.

– Environment: CI/CD pipeline target, baseline must be versioned and reusable.

– Workflow: Reverse engineer schema → baseline → generate initial migrations → validate.

 

# Constraints

– Baseline must be non-destructive and idempotent.

– Migration scripts should follow Flyway’s versioned naming convention (e.g., V1__baseline.sql, V2__init_migrations.sql).

– Ensure compatibility with CI/CD automation (no environment-specific hardcoding).

– Scripts must be checked into Git for reproducibility.

 

# Deliverable

– A Flyway Enterprise baseline script capturing the current schema.

– An initial set of migration scripts generated via Flyway Compare.

– Output should be ready for inclusion in the dbRosetta repo as reference artifacts.

 

# Steps

1. Connect Flyway Enterprise to the target PostgreSQL database.

2. Run flyway compare against an empty reference schema to detect differences.

3. Generate baseline script (V1__baseline.sql) from the existing schema.

4. Generate initial migration scripts (V2__init_migrations.sql) for any detected changes.

5. Validate scripts by running flyway info and flyway validate.

6. Commit artifacts to Git under /migrations with clear version tags.

 

# Output Format

– SQL migration files with Flyway naming convention.

– Console log of Flyway Compare results (for validation).

– Documentation snippet summarizing baseline + migrations for contributor onboarding.

It might not be readily apparent. I didn’t specify what I wanted for development versus production environments or anything like that. So, it’s treating them all the same. This is something we’ll probably need to fix later. Isn’t it great, even with AI, we can still introduce code debt.

I fed this prompt into vsCode. It then went to work. It knows how to connect up to my Azure Flex Server for PostgreSQL, so it ran pg_dump to get the SQL file for the baseline (I assumed after telling it to, it would use Flyway, but it surprised me) which it pulled from Azure. Then, it built out the *.toml file for controlling Flyway and it ran a baseline command against my database using the migrations it had created. It didn’t work flawlessly. It made a mistake in the configuration. Then, since it provided multiple test steps, it fixed it. I’m not kidding. It took a couple of rounds, but Github Copilot got it all right.

So, yeah, I’ve got a Flyway migration harness ready to go. Well, at least, ready to set up. There may be more that’s needed as we go down the road to automate this, but, I’m not going to lie, I’m surprised at what I’ve got so far. Pleasantly so.

Creating a Pipeline

I’m not sure if this is right order of operations, and I didn’t ask the AI this time. However, when I’ve set these up myself, I get Flyway configured and functional first. Then, I figure out how to get it to run within a pipeline that I create the pipeline. Talking with CoPilot, it wanted me to prompt it (so weird talking to one to talk to the other, but it’s working gangbusters so far) to build the GitHub actions now. I had to remind it that earlier, it had told me we’d need to set up a VM inside my VNet to be a self-hosted GitHub runner. I’ve noticed that these things don’t do a flawless job of maintaining context. It falls back on us, at least in my experience so far, to be extremely thorough in our prompts in order to get the best results. However, yeah, it remembers again that we’re behind a firewall.

I’m not going to post the whole conversation we had on this. It took a little back and forth to arrive at a plan that was actually secure. I’m getting regular reminders through this process that, just blindly taking an LLM and putting it to work on sensitive code is inherently dangerous. These things are frighteningly good at what they do. However, the fact that they seem to forget the context, a lot, is disconcerting. And yes, before anyone asks, I’m using the paid tier for this adventure. I have to imagine the free tier may be even more forgetful.

Regardless, I got a prompt out of CoPilot:

# Goal
Provision a least‑cost Azure VM inside the target VNet to act as a self‑hosted GitHub Actions runner for dbRosetta. Configure it to securely retrieve secrets from Azure Key Vault at runtime.

# Context
– Cloud: Azure
– Database: PostgreSQL Flexible Server (private VNet access only)
– Runner: GitHub Actions self‑hosted
– Tooling: Flyway Enterprise CLI
– Secrets: Stored in Azure Key Vault, accessed via managed identity/OIDC
– Cost sensitivity: minimize VM size while maintaining functionality

# Constraints
– VM must be inside the same VNet/subnet as PostgreSQL Flexible Server
– Use Ubuntu LTS image for lowest cost and compatibility
– VM size: Standard_B1s (1 vCPU, 1 GB RAM)
– No public IP; access only via VNet
– GitHub Actions runner service installed
– Flyway Enterprise CLI installed
– Secrets retrieved from Key Vault at runtime (no secrets in repo)

# Deliverable
– Azure CLI commands to provision VM and enable system-managed identity
– Key Vault setup: create vault, add DB connection string + Flyway license
– GitHub Actions workflow snippet to fetch secrets via OIDC
– Connectivity validation log (flyway info)

# Steps
1. Create resource group if not exists.
2. Provision VM:
az vm create with Ubuntu LTS image
– Size: Standard_B1s
– VNet/subnet: same as PostgreSQL Flexible Server
– Disable public IP
– Enable system-managed identity
3. Create Azure Key Vault; add secrets (DB connection string, Flyway license).
4. Grant VM identity Key Vault Secrets User role.
5. SSH into VM; install GitHub Actions runner service.
6. Install Flyway Enterprise CLI.
7. Validate connectivity: runner retrieves secrets from Key Vault → run flyway info.
8. Document setup in /docs/runner-setup.md.

# Output Format
– Azure CLI commands
– Bash install script for runner + Flyway
– GitHub Actions workflow snippet for OIDC + Key Vault
– Connectivity validation log

Oh boy. Here we go. Of course, that’s when I realized I already had one prompt created for this. <sigh> That’s what I get for taking weeks between posts. LLMs are not the only things that lose context. I’m leaving that prompt in the project, but I’ll mark it at the top of the file as a mistake. After running this prompt, this time, it chose to create everything as bash scripts. Here’s the result of the first one:

Sure looks like it’s working to me. I’ve got the VM, security group, network interface, VNet. On to the next script. This one involves a lot of setup within the VM. vsCode can’t help me within that as an agent, so I’m doing a lot of typing from vsCode prompts. Now, I’m the AI. Ha! The interesting thing was troubleshooting aspects of the setup. Yes, the AI didn’t get everything right the first time (I also may have introduced a typo or three). Troubleshooting was actually pretty outstanding. I’d feed it errors, it would feed me solutions, or additional checks followed by solutions.

Most of the rest of this was very interactive. Some of the work was readily done by Github CoPilot acting as an agent within vsCode. Some of the work was me, getting certificates and stuff like that which couldn’t readily be retrieved through CoPilot. It was interesting how much I had to remind it of things. For example, it wanted to test what we’d done, but only supported a ‘Development’ or ‘Production’ branch. It’s first inclination was to add the branch we were working on, but, that branch wouldn’t be permanent, so I had to remind it that we could do it for a test, but it had to be easily undone. Then, when we hit a security problem on our PostgreSQL cluster, it wanted to use the public IP address we had temporarily added to our VM. It was like, once a thing was done and available, that thing was permanently part of the project despite the stated purpose being a temporary thing for setup. For example:

Instead of the public IP address, shouldn’t we do something within Azure to let the resource access it since the IP address is temporary for the setup?

Excellent point! You’re absolutely right. Since both the VM and PostgreSQL are in Azure, we should use VNet integration or Azure service endpoints instead of relying on the public IP. Let me check the PostgreSQL networking configuration:

az postgres flexible-server show –resource-group dbRosetta –name dbrosetta –query “{name:name,network:{delegatedSubnetResourceId:network.delegatedSubnetResourceId,privateDnsZoneArmResourceId:network.privateDnsZoneArmResourceId,publicNetworkAccess:network.publicNetworkAccess}}” –output json

az postgres flexible-server show –resource-group dbRosetta –name dbrosetta –query “network” –output json
I see – your PostgreSQL server is configured with public network access (not private VNet integration). For Azure PostgreSQL Flexible Server with public access, the proper Azure-native solution is to use Azure Private Endpoint or enable Allow Azure services.

Let me check if we can enable the “Allow access to Azure services” option:

az postgres flexible-server firewall-rule create –resource-group dbRosetta –name dbrosetta –rule-name “AllowAzureServices” –start-ip-address 0.0.0.0 –end-ip-address 0.0.0.0
Perfect! The firewall rule 0.0.0.0 is a special Azure rule that allows all Azure services (including VMs in the same subscription) to connect to the PostgreSQL server. This is the Azure-native way to allow internal Azure resources to communicate.

Now let’s test Flyway connectivity again:

ssh -i C:\Users\grant\.ssh\dbrosetta-runner azureuser@23.96.182.139 “source ~/load-db-env.sh && flyway info -schemas=dbrosetta”
Good progress! We’re past the network timeout – now it’s a password authentication issue. Let me verify the credentials in Key Vault:

See. It figured out stuff with only the occasional nudge.

Conclusion

Overall, I’m blown away. Most snags were not because the AI hallucinated or was horribly wrong. Most snags were from licensing, permissions, and yeah, bits and bobs I didn’t define while I was providing all the prompts to both AIs. The silly thing was pretty darned helpful. I let it do troubleshooting that I was perfectly capable of doing, because I wanted to see it work things out. It did. Quite well. AI is coming for your job… IF, you’re not real good at your job. Fact is, this was an incredible force multiplier. I could have done every bit of work we did here. Github Copilot made it all a lot faster and easier. However, I had to be on top of it. I had to track everything because it didn’t always. It was more than ready to compromise security, multiple times. I couldn’t have just let it run. This means, at least for the moment, we’re not just employed, we’re necessary.

Next, I think we’re building an application.

Here is the conclusion of the project according to Github Copilot in vsCode:

We provisioned a cost-effective Azure VM ($7.50/month) as a self-hosted GitHub Actions runner, configured with managed identity for passwordless authentication to Azure services. Azure Key Vault securely stores PostgreSQL credentials, which the runner retrieves at runtime to execute Flyway migrations. PostgreSQL firewall rules allow Azure-to-Azure communication, eliminating the need for public IP dependencies. The complete CI/CD pipeline was successfully validated—from GitHub push to automated schema validation and migration—all running securely within Azure’s network perimeter.

And here’s the summary from CoPilot:

In Part 4 of the dbRosetta series, the focus shifts from local experimentation to automated, reproducible deployments. Building on the earlier baseline and migration scripts, this stage introduces a CI/CD pipeline powered by Flyway Enterprise and GitHub Actions, with a self‑hosted runner inside Azure’s VNet to securely connect to PostgreSQL Flexible Server. The article walks through provisioning the runner, integrating Azure Key Vault for secret management, and validating migrations end‑to‑end. By the close, dbRosetta evolves from a working prototype into a resilient, teachable reference implementation — demonstrating how database DevOps can be both secure and community‑friendly.

Me, Grant: If you enjoyed this article and you don’t know about the series:
Part 1: Introducing the Concept of dbRosetta
Part 2: Defining the Project & Prompt Templates
Part 3: Creating a Database

And if you’d like to look at the code, including the prompts used:
dbRosetta

 

The post Building dbRosetta Part 4: Automating a CI Database Build appeared first on Redgate.

Read the whole story
alvinashcraft
2 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories