Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
153066 stories
·
33 followers

e238 – Presentation Pitfalls with John Polk

1 Share
Show Notes – Episode #238 The latest episode of The Presentation Podcast brings together hosts Troy, Sandy, and Nolan with special guest John Polk—author, workshop leader, and consultant—to discuss his new book, Presentation Pitfalls: Ten Traps Business Professionals Fall Into and How to Avoid Them (co-authored with Justin Hunsaker). This episode is [...]



Download audio: https://traffic.libsyn.com/secure/thepresentationpodcast/TPP_e238.mp3
Read the whole story
alvinashcraft
just a second ago
reply
Pennsylvania, USA
Share this story
Delete

Reclaiming underutilized GPUs in Kubernetes using scheduler plugins

1 Share

The problem nobody talks about

GPUs are expensive; and yours are probably sitting idle right now. High-end GPUs (for example, NVIDIA A100-class devices) can cost $10,000+, and in a Kubernetes cluster running AI workloads, you might have dozens of them. Here’s the uncomfortable truth: most of the time, they’re sitting idle. If you’re struggling with GPU scheduling in Kubernetes or looking for ways to reclaim idle GPUs, you’re not alone.

A data scientist spins up a training job, requests 4 GPUs, runs for two hours, then leaves for lunch. The GPUs sit allocated but unused. Meanwhile, another team’s job is queued, waiting for resources that technically exist but aren’t available.

Standard Kubernetes scheduling doesn’t help here. It sees allocated resources as unavailable — period. The scheduler does not currently take real-time GPU utilization into account. 

Kubernetes scheduling trade-offs for GPUs

Kubernetes was built for CPUs. Its scheduling model assumes resources are either allocated or free, with nothing in between. For CPUs, this mostly works — a pod using 10% of its requested CPU isn’t blocking others in the same way.

GPUs are different. They’re discrete, expensive, and often requested in large quantities. A pod requesting 4 GPUs gets exactly 4 GPUs, even if it’s only actively using them 20% of the time. This is the core challenge of GPU resource management in Kubernetes — the scheduler has no concept of actual utilization.

The default Kubernetes preemption mechanism (DefaultPreemption) can evict lower-priority pods to make room for higher-priority ones. But it only considers priority — not actual utilization. Pods are treated equivalently from a preemption perspective when they share the same priority, regardless of their current utilization.

We evaluated several existing approaches. For example, device plugins focus on allocation, while autoscaling addresses capacity rather than reclaiming idle resources. Cluster autoscaler can add nodes but won’t reclaim idle resources on existing ones. Various GPU sharing approaches exist, but they don’t address the fundamental scheduling problem.

The core idea: Utilization-aware preemption

We needed utilization-aware preemption that considers what GPUs are actually doing, not just what they’ve been allocated. The solution: a custom Kubernetes scheduler plugin for idle GPU reclaim that replaces the default preemption logic with an alternative approach that incorporates utilization signals.

The plugin, which we called ReclaimIdleResource, operates in the PostFilter phase of the scheduling cycle. This is where Kubernetes looks for preemption candidates when a pod can’t be scheduled normally.

Here’s the key insight: instead of just comparing priorities, we query Prometheus for GPU utilization metrics (in our case, sourced from DCGM).  A pod is only eligible for preemption if:

1. Its priority is below the preemptor’s threshold

2. It’s been running long enough to establish a usage pattern

3. Its actual GPU utilization is below a configured threshold

This means an idle pod with priority 1000 can be preempted by a pod with priority 500, if the idle pod isn’t actually using its GPUs.

Where ReclaimIdleResource fits in the scheduling cycle

The plugin replaces DefaultPreemption in the PostFilter phase, activating only when normal scheduling fails. 

This image shows a flow chart view of the scheduling cycle. It shows the flow from PreFilter to Filter to Score. At score, if nodes are found, the chart moves straight on to bind. If nodes aren't found, the flow chart moves to PostFilter ReclaimIdleResource replaces DefaultPreemption.

How it works

The plugin hooks into the scheduler as a PostFilter extension:

profiles:
- schedulerName: default-scheduler
  plugins:
    postFilter:
      enabled:
      - name: ReclaimIdleResource
      disabled:
      - name: DefaultPreemption

When a GPU-requesting pod can’t be scheduled, the plugin:

1. Checks cooldown — Has this pod recently triggered preemption? If so, wait. This prevents thrashing.

2. Scans potential victims — Finds all lower-priority pods on candidate nodes that have GPUs.

3. Evaluates each victim — Parses its PriorityClass for reclaim policy annotations, checks if it’s still in its “toleration period” (grace period after scheduling), queries Prometheus for average GPU utilization over the monitoring window, and compares utilization against the idle threshold.

4. Selects minimal victims — Sorts eligible victims by GPU count (descending) and priority (ascending), then selects the minimum set needed to free enough GPUs.

5. Validates the decision — Runs filter plugins to confirm the preemptor will actually fit after preemption.

The policy is defined per-PriorityClass through annotations:

kind: PriorityClass
metadata:
  name: batch-workload
  annotations:
    reclaim-idle-resource.scheduling.x-k8s.io/minimum-preemptable-priority: "10000"
    reclaim-idle-resource.scheduling.x-k8s.io/toleration-seconds: "3600"
    reclaim-idle-resource.scheduling.x-k8s.io/resource-idle-seconds: "3600"
    reclaim-idle-resource.scheduling.x-k8s.io/resource-idle-usage-threshold: "10.0"
value: 8000

This says: pods in this priority class can tolerate preemption for one hour after scheduling, and can be preempted if their GPU usage stays below 10% for an hour — but only by pods with priority 10000 or higher.

Key design decisions

Why PriorityClass annotations?

We considered a custom CRD, but PriorityClass already exists in the scheduling mental model. Teams already think about priority when designing workloads. Adding reclaim policy as annotations keeps the configuration close to where people expect it.

Why a monitoring window instead of instant utilization?

GPU workloads are bursty. A training job might spike to 100% utilization during forward/backward passes, then drop to near-zero during data loading. Instant measurements would give false positives. We use a configurable window (typically 30–60 minutes) to capture the true usage pattern.

Why query Prometheus instead of using in-memory metrics?

The scheduler runs as a single replica. We needed utilization data that survives scheduler restarts and can be queried historically. DCGM exports to Prometheus naturally, and most GPU clusters already have this pipeline.

Why a cooldown period?

Without it, a preemptor pod could trigger preemption, fail to schedule for unrelated reasons, and immediately trigger another preemption attempt. The 30-second cooldown prevents rapid-fire preemption storms.

What we learned

Tuning matters more than we expected. The idle threshold and monitoring window need to match your workload patterns. Too aggressive and you’ll preempt jobs mid-training. Too conservative and you won’t reclaim much.

Observability is essential. We added extensive logging and Kubernetes events so operators can understand why preemption decisions were made. When someone’s job gets preempted, they want to know why.

Multi-Instance GPU (MIG) adds additional scheduling considerations.. NVIDIA’s Multi-Instance GPU feature means a single physical GPU can be partitioned. We had to add partition-size compatibility checks to avoid preempting pods on nodes where the preemptor couldn’t run.

Related Links

• Kubernetes Scheduler Plugins: https://github.com/kubernetes-sigs/scheduler-plugins

• NVIDIA DCGM Exporter: https://github.com/NVIDIA/dcgm-exporter

Originally published on Medium. Permission granted for republishing on CNCF blog.

Read the whole story
alvinashcraft
22 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

What I’ve learned over 3 years of writing code by Doro Hinrichs

1 Share

Three years of writing code

At the beginning of 2023 I kick-started my career in tech with Scott Logic’s graduate program. It’s been an incredible journey and three years later I wanted to reflect and share some of the lessons I’ve learned.

Names

Solid naming conventions make a huge difference. A branch named “ticket-42-add-intergalactic-microservice” is far better than simply “42”. A commit message “ticket 42: added controller” is easier to work with than “made changes”.

And when you work with users, product owners, testers, and other teams, making sure that we all know that we are talking about the same thing becomes absolutely necessary for making progress. Consistency is vital, and I wouldn’t shy away from adding definitions into the README or add an intranet page explaining the terminology different teams use. At the end of the day it doesn’t matter if we call it a spaceship_id, ship-key, or transportIdentifier, as long as we use whichever naming styles and naming conventions we pick consistently in our code and conversations.

Speaking of conversations – let’s talk about people next.

People

When I first started to learn coding someone told me: “In software development, all problems that aren’t maths problems are people problems”. After three years I think I’m starting to get an idea of what they meant.

The best developers I know are not just technically excellent, but also know how to talk to colleagues and stakeholders alike with kindness and clarity. They know how to explain complicated processes in simple terms, they don’t overuse jargon just for the sake of it, and most importantly: They take ego out of the equation. If they mess up, they don’t try to brush it under the carpet hoping that no one will notice, they own their mistakes and fix them as soon as possible. For more junior staff it makes a big difference to see your seniors and leads model this kind of growth-mindset; it creates a safe culture for learning and making progress fast.

The myth of “fixing it later”

On the topic of making progress fast: It’s usually better to take a few more hours to do something right the first time than to rush a PR merge with hard-coded mock data without tests and saying ‘we’ll fix it later’. It builds up tech debt that just gets worse with every consecutive commit. Later we’ll be busy building the next feature. Later we’ll be dealing with a bug. Later doesn’t exist. Let’s fix it now.

That being said: Sometimes you do need to pivot and take a ticket that has scope-crept into an entire epic and break it up into smaller tasks. Just make sure you do those small tasks right the first time.

Coding with style

As I’m writing this blog post, my text editor autocorrects small mistakes and shows me a red squiggly line to tell me that “devs” is not a real word. Did you know that code editors can do the same thing for code quality, formatting, indentation, naming conventions and more? Incredible! Which is why I am flabbergasted on the rare occasion that I come across a codebase that clearly doesn’t use them.

I could tell you about cognitive load and how a visually clear structure reduces it, or about the effort of reviewing a PR with multiple typos in it, but I think you already get the point.

It takes a small effort to add the correct settings and configurations to your IDE/pipeline checks, but it will save you a huge amount of energy and time in the long run.

It really doesn’t matter how many indentations you like, or where you put your curly brackets, but make it consistent. And: Tell your AI companion about your preferences. On that note:

I, for one, welcome our new AI overlords

Yes, they might take my job. But until they do I’ll enjoy the benefits of coding with AI tools. My enjoyment of writing unit tests has increased by at least 50% since I can tell AI “write the same test but now check what happens if the warp core fails” and within a few seconds I have another edge case covered. And, to highlight the importance of naming things well one more time: AI can also give me great alternatives for testCaseNamesThatGetABitTooLongToReadFluently_Returns413.

Reviewing code is also easier with AI - more than once has a quick AI check spotted a logic error, or offered a different implementation with better performance.

The necessary caveat: AI does make mistakes. Even with plenty of context, it often defaults to ‘standard’ answers that won’t work for your project. When you use AI tools in your development, you are still responsible for every line of code you commit so make sure to double check everything.

TLDR

  • Names (and naming conventions) hold great power
  • Use your words with kindness and clarity
  • Own your mistakes
  • Do it right the first time
  • Use the tools available to you (including AI, with caution) to make your life easier and your code better.

That’s a wrap for 3 years of coding.

I can’t wait to see what the next few years will teach me.

Read the whole story
alvinashcraft
38 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Microsoft’s free Xbox Cloud Gaming is coming soon with ads

1 Share

Microsoft has been testing its ad-supported Xbox Cloud Gaming service over the past few months, as I exclusively revealed in October. Now, the software giant is getting close to testing the free streaming option with Xbox Insiders.

Over the weekend Microsoft updated its Xbox app with loading screens for Xbox Cloud Gaming that mention "1 hour of ad-supported playtime per session." That lines up with Microsoft's internal testing, which has been limited to one-hour sessions, with up to five hours free per month.

Sources told me in October that internal testing includes around two minutes of preroll ads before a game is available to stream for …

Read the full story at The Verge.

Read the whole story
alvinashcraft
1 hour ago
reply
Pennsylvania, USA
Share this story
Delete

Using ClientConnectionId to Correlate .NET Connection Attempts in Azure SQL

1 Share

Getting Better Diagnostics with ClientConnectionId in .NET

A few days ago, I was working on a customer case involving intermittent connectivity failures to Azure SQL Database from a .NET application. On the surface, nothing looked unusual. Retries were happening.

In this post, I want to share a simple yet effective pattern for producing JDBC-style trace logs in .NET — specifically focusing on the ClientConnectionId property exposed by SqlConnection. This gives you a powerful correlation key that aligns with backend diagnostics and significantly speeds up root cause analysis for connection problems.

Why ClientConnectionId Matters

Azure SQL Database assigns a unique identifier to every connection attempt from the client. In .NET, this identifier is available through the ClientConnectionId property of SqlConnection. According to the official documentation:

The ClientConnectionId property gets the connection ID of the most recent connection attempt, regardless of whether the attempt succeeded or failed. Source: https://learn.microsoft.com/en-us/dotnet/api/system.data.sqlclient.sqlconnection.clientconnectionid?view=netframework-4.8.1

This GUID is the single most useful piece of telemetry for correlating client connection attempts with server logs and support traces.

What .NET Logging Doesn’t Give You by Default

Unlike the JDBC driver, the .NET SQL Client does not produce rich internal logs of every connection handshake or retry. There’s no built-in switch to emit gateway and redirect details, attempt counts, or port information.

What you do have is:

  • Timestamps
  • Connection attempt boundaries
  • ClientConnectionId values
  • Outcome (success or failure)

If you capture and format these consistently, you end up with logs that are as actionable as the JDBC trace output — and importantly, easy to correlate with backend diagnostics and Azure support tooling.

Below is a small console application in C# that produces structured logs in the same timestamped, [FINE] format you might see from a JDBC trace — but for .NET applications:

using System; using Microsoft.Data.SqlClient; class Program { static int Main() { // SAMPLE connection string (SQL Authentication) // Replace this with your own connection string. // This is provided only for demonstration purposes. string connectionString = "Server=tcp:<servername>.database.windows.net,1433;" + "Database=<database_name>;" + "User ID=<sql_username>;" + "Password=<sql_password>;" + "Encrypt=True;" + "TrustServerCertificate=False;" + "Connection Timeout=30;"; int connectionId = 1; // Log connection creation Log($"ConnectionID:{connectionId} created by (SqlConnection)"); using SqlConnection connection = new SqlConnection(connectionString); try { // Log connection attempt Log($"ConnectionID:{connectionId} This attempt No: 0"); // Open the connection connection.Open(); // Log ClientConnectionId after the connection attempt Log($"ConnectionID:{connectionId} ClientConnectionId: {connection.ClientConnectionId}"); // Execute a simple test query using SqlCommand cmd = new SqlCommand("SELECT 1", connection) { Log($"SqlCommand:1 created by (ConnectionID:{connectionId})"); Log("SqlCommand:1 Executing (not server cursor) SELECT 1"); cmd.ExecuteScalar(); Log("SqlDataReader:1 created by (SqlCommand:1)"); } } catch (SqlException ex) { // ClientConnectionId is available even on failure Log($"ConnectionID:{connectionId} ClientConnectionId: {connection.ClientConnectionId} (failure)"); Log($"SqlException Number: {ex.Number}"); Log($"Message: {ex.Message}"); return 1; } return 0; } // Simple logger to match JDBC-style output format static void Log(string message) { Console.WriteLine( $"[{DateTime.Now:yyyy-MM-dd HH:mm:ss}] [FINE] {message}" ); } }

Run the above application and you’ll get output like:

 

[2025-12-31 03:38:10] [FINE] ConnectionID:1 This attempt server name: aabeaXXX.trXXXX.northeurope1-a.worker.database.windows.net port: 11002 InstanceName: null useParallel: false [2025-12-31 03:38:10] [FINE] ConnectionID:1 This attempt endtime: 1767152309272 [2025-12-31 03:38:10] [FINE] ConnectionID:1 This attempt No: 1 [2025-12-31 03:38:10] [FINE] ConnectionID:1 Connecting with server: aabeaXXX.trXXXX.northeurope1-a.worker.database.windows.net port: 11002 Timeout Full: 20 [2025-12-31 03:38:10] [FINE] ConnectionID:1 ClientConnectionID: 6387718b-150d-482a-9731-02d06383d38f Server returned major version: 12 [2025-12-31 03:38:10] [FINE] SqlCommand:1 created by (ConnectionID:1 ClientConnectionID: 6387718b-150d-482a-9731-02d06383d38f) [2025-12-31 03:38:10] [FINE] SqlCommand:1 Executing (not server cursor) select 1 [2025-12-31 03:38:10] [FINE] SqlDataReader:1 created by (SqlCommand:1) [2025-12-31 03:38:10] [FINE] ConnectionID:2 created by (SqlConnection) [2025-12-31 03:38:11] [FINE] ConnectionID:2 ClientConnectionID: 5fdd311e-a219-45bc-a4f6-7ee1cc2f96bf Server returned major version: 12 [2025-12-31 03:38:11] [FINE] sp_executesql SQL: SELECT 1 AS ID, calling sp_executesql [2025-12-31 03:38:12] [FINE] SqlDataReader:3 created by (sp_executesql SQL: SELECT 1 AS ID)

 

Notice how each line is tagged with:

  • A consistent local timestamp (yyyy-MM-dd HH:mm:ss)
  • A [FINE] log level
  • A structured identifier that mirrors what you’d see in JDBC logging

If a connection fails, you’ll still get the ClientConnectionId logged, which is exactly what Azure SQL support teams will ask for when troubleshooting connectivity issues.

 

Read the whole story
alvinashcraft
1 hour ago
reply
Pennsylvania, USA
Share this story
Delete

Microsoft Foundry for VS Code: January 2026 Update

1 Share

Enhanced Workflow and Agent Experience

The January 2026 update for Microsoft Foundry extension in VS Code brings a follow update to the capabilities we introduced during Ignite of last year. We’re excited to announce a set of powerful updates that make building and managing AI workflows in Azure AI Foundry even more seamless. These enhancements are designed to give developers greater flexibility, visibility, and control when working with multi-agent systems and workflows.

Support for Multiple Workflows in the Visualizer

Managing complex AI solutions often involves multiple workflows. With this update, the Workflow Visualizer now supports viewing and navigating multiple workflows in a single project. This makes it easier to design, debug, and optimize interconnected workflows without switching contexts. 

View and Test Prompt Agents in the Playground

Prompt agents are a critical part of orchestrating intelligent behaviors. You can now view all prompt agents directly in the Playground and test them interactively. This feature helps you validate agent logic and iterate quickly, ensuring your prompts deliver the desired outcomes. 

Open Code files

Transparency and customization are key for developers. We’ve introduced the ability to open sample code files for all agents, including:

  • Prompt agents
  • YAML-based workflows
  • Hosted agents
  • Foundry classic agents

This gives you the ability to programmatically run agents, enabling adding these agents into your existing project.

Separated Resource View for v1 and v2 Agents

To reduce confusion and improve clarity, we’ve introduced a separated resource view for Foundry Classic resources and agents. This makes it simple to distinguish between legacy and new-generation agents, ensuring you always know which resources you’re working with.

How to Get Started

Feedback & Support

These improvements are part of our ongoing commitment to deliver a developer-first experience in Microsoft Foundry. Whether you’re orchestrating multi-agent workflows or fine-tuning prompt logic, these features help you build smarter, faster, and with greater confidence.

Try out the extensions and let us know what you think! File issues or feedback on our GitHub repo for Foundry extension. Your input helps us make continuous improvements.

Read the whole story
alvinashcraft
1 hour ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories