In Module 08 of the EdgeAI for Beginners course, Microsoft introduces Foundry Local a toolkit that helps you deploy and test Small Language Models (SLMs) completely offline. In this blog, I’ll share how I installed Foundry Local, ran the Phi-3.5-mini model on my windows laptop, and what I learned through the process.
What Is Foundry Local?
Foundry Local allows developers to run AI models locally on their own hardware. It supports text generation, summarization, and code completion — all without sending data to the cloud. Unlike cloud-based systems, everything happens on your computer, so your data never leaves your device.
Prerequisites
Before starting, make sure you have:
Windows 10 or 11
Python 3.10 or newer
Git
Internet connection (for the first-time model download)
After installing Foundry Local, open Command Prompt and type:
foundry --version
If you see a version number, Foundry Local is installed correctly.
Step 2 — Start the Service
Start the Foundry Local service using:
foundry service start
You should see a confirmation message that the service is running.
Step 3 — List Available Models
To view the models supported by your system, run:
foundry model list
You’ll get a list of locally available SLMs. Here’s what I saw on my machine:
Note: Model availability depends on your device’s hardware. For most laptops, phi-3.5-mini works smoothly on CPU.
Step 4 — Run the Phi-3.5 Model
Now let’s start chatting with the model:
foundry model run phi-3.5-mini-instruct-generic-cpu:1
Once it loads, you’ll enter an interactive chat mode.
Try a simple prompt: Hello! What can you do?
The model replies instantly — right from your laptop, no cloud needed.
To exit, type:
/exit
How It Works
Foundry Local loads the model weights from your device and performs inference locally.This means text generation happens using your CPU (or GPU, if available). The result: complete privacy, no internet dependency, and instant responses.
Benefits for Students
For students beginning their journey in AI, Foundry Local offers several key advantages:
No need for high-end GPUs or expensive cloud subscriptions.
Easy setup for experimenting with multiple models.
Perfect for class assignments, AI workshops, and offline learning sessions.
Promotes a deeper understanding of model behavior by allowing step-by-step local interaction.
These factors make Foundry Local a practical choice for learning environments, especially in universities and research institutions where accessibility and affordability are important.
Why Use Foundry Local
Running models locally offers several practical benefits compared to using AI Foundry in the cloud.
With Foundry Local, you do not need an internet connection, and all computations happen on your personal machine. This makes it faster for small models and more private since your data never leaves your device. In contrast, AI Foundry runs entirely on the cloud, requiring internet access and charging based on usage.
For students and developers, Foundry Local is ideal for quick experiments, offline testing, and understanding how models behave in real-time. On the other hand, AI Foundry is better suited for large-scale or production-level scenarios where models need to be deployed at scale.
In summary, Foundry Local provides a flexible and affordable environment for hands-on learning, especially when working with smaller models such as Phi-3, Qwen2.5, or TinyLlama. It allows you to experiment freely, learn efficiently, and better understand the fundamentals of Edge AI development.
Optional: Restart Later
Next time you open your laptop, you don’t have to reinstall anything. Just run these two commands again:
foundry service start
foundry model run phi-3.5-mini-instruct-generic-cpu:1
How small models like Phi 3.5 can run on a local machine
How to test prompts and build chat apps with zero cloud usage
Conclusion
Running the Phi-3.5-mini model locally with Foundry Localgave me hands-on insight into edge AI. It’s an easy, private, and cost-free way to explore generative AI development. If you’re new to Edge AI, start with the EdgeAI for Beginners course and follow its Study Guide to get comfortable with local inference and small language models.
AI Toolkit VS Code extension is a great place for developers to experiment, prototype, and learn about AI workflows. Developers exploring AI often start in "Model Playground" — experimenting with models, testing prompts, and iterating on ideas. But turning those experiments into real applications can take time.
That’s why we partnered with the AI Toolkit team to introduce a new project scaffold experience. With just a few clicks, you can now generate a complete AI-powered chat app.
From playground to real app — in one click
AI Toolkit helps developers learn, prototype, and experiment with AI workflows. Until now, it provided code snippets — great for exploration, but still a few steps away from a runnable project.
To bridge that gap, the latest release introduces project scaffolding. Once you’ve explored model responses in Model Playground, click View Code. You’ll now see a new option:
View Code -> OpenAI SDK. The support for other SDKs are coming.Select "Scaffold a chat application"
After choosing a local folder, AI Toolkit will scaffold a full project — including both backend and frontend code and you’re ready to run.
What’s inside the chat app
The scaffolded project, named AI Chat, provides a complete end-to-end example of an AI + real-time chat application.
VS Code on the left and the UI of the chat app on the right
Key features:
Multi-room, real-time chat,
AI bot replies powered by GitHub LLM models via your token,
Same React frontend and Python backend code in both local or cloud modes,
(Optional) Azure Web PubSub integration for scale and message broadcasting.
You can start locally in minutes — no Azure setup required. When you’re ready to scale, deploy to Azure with no code changes.
Run locally in minutes
Prerequisites:
Python 3.12+
Node.js 18+
GitHub Personal Access Token with Models – Read permission
You can open a second browser window to see real-time message streaming between rooms.
I use Edge Browser "Split screen" feature here.
From local to cloud: scale without rewrites
One of the best parts about this sample is its flexibility.
You can run it entirely locally, with no Azure setup or dependencies. Everything — frontend, backend, and real-time messaging — works out of the box. This makes it perfect for experimentation, quick demos, or internal prototypes.
But when you’re ready to go beyond local, Azure steps in to take care of scalability, reliability, and lifecycle management — with no code changes.
Why run it on Azure?
Deploying to Azure offers several advantages:
Built-in scalability — Move from a handful of users to thousands of concurrent connections without changing your architecture,
Managed infrastructure — Azure App Service, Azure Web PubSub, Azure Storage are fully managed services; you don’t manage servers or maintain WebSocket connections,
Security and access control — Use Azure identity integration for production-grade protection,
Dev-friendly automation — Provision everything with a single command using Azure Developer CLI (azd).
To deploy the sample app to Azure, you only need one command.
azd up
Everything — including Azure App Service, Azure Web PubSub and Azure Storage — is provisioned automatically.
Real-time, managed: Azure Web PubSub
At the heart of the cloud setup is Azure Web PubSub, a fully managed service for building real-time, bi-directional messaging applications using WebSockets. Developers can focus on application logic and leaving infra-related concerns to the service.
In the AI Chat Demo, Azure Web PubSub powers the real-time messaging and multi-room architecture, while LLMs via GitHub Models handle the intelligence layer. Specifically, Azure Web PubSub handles:
Message broadcasting across chat rooms,
Group management (join, leave, and isolate rooms),
Event handling through CloudEvents for flexible server integration,
Client negotiation via tokens for secure, scoped access.
This means your chat app can support large numbers of simultaneous users and global traffic — without you managing connection state or scaling infrastructure.
Next Steps
Try the new project scaffold in AI Toolkit VS Code,
The new AI Toolkit + Azure Web PubSub experience helps developers go from model exploration to real-time AI application in minutes — no boilerplate, no setup friction.
Start experimenting today and bring your AI chat ideas to life.
If you’ve been waiting to run Python 3.14 in on Azure App Service - it’s here. Azure App Service for Linux now offers Python 3.14 as a first-class runtime. You can create a new 3.14 app through the Azure portal, automate it with the Azure CLI, or roll it out using your favorite ARM/Bicep templates — and App Service continues to handle the OS, runtime updates, and patching for you so you can stay focused on code.
Why Python 3.14 matters
Python 3.14 (released October 7, 2025) lands with real performance and runtime improvements.
Faster under load. Internal interpreter work reduces overhead in common call paths and improves memory behavior, which can translate to lower latency and less CPU churn in web apps and APIs.
Smarter concurrency. Python 3.14 continues the rollout of subinterpreters and a free-threaded build (no GIL), making it easier to take advantage of multi-core parallelism for CPU-bound or high-throughput workloads. In 3.14, that free-threaded mode is more mature and shows significantly better multi-threaded performance than earlier releases.
Developer quality-of-life. You get a more helpful interactive REPL (better highlighting and error hints), cleaner typing through deferred annotations, and new template string syntax (“t-strings”) for safe, structured interpolation.
All of that is now available to you on App Service for Linux.
What you should do next
If you’re currently running an older Python version on App Service for Linux, this is a good moment to validate on 3.14:
Stand up a staging app or deployment slot on Python 3.14.
Run your normal tests and watch request latency, CPU, and memory.
Confirm that any native wheels or pinned dependencies you rely on install and import cleanly.
Most apps will only need minor adjustments — and you’ll walk away with a faster, more capable runtime on a platform that keeps the underlying infrastructure patched and production-ready for you.
1129. This week, we talk with Ben Zimmer about the linguistic detective work of antedating words — finding earlier usages than those published in dictionaries. We look at the surprising origins of "Ms.," "scallywag," and the baseball history of "jazz."
How do you build quality software with LLMs? Carl and Richard talk to Den Delimarsky about the GitHub Spec Kit, which uses specifications to help LLMs generate code for you. Den discusses the iterative process of refining specifications to produce better code, and then being able to add your own code without disrupting the process. The conversation delves into this new style of software development, utilizing specifications to break down tasks sufficiently for LLMs to be successful, and explores the limitations that exist today.
Welcome to episode 327 of The Cloud Pod, where the forecast is always cloudy! Justin, Matt, and Ryan are here to bring you all the latest news (and a few rants) in the worlds of Cloud and AI. I’m sure all our readers are aware of the AWS outage last week, as it was in all the news everywhere. But we’ve also got some new AI models (including Sora in case you’re low on really crappy videos the youths might like), plus EKS, Kubernetes, Vertex AI, and more. Let’s get started!
Titles we almost went with this week
Oracle and Azure Walk Into a Cloud Bar: Nobody Gets ETL’d
When DNS Goes Down, So Does Your Monday: AWS Takes Half the Internet on a Coffee Break
404 Cloud Not Found: AWS Proves Even the Internet’s Phone Book Can Get Lost
DNS: Definitely Not Staffed – How AWS Lost Its Way When It Lost Its People
When Larry Met Satya: A Cloud Love Story
Azure Finally Answers ‘Dude, Where’s My Data?’ with Storage Discovery
Breaking: Microsoft Discovers AI Training Uses More Power Than a Small Country
404 Engineers Not Found – AWS Learns the Hard Way That People Are Its Most Critical Infrastructure
Azure Storage Discovery: Finding Your Data Needles in the Cloud Haystack
EKS Auto Mode: Because Even Your Clusters Deserve Cruise Control
Azure Gets Reel: Microsoft Adds Video Generation to AI Foundry
The Great Token Heist: Vertex AI Steals 90% Off Your Gemini Bills
Cache Me If You Can: Vertex AI’s Token-Saving Feature
IaC Just Got a Manager – And It’s Not Your Boss
From Musk to Microsoft: Grok 4 Makes the Great Cloud Migration
No Harness.. You are not going to make IACM happen
Microsoft Drafts a Solution to Container Creation Chaos
PowerShell to the People: Azure Simplifies the Great Gateway Migration
IP There Yet? Azure’s Scripts Keep Your Address While You Upgrade
Follow Up
00:53 Glacier Deprecation Email
Standalone Amazon Glacier service (vault-based with separate APIs) will stop accepting new customers as of December 15, 2025.
S3 Glacier storage classes (Instant Retrieval, Flexible Retrieval, Deep Archive) are completely unaffected and continue normally
Existing Glacier customers can keep using it forever – no forced migration required.
AWS is essentially consolidating around S3 as the unified storage platform, rather than maintaining two separate archival services.
The standalone service will enter maintenance mode, meaning there will be no new features, but the service will remain operational.
Migration to S3 Glacier is optional but recommended for better integration, lower costs, and more features. (Justin assures us it is actually slightly cheaper, so there’s that.)
The breach compromised product development and engineering systems, but did not affect customer CRM data, financial systems, or F5’s software supply chain, according to independent security audits.
F5 has released security patches for BIG-IP, F5OS, and BIG-IP Next products and is providing threat-hunting guides to help customers monitor for suspicious activity.
This represents the first publicly disclosed breach of F5’s internal systems, notable given that F5 handles traffic for 80% of Fortune Global 500 companies through its load-balancing and security services.
The incident highlights supply chain security concerns, as attackers targeted source code and vulnerability information, rather than customer data, potentially seeking ways to exploit F5 products deployed across enterprise networks.
03:12 Justin – “A little concerning on this one, mostly because F5 is EVERYWHERE.”
Anthropic launched web and mobile interfaces for Claude Code, their CLI-based AI coding assistant, with the web version supporting direct access to GitHub repositories and the ability to process general instructions, such as “add real-time inventory tracking to the dashboard.”
The web interface introduces multi-session support, allowing developers to run and switch between multiple coding sessions simultaneously through a left-side panel, plus the ability to provide mid-task corrections without canceling and restarting
A new sandboxing runtime has been implemented to improve security and reduce friction, moving away from the previous approach where Claude Code required permission for most changes and steps during execution
The mobile version is currently limited to iOS and is in an earlier development stage compared to the web interface, indicating a phased rollout approach
This positions Claude Code as a more accessible alternative to traditional CLI-only AI coding tools, potentially expanding its reach to developers who prefer web-based interfaces over command-line environments
05:51 Ryan – “I haven’t had a chance to play with the web version, but I am interested in it just because I found the terminal interface limiting, but I also feel like a lot of the value is in that local sort of execution and not in the sandbox. A lot of the tasks I do are internal and require access to either company resources or private networks, or the kind of thing where you’re not going to get that from a publicly hosted sandbox environment.”
Containerization Assist automates the tedious process of creating Dockerfiles and Kubernetes manifests, eliminating manual errors that plague developers during the containerization process
Built on AKS Draft’s proven foundation, this open-source tool goes beyond basic AI coding assistants by providing a complete containerization platform rather than just code suggestions.
The tool addresses a critical pain point where developers waste hours writing boilerplate container configurations and debugging deployment issues caused by manual mistakes. (Listener beware, Justin mini rant here.)
As an open-source MCP (Model Context Protocol) server, it integrates seamlessly with existing development workflows while leveraging Microsoft’s containerization expertise from Azure Kubernetes Service. (Expertise is a stretch.)
This launch signals Microsoft’s commitment to simplifying Kubernetes adoption by removing the steep learning curve associated with container orchestration and manifest creation – or you could just use a pass.
09:47 Matt – “The piece I did like about this is that it integrated in as an optional feature, kind of the trivia and the security thing. So it’s not just setting it up, but they integrated the next steps of security code scanning. It’s not Microsoft saying, you know, hey, it’s standard … they are building security in, hopefully.”
IaCM (Infrastructure as Code Management) extends traditional IaC by adding lifecycle management capabilities, including state management, policy enforcement, and drift detection to handle the complexity of infrastructure at scale.
Key features include centralized state file management with version control, module and provider registries for reusable components, and automated policy enforcement to ensure compliance without slowing down teams.
The platform integrates directly into CI/CD workflows with visual PR insights showing cost estimates and infrastructure changes before deployment, solving the problem of unexpected costs and configuration conflicts.
IaCM addresses critical pain points like configuration drift, secret exposure in state files, and resource conflicts when multiple teams work on the same infrastructure simultaneously.
Harness IaCM specifically supports OpenTofu and Terraform with features like Variable Sets, Workspace Templates, and Default Pipelines to standardize infrastructure delivery across organizations.
13:04 Justin – “So let me boil this down for you. We created our own Terraform Enterprise or Terraform Cloud, but we can’t use that name because it’s copyrighted. So we’re going to try to create a new thing and pretend we invented this – and then try to sell it to you as our new Terraform or OpenTofu replacement for your management tier.”
AWS US-EAST-1 experienced a major outage starting after midnight Pacific on Monday, caused by DNS resolution issues with DynamoDB that prevented proper address lookup for database services, impacting thousands of applications, including Facebook, Snapchat, Coinbase, ChatGPT, and Amazon’s own services.
The outage highlighted ongoing redundancy concerns as many organizations failed to implement proper failover to other regions or cloud providers, despite similar incidents in US-EAST-1 in 2017, 2021, and 2023, raising questions about single-region dependency for critical infrastructure.
AWS identified the root cause as an internal subsystem responsible for monitoring network load balancer health, with core DNS issues resolved by 3:35 AM Pacific, though Lambda backlog processing and EC2 instance launch errors persisted through the morning recovery period.
Real-world impacts included LaGuardia Airport check-in kiosk failures, causing passenger lines, widespread disruption to financial services (Venmo, Robinhood), gaming platforms (Roblox, Fortnite), and productivity tools (Slack, Canva), demonstrating the cascading effects of cloud provider outages.
The incident underscores the importance of multi-region deployment strategies and proper disaster recovery planning for AWS customers, particularly those using US-EAST-1 as their primary region due to its status as AWS’s oldest and largest data center location.
We have a couple of observations: this one took a LONG time to resolve, including hours before the DNS was restored. Maybe they’re out of practice? Maybe it’s a people problem? Hopefully, this isn’t the new norm as some of the talent have been let go/moved on.
17:53 Ryan – “If it’s a DNS resolution issue that’s causing a global outage, that’s not exactly straightforward. It’s not just a bug, you know, or a function returning the wrong value, or that you’re looking at global propagation, you’re looking at clients in different places, resolving different things, at the base parts of the internet for functionality. And so it does take a pretty experienced engineer to sort of have that in their heads conceptually in to order to troubleshoot. I wonder if that’s really the cause, where they’re not able to recover as fast. But I also feel like cloud computing has come a long way, and the impact was very widely felt because a lot more people are using AWS as their hosting provider than I think have been in the past. A little bit of everything, I think.”
AWS’s US-EAST-1 region experienced an outage due to an internal monitoring subsystem failure affecting network load balancers, impacting major services including Facebook, Coinbase, and LaGuardia Airport check-in systems.
The issue was related to DNS resolution problems with DynamoDB, not a cyberattack.
The incident highlights ongoing single-region dependency issues, as US-EAST-1 remains AWS’s largest region and has caused similar widespread disruptions in 2017, 2021, and 2023. Many organizations still lack proper multi-region failover despite repeated outages from this location.
Industry experts warn that the outage demonstrates vulnerability to potential targeted attacks on cloud infrastructure monoculture. The concentration of services on single providers creates systemic risk similar to agricultural monoculture, where one failure can cascade widely.
The failure occurred at the control-plane level, suggesting AWS should implement more aggressive isolation of critical networking components. This may accelerate enterprise adoption of multi-cloud and multi-region architectures as baseline resilience requirements.
AWS resolved the issue within hours but the incident reinforces that even major cloud providers remain vulnerable to cascading failures when core monitoring and health check systems malfunction, affecting downstream services across their infrastructure.
AWS experienced a major outage on October 20, 2025 in US-EAST-1 region caused by DNS resolution failures for DynamoDB endpoints, taking 75 minutes just to identify the root cause and impacting banking, gaming, social media, and government services across much of the internet.
The incident highlights concerns about AWS’s talent retention, with 27,000+ Amazon layoffs between 2022-2025 and internal documents showing 69-81% regretted attrition, suggesting loss of senior engineers who understood complex failure modes and had institutional knowledge of AWS systems.
DynamoDB’s role as a foundational service meant the DNS failure created cascading impacts across multiple AWS services, demonstrating the risk of centralized dependencies in cloud architectures and the importance of regional redundancy for critical workloads.
AWS’s status page showed “all is well” for the first 75 minutes of the outage, continuing a pattern of slow incident communication that AWS has acknowledged as needing improvement in multiple previous post-mortems from 2011, 2012, and 2015.
The article suggests this may be a tipping point where the loss of experienced staff who built these systems is beginning to impact AWS’s legendary operational excellence, with predictions that similar incidents may become more frequent as institutional knowledge continues to leave.
-And that’s an end to Hugops. Moving on to the rest of AWS-
EC2Capacity Manager provides a single dashboard to monitor and manage EC2 capacity across all accounts and regions, eliminating the need to collect data from multiple AWS services like Cost and Usage Reports, CloudWatch, and EC2 APIs.
Available at no additional cost in all commercial AWS regions.
The service aggregates capacity data with hourly refresh rates for On-Demand Instances, Spot Instances, and Capacity Reservations, displaying utilization metrics by vCPUs, instance counts, or estimated costs based on published On-Demand rates.
Key features include automated identification of underutilized Capacity Reservations with specific utilization percentages by instance type and AZ, plus direct modification capabilities for ODCRs within the same account.
Data exports to S3 extend analytics beyond the 90-day console retention period, enabling long-term capacity trend analysis and integration with existing BI tools or custom reporting systems.
Organizations can enable cross-account visibility through AWS Organizations integration, helping identify optimization opportunities like redistributing reservations between development accounts showing 30% utilization and production accounts exceeding 95%.
25:45 Ryan – “This is kind of nice to have it built in and just have it be plug and play – especially when it’s at no cost.”
EKS Auto Mode now supports EC2 On-Demand Capacity Reservations and Capacity Blocks for ML, allowing customers to target pre-purchased capacity for AI/ML workloads requiring guaranteed access to specialized instances like P5s. This addresses the challenge of GPU availability for training jobs without over-provisioning.
New networking capabilities include separate pod subnets for isolating infrastructure and application traffic, explicit public IP control for enterprise security compliance, and forward proxy support with custom certificate bundles. These features enable integration with existing enterprise network architectures without complex CNI customizations.
Complete AWS KMS encryption now covers both ephemeral storage and root volumes using customer-managed keys, addressing security audit findings that previously flagged unencrypted storage.
This eliminates the need for custom AMIs or manual certificate distribution.
Performance improvements include multi-threaded node filtering and intelligent capacity management that can automatically relax instance diversity constraints during capacity shortages.
These optimizations particularly benefit time-sensitive applications and AI/ML workloads requiring rapid scaling.
EKS Auto Mode is available for new clusters or can be enabled on existing EKS clusters running Kubernetes 1.29+, with migration guides available for teams moving from Managed node groups, Karpenter, or Fargate.
Pricing follows standard EKS pricing at $0.10 per cluster per hour plus EC2 instance costs.
27:33 Ryan – “This just highlights how terrible it was before.”
EC2 now lets customers reduce vCPU counts and disable hyperthreading on Windows Server and SQL Server license-included instances, enabling up to 50% savings on vCPU-based licensing costs while maintaining full memory and IOPS performance.
This feature targets database workloads that need high memory and IOPS but fewer vCPUs – for example, an r7i.8xlarge instance can be reduced from 32 to 16 vCPUs while keeping its 256 GiB memory and 40,000 IOPS.
The CPU optimization extends EC2’s existing Optimize CPUs feature to license-included instances, addressing a common pain point where customers overpay for Microsoft licensing due to fixed vCPU counts.
Available now in all commercial AWS regions and GovCloud regions, with no additional charges beyond the adjusted licensing costs based on the modified vCPU count.
This positions AWS competitively against Azure for SQL Server workloads by offering more granular control over licensing costs, particularly important as organizations migrate legacy database workloads to the cloud.
AWS Systems Manager Patch Manager now includes an “AvailableSecurityUpdate” state that identifies Windows security patches available but not yet approved by patch baseline rules, helping prevent accidental exposure from delayed patch approvals.
The feature addresses a specific operational risk where administrators using ApprovalDelay with extended timeframes could unknowingly leave systems vulnerable, with instances marked as Non-Compliant by default when security updates are pending.
Organizations can customize compliance reporting behavior to maintain existing workflows while gaining visibility into security patch availability across their Windows fleet, particularly useful for enterprises with complex patch approval processes.
The update provides a practical solution for balancing security requirements with operational stability, allowing teams to maintain patch deployment schedules while staying informed about critical security updates awaiting approval.
30:20 Ryan – “It sounds like just a quality of life improvement, but it’s something that should be so basic, but isn’t there, right? Which is like Windows patch management is cobbled together and not really managed well, and so you could have a patch available, but the only way to find out that it was available previously to this was to actually go ahead and patch it and then see if it did something. And so now, at least you have a signal on that; you can apply your patches in a way that’s not going to take down your entire service if a patch goes wrong. So this is very nice. I think for people using the Systems Manager patch management, they’re going to be very happy with this.”
AWS introduces CLI Agent Orchestrator (CAO), an open source framework that enables multiple AI-powered CLI tools like Amazon Q CLI and Claude Code to work together as specialized agents under a supervisor agent, addressing limitations of single-agent approaches for complex enterprise development projects.
CAO uses hierarchical orchestration with tmux session isolation and Model Context Protocol servers to coordinate specialized agents – for example, orchestrating Architecture, Security, Performance, and Test agents simultaneously during mainframe modernization projects.
The framework supports three orchestration patterns (Handoff for synchronous transfers, Assign for parallel execution, Send Message for direct communication) plus scheduled runs using cron-like automation, with all processing occurring locally for security and privacy.
Key use cases include multi-service architecture development, enterprise migrations requiring parallel implementation, comprehensive research workflows, and multi-stage quality assurance processes that benefit from coordinated specialist agents.
We definitely appreciate another tool in the Agent Orchestration world.
Amazon ECS now publishes CloudTrail data events for ECS Agent API activities, enabling detailed monitoring of container instance operations, including polling (ecs: Poll), telemetry sessions (ecs: StartTelemetrySession), and managed instance logging (ecs: PutSystemLogEvents).
Security and operations teams gain comprehensive audit trails to detect unusual access patterns, troubleshoot agent communication issues, and understand how container instance roles are utilized for compliance requirements.
The feature uses the new data event resource type AWS::ECS::ContainerInstance and is available for ECS on EC2 in all AWS regions, with ECS Managed Instances supported in select regions.
Standard CloudTrail data event charges apply – typically $0.10 per 100,000 events recorded, making this a cost-effective solution for organizations needing detailed container instance monitoring.
This addresses a previous visibility gap in ECS operations, as teams can now track agent-level activities that were previously opaque, improving debugging capabilities and security posture for containerized workloads.
39:33 Ryan – “This is definitely something I would use sparingly because the UCS API is agent API chatting. So this seems like it would be very expensive, very fast.”
Google Cloud launches G4 VMs with NVIDIA RTX 6000 Blackwell GPUs, offering up to 9x throughput improvement over G2 instances and supporting workloads from AI inference to digital twin simulations with configurations of 1, 2, 4, or 8 GPUs.
The G4 VMs feature enhanced PCIe-based peer-to-peer data paths that deliver up to 168% throughput gains and 41% lower latency for multi-GPU workloads, addressing the bottleneck issues common in serving large generative AI models that exceed single GPU memory limits.
Each GPU provides 96GB of GDDR7 memory (up to 768GB total), native FP4 precision support, and Multi-Instance GPU capability that allows partitioning into 4 isolated instances, enabling efficient serving of models from under 30B to over 100B parameters.
NVIDIA Omniverse and Isaac Sim are now available on Google Cloud Marketplace as turnkey solutions for G4 VMs, enabling immediate deployment of industrial digital twin and robotics simulation applications with full integration across GKE, Vertex AI, Dataproc, and Cloud Run.
G4 VMs are available immediately with broader regional availability than previous GPU offerings, though specific pricing details were not provided in the announcement – customers should contact Google Cloud sales for cost information. (AKA $$$$.)
Dataproc 2.3 introduces a lightweight, FedRamp High-compliant image that contains only essential Spark and Hadoop components, reducing CVE exposure and meeting strict security requirements for organizations handling sensitive data.
Optional components like Flink, Hive WebHCat, and Ranger are now deployed on-demand during cluster creation rather than pre-packaged, keeping clusters lean by default while maintaining full functionality when needed.
Custom images allow pre-installation of required components to reduce cluster provisioning time while maintaining the security benefits of the lightweight base image.
The image supports multiple operating systems, including Debian 12, Ubuntu 22, and Rocky 9, with deployment as simple as specifying version 2.3 when creating clusters via gcloud CLI.
Google employs automated CVE scanning and patching combined with manual intervention for complex vulnerabilities to maintain compliance standards and security posture.
44:14 Ryan – “But on the contrary, like FedRAMP has such tight SLAs for vulnerability management that you don’t have to carry this risk or request an exception because of Google not patching Flink as fast as you would like them to. At least this puts the control at the end user, where they can say, well, I’m not going to use it.”
BigQuery Studio’s new interface introduces an expanded Explorer view that allows users to filter resources by project and type, with a dedicated search function that spans across all BigQuery resources within an organization – addressing the common pain point of navigating through large-scale data projects.
The Reference panel provides context-aware information about tables and schemas directly within the code editor, eliminating the need to switch between tabs or run exploratory queries just to check column names or data types – particularly useful for data analysts writing complex SQL queries.
Google has streamlined the workspace by moving job history to a dedicated tab accessible from the Explorer pane and removing the bottom panel clutter, while also allowing users to control tab behavior with double-click functionality to prevent unwanted tab replacements.
The update includes code generation capabilities where clicking on table elements in the Reference panel automatically inserts query snippets or field names into the editor, reducing manual typing errors and speeding up query development workflows.
This interface refresh targets data analysts, data engineers, and data scientists who need efficient navigation across multiple BigQuery projects and datasets – no pricing changes mentioned as this appears to be a UI update to the existing BigQuery Studio service.
46:00 Ryan – “Although I’m a little nervous about having all the BigQuery resources across an organization available on a single console, just because it sounds like a permissions nightmare.”
Google launches GA of Prompt Management in Vertex AI SDK, enabling developers to create, version, and manage prompts programmatically through Python code rather than tracking them in spreadsheets or text files.
The feature provides seamless integration between Vertex AI Studio’s visual interface for prompt design and the SDK for programmatic management, with prompts stored as centralized resources within Google Cloud projects for team collaboration.
Enterprise security features include Customer-Managed Encryption Keys (CMEK) and VPC Service Controls (VPCSC) support, addressing compliance requirements for organizations handling sensitive data in their AI applications.
Key use cases include teams building production generative AI applications that need version control, consistent prompt deployment across environments, and the ability to programmatically update prompts without manual code changes.
Google launches Gemini Code Assist for GitHub Enterprise, bringing AI-powered code reviews to enterprise customers using GitHub Enterprise Cloud and on-premises GitHub Enterprise Server.
This addresses the bottleneck where 60.2% of organizations take over a day for code changes to reach production due to manual review processes.
The service provides organization-level controls, including centralized custom style guides and org-wide configuration settings, allowing platform teams to enforce coding standards automatically across all repositories.
Individual teams can still customize repo-level settings while maintaining organizational baselines.
Built under Google Cloud Terms of Service, the enterprise version ensures code prompts and model responses are stateless and not stored, with Google committing not to use customer data for model training without permission. This addresses enterprise security and compliance requirements for AI-assisted development.
Currently in public preview with access through the Google Cloud Console, the service includes a higher pull request quota than the individual developer tier. Google is developing additional features, including agentic loop capabilities for automated issue resolution and bug fixing.
This release complements the recently launched Code Review Gemini CLI Extension for terminal-based AI assistance and represents part of Google’s broader strategy to provide AI assistance across the entire software development lifecycle.
Pricing details are not specified in the announcement.
51:08 Ryan – “It’s just sort of the ability to sort of do organizational-wide things is super powerful for these tools, and I’m just sort of surprised that GitHub allows that. It seems like they would have to develop API hooks and externalize that.”
Vertex AI context caching reduces costs by 90% for repeated content in Gemini models by storing precomputed tokens – implicit caching happens automatically, while explicit caching gives developers control over what content to cache for predictable savings
The feature supports caching from 2,048 tokens up to Gemini 2.5 Pro’s 1 million token context window across all modalities (text, PDF, image, audio, video) with both global and regional endpoint support
Key use cases include document processing for financial analysis, customer support chatbots with detailed system instructions, codebase Q&A for development teams, and enterprise knowledge base queries
Implicit caching is enabled by default with no code changes required and clears within 24 hours, while explicit caching charges standard input token rates for initial caching, then a 90% discount on reuse, plus hourly storage fees based on TTL.
Integration with Provisioned Throughput ensures production workloads benefit from caching, and explicit caches support Customer Managed Encryption Keys (CMEK) for additional security compliance
54:18 Ryan – “This is awesome. If you have a workload where you’re gonna have very similar queries or prompts and have it return similar data, this is definitely nicer than having to regenerate that every time. They’ve been moving more and more towards this. And I like to see it sort of more at a platform level now, whereas you could sort of implement this – in a weird way – directly in a model, like in a notebook or something. This is more of a ‘turn it on and it works’.”
Cloud Armor introduces hierarchical security policies (GA) that enable WAF and DDoS protection at the organization, folder, and project levels, allowing centralized security management across large GCP deployments with consistent policy enforcement.
Enhanced WAF inspection capability (preview) expands request body inspection from 8KB to 64KB for all preconfigured rules, improving detection of malicious content hidden in larger payloads while maintaining performance.
JA4 network fingerprinting support (GA) provides advanced SSL/TLS client identification beyond JA3, offering deeper behavioral insights for threat hunting and distinguishing legitimate traffic from malicious actors.
Azure Firewall’s new observed capacity metric provides real-time visibility into capacity unit utilization, helping administrators track actual scaling behavior versus provisioned capacity for better resource optimization and cost management.
This observability enhancement addresses a common blind spot where teams over-provision firewall capacity due to uncertainty about actual usage patterns, potentially reducing unnecessary Azure spending on unused capacity units.
The metric integrates with Azure Monitor and existing alerting systems, enabling proactive capacity planning and automated scaling decisions based on historical utilization trends rather than guesswork.
Target customers include enterprises with variable traffic patterns and managed service providers who need granular visibility into firewall performance across multiple client deployments to optimize resource allocation.
While pricing remains unchanged for Azure Firewall itself (starting at $1.25/hour plus $0.016/GB processed), the metric helps justify right-sizing decisions that could significantly impact monthly costs for organizations running multiple firewall instances.
Azure Firewall prescaling allows administrators to reserve capacity units in advance for predictable traffic spikes like holiday shopping seasons or product launches, eliminating the lag time typically associated with auto-scaling firewall resources.
This feature addresses a common pain point where Azure Firewall’s auto-scaling couldn’t respond quickly enough to sudden traffic surges, potentially causing performance degradation during critical business events.
Prescaling integrates with Azure’s existing capacity planning tools and can be configured through Azure Portal, PowerShell, or ARM templates, making it accessible for both manual and automated deployment scenarios.
Target customers include e-commerce platforms, streaming services, and any organization with predictable traffic patterns that require guaranteed firewall throughput during peak periods.
While specific pricing wasn’t detailed in the announcement, prescaling will likely follow Azure Firewall’s existing pricing model where customers pay for provisioned capacity units, with costs varying by region and SKU tier.
When you combine these two announcements, they’re pretty good!
Azure API Management introduces carbon-aware capabilities that allow organizations to route API traffic and adjust policy behavior based on carbon intensity data, helping reduce the environmental impact of API infrastructure operations.
The feature enables developers to implement sustainability-focused policies such as throttling non-critical API calls during high carbon intensity periods or routing traffic to regions with cleaner energy grids.
This aligns with Microsoft’s broader carbon negative commitment by 2030 and provides enterprises with tools to measure and reduce the carbon footprint of their digital services at the API layer.
Target customers include organizations with ESG commitments and sustainability reporting requirements who need granular control over their cloud infrastructure’s environmental impact.
Pricing details are not yet available for the preview, but the feature integrates with existing API Management tiers and will likely follow consumption-based pricing models when generally available.
1:02:44 Matt – “So APIMs are one, stupidly expensive. If you have to be on the premier tier, it’s like $2,700 a month. And then if you want HA, you have to have two of them. So whatever they’re doing to the hood is stupidly expensive. If you ever had to deal with the SharePoint, they definitely use them because I’ve hit the same error codes as we provide to customers. On the second side, when you do scale them, you can scale them to be multi-region APIMs in the paired region concept, so in theory, what you can do based on this is route a cheaper or more environmentally efficient one, you could route to your paired region and then have the traffic coming that way.”
Azure Storage Discovery is now generally available as a fully managed service that provides enterprise-wide visibility into data estates across Azure Blob Storage and Data Lake Storage, helping organizations optimize costs, ensure security compliance, and improve operational efficiency across multiple subscriptions and regions.
The service integrates Microsoft Copilot in Azure to enable natural language queries for storage insights, allowing non-technical users to ask questions like “Show me storage accounts with default access tier as Hot above 1TiB with least transactions” and receive actionable visualizations without coding skills. Because a non-technical person is asking this question. In the ever-wise words of Marcia Brady, “Sure, Jan.”
Key capabilities include 18-month data retention for trend analysis, insights across capacity, activity, security configurations, and errors, with deployment taking less than 24 hours to generate initial insights from 15 days of historical data.
Pricing includes a free tier with basic capacity and configuration insights retained for 15 days, while the standard plan adds advanced activity, error, and security insights with 18-month retention – specific pricing varies by region at azure.microsoft.com/pricing/details/azure-storage-discovery.
Target use cases include identifying cost optimization opportunities through access tier analysis, ensuring security best practices by highlighting accounts still using shared access keys, and managing data redundancy requirements across global storage estates.
1:08:35 Ryan – “Well, I’ll tell you when I was looking for this report, I had a lot of natural language – and I was shouting it at my computer.”
The platform provides a unified environment combining Sora 2 with other generative models like GPT-image-1 and Black Forest Lab’s Flux 1.1, all backed by Azure’s enterprise security and content filtering for both inputs and outputs.
Key capabilities include realistic physics simulation, detailed camera control, and creative features for marketers, retailers, educators, and creative directors to rapidly prototype and produce video content within existing business workflows.
Sora 2 is currently available via API through Standard Global deployment in Azure AI Foundry, with pricing details available on the Azure AI Foundry Models page.
Microsoft positions this as part of their responsible AI approach, embedding safety controls and compliance frameworks to help organizations innovate while maintaining governance over generated content.
Microsoft brings xAI’s Grok 4 model to Azure AI Foundry, featuring a 128K-token context window, native tool use, and integrated web search capabilities. The model emphasizes first-principles reasoning with a “think mode” that breaks down complex problems step-by-step, particularly excelling at math, science, and logic puzzles.
Grok 4’s extended context window allows processing of entire code repositories, lengthy research papers, or hundreds of pages of documents in a single query. This eliminates the need to manually chunk large inputs and enables comprehensive analysis across massive datasets without losing context.
Azure AI Content Safety is enabled by default for Grok 4, addressing enterprise concerns about responsible AI deployment. Microsoft and xAI conducted extensive safety testing and compliance checks over the past month to ensure business-ready protection layers.
Pricing starts at $2 per million input tokens and $10 per million output tokens for Grok 4, with faster variants available at lower costs.
The model’s real-time data integration allows it to retrieve and incorporate external information beyond its training data, functioning as an autonomous research assistant. This capability is particularly valuable for tasks requiring current information like market analysis or regulatory updates.
Azure releases PowerShell scripts to help customers migrate from Application Gateway V1 to V2 before the April 2026 retirement deadline, addressing a critical infrastructure transition need.
The enhanced cloning script preserves configurations during migration while the Public IP retention script ensures customers can maintain their existing IP addresses, minimizing disruption to production workloads.
This migration tooling targets enterprises running legacy Application Gateway Standard or WAF SKUs who need to upgrade to Standard_V2 or WAF_V2 for continued support and access to newer features.
The scripts automate what would otherwise be a complex manual migration process, reducing the risk of configuration errors and downtime during the transition.
Customers should begin planning migrations now as the 2026 deadline approaches, with these scripts providing a standardized path forward for maintaining application delivery infrastructure.
You know would be even easier than PowerShell? How about just doing it for them? Too easy?
The platform enables the creation of AI agents that can automate tasks across Oracle Fusion Cloud Applications, including ERP, HCM, and CX, with pre-built templates and low-code development tools for business users.
Oracle is partnering with major consulting firms like Accenture, Deloitte, and Infosys to help customers implement AI agents, though this likely means significant professional services costs for most deployments.
The AI agents can handle tasks like expense report processing, supplier onboarding, and customer service inquiries, with Oracle claiming reduced manual work by up to 50% in some use cases.
Pricing details remain unclear, but the service requires Oracle Fusion Applications subscriptions and likely additional fees for LLM usage and agent deployment based on Oracle’s typical pricing model.
1:15:45 Ryan – “They’re partnering with these giant firms that will come in with armies of engineers who will build you a thing – and hopefully document it before running away.”
Closing
And that is the week in the cloud! Visit our website, the home of the Cloud Pod, where you can join our newsletter, Slack team, send feedback, or ask questions at theCloudPod.net or tweet at us with the hashtag #theCloudPod