Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
149446 stories
·
33 followers

Temporary rollback: build identities can access Advanced Security: read alerts again

1 Share

If you use build service identities like Project Collection Build Service to call Advanced Security APIs, the Advanced Security permission changes in Sprint 269 broke that. We restricted API access for build identities as a security improvement but failed to provide an early notice for customers that relied upon this for various automations.

We’re rolling it back temporarily. The restriction will be re-enforced on April 15, 2026.

What you should do

Action is required. The recommended path is a service principal with Advanced Security: Read alerts permissions for your Advanced Security-enabled repositories. Scope it narrowly, and if the service principal isn’t committing code, it won’t consume an Advanced Security committer license.

Status checks in Sprint 272

We’re also shipping status checks soon, which give teams a native way to gate on security posture without API-driven alert mutations from pipeline identities.

ado status checks image

This won’t replace every automation scenario, though it enables pull request-time blocking on the presence of high and critical alerts.

Have feedback or hitting gaps moving to a service principal? Let us know.


Action required by April 15: move API automation to a service principal with Advanced Security: Read alerts or watch for status checks in Sprint 272.

The post Temporary rollback: build identities can access Advanced Security: read alerts again appeared first on Azure DevOps Blog.

Read the whole story
alvinashcraft
31 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Daily Reading List – March 11, 2026 (#739)

1 Share

Sometimes a different model really does help. I was building an agent last night for fun, and couldn’t get it to do what I wanted. I upgraded to the latest Gemini model and now it works like a charm. That was the only change!

[blog] Welcoming Wiz to Google Cloud: Redefining security for the AI era. I’m excited that we closed this acquisition and now have this talented team with their differentiated security platform. I’m not sure anyone has a security stack quite like ours.

[article] Andrej Karpathy’s new open source ‘autoresearch’ lets you run hundreds of AI experiments a night — with revolutionary implications. Besides coding (which I don’t get to do every day), my main use of AI is for research. Autoresearch could absolutely be a transformative thing.

[blog] The 8 Levels of Agentic Engineering. Fantastic post that articulates each progressive stage of using AI in engineering, and what you gain from each.

[blog] Plan mode is now available in Gemini CLI. Great functionality added to the Gemini CLI. Do safe, read-only mode work before jumping into action.

[blog] Cost-Effective AI with Ollama, GKE GPU Sharing, and vCluster. As I see people struggle with token costs or model availability, I’m definitely of the mind that there’s clear cases where you want to run models yourself.

[blog] When Developer Workflow Discipline Isn’t Enough. It’s about platforms, and how you’re serving AI functionality at scale in large companies.

[article] 10 Hacks Every NotebookLM User Should Know. Great list of ways to personalize the experience and learn on your terms.

[blog] From games to biology and beyond: 10 years of AlphaGo’s impact. This was a turning point for AI, and we may look back at this as a pivotal moment.

[article] How to Quash Your Fear of Messing Up. What causes you to hesitate? How can we think about risk differently? Read this for advice.

[blog] Best Practices for Secure Error Handling in Go. Even if you don’t write code in Go, you’ll learn a few useful things about avoiding security issues with your error handling.

[article] Layoffs, cost-cutting shatters IT worker confidence. Understandable! Tech workers faces a lot of simultaneous stresses.

[blog] Bring Your Database Tools to the Agent Skill Ecosystem. Very cool new capability to turn MCP toolsets into Agent Skills. Worth trying this out.

Want to get this update sent to you every day? Subscribe to my RSS feed or subscribe via email below:



Read the whole story
alvinashcraft
36 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

dotnet-1.0.0-rc4

1 Share

What's Changed

  • .NET: bug fix for duplicate output on GitHubCopilotAgent by @normalian in #3981
  • .NET: Increase credential timeout for Integration Tests by @westey-m in #4472
  • .NET: Add foundry extension samples for dotnet by @yaoleo34 in #4359
  • .NET: CI Build time end to end improvement by @westey-m in #4208
  • .NET: Switch auth sample to use Singletons by @westey-m in #4454
  • .NET: Add ServiceLifetime support for Hosting DI registration by @westey-m in #4476
  • .NET: Fix filter combine logic for ChatHistoryMemoryProvider by @westey-m in #4501
  • .NET: Update HostedAgents samples to Azure.AI.AgentServer.AgentFramework 1.0.0-beta.9 and MEAI 10.3.0 by @Copilot in #4477
  • .NET: Improve skill name validation: reject consecutive hyphens and enforce directory name match by @SergeyMenshykh in #4526
  • .NET: Create a sample to show bounded chat history with overflow into chat history memory by @westey-m in #4136
  • .NET: Update Anthropic to 12.8.0 and Anthropic.Foundry to 0.4.2 by @Copilot in #4475
  • .NET: Add security warnings to xml comments for core components by @westey-m in #4527
  • Auto-finalize ResponseStream on iteration completion by @giles17 in #4478
  • .NET: Skip Azure Persistent (V1) flaky CodeInterpreter integration tests by @rogerbarreto in #4583
  • .NET: Enable Microsoft.Agents.AI.FoundryMemory for NuGet release by @rogerbarreto in #4559
  • Fix Strands Agents documentation links in ADR by @Copilot in #4584
  • .NET: Cleanup unnecessary usages of AsIChatClient by @westey-m in #4561
  • .NET: Added support for polymorphic type as workflow output by @peibekwe in #4485
  • .NET Compaction - Introducing compaction strategies and pipeline by @crickman in #4533
  • .NET: SDK Patch Bump (10.0.200) - Address false positive trigger of IL2026/IL3050 diagnostics in hosting projects by @rogerbarreto in #4586
  • .NET: Add FinishReason to AgentResponses by @westey-m in #4617
  • .NET: Updated package versions by @dmytrostruk in #4632
  • .NET: Fixed CA1873 warning by @dmytrostruk in #4634

New Contributors

Full Changelog: dotnet-1.0.0-rc3...dotnet-1.0.0-rc4

Read the whole story
alvinashcraft
55 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

python-1.0.0rc4

1 Share

[1.0.0rc4] - 2026-03-11

Added

  • agent-framework-core: Add propagate_session to as_tool() for session sharing in agent-as-tool scenarios (#4439)
  • agent-framework-core: Forward runtime kwargs to skill resource functions (#4417)
  • samples: Add A2A server sample (#4528)

Changed

  • agent-framework-github-copilot: [BREAKING] Update integration to use ToolInvocation and ToolResult types (#4551)
  • agent-framework-azure-ai: [BREAKING] Upgrade to azure-ai-projects 2.0+ (#4536)

Fixed

  • agent-framework-core: Propagate MCP isError flag through the function middleware pipeline (#4511)
  • agent-framework-core: Fix as_agent() not defaulting name/description from client properties (#4484)
  • agent-framework-core: Exclude conversation_id from chat completions API options (#4517)
  • agent-framework-core: Fix conversation ID propagation when chat_options is a dict (#4340)
  • agent-framework-core: Auto-finalize ResponseStream on iteration completion (#4478)
  • agent-framework-core: Prevent pickle deserialization of untrusted HITL HTTP input (#4566)
  • agent-framework-core: Fix executor_completed event handling for non-copyable raw_representation in mixed workflows (#4493)
  • agent-framework-core: Fix store=False not overriding client default (#4569)
  • agent-framework-redis: Fix RedisContextProvider compatibility with redisvl 0.14.0 by using AggregateHybridQuery (#3954)
  • samples: Fix chat_response_cancellation sample to use Message objects (#4532)
  • agent-framework-purview: Fix broken link in Purview README (Microsoft 365 Dev Program URL) (#4610)
Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete

Addressing GitHub’s recent availability issues

1 Share

Over the past several weeks, GitHub has experienced significant availability and performance issues affecting multiple services. Three of the most significant incidents happened on February 2, February 9, and March 5.

First and foremost, we take responsibility. We have not met our own availability standards, and we know that reliability is foundational to the work you do every day. We understand the impact these outages have had on your teams, your workflows, and your confidence in our platform.

Here, we’ll unpack what’s been causing these incidents and what we’re doing to make our systems more resilient moving forward.

What happened

These incidents have occurred during a period of extremely rapid usage growth across our platform, exposing scaling limitations in parts of our current architecture. Specifically, we’ve found that recent platform instability was primarily driven by rapid load growth, architectural coupling that allowed localized issues to cascade across critical services, and inability of the system to adequately shed load from misbehaving clients.

Before we cover what we are doing to prevent these issues going forward, it is worth diving into the details of the most impactful incidents.

February 9 incident

On Monday, February 9, we experienced a high‑impact incident due to a core database cluster that supports authentication and user management becoming overloaded. The mistakes that led to the problem were made days and weeks earlier.

In early February, two very popular client-side applications that make a significant amount of API calls against our servers were released, with unintentional changes driving a more-than-tenfold increase in read traffic they generated. Because these applications end up being updated by the users over time, the increase in usage doesn’t become evident right away; it appears as enough users upgrade.

On Saturday, February 7, we deployed a new model. While trying to get it to customers as quickly as possible, we changed a refresh TTL on a cache storing user settings from 12 to 2 hours. The model was released to a narrower set of customers due to limited capacity, which made the change necessary. At this point, everything was operating normally because the weekend load is significantly lower, and we didn’t have sufficiently granular alarms to detect the looming issue.

Three things then compounded on February 9: our regular peak load, many customers updating to the new version of the client apps as they were starting their week, and another new model release. At this point, the write volume due to the increased TTL and the read volume from the client apps combined to overwhelm the database cluster. While the TTL change was quickly identified as a culprit, it took much longer to understand why the read load kept increasing, which prolonged the incident. Further, due to the interaction between different services after the database cluster became overwhelmed, we needed to block the extra load further up the stack, and we didn’t have sufficiently granular switches to identify which traffic we needed to block at that level.

The investigation for the February 9 incident raised a lot of important questions about why the user settings were stored in this particular database cluster and in this particular way. The architecture was originally selected for simplicity at a time when there were very few models and very few governance controls and policies related to those models. But over time, something that was a few bytes per user grew into kilobytes. We didn’t catch how dangerous that was because the load was visible only during new model or policy rollouts and was masked by the TTL. Since this database cluster houses data for authentication and user management, any services that depend on these were impacted.

GitHub Actions incidents on February 2 and March 5

We also had two significant instances where our failover solution was either insufficient or didn’t function correctly:

  • Actions hosted runners had a significant outage on February 2. Most cloud infrastructure issues in this area typically do not cause impact as they occur in a limited number of regions, and we automatically shift traffic to healthy regions. However, in this case, there was a cascading set of events triggered by a telemetry gap that caused existing security policies to be applied to key internal storage accounts affecting all regions. This blocked access to VM metadata on VM creates and halted hosted runner lifecycle operations.
  • Another impactful incident for Actions occurred on March 5. Automated failover has been progressively rolling out across our Redis infrastructure, and on this day, a failover occurred for a Redis cluster used by Actions job orchestration. The failover performed as expected, but a latent configuration issue meant the failover left the cluster in a state with no writable primary. With writes failing and failover not available as a mitigation, we had to correct the state manually to mitigate. This was not an aggressive rollout or missing resiliency mechanism, but rather latent configuration that was only exposed by an event in production infrastructure.

For both of these incidents, the investigations brought up unexpected single points of failure that we needed to protect and needed to dry run failover procedures in the production more rigorously.

Across these incidents, contributing factors expanded the scope of impact to be much broader or longer than necessary, including:

  • Insufficient isolation between critical path components in our architecture
  • Inadequate safeguards for load shedding and throttling
  • Gaps in end-to-end validation, monitoring for attention on earlier signals, and partner coordination during incident response

What we are doing now

Our engineering teams are fully engaged in both near-term mitigations and durable longer-term architecture and process investments. We are addressing two common themes: managing rapidly increasing load by focusing on resilience and isolation of critical paths and preventing localized failures from ever causing broad service degradation.

In the near term, we are prioritizing stabilization work to reduce the likelihood and impact of incidents. This includes:

  1. Redesigning our user cache system, which hosts model policies and more, to accommodate significantly higher volume in a segmented database cluster.
  2. Expediting capacity planning and completing a full audit of fundamental health for critical data and compute infrastructure to address urgent growth.
  3. Further isolate key dependencies so that critical systems like GitHub Actions and Git will not be impacted by any shared infrastructure issues, reducing cascade risk. This is being done through a combination of removing or handling dependency failures where possible or isolating dependencies.
  4. Protecting downstream components during spikes to prevent cascading failures while prioritizing critical traffic loads.

In parallel, we are accelerating deeper platform investments to deliver on GitHub’s commitment to supporting sustained, high-rate growth with high availability. These include:

  1. Migrating our infrastructure to Azure to accommodate rapid growth, enabling both vertical scaling within regions and horizontal scaling across regions. In the short term, this provides a hybrid approach for infrastructure resiliency. As of today, 12.5% of all GitHub traffic is served from our Azure Central US region, and we are on track to serving 50% of all GitHub traffic by July. Longer term, this enables simplification of our infrastructure architecture and more global resiliency by adopting managed services.
  2. Breaking apart the monolith into more isolated services and data domains as appropriate, so we can scale independently, enable more isolated change management, and implement localized decisions about shedding traffic when needed.

We are also continuing tactical repair work from every incident.

Our commitment to transparency

We recognize that it’s important to provide you with clear communication and transparency when something goes wrong. We publish summaries of all incidents that result in degraded performance of GitHub services on our status page and in our monthly availability reports. The February report will publish later today with a detailed explanation of incidents that occurred last month, and our March report will publish in April.

Given the scope of recent incidents, we felt it was important to address them with the community today. We know GitHub is critical digital infrastructure, and we are taking urgent action to ensure our platform is available when and where you need it. Thank you for your patience as we strengthen the stability and resilience of the GitHub platform.

The post Addressing GitHub’s recent availability issues appeared first on The GitHub Blog.

Read the whole story
alvinashcraft
3 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Learn how to fine-tune LLMs in 12 hours

1 Share

The goal isn't just to train a model; it's to build a system that understands your specific data as well as you do.

We just posted a massive, 12-hour course on the freeCodeCamp.org YouTube channel designed to turn you from an AI consumer into an LLM architect.

While massive models like Llama 3, Gemini, and GPT-4 are impressive out of the box, their true power is unlocked when they are tailored to specific domains. This course is a deep dive into the modern LLM ecosystem, teaching you how to take these giants and make them work for your specific needs.

The course is structured into four major sections:

  • The Foundations of PEFT: Learn why full fine-tuning is often overkill. You will learn about Parameter-Efficient Fine-Tuning (PEFT), and master techniques like LoRA and QLoRA to train models on consumer hardware.

  • Advanced Alignment: You’ll learn about Reinforcement Learning from Human Feedback (RLHF) and the increasingly popular Direct Preference Optimization (DPO) to align models with human intent.

  • High-Performance Tooling: Get hands-on with the fastest tools in the industry. The course covers Unsloth (for 2x faster training), Axolotl, and the Llama Factory project for streamlined workflows.

  • Enterprise & Multimodal AI: Beyond text, the course explores Vision Transformers (ViT), multimodal architectures (Image, Video, Audio), and how to leverage enterprise APIs from OpenAI and Google Cloud Vertex AI.

Watch the full course on the freeCodeCamp.org YouTube channel (12-hour watch).



Read the whole story
alvinashcraft
3 hours ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories