Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
155315 stories
·
33 followers

WWDC 2026: All the news from Apple’s developers conference

1 Share

Apple’s annual WWDC event is kicking off on June 8th with a keynote presentation starting at 1PM ET / 10AM PT, where Apple will announce major updates to iOS, macOS, and its other operating systems. 

Among those updates could be Apple’s delayed Siri overhaul, which has faced setbacks since it was initially announced at WWDC 2024. Apple is revamping Siri with some help from Google Gemini and is also rumored to have a dedicated Siri app in the works. 

This is also the last keynote we’re expecting to see from Tim Cook before he steps down as CEO on September 1st and is replaced by John Ternus, who currently leads hardware engineering.

Follow along here for all the latest news and updates. 

Read the whole story
alvinashcraft
15 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

GitHub for Beginners: Answers to some common questions

1 Share

Welcome back to GitHub for Beginners. This is the final episode of the season, and we’ve covered a lot so far. Make sure to check out our other episodes to see all the various topics we’ve discussed.

Today, we’re going to spend some time answering some questions that people often have, especially when they’re first getting started. So without further ado, let’s jump right in.

As always, if you prefer to watch the video or want to reference it, we have all of our GitHub for Beginners episodes available on YouTube.

What is SSH and how do I add my SSH key to GitHub?

An SSH key is a secure shell key. It’s a pair of files on your computer that has two parts: a private key and a public key.

The private key stays on your computer and should never be shared. The public key is what you share with platforms like GitHub. When you store your public key on GitHub, git uses your private key to confirm your identity when you push and pull code. In order for you to be authenticated, your public key on GitHub needs to match the private key on your computer.

So how do you do this? Let’s create a key pair and add your public key to GitHub now.

(And remember, if you prefer a video walkthrough, that is available.)

  1. Open up a terminal and enter the following command. Remember to replace the email placeholder with your email address you use to log into GitHub.
ssh-keygen –t ed25519 – C YOUREMAIL@DOMAIN.COM
  1. When it prompts you to enter a file to save the key, press Enter to accept the default file and location.
  2. Enter a passphrase that you’ll remember. Note that the terminal will not display what you type, so be careful not to have any typos!
  3. Reenter your passphrase.

This will create your new SSH key. Now you want to add it to your ssh-agent. An ssh-agent is a program that securely stores your keys so that you don’t need to keep entering your passphrase.

🔍 To learn more, check out our docs about adding your SSH key to ssh-agent

To add this new SSH key to the ssh-agent, run the following command. Note that you will need to add your passphrase when prompted.

ssh-add ~/.ssh/id_ed25519

Now that you have created the SSH key and configured your ssh-agent, the next step is adding the public key to GitHub.

  1. In your terminal, run the following command.
cat ~/.ssh/id_ed25519.pub
  1. Copy the entire line that appears in the terminal as a response to that command.
  2. Open a browser and navigate to github.com.
  3. Click your profile picture in the top-right corner and select Settings.
  4. In the menu on the left-hand side, select SSH and GPG keys.
  5. On the right-hand side, click the green New SSH key button.
  6. Give the key that you’re about to add a name in the “Title” box that describes this key in a way you’ll remember. For example, if this is your work laptop you might enter a title of “work-laptop”.
  7. Paste the key you copied from the terminal into the “Key” box.
  8. Click the green Add SSH key button at the bottom of the window.

Congratulations! Your computer is now configured to connect to GitHub over SSH.

How do I add a PAT to GitHub? What is a PAT?

PAT stands for Personal Access Token. A PAT is a special credential that you create on GitHub for tools that need authentication. You control its permissions and can revoke it any time. On GitHub, you’ll commonly use a PAT to authenticate via command line or the GitHub API.

There are two types of PATs available: fine-grained tokens and classic tokens. First we’ll walk through creating a fine-grained PAT.

  1. Open a browser and navigate to github.com.
  2. Click your profile picture in the top-right corner and select Settings.
  3. Scroll to the bottom of the list of options in the left-hand column and select Developer settings.
  4. In the left-hand column, expand the option for Personal access tokens. Select Fine-grained tokens from the options displayed.
  5. Click the green Generate new token button in the center of the window.
  6. Enter a name and description for the token. This should make it clear what the token is going to be used for (e.g., a name of “cli-access” with a description of “access the Copilot CLI”).
  7. Under “Expiration,” select a date that matches how long you need the token to be valid. Once the token expires, it will not work anymore.
  8. Under “Repository access,” select which repositories you want the PAT to be able to access. You can limit the selection to only specific repositories if you know which repositories it will need to access.
  9. Under “Permissions”, click Add permissions to select which permissions you’re granting to this PAT. This lets you define the scope of what the PAT can do.
  10. For each permission, you can specify whether it has read-only access or read and write access.
  11. When you’re satisfied with the permissions, scroll to the bottom and click the green Generate token button.
  12. A window pops up, providing a review of all of the information associated with this token. Verify that the information is correct, and then click Generate token.
  13. GitHub will now show you the token. Make sure that you copy it and store it in a safe location (e.g., a password manager), because GitHub only shows you this token once.
🔍 For more information, check out our documentation about Personal Access Tokens.

Now let’s go through creating a classic token. As you’ll see, it’s very similar in several ways.

  1. Open a browser and navigate to github.com.
  2. Click your profile picture in the top-right corner and select Settings.
  3. Scroll to the bottom of the list of options in the left-hand column and select Developer settings.
  4. In the left-hand column, expand the option for Personal access tokens. Select Tokens (classic) from the options displayed.
  5. In the main window, click Generate new token and select Generate new token (classic).
  6. Give the token a clear name that explains what it will be used for (e.g., “terminal-access”).
  7. Under “Expiration,” select a date that matches how long you need the token to be valid. Once the token expires, it will not work anymore.
  8. Select the scopes for your token. The scopes indicate what access permissions the token grants.
  9. When you’re satisfied with the scopes, scroll to the bottom and click the green Generate token button.
  10. GitHub will now show you the token. Make sure that you copy it and store it in a safe location (e.g., a password manager), because GitHub only shows you this token once.

This creates the classic token. The next time that GitHub asks for your password in a terminal, instead of supplying your password, you could paste this token.

What’s the difference between merging and rebasing, and how do I fix a merge conflict?

A merge conflict is what happens when two changes touch the same part of a file, and git needs your help to decide what the final version should be. There are a few different ways you can resolve this, but we’re just going to walk through it using the GitHub UI.

  1. Open a pull request that has a merge conflict. GitHub will provide a message indicating that there’s a conflict and you won’t be able to automatically merge.
  2. Scroll to the bottom and click the Resolve conflicts button inside the warning about conflicts.
  3. GitHub opens the files that have conflicts. Use the editor to resolve the conflicts by choosing which version of the file to use in each case where there’s a conflict.
  4. Once the file has no more conflict markers (i.e., you’ve addressed every conflict), select Mark as resolved in the top-right.
  5. Repeat this process for every file that has merge conflicts.
  6. After you’ve addressed all the files that have conflicts, click the green Commit merge button at the top of the window.

This updates your pull request with the merged conflict and now you’ll be able to merge that change into your repository. Well done!

Now let’s talk about the difference between merging and rebasing, and when you might want to use one over another.

Merging combines changes from one branch into another by creating a new commit that ties both histories together. It preserves the history of both branches. You should merge when you want to preserve the full history of how work happened. This is commonly used for feature branches that are going to be merged into main, such as when you’re adding new functionality.

On the other hand, rebasing moves or replaces your branch’s commits on top of another branch. It rewrites the history to create a linear and cleaner commit timeline. You should rebase when you want a clean linear history, like when you are updating your feature branch to pull in the latest changes from main. For example, if you are working on a feature, but you want to pull in the latest changes from main before merging.

How do I undo my last commit?

Let’s say that you’re in a situation where you’ve already pushed your commit to your branch, and you want to undo it. You can undo your commit through the GitHub UI.

  1. Open the commit that you want to undo on github.com.
  2. Scroll to the bottom of the commit and select the Revert button.
  3. GitHub creates a new commit that undoes the changes from your previous commit. It’s important to realize that this doesn’t erase the commit history, but rather puts a new change in place that undoes your previous changes. You can now either merge this commit directly or open a pull request with it. Opening a pull request is the safest option when others might be using the branch.

If your changes are local, and you haven’t yet pushed to a branch, you can locally revert your commit by running the following command.

git reset --soft HEAD~1

This removes the commit from your local repository, but keeps your work staged so that you don’t lose any changes. If you would rather undo your changes even locally to reset your workspace, you can use the following command. Just realize that by doing this, you might lose your work!

git reset --hard HEAD~1 

How do I update or sync a forked repository on GitHub?

Forking a repository creates your own copy of a project so that you can explore or make changes to it without affecting the original repository. This is especially important when you want to contribute to an open source project. Here’s how you can fork a repository.

  1. Open the repository that you want to fork on github.com.
  2. Select the Fork button at the top of the repository.
  3. Choose the Owner of this forked version, which in most cases will be your GitHub account.
  4. You may optionally rename the repository by providing a new Repository name. By default, forked repositories keep the names of the upstream repository.
  5. At the bottom of the window, select Create fork.
  6. This creates a full copy of the project under your account. To work on it locally, select the Code button and clone it to your machine.
🔍 You can learn more about forking by checking our documentation.

Now that you’ve created the fork, you want to make sure that you still pull in the latest changes from the upstream repository. Otherwise, your forked copy can quickly become out of date.

  1. Navigate to the main page for your forked repository on github.com.
  2. At the top of the repository, select the Sync fork button.
  3. Select Update branch in the pop-up menu.

When you do this, GitHub automatically pulls in the latest changes from the upstream repository to keep your fork up to date. You can also do this from the command line.

  1. Open a terminal and navigate to your repository.
  2. Set the upstream repository. Make sure to update the URL in the following command with your original repository URL.
git remote add upstream YOUR_ORIGINAL_REPOSITORY_URL
  1. Pull in the latest changes.
git fetch upstream
  1. Merge the latest changes into your project. Note that the following command assumes that the upstream project uses main as the default branch. If it uses something else, you will need to use that branch in the following command.
git merge upstream/main
  1. This updates your local copy with all of the changes to the upstream branch. So now, you need to push them to GitHub to make sure your repository is synchronized.
git push origin main

Now you know how to work in your own copy of a project and keep your work synchronized!

How do I review a pull request on GitHub?

A pull request (often abbreviated PR) is a place to share code and talk about changes. Here are three helpful practices to keep in mind when you’re reviewing a pull request.

  1. Start by understanding the goal of the pull request. Open the pull request and read the description. See if it has an associated issue, any screenshots, or notes from the author. Knowing the purpose helps you know why the changes exist and what you’re looking for.
  2. Review the code changes in small sections. Open the Files changed tab and move through the changes one group at a time. If something isn’t clear, leave a comment on that line. Ask questions, offer suggestions, or let the author know if you see a better approach. Keep your comments specific so they know exactly what you’re referencing. It might help to open the code on your machine either via the command line or in a codespace to run the code yourself to ensure you understand. Use terms like “nit” if your comment is not a necessary suggestion for merging the pull request.
  3. Highlight what’s going well. When you see code that’s well organized, thoughtful, or teaches you something, mention it! Positive feedback reinforces good patterns and helps teammates feel supported.
🔍 Learn more about reviewing pull requests by taking a look at our documentation on the topic.

When everything looks ready, use the Submit review button to approve the changes or request updates.

Copilot code review can also help you understand pull requests and suggest improvements. Note that in order to use Copilot, your organization admin needs to enable Copilot for either your repository or your user account. Once Copilot is enabled, you don’t need to install anything special—Copilot code review will automatically appear as an option in pull requests.

  1. Open a pull request on github.com where you want to use Copilot code review.
  2. Select Reviewers in the top-right.
  3. Select Copilot from the list of suggested reviewers.

In a short amount of time, Copilot will complete its review. You can scroll down and see the comments left by Copilot. It always leaves a “Comment” type of review, not an “Approve” or “Request changes” type of review. This means that Copilot reviews do not count toward required approvals nor will they block merging changes.

🔍 Learn more by checking out our Copilot code review documentation.

Next steps

And that’s a wrap! With this episode, we’ve finished another season of GitHub for Beginners, ending with some of the most common questions we’ve seen or heard. We hope that you found this information helpful, and don’t forget to check out our full library of GitHub for Beginners topics.

Happy coding!

The post GitHub for Beginners: Answers to some common questions appeared first on The GitHub Blog.

Read the whole story
alvinashcraft
15 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Long-Running Agents

1 Share

The following article originally appeared on Addy Osmani’s blog and is being reposted here with the author’s permission.

A long-running AI agent can keep making progress over hours, days, or weeks. It can do this across many context windows and sandboxes, recover from failure, leave structured artifacts behind, and resume where it left off.

For two years the dominant image of an “AI agent” has been a chat window with a clever loop in it. You type a goal; the agent calls some tools; you watch tokens stream by; you stop watching when the work runs out of patience or the context window fills up. That paradigm got us a long way, but it has a ceiling. The model forgets. It declares “task complete” when it isn’t. It reintroduces a bug it fixed nine turns ago. The whole thing is structured around a single sitting.

Long-running AI agents

Long-running agents are what comes next. The idea is easy to state: an agent that keeps making forward progress on a goal across many sessions and many sandboxes, possibly many days or weeks, while leaving the workspace clean enough that the next session can pick up where the last one left off. The engineering is harder. You have to solve for persistence, recovery, and verification in a way that doesn’t just paper over the cracks. You have to build a state layer that lives outside the model’s context window, and you have to design the handoff between sessions so the agent doesn’t lose its mind when it wakes up and finds itself in a different sandbox with a different context window.

This post is my attempt to lay out what’s changed, who’s pushing on it, and how an engineer can use long-running agents today without writing the whole thing from scratch.

What “long-running” actually means

“Long-running” used to mean at least three different things in practice, and it helps to keep them separate.

Long-horizon reasoning. The agent has to plan and execute over many dependent steps. This is mostly a model-quality story: coherence, planning, the ability to recover from a wrong turn 10 steps ago. METR has been tracking this with their time horizon metric, which estimates how long a task a frontier model can complete with 50% reliability. The headline finding is that the metric has been doubling roughly every seven months since 2019, and their TH1.1 update earlier this year doubled the count of eight-hour-plus tasks in the eval set. If that curve holds, frontier agents complete tasks at the day scale by 2028 and the year scale by 2034.

Long-running execution. The agent’s process runs for hours or days. Maybe it’s a coding job, maybe it’s a research sweep, maybe it’s a 24-7 monitoring service. The model might be invoked thousands of times across the run. This is mostly a harness story, and it’s the one this post is mostly about.

Persistent agency. The agent has an identity that outlives any single task. It accumulates memory, learns user preferences, and is always available. This is the Memory Bank flavor of long-running.

In practice the three blur together. A real production agent does long-horizon reasoning inside a long-running execution backed by persistent agency. But the engineering problems are different in each, and so are the products that solve them.

Why this matters

There are two reasons I believe this work matters a lot right now.

The first is a phase change in what’s economically feasible to delegate. An agent that runs for 10 minutes can answer a question, summarize a doc, fix a small bug. An agent that runs for 10 hours can own an entire feature, finish a migration that was on the backlog for six quarters, or do the kind of overnight research sweep that used to require a junior analyst. One of Anthropic’s Claude Sonnet announcements put concrete numbers on this last fall: 30+ hours of autonomous coding in internal tests, including one run that produced an 11,000-line Slack-style app. That’s already past the threshold where the answer to “Should I delegate this?” is no longer obvious.

The second is that persistence changes what the agent is. A stateless agent answers your question and disappears. A long-running one accumulates context: which competitor moved which way last week, which test flaked twice on Tuesday, what you usually mean by “the dashboard.” Anthropic’s Project Vend was the most public early demonstration of this. They had a Claude instance run an actual office vending business for a month, managing inventory, setting prices, talking to suppliers. It failed in informative ways, and the second phase ran much better, but the point wasn’t profitability. The point was watching what kinds of weird coherence problems show up when an agent has to maintain identity across weeks instead of turns.

Those are the same problems every team building production agents now hits.

The three walls every long-running agent hits

Three walls show up in basically every write-up I’ve read this year.

Finite context. Even a 1M-token window fills. And context rot, the steady degradation of model performance as the window gets full, kicks in well before the hard limit. A 24-hour run is not going to fit in any context window the field has on its roadmap. Something has to give.

No persistent state. A new session starts blank. Anthropic’s framing in their scientific computing post is the cleanest version I’ve seen: “Imagine a software project staffed by engineers working in shifts, where each new engineer arrives with no memory of what happened on the previous shift.” Without an explicit persistence story, every shift change is a productivity disaster.

No self-verification. Models reliably skew positive when they grade their own work. Asked “Are you done?” they answer “yes” more often than they should. Without a separate signal that the work meets a bar, you get the agent that ships at 30% complete with full confidence.

Long-running agent designs are mostly answers to these three problems. The major labs have converged on similar shapes of answer, but with very different surface area.

The Ralph loop: One of the simpler practitioner versions of long-running agents

The Ralph loop (sometimes called the Ralph Wiggum technique) is one of “simpler” practitioner version of long-running agents, popularized by Geoffrey Huntley and Ryan Carson. The reference implementation is literally a bash script that loops:

  1. Pick the next unfinished task from a list (prd.json or equivalent).
  2. Build a prompt with the task, the relevant context, and any persistent notes.
  3. Call the agent.
  4. Run tests or other checks.
  5. Append what happened to progress.txt.
  6. Update the task list (done, failed, blocked).
  7. Go back to step 1.

The reason it works is the same reason any of the harnesses below work: State lives outside the agent’s context. prd.json is the plan, progress.txt is the lab notes, and AGENTS.md is the rolling rulebook. The agent itself is amnesiac, but the filesystem isn’t. Each iteration starts fresh and reads enough state from disk to keep going. Carson’s Compound Product extends the idea by chaining multiple loops (an analysis loop that reads daily reports, a planning loop that emits a PRD, an execution loop that writes the code), which is roughly the open source version of the planner-generator-evaluator triad Anthropic landed on independently.

I went deeper on all of this in “Self-Improving Coding Agents”: task list structure, progress files, QA gates, monitoring, the failure modes you’ll actually hit. The short version is that you can build a working long-running agent in an evening with a bash script and a JSON file. Most of what Google and Anthropic have productized is the work of making this pattern recoverable, secure, and observable at scale.

The big-lab stories below are different ways of paying for that production-readiness.

Anthropic: Harnesses, then the brain/hands/session split

Anthropic has been the most public about the engineering. Two posts are worth reading end to end.

The first is “Effective Harnesses for Long-Running Agents,” which lays out a two-agent harness for autonomous full stack development. An initializer agent runs once at the start of a project to set up the environment, expand the prompt into a structured feature-list.json, and write an init.sh that future sessions will run on boot. A coding agent is then woken up over and over, each session asked to make incremental progress on one feature, run tests, leave a claude-progress.txt note, and commit. A test ratchet (“it is unacceptable to remove or edit tests because this could lead to missing or buggy functionality”) sits in the prompt to stop the very common failure of an agent deleting failing tests to “make them pass.” InfoQ’s writeup extends this into a planner, generator, and evaluator triad, on the same logic that separating generation from evaluation matters because models grade their own work too generously.

The second is “Scaling Managed Agents: Decoupling the Brain from the Hands,” the architectural post behind Claude Managed Agents (Anthropic’s hosted runtime, launched in early April). The argument is that an agent has three components that should be independently replaceable. The Brain is the model and the harness loop that calls it. The Hands are sandboxed, ephemeral execution environments where tools actually run. The Session is an append-only event log of every thought, tool call, and observation.

This sounds abstract, but it isn’t. Here’s Anthropic’s framing: “Every component in a harness encodes an assumption about what the model can’t do on its own.” When you couple them, an assumption that goes stale (e.g., the model used to need an explicit planner and now plans natively) means the whole system has to change at once. When you decouple them, the harness becomes stateless, sandboxes become cattle, not pets, and a brain crash doesn’t lose the run. A fresh container calls wake(sessionId) and reconstitutes the state from the log. They reported time-to-first-token dropped ~60% at p50 and over 90% at p95 just from being able to start inference before the sandbox is ready.

The session-as-event-log idea is the part most teams underappreciate. It is what makes a long-running agent recoverable. Without it, a container failure is a session failure and you’re debugging into a stale snapshot. With it, the agent’s memory is a queryable artifact that lives outside whatever process happens to be running at the moment.

For the scientific computing crowd, Anthropic’s “long-running Claude” post reduces all of this to a simpler stack: CLAUDE.md as a living plan the agent edits as it learns, CHANGELOG.md as portable lab notes, tmux plus SLURM plus git as the execution and coordination layer, and the Ralph loop, a for loop that kicks the agent back into context whenever it claims completion and asks if it’s really done. Their flagship case study is a Boltzmann solver Claude Opus 4.6 built over a few days that reached subpercent agreement with a reference CLASS implementation. Months to years of researcher time, compressed.

Same patterns across all three posts: an explicit plan file, an explicit progress file, structured handoffs between sessions, separate generation from evaluation, and a loop that refuses to let the agent stop early.

Cursor: Planners, workers, judges

Cursor’s “Scaling Long-Running Autonomous Coding” is the other essential read this year. They walked into walls that Anthropic mostly papered over.

Their first attempt was a flat coordination model: equal-status agents writing to shared files with locks. It became a bottleneck and made the agents risk averse, churning rather than committing. Their second attempt swapped locks for optimistic concurrency control, which removed the bottleneck but didn’t fix the coordination problem. The third design is what’s running in production now and what they describe as solving most of the problem:

  • Planners continuously explore the codebase and emit tasks. They can recursively spawn subplanners.
  • Workers are focused executors. They don’t coordinate with each other and they don’t worry about the big picture.
  • Judges decide when an iteration is finished and when to restart.

Two things stand out from the post. One: “A surprising amount of the system’s behavior comes down to how we prompt the agents” more than the harness or the model. Two: Different models slot into different roles. Their reported finding is that a GPT model was better than Opus for extended autonomous work specifically because Opus tended to stop early and take shortcuts. Same task, different role, different model. The matching is becoming part of the design surface.

This pairs with Composer 2 (their proprietary frontier coding model that ships in Cursor 3) and their background cloud agents: long-running tasks that run on Anysphere’s cloud infrastructure rather than your laptop. Eight-hour refactors and codebase-wide migrations survive a closed lid. You can start a task locally, hit run in cloud when you realize it’ll take 30 minutes, and reattach later from your phone. Each agent runs in an isolated Git worktree and merges back via PR. The handoff between local and remote is the part most teams haven’t figured out yet, and Cursor’s bet is that it has to be its own product surface.

The shape ends up close to Anthropic’s: Roles are split, sessions are durable, judges sit beside the worker, and a long task runs in a cloud sandbox with Git as the coordination substrate.

Google: Long-running agents on the Agent Platform

Google’s announcement at Cloud Next ’26 folded Vertex AI into the Gemini Enterprise Agent Platform and turned long-running agents into a named product, with named SLAs.

The pieces that matter for this post:

  • Agent Runtime supports agents that “run autonomously for days at a time” with sub-second cold starts and on-demand sandbox provisioning. The launch post’s example use case is a sales prospecting sequence that takes a week to play out, which is roughly the right shape for it.
  • Agent Sessions persist conversation and event history. You can pin them to a custom session ID that maps to your own CRM or DB record, so the agent’s state lives next to the business state instead of in a separate AI silo.
  • Agent Memory Bank is the persistent long-term memory layer, generally available as of Next ’26. It curates memories from sessions, scopes them to a user identity, and exposes a search API so the next agent invocation can pull what’s relevant. Payhawk reported that auto-submitting expenses through a Memory Bank-backed agent cut submission time by over 50%.
  • Agent Sandbox handles hardened code execution.
  • Agent-to-Agent Orchestration, Agent Registry, Agent Identity, Agent Gateway, Agent Observability, and Agent Simulation cover basically every operational concern you’d otherwise build by hand for a production fleet, including the cryptographic-identity-and-audit-log story enterprises actually need to ship.

Architecturally this is the same brain/hands/session split Anthropic described, just productized at platform scale and bundled with ADK (the code-first dev kit) and Agent Studio (the visual one). If you’re building inside Google Cloud, you don’t have to design a session log or a memory store from scratch anymore. You wire an ADK agent into Memory Bank and Sessions, deploy onto Agent Runtime, and the persistence question is answered.

Notice how much this looks like the pattern Anthropic and Cursor describe, just unbundled into named services with SLAs. Three years ago you’d have built all of this yourself. Now you pick which version of “decoupled brain, hands, and session” you want to rent.

Five patterns for long-running agents in production

Shubham Saboo and I wrote up five design patterns we’ve seen separate working long-running agents from demos. They aren’t Google-specific, but they map cleanly onto the primitives Agent Runtime now exposes, so it’s worth walking through them here in shortened form.

Checkpoint-and-resume. The most common multiday failure is context loss. An agent processes 200 documents over four hours, hits an error on document 201, and without a checkpoint you start from scratch. Treat the agent like a long-running server process: write intermediate state to disk, checkpoint every N units of work, recover from failures. The Agent Runtime sandbox gives you a persistent filesystem, but choosing the right checkpoint granularity (not every step, not only the end) is on you.

Delegated approval (human-in-the-loop). Most “human-in-the-loop” implementations are: serialize state to JSON, fire a webhook, hope someone responds. The state goes stale, the notification gets buried, the agent re-deserializes into a slightly different world. Long-running runtimes let the agent pause in place with full execution state intact: reasoning chain, working memory, tool history, pending action. Hours of human time pass, the agent consumes zero compute, and it resumes with subsecond latency. Mission Control is Google’s inbox for this. The pattern works regardless of vendor.

Memory-layered context. A seven-day agent needs more than session state. Memory Bank handles long-term curated memory, Memory Profiles add low-latency lookups, and the failure mode you’ll hit in production is memory drift: The agent learns a procedural shortcut from a few atypical interactions and starts applying it broadly. Govern memory like you govern microservices. Agent Identity controls who can read and write which banks. Agent Registry tracks which version of which agent is running. Agent Gateway enforces policy on the wire. The auditing question stops being “What are my agents doing?” and becomes “What are my agents remembering, and how is that changing their behavior?”

Ambient processing. Not every long-running agent talks to a human. Some sit on a Pub/Sub stream or a BigQuery table and act on events as they arrive: content moderation, anomaly detection, inbox triage. The architectural decision worth making early is to not hardcode policy into the agent. Define it in the Gateway and the fleet picks up policy changes without redeploys. Ambient agents run unsupervised for long stretches, and the only sane way to update a hundred of them is to update the policy layer once.

Fleet orchestration. In real systems, you rarely have one agent. A coordinator delegates subtasks to specialists (a Lead Researcher Agent, a Scoring Agent, an Outreach Agent), each running independently for different durations. Each specialist gets its own Identity (so the Outreach Agent can’t read financial data meant for Scoring), its own policy enforcement, its own Registry entry. This is the same coordinator/worker shape distributed systems have used for decades. What’s new is that ADK handles it declaratively with graph-based workflows, and a bad deployment in one specialist doesn’t cascade to the others.

The patterns compose. A compliance system might use checkpointing for document processing, delegated approval for review gates, memory layering for cross-session knowledge, and fleet orchestration to coordinate the specialists. The opening question is always the same: What’s the longest uninterrupted unit of work your agent needs to perform? Minutes, and you don’t need long-running agents. Hours or days, and these patterns are where to start. The full write-up with code samples covers each pattern in depth.

So how do you actually build one today?

This is the practical question, and it has a different answer depending on what you’re building.

You’re a developer who wants long-running coding work on your own repo. Just use Claude Code (or Antigravity, Cursor, or Codex). The harness is already there. Treat your AGENTS.md like a pilot’s checklist: short, every line earned by a real failure. Add hooks for typecheck and lint that surface failures back to the agent. Write a plan file before the agent starts. Use the Ralph loop when the agent claims it’s done and you don’t believe it. For multihour or overnight jobs, run in a worktree so a closed laptop doesn’t kill the run, and have it commit progress every meaningful unit of work. This is the path most people should take, and it’s where the most leverage is right now.

You’re building a hosted agent product. Don’t build the runtime. Pick a managed one. The three real options today: Google’s Agent Platform (Agent Engine + Memory Bank + Sessions), Claude Managed Agents, or roll something on top of ADK, the Claude Agent SDK, or Codex SDK and host it yourself. The trade-off is the usual one. Managed gets you the brain/hands/session split, observability, identity, and an audit trail out of the box. Self-hosted gets you control and the ability to use weird models for weird roles (Cursor’s pattern). For most teams, the right starting point is a managed runtime plus your own ADK or SDK code for the actual loop.

You’re doing something autonomous and operational (monitoring, research, ops). Memory Bank-style persistence is what you want, and it’s the part that doesn’t exist in Claude Code. ADK + Memory Bank + Cloud Run + Cloud Scheduler is the cleanest stack I’ve seen for “agent runs every N hours, accumulates state, alerts on a threshold.” This is also where Cursor’s planner/worker/judge split starts to matter more than it does for IDE coding, because the work is genuinely parallel and the failure modes are different.

A few things matter regardless of which path you take.

Write down the done condition before the agent starts. This is the single highest-leverage move for long runs. The Anthropic harness post calls it the feature list; Cursor calls it the planner’s task spec. Either way, it’s an external file with explicit, testable completion criteria, and it exists so the agent can’t quietly redefine done midrun.

Separate the evaluator from the generator. Self-grading is the failure mode. A planner/worker/judge pipeline, or a generator/evaluator pair, is a real architectural pattern, not a stylistic preference. Even if it’s the same model in different roles with different prompts.

Invest in the session log, not just the prompt. The append-only event log is what makes the agent recoverable, debuggable, and auditable. If you can’t reconstruct what the agent did in the last 24 hours from durable storage, what you have is a long-running shell script that happens to call an LLM, not a long-running agent.

Treat compaction and context resets as first class. Anthropic is explicit that summarization-as-compaction wasn’t enough for very long jobs; they had to do full context resets where the harness tears the session down and rebuilds it from a structured handoff file. It is essentially how humans onboard a new engineer.

There are some real limitations right now

A few things are still genuinely unsolved.

Cost. A 24-hour run with a frontier model and a few tools is not cheap. Without budgets, circuit breakers, and a hard cap on tool spend, an agent can quietly burn through a week’s API budget in an afternoon. This is solvable, but it’s an explicit step you have to take.

Security. A long-running agent with API keys, cloud access, and the ability to run shell commands has a much larger attack surface than a chat session. The brain/hands separation pattern matters here too: Credentials should be unreachable from the sandbox where model-generated code runs, which is one of the benefits Anthropic calls out for Managed Agents.

Alignment drift. Over many context windows, agents drift. The original goal gets summarized, then resummarized, then loses fidelity. This is the part hooks and judges exist to defend against. It is also the most common reason “the agent went off and did something I didn’t ask for.”

Verification. Auditing 24 hours of autonomous activity is a real human-time problem. Observability and structured artifacts (PRs, commits, briefings, test runs) are how you make this tractable. Without them, you’re scrolling logs and you’ll miss what matters.

The human role. This is the one I keep coming back to. Defining work crisply enough that an agent can run for a day on it is harder than doing the work yourself. The skill that’s appreciating in value isn’t writing code. It’s writing specs that survive contact with an autonomous executor.

Where this is going

Google, Anthropic, and Cursor have converged on roughly the same shape. Separate the model loop from the execution sandbox from the durable session log. Split planning from generation from evaluation. Bake in compaction, hooks, and context resets. Expose memory as a managed service that any agent invocation can query.

Surface area is what differs. Google’s Agent Platform is the enterprise-stack version, with the identity and audit trail story baked in. The patterns underneath are the same. Claude Managed Agents is “Anthropic’s harness, hosted.” Cursor’s background agents are “long-running coding, pulled out of the IDE and into the cloud.”

The harder problems for the next year aren’t in any of those layers individually. They’re in the coordination above them. Many long-running agents on a shared codebase. Agents that read their own traces and patch their own harnesses. Harnesses that assemble tools and context just in time for a task instead of being preconfigured at startup. That’s where the agent stops looking like a smarter chat window and starts looking like a colleague who’s been on the project longer than you have.

The model is still load-bearing. But the gap between a chat window and an agent you can leave running overnight is mostly in the state, sessions, and structured handoffs wrapped around it. That’s where I’d spend my learning time right now.



Read the whole story
alvinashcraft
15 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Rider 2026.2 EAP 5: Code Quality Checks for Your AI Agents, and More.

1 Share

Rider 2026.2 EAP 5 is now available, bringing a faster startup flow with the new non-modal Welcome screen and quality-check hooks for AI agents.

If you’re catching up on the 2026.2 EAP cycle, be sure to check out the blog posts we’ve already published about other updates unveiled so far, including WPF Hot Reload, the finding-tests skill for AI-assisted test generation, and the earlier EAP builds.

Quality-check hooks for Claude Code and Codex

Rider 2026.2 EAP 5 introduces bundled quality-check hooks for external AI agents, starting with Claude Code and Codex. In agent workflows, a hook is an automated step that runs at a specific point in the agent’s process. Here, Rider uses a PostToolUse hook: after an agent edits a file, Rider automatically runs IDE-level validation before the agent continues.

This means agent-generated code is no longer just accepted as-is. These checks can detect code issues identified by Rider’s built-in analysis and inspections, as well as formatting inconsistencies.

Agent-generated code with and without Rider’s quality-check feedback.
Watch Rider hooks catch potential errors and redirect the agent.

Errors can block the agent from treating the task as complete, while warnings and other findings are returned as feedback the agent can use to fix its own output. The result is a tighter AI-assisted development loop where the IDE, not the agent, sets the quality bar.

Easier access to Explain with AI

The Explain with AI action is now easier to discover when you need it most: while dealing with build errors and runtime exceptions. Instead of copying diagnostics into chat or manually describing what went wrong, you can trigger an AI explanation directly from the place where Rider surfaces the problem.

For .NET developers, this is especially useful because build output often combines Roslyn diagnostics, analyzer warnings, MSBuild issues, NuGet restore problems, and multi-targeting failures. Explain with AI helps turn noisy or context-dependent errors into a clearer explanation with likely causes and next steps, so you can move from failure to fix faster.

Share your thoughts

That’s it for Rider 2026.2 EAP 5. Download the latest EAP build, try the new features for AI-assisted development, and let us know how they work in your projects.

Read the whole story
alvinashcraft
16 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Agent 365 | Identity & Access Controls in Entra

1 Share
From: Microsoft Mechanics
Duration: 8:31
Views: 93

Take control of every AI agent, managed or not, running in your environment using Agent 365 and Microsoft Entra. Surface agents across AWS Bedrock, Google Vertex, Databricks, and Salesforce in one registry, assign Entra Agent IDs via CLI or SDK, and enforce least-privilege access through Conditional Access policies and Agent Blueprints, all without rebuilding your existing identity infrastructure.

Lock down agent activity with sign-in logs that capture every authentication attempt, policy hit, and failure. Govern agents as first-class identities alongside your users, apps, and devices, and draw a hard line between managed and unmanaged AI in your organization.

Vince Smith, Microsoft Entra Principal Product Manager, shares how to establish full visibility, access control, and lifecycle governance for AI agents using Microsoft Entra and Agent 365.

► QUICK LINKS:
00:00 - Visibility and control with Agent 365
01:39 - Multi-platform registry sync
02:29 - Assign Agent ID
04:14 - Agent Blueprints
05:24 - Conditional Access for agents
06:24 - Sign-in logs audit trail
07:03 - Unblock the agent
07:54 - Wrap up

► Link References

Check out https://aka.ms/EntraforAgents

► Unfamiliar with Microsoft Mechanics?
As Microsoft's official video series for IT, you can watch and share valuable content and demos of current and upcoming tech from the people who build it at Microsoft.

• Subscribe to our YouTube: https://www.youtube.com/c/MicrosoftMechanicsSeries
• Talk with other IT Pros, join us on the Microsoft Tech Community: https://techcommunity.microsoft.com/t5/microsoft-mechanics-blog/bg-p/MicrosoftMechanicsBlog
• Watch or listen from anywhere, subscribe to our podcast: https://microsoftmechanics.libsyn.com/podcast

► Keep getting this insider knowledge, join us on social:
• Follow us on Twitter: https://twitter.com/MSFTMechanics
• Share knowledge on LinkedIn: https://www.linkedin.com/company/microsoft-mechanics/
• Enjoy us on Instagram: https://www.instagram.com/msftmechanics/
• Loosen up with us on TikTok: https://www.tiktok.com/@msftmechanics

#MicrosoftEntra #AIGovernance #ZeroTrust #AIAgents

Read the whole story
alvinashcraft
17 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Episode 188 - Virtualization pitfalls and best practices with SQL Server

1 Share

Guy talks about an interesting incident involving performance problems in a virtual environment.

And also, we talk about the SSMS StatisticsParser extension and the question of SSMS extensions in general.

Relevant links:





Download audio: https://traffic.libsyn.com/secure/madeirasqlserverradio/SQLServerRadio_Show188.mp3?dest-id=213904
Read the whole story
alvinashcraft
17 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories