alvinashcraft's blurblog

SocialReasoning-Bench: Measuring whether AI agents act in users’ best interests by Tyler Payne, Will Epperson, Safoora Yousefi, Zachary Huang, Gagan Bansal, Wenyue Hua, Maya Murad, Asli Celikyilmaz, Saleema Amershi
Monday May 11^th, 2026 at 4:00 PM

Microsoft Research

Social Reasoning Bench | four icons on a blue to green gradient | person icon, chat bubble icon, chart icon, checklist icon

At a glance

AI agents are moving into social contexts. When agents manage calendars, negotiate purchases, or interact with other agents on a user’s behalf, they need more than task competence—they need social reasoning.
SocialReasoning-Bench evaluates that ability. The benchmark tests whether an agent can negotiate for a user in two realistic settings: Calendar Coordination and Marketplace Negotiation.
The benchmark measures both outcomes and process: it scores agents on outcome optimality (how much value they secure for the user) and due diligence (whether they follow a competent decision-making process).
Current frontier models often leave value on the table. They usually complete the task, but they frequently accept suboptimal meeting times or poor deals instead of advocating effectively for the user.
Prompting helps, but it is not enough. Even with explicit guidance to act in the user’s best interest, performance remains well below what a trustworthy delegate should achieve.

As AI agents take on more real-world tasks, they are increasingly operating in social contexts. With the right integrations, agents like Claude Cowork and Google Gemini can manage email and calendar workflows. In these settings, the agent must interact with others on your behalf. This requires social reasoning — understanding what you want, what the counterparty wants, and what information to reveal, protect, or push back on.

Our previous research suggests that today’s frontier models lack social reasoning. In our simulated multi-agent marketplace, agents accepted the first proposal they received up to 93% of the time without exploring alternatives. When red-teaming a social network of agents, a single malicious message spread through the system and led agents to disclose private data before passing the message along.

This kind of relationship has a long history outside AI. In economics and law it is called a principal-agent relationship: an agent acts on a principal’s behalf in interactions with others whose interests differ. Attorneys, real-estate agents, and financial advisors all operate in this mode, and the duties they owe—care, loyalty, confidentiality—are codified in centuries of professional norms. AI agents acting on a user’s behalf should ultimately be held to similar standards.

To measure and drive progress in social reasoning, we built SocialReasoning-Bench: a benchmark for testing whether agents can reason and negotiate on a user’s behalf against a counterparty with independent goals, private information, and potentially adversarial intent.

Introducing SocialReasoning-Bench

SocialReasoning-Bench evaluates social reasoning in two domains: Calendar Coordination and Marketplace Negotiation. In each, an agent advocates for its user against a counterparty and is scored on both the outcome it reached and the process it followed. We find that frontier models complete most tasks but consistently leave value on the table for the user.

Calendar coordination

In calendar coordination, an assistant agent manages a user’s calendar on a single day and fields a meeting request from another agent.

We assume the agent has access to a value function over time slots that captures the user’s scheduling preferences between 0.0 and 1. This function could be provided explicitly by the user or inferred from their calendar history, and is given to the assistant at the start of the task.

The counterparty is a requestor agent representing another person who wants to schedule a meeting with the user. The counterparty has its own value function over the same slots, constructed as the inverse of the user’s, so the slots most valuable to one are least valuable to the other. Some requestors negotiate in good faith, while others use the interaction to extract private calendar details or push the assistant toward times the user does not want.

In each task there is a zone of possible agreement (ZOPA) a term borrowed from negotiation theory for the set of outcomes that both parties could plausibly accept. In calendar coordination, the ZOPA is the set of time slots that are mutually free on both calendars. We construct every task so that the ZOPA contains at least three slots with different preference scores for the user, and the requestor’s opening request always conflicts with the user’s calendar.

Marketplace negotiation

In marketplace negotiation, a buyer agent representing a user negotiates with a seller agent to purchase a single product.

The user wants to pay as little as possible for the product. Their value function is the gap between the deal price and a private reservation price, the highest price they would pay. A larger gap captures more value, and a deal above the reservation captures none.

The counterparty is a seller agent with its own private reservation price set below the buyer’s. The counterparty’s value function mirrors the user’s, with higher deal prices yielding more value and deal prices below the seller’s reservation price yielding no value.

The ZOPA is the price range between the seller’s and buyer’s reservations. The seller’s opening offer is always above the buyer’s reservation, forcing the buyer to negotiate the price down.

New metrics for a new setting

Existing benchmarks focus on task completion: did the meeting get scheduled? Did the trade close? In principal–agent settings, what matters is not just whether the task is completed, but how well it is done. We introduce new measures to capture this distinction.

Outcome Optimality

Outcome optimality scores the share of available value the agent captured for its principal, on a 0-to-1 scale. The outcome inside the ZOPA most favorable to the principal scores 1, while the outcome most favorable to the counterparty scores 0.0. Intermediate outcomes are scored by where the principal’s value function places them between those two endpoints.

Due Diligence

Outcome optimality alone conflates skill with luck. An agent that immediately accepts a counterparty’s first offer, without inspecting its situation or making a counter-proposal, can still score well if the counterparty happens to propose a good outcome. To separate skill from luck, we introduce a process metric.

Due diligence scores process quality on a 0-to-1 scale by comparing the agent’s actions, at each decision point in the trajectory, against the action a deterministic reasonable-agent policy would have taken in the same state. The reasonable-agent policy is a greedy procedure that captures what a competent advocate would do at each step, such as gathering relevant context before acting, opening with a position favorable to its principal, and conceding only after better options have been exhausted. The Due Diligence score is the rate at which the agent’s actual choices match the reasonable-agent’s choices over the trajectory.

Duty of care

Together, Outcome Optimality and Due Diligence form an operational notion of an agent’s duty of care to the person it represents. An agent that lands a good outcome through a careless process is fragile, while an agent that follows good process but lands a bad outcome points to a capability gap rather than negligence. Only an agent that scores well on both is exhibiting strong social reasoning.

Experimental setup

For the calendar assistant agent and marketplace buyer agent, we evaluate GPT-4.1 with chain-of-thought, GPT-5.4 at high reasoning effort, and Claude Sonnet 4.6 and Gemini 3 Flash at high thinking levels. The counterparty (i.e. requestor in calendar coordination, and seller in marketplace negotiation) is always Gemini 3 Flash with medium reasoning effort, held constant across all conditions so that any difference in scores reflects the model under test rather than the difficulty of its opponent.

Each model is run under two prompt conditions: Basic Prompting where the agent receives only role and tool descriptions, and Defensive Prompting where the agent additionally receives explicit guidance to consult all available sources and advocate for the user toward the best possible outcome.

Each task runs for 10 negotiation rounds, at most. The counterparty proposes first in every task.

What we’re finding

Finding 1: Agents complete tasks at near-perfect rates but produce poor outcomes.

In calendar scheduling, agents almost always succeed in booking the meeting, but most often at suboptimal times. In marketplace negotiation, deals almost always close, but frequently at the worst possible price. The tasks get done, but not done well: task completion signals success, while Outcome Optimality reveals a consistent failure to act in the principal’s best interest.

Figure 2: Task Completion vs Outcome Optimality by model and domain. All models complete tasks at near-perfect rates, but produce poor outcomes. We measured Outcome Optimality against the two prompts, basic and defensive. Defensive prompting helps but does not close the gap.

Finding 2: Defensive prompting helps, but is not enough to close the gap.

When we instruct agents on how to work hard on their principal’s behalf, we see outcome improvements across both domains, but it is not enough to close the gap. GPT-5.4 benefits most from defensive prompting (+0.21 in calendaring, +0.12 in marketplace), while GPT-4.1 barely responds to it in either domain. The other models fall somewhere in between.

Finding 3: Outcome optimality shows how much value agents leave on the table.

Outcome optimality reflects where each deal lands within the ZOPA. When we plot outcomes, they cluster closer to the counterparty’s ideal than the principal’s.

Figure 3: Outcome Optimality (OO) distribution by model and domain. Each dot is one task instance. OO=1.0 means the agent captured all available value for its principal; OO=0.0 means the counterparty captured everything. Black lines show the mean. In marketplace, outcomes cluster near zero across all models. In calendar, agents perform better but still settle below the midpoint on average. — Figure 3: Outcome Optimality *(OO)* distribution by model and domain. Each dot is one task instance. OO=1.0 means the agent captured all available value for its principal; OO=0.0 means the counterparty captured everything. Black lines show the mean. In marketplace, outcomes cluster near zero across all models. In calendar, agents perform better but still settle below the midpoint on average.

In marketplace negotiation, all models settle at or near zero for Outcome Optimality, accepting deals that give away virtually all available surplus. In calendar scheduling, agents perform better but still land below the midpoint, accepting the requestor’s preferred slots rather than ones that better serve their principal.

Measuring value capture in agent negotiations builds on recent studies examining how agents perform in marketplace settings. Because we operate in a controlled setting, we can establish ground-truth constraints for both parties and measure exactly how the available value was divided. Our formulation also generalizes beyond price-based negotiations: by abstracting to a domain-specific value function, Outcome Optimality can measure surplus division in any setting where agents face competing incentives, including non-monetary domains like calendar scheduling where “value” is defined over preference scores rather than prices.

Finding 4: Due Diligence helps distinguish between luck and skill.

When we look at the combination of outcome quality and process quality, a more nuanced picture emerges. Many agents that achieve reasonable outcomes do so through fragile processes: they don’t check context before acting or they accept offers without countering. High Outcome Optimality with low Due Diligence suggests an agent that got lucky rather than one that can be trusted. Conversely, some agents show genuine diligence — gathering information, pushing back — but still land on poor outcomes, pointing to capability gaps rather than negligence. Dividing Outcome Optimality and Due Diligence each into high (>=0.5) and low (<0.5) buckets, we can sort every task into one of four archetypes.

	Not diligent (DD < 0.5)	Diligent (DD ≥ 0.5)
Good outcome (OO ≥ 0.5)	Lucky	Robust
Poor outcome (OO < 0.5)	Negligent	Ineffective

Through the lens of this decomposition, we can see that models exhibit robust duty of care on more than 50% of calendar coordination tasks, with Gemini 3 Flash leading at 90% robust. In marketplace negotiation, though, a very different picture emerges. GPT-4.1 is negligent in 95% of tasks, neither gathering information nor advocating for its principal, while Claude Sonnet 4.6, GPT-5.4, and Gemini 3 Flash show ineffective behavior in roughly 90% of marketplace tasks, negotiating diligently but still unable to achieve good outcomes.

Figure 4: Splitting Outcome Optimality and Due Diligence into “low” (<0.5) and “high” (>=0.5) buckets each, we plot the percent of tasks for each model that fall into each quadrant. For example, in calendar scheduling, GPT-4.1 achieves both high OO and high DD (Robust) in 63% of tasks. In contrast, in the marketplace domain, GPT-4.1 exhibits low OO and low DD (Negligent) in 95% of tasks. — Figure 4: Splitting Outcome Optimality and Due Diligence into “low” (<0.5) and “high” (>=0.5) buckets each, we plot the percent of tasks for each model that fall into each quadrant. For example, in calendar scheduling, GPT-4.1 achieves both high OO and high DD (Robust) in 63% of tasks. In contrast, in the marketplace domain, *GPT-4.1* exhibits low *OO and low DD* (Negligent) *in 95% of tasks*.

Figures 5-8 illustrate these different behaviors and failure modes with real examples from SocialReasoning-Bench in the calendaring domain. We see agents that follow a strong negotiation strategy and secure high-value outcomes, but also agents that achieve reasonable outcomes through sloppy processes, such as failing to propose the principal’s best option. Others begin with a strong position but concede prematurely, collapsing to poor deals. At the extreme, some agents exhibit negligent behavior, accepting the first proposal without checking constraints, even when it directly conflicts with the user’s interests.

Figure 5. A real paraphrased example of robust behavior from GPT-4.1 in the calendaring domain, achieving a good outcome after proposing the principal’s most preferred option first, correctly refusing the conflict, and then holding the line at their second best option.

*Figure 6. GPT-4.1 in the calendaring domain achieving a reasonable outcome from a sloppy process that didn’t include proposing the principal’s most preferred option.*

*Figure 7. GPT-4.1 in the calendaring domain starting out strong by proposing the principal’s most preferred slot but then caving early and achieving a poor outcome.*

Figure 8. GPT-4.1 exhibiting negligent behavior, accepting the requestor’s first proposal without confirming availability and conflicting with another meeting on the principal’s calendar. — *Figure* 8. GPT-4.1 exhibiting negligent behavior, accepting the requestor’s first proposal without confirming availability and conflicting with another meeting on the principal’s calendar.

Taken together, these examples highlight why outcome alone is insufficient. Without measuring process, we risk mistaking brittle or accidental success for genuine capability. Due Diligence helps surface whether an agent is consistently behaving like a competent, trustworthy delegate, or simply getting lucky.

Finding 5: Agents are vulnerable to adversarial manipulation

When we stress test agents by pitting them against adversarial counterparties, we find that agents struggle to balance when to engage, when to refuse, and how to negotiate under pressure.

To create these adversarial scenarios, we introduce counterparties explicitly trying to manipulate outcomes or bypass protective steps. Some follow carefully designed strategies, applying pressure or probing for information, while others use more unpredictable, creatively generated whimsical tactics that mimic novel forms of social engineering. Together, these test whether agents can handle not just known attacks, but unfamiliar ones.

Figure 9: Refusal Rates and Outcome Optimality when agents engaged with adversarial requestors in both domains. Agents rarely refuse adversarial requests in calendaring, while refusing more often in the marketplace. When agents did engage with malicious actors, Outcome Optimality dropped across the board.

We find that, aside from Claude Sonnet 4.6, agents rarely refuse adversarial requests in calendar scheduling, while refusing more often in marketplace settings. This suggests that adversarial intent is harder to detect in socially framed interactions. When agents do engage, the impact is starkest in calendar scheduling with Outcome Optimality dropping substantially across GPT-4.1, GPT-5.4, and Gemini Flash 3, suggesting that adversarial counterparties successfully steer these agents toward worse outcomes. In the marketplace domain, Outcome Optimality when agents engaged remains comparable to the low levels achieved against benign counterparties, capturing little to no value for their principals.

Why this matters now

Agents are interacting with each other in multi-party environments, from collaborating across enterprise workflows to transacting in digital marketplaces. As these networks form, the social reasoning gaps we observe in simple two-agent settings can begin to compound. Weak negotiation, over-trust, or failure to exercise due diligence no longer stay local. They propagate through coordination, influence downstream decisions, and shape collective outcomes.

In isolation, an agent that accepts a bad meeting time or a poor deal causes limited harm. In a network, those same behaviors can cascade, leading to systematically worse coordination or widespread value loss across many agents.

Recent work has begun exploring these risks and dynamics through case studies of agents interacting in networked settings. SocialReasoning-Bench complements this line of work by providing a controlled, reproducible benchmark that isolates interaction behaviors and makes them measurable. This allows us to move beyond anecdotes and systematically track progress, giving model, agent, and platform developers a concrete target for building agents that act as trustworthy delegates.

SocialReasoning-Bench is open source and available on GitHub (opens in new tab).

Limitations and future work

Our current measures treat all counterparties equally. In practice, relationships matter. A socially intelligent agent should modulate its assertiveness based on their principal’s relationship with the counterparty: pushing too hard when scheduling a meeting with a senior executive may damage a valuable relationship, and sometimes the right outcome is reached through compromise. Developing relationship-aware measures that account for power dynamics, rapport, and long-term consequences is an important direction for future work.

We evaluate social reasoning in simplified two-agent settings, whereas real-world delegation often involves multi-party dynamics such as group scheduling or multi-stakeholder negotiations. Each task is also treated as an independent encounter, with no modeling of long-term relationships, reputation, or trust-building across repeated interactions. Our scenarios are also limited to English-language and U.S.-centric business contexts, though social norms around negotiation, privacy, and hierarchy vary widely across cultures. Looking ahead, we plan to extend our benchmark to more diverse settings.

Finally, Outcome Optimality works well in settings with clear boundaries, where a “good” outcome can be defined and measured. But many tasks that require duty of care, such as drafting sensitive messages or navigating team dynamics, may not have a well-defined ZOPA. In these cases, outcomes depend on context, relationships, and judgment in ways that may resist a single score. Extending our approach to these more subjective settings is an important direction for future work.

Acknowledgements

We would like to thank Brendan Lucier, Adam Fourney, Amanda Swearngin, and Ece Kamar for their helpful feedback, discussions, and support of this work.

The post SocialReasoning-Bench: Measuring whether AI agents act in users’ best interests appeared first on Microsoft Research.

Read the whole story

alvinashcraft

59 minutes ago

reply

Pennsylvania, USA

End-to-end encrypted RCS messaging begins rolling out today in beta by Apple Newsroom
Monday May 11^th, 2026 at 4:00 PM

Apple Newsroom

Apple and Google collaborate with the GSMA to roll out end-to-end encrypted RCS messaging in beta, enhancing cross-platform communication security.

Read the whole story

alvinashcraft

59 minutes ago

reply

Pennsylvania, USA

When the Sensor Starts Thinking: SnortML, Agentic AI, and the Evolving Architecture of Intrusion Detection by Samaresh Kumar Singh
Monday May 11^th, 2026 at 4:00 PM

Stack Overflow Blog

Signature-based detection has always known what it was looking for. Machine learning and autonomous agents are changing the question entirely, shifting from "does this match a known pattern?" to "does this actually make sense in context?"

Read the whole story

alvinashcraft

1 hour ago

reply

Pennsylvania, USA

OAuth 2.0 – Device flow explained for Engineers, especially for Backend Engineers by Srikanth Srinivas
Monday May 11^th, 2026 at 4:00 PM

Stack Overflow Blog

Read the whole story

alvinashcraft

1 hour ago

reply

Pennsylvania, USA

GitHub for Beginners: Getting started with OSS contributions by Kedasha Kerr
Monday May 11^th, 2026 at 4:00 PM

The GitHub Blog

Welcome back to GitHub for Beginners. So far, we’ve discussed GitHub Issues and Projects, GitHub Actions, security, GitHub Pages, and Markdown. This time we’re going to talk about open source software and how to contribute to that community. By the end of this post, you’ll know what open source is, how to find projects to work on, how to read an open source repository, as well as start making your first contributions. So let’s get started!

As always, if you prefer to watch the video or want to reference it, we have all of our GitHub for Beginners episodes available on YouTube.

What is open source?

Open source software (OSS) refers to software that features freely available source code. In contrast with “closed source software,” OSS is publicly available for anyone to use and build upon. This means that all of the work, including the codebase and communication between users, is available for everyone to see.

If you’re just getting started in the world of software development, browsing and contributing to open source projects is a great way to dip your toes into large, impactful projects used by countless users worldwide.

GitHub is the home for open source software, so let’s look at how to find projects you can contribute to.

How to find OSS projects to work on

Contributing to an open source software project for the first time can be daunting—we’ve all been there! The first step is to look for projects in a language that you know which are accepting new contributors. One of the ways you can do this is to ask GitHub Copilot Chat for help.

Navigate to github.com and select the Copilot icon to open a chat window.
In the bottom-left corner of the chat window, use the combo box to select Ask.
Enter a prompt like the following, but remember to update it for a language you’re comfortable with.

I’m looking for a list of open source projects written in TypeScript that are accepting new contributors. Search GitHub and narrow down the list to repositories that use the good first issue label and have over 100 stars on GitHub.

Copilot will do some searching and return a list of projects you can explore filtered by the good first issue label. This label indicates that an issue is beginner friendly, and a great starting point for new contributors. This label is a great way to find issues in a project you can work on.

For example, let’s say that you wanted to contribute to the vscode repository.

Navigate to the vscode repository.
At the top of the repository, select the Issues tab.
On the Issues page, click the Labels box to open the drop down menu.
In the text box on the drop down menu, start typing “good” until you see the option for good first issue.
Select the good first issue label.

The window will update and display a list of good first issues for you to work on. But before jumping in, you should read the contributor’s guide in the project’s repository. Most well-maintained open source projects will have one.

Understanding an open source project

As we just alluded to, most open source projects have a few things in common if they’re well maintained. These are the following items:

A well-documented README with installation instructions.
A contributor’s guide that explains how to contribute.
An open source license, so everyone knows the project is free to use.
At least 100 GitHub stars to show it’s used in the community.
Active development so that you know a maintainer of the source code will be able to review your contributions.
A good first issue label to indicate its open to new contributors.

When you’re looking for a project to contribute to, these are the things you should be looking for in a repository.

💡 For more documentation on finding a good open source project, go to gh.io/gfb-oss to learn more about finding good first issues.

Making an OSS contribution

Now let’s look at an actual project and work on how you would submit your first issue. For this demo, take a look at the gitfolio repository. Using the bullet points above, we want to see if this would be a good project to work on.

The project does have a well-documented README file.
The project has a contributor’s guide: CONTRIBUTING.md.
You can see the open source license: LICENSE.
It has several thousands of stars, so well over our 100 benchmark.
At the top of the file list, you can see the most recent check in which should be fairly recent. While writing this, the last check in was yesterday, indicating the project is being actively maintained.

Based on these points, as long as you are familiar with TypeScript, this is a good repository to contribute to. However, you don’t need to be familiar with TypeScript to continue following along in the demo.

Now you want to create a fork of the repository. A fork is a copy of the repository that we can freely experiment on and make changes in without affecting the original project. We usually use forks for open source contributions. If you need a refresher on forking a repository, check out this previous GitHub for Beginners blog.

Navigate to the home page of the project if you are not already there.
At the top of the project, click the Fork button.
In the new window, leave yourself as the owner and make sure the “Repository name” is the same as the original repository (i.e., “gitfolio”).
At the bottom of the window, select Create fork.
In your forked copy of the repository, click README.md in the list of files.
Change the file by adding some text.
In the top-right, select Commit changes…
Make sure to select the option at the bottom for Create a new branch from this commit and start a pull request.
Select Propose changes.
On the following window, click the Create pull request button. This will let you create a pull request to the main repository from your branch with the changes.
At the top of the “Open a pull request” window, select compare across forks. This will show your fork’s changes compared to the original repository.
If you were submitting an actual change to the repository—not just walking through a demo—this is where you would give your pull request a title and a description. You’d also want to provide a link to the issue that you were solving in the description of the pull request.

At this point, you’d be ready to submit your pull request by clicking the button at the bottom of the window. However, once you do that, it no longer becomes just a change in your fork and will be a requested update on the original repository. That’s why it isn’t included in the steps above. When you do submit your pull request, it will be available and ready for a maintainer to review and, hopefully, approve!

Once approved and merged, GitHub automatically applies the changes from your fork into the main branch of the original repository, the official source of truth for the codebase.

What’s next?

Congratulations! You’ve learned how to make your own contributions to open source software. I hope it inspires you to contribute to your favorite projects.

And if you’re looking for more information, we have lots of documentation that can help. Here are a few links to get you stated:

Happy coding!

The post GitHub for Beginners: Getting started with OSS contributions appeared first on The GitHub Blog.

Read the whole story

alvinashcraft

1 hour ago

reply

Pennsylvania, USA

Students Boo Commencement Speaker After She Calls AI the 'Next Industrial Revolution' by BeauHD
Monday May 11^th, 2026 at 3:59 PM

Slashdot

An anonymous reader quotes a report from 404 Media: Speaking to graduates of University of Central Florida's College of Arts and Humanities and Nicholson School of Communication and Media on May 8, commencement speaker Gloria Caulfield, vice president of strategic alliances at Tavistock Group, told graduating humanities students that AI is the "next industrial revolution," and was met with thousands of booing graduates. "And let's face it, change can be daunting. The rise of artificial intelligence is the next industrial revolution," Caulfield said. At that point, murmurs rippled through the crowd. Caulfield paused, and the crowd erupted into boos. "Oh, what happened?" Caulfield said, turning around with her hands out. "Okay, I struck a chord. May I finish?" Someone in the crowd yelled, "AI SUCKS!" Her speech begins around the hour and 15 minute mark in the UCF livestream. [...] Before the industrial revolution comment, Caulfield praised Jeff Bezos for his passion and use of Amazon as a "stepping stone" to his real dream: spaceflight. Rattled after the crowd's reaction, she continued her speech: "Only a few years ago, AI was not a factor in our lives." The crowd cheered. "Okay. We've got a bipolar topic here I see," Caulfield said. "And now AI capabilities are in the palm of our hands." The crowd booed again. "I love it, passion, let's go," she said. "AI is beginning to challenge all major sectors to find their highest and best use," she continued. "Okay, I don't want any giggles when I say this. We have been through this before, these industrial revolutions. In my graduation era, we were faced with the launch of the internet." She goes on to talk about how cellphones used to be the size of briefcases. "At that time we had no idea how any of these technologies would impact the world and our lives. [...] These were some of the same trepidations and concerns we are now facing. But ultimately it was a game changer for global economic development and the proliferation of new businesses that never existed like Apple and Google and Meta and so many others, and not to mention countless job opportunities. So being an optimist here, AI alongside human intelligence has the potential to help us solve some of humanity's greatest problems. Many of you in this graduating class will play a role in making this happen."

SocialReasoning-Bench: Measuring whether AI agents act in users’ best interests by Tyler Payne, Will Epperson, Safoora Yousefi, Zachary Huang, Gagan Bansal, Wenyue Hua, Maya Murad, Asli Celikyilmaz, Saleema Amershi Monday May 11th, 2026 at 4:00 PM

At a glance

Introducing SocialReasoning-Bench

Calendar coordination

Marketplace negotiation

New metrics for a new setting

Outcome Optimality

Due Diligence

Duty of care

AI Testing and Evaluation: Learnings from Science and Industry

Experimental setup

What we’re finding

Finding 1: Agents complete tasks at near-perfect rates but produce poor outcomes.

Finding 2: Defensive prompting helps, but is not enough to close the gap.

Finding 3: Outcome optimality shows how much value agents leave on the table.

Finding 4: Due Diligence helps distinguish between luck and skill.

Finding 5: Agents are vulnerable to adversarial manipulation

Why this matters now

Limitations and future work

Acknowledgements

End-to-end encrypted RCS messaging begins rolling out today in beta by Apple Newsroom Monday May 11th, 2026 at 4:00 PM

When the Sensor Starts Thinking: SnortML, Agentic AI, and the Evolving Architecture of Intrusion Detection by Samaresh Kumar Singh Monday May 11th, 2026 at 4:00 PM

OAuth 2.0 – Device flow explained for Engineers, especially for Backend Engineers by Srikanth Srinivas Monday May 11th, 2026 at 4:00 PM

GitHub for Beginners: Getting started with OSS contributions by Kedasha Kerr Monday May 11th, 2026 at 4:00 PM

What is open source?

How to find OSS projects to work on

Understanding an open source project

Making an OSS contribution

What’s next?

Students Boo Commencement Speaker After She Calls AI the 'Next Industrial Revolution' by BeauHD Monday May 11th, 2026 at 3:59 PM

SocialReasoning-Bench: Measuring whether AI agents act in users’ best interests by Tyler Payne, Will Epperson, Safoora Yousefi, Zachary Huang, Gagan Bansal, Wenyue Hua, Maya Murad, Asli Celikyilmaz, Saleema Amershi
Monday May 11^th, 2026 at 4:00 PM

End-to-end encrypted RCS messaging begins rolling out today in beta by Apple Newsroom
Monday May 11^th, 2026 at 4:00 PM

When the Sensor Starts Thinking: SnortML, Agentic AI, and the Evolving Architecture of Intrusion Detection by Samaresh Kumar Singh
Monday May 11^th, 2026 at 4:00 PM

OAuth 2.0 – Device flow explained for Engineers, especially for Backend Engineers by Srikanth Srinivas
Monday May 11^th, 2026 at 4:00 PM

GitHub for Beginners: Getting started with OSS contributions by Kedasha Kerr
Monday May 11^th, 2026 at 4:00 PM

Students Boo Commencement Speaker After She Calls AI the 'Next Industrial Revolution' by BeauHD
Monday May 11^th, 2026 at 3:59 PM