Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
151500 stories
·
33 followers

Help Shape the Future of Microsoft 365: Join Research Sessions at Microsoft 365 Community Conference

1 Share

One of the best ways to get value from the Microsoft 365 Community Conference isn’t just attending sessions, it’s helping shape what gets built next.

The Research sessions at Microsoft 365 Community Conference give attendees a rare, behind-the-scenes chance to collaborate directly with the Microsoft teams designing the products millions of people use every day. These sessions are designed for customers, practitioners, and IT professionals who want their real-world experiences to influence what’s coming next across Microsoft 365.

What to Expect

These aren’t traditional lectures or breakouts. Research sessions are small-group, interactive conversations designed to bring your voice into the product design process.

In a typical session, you can expect:

  • Spending dedicated time with Microsoft researchers (and sometimes partner team members)
  • Discuss scenarios drawn from real customer workflows
  • Review early ideas or prototypes and share what would (or wouldn’t) work in your environment
  • Offer candid feedback to help prioritize and refine future experiences

Topics That Matter to You

The Research sessions focus on some of the most widely used and fast-evolving Microsoft 365 experiences, including:

  • SharePoint
  • OneDrive
  • Microsoft Teams
  • Microsoft Planner
  • Microsoft Lists
  • Viva Engage

Whether you’re managing these tools at scale, supporting adoption, or building solutions on top of them, your perspective helps teams understand what’s working today and what needs to change next.

How to Participate

Research sessions are invite-only to keep the groups small and ensure everyone has time to contribute.

Here’s how it works:

  • Registered conference attendees will receive an email invitation to sign up for Research sessions.
  • The invitation includes short descriptions of each session and a brief signup form so you can indicate which topics you’re interested in.
  • Participants are required to complete a consent form and a non-disclosure agreement. Participation in Research sessions is optional. Completing the survey does not affect your ability to attend the conference if you choose not to participate.
  • Spots are filled first-come, first-served, so early registration is encouraged.

Be Part of What’s Next

The Microsoft 365 Community Conference is all about learning, connection, and community—and the Research sessions take that mission one step further. This is your opportunity to go beyond listening and actively contribute to the evolution of Microsoft 365.

If you’ve ever wanted to influence the tools you rely on every day, the Research sessions are one of the most meaningful ways to do it. Keep an eye out for the invitation, choose a session that matches your interests, and bring your real-world experience to the table.

--

Attendees of the research sessions must be registered attendees of the Microsoft 365 Community Conference.

Read the whole story
alvinashcraft
just a second ago
reply
Pennsylvania, USA
Share this story
Delete

Introducing the 2026 Imagine Cup Top Launch Startup

1 Share

Early momentum. Clear direction. 

The Launch path highlights student founders who are at an earlier stage but already showing strong signals in how they are approaching what they are building. 

L-Guard Ltd. stood out for how clearly the problem was defined, how intentionally the solution is taking shape, and the direction it is heading next. 

As the Top Launch Startup, L-Guard Ltd. receives $50,000 USD along with continued visibility and support from Microsoft as they move their solution forward. 

Meet the startup 

L-Guard Ltd.: AI-powered road safety, built for real-time response 

Rwanda 

 

L-Guard Ltd. is addressing a critical gap in road safety across Africa, where many accident victims lose their lives not from the crash itself, but from delayed emergency response. 

The startup has built an AI- and IoT-powered system that monitors vehicle activity, detects crashes in real time, and automatically alerts nearby hospitals and emergency responders. By combining sensor data with machine learning models on Azure, L-Guard transforms real-time vehicle signals into actionable emergency intelligence. 

This shifts road safety from reactive response to proactive intervention, issuing risky driving warnings, detecting incidents as they happen, and ensuring that help is activated as quickly as possible, even in low-connectivity environments. 

As the startup continues to move from pilot validation toward broader deployment, the focus is on strengthening reliability, expanding partnerships, and scaling across high-risk transport markets. By making timely rescue the standard, L-Guard is working to reduce preventable fatalities and bring more accountability to emergency response systems. 

 

Helen Ugoeze Okereke – Growing up in Ebonyi State, Nigeria, Helen set out to become what she called a “computer wizard,” focused on building real solutions with technology. Today, she leads L-Guard’s vision and strategy, driven by a mission to use technology to save lives. 

Ramadhani Wanjenja – With a background in embedded systems and intelligent hardware, Ramadhani leads the technical architecture of L-Guard. His personal experience surviving a motorcycle accident shaped the direction of the solution and its focus on immediate response. 

Terry Manzi – Raised in Kigali, Terry brings a systems and operations mindset, leading software-hardware integration, deployment, and partnerships to ensure L-Guard works effectively in real environments. 

Erioluwa Olowoyo – With a focus on product design and user experience, Erioluwa ensures L-Guard remains intuitive and accessible. His path into technology was self-driven, shaped by a commitment to building solutions that work for real users in real contexts. 

What this represents 

The Top Launch startup reflects what it means to build with intention from the start. 

This is not about having everything finished. It is about identifying a real problem, building toward a solution, and continuing to move forward with clarity and purpose. 

As L-Guard Ltd. continues to develop, their work highlights the impact student founders can have when they combine technical skill with lived experience and a clear mission. 

 

Partner tools behind the build 

Alongside mentorship and community, Imagine Cup startups gain access to tools that support how their solutions continue to take shape. 

Through GitHub Education, teams use the Student Developer Pack, collaborate with AI-assisted coding through Copilot, and build on a platform used by developers around the world. 

With Replit, teams build, test, and deploy using natural language in an AI-powered environment designed for rapid iteration. 

Together, these tools give startups the flexibility and support to keep moving forward as they scale their solutions. 

Read the whole story
alvinashcraft
20 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Evaluating Netflix Show Synopses with LLM-as-a-Judge

1 Share

by Gabriela Alessio, Cameron Taylor, and Cameron R. Wolfe

Introduction

When members log into Netflix, one of the hardest choices is what to watch. The challenge isn’t a lack of options — there are thousands of titles — but finding the most intriguing one is complex and deeply personal. To help, we surface personalized promotional assets, especially the show synopsis — a brief description highlighting key plot elements, with cues like genre or talent.

Strong synopses help members scan, understand, and choose. Poor synopses frustrate, mislead, and drive abandonment. Ensuring high-quality synopses is essential, but scaling quality validation is hard. We host hundreds of thousands of synopses, usually with multiple variants per show. We need to ensure quality at scale so every member gets a consistently great experience every time they read a synopsis. This approach helps us scale high‑quality synopsis coverage for our rapidly expanding catalog, enabling greater speed and coverage without sacrificing quality.

This report outlines our LLM-based approach for evaluating synopsis quality. Using recent advances in agents, reasoning, and LLM-as-a-Judge, we score four key synopsis quality dimensions, achieving 85%+ agreement with creative writers. Additionally, we show that higher LLM judge quality is correlated with key streaming metrics, allowing us to proactively identify and fix impactful issues weeks or months before a show debuts on Netflix.

The Making of a “Good” Synopsis

Writing high-quality synopses requires creative expertise. Our expert creative leads are best positioned to craft the creative approaches and define quality standards. However, AI can help us consistently evaluate these expert-driven quality criteria at scale. Synopsis quality at Netflix, which our system aims to predict, is viewed along two dimensions:

  1. Creative Quality: members of our creative writing team assess synopsis quality according to our internal writing guidelines and rubrics.
  2. Member Implicit Feedback: we measure the relative impact of a particular show synopsis on core streaming metrics.

These two definitions of quality capture distinct and important aspects of quality, one focused upon creative excellence and the other upon utility to members.

Creative Quality

For this project, we evaluate synopses against a subset of our creative writing quality rubric — the same criteria to which human writers would adhere. These quality rubrics change over time, and more details on the current quality standards can be found in our Editorial Style Guide and Technical Style Guide. Given Netflix’s distinctive voice and elevated editorial standards, the quality bar is high. Each criterion has extensive guidelines with examples across regions, genres, and synopsis types.

Human evaluation. We began by partnering with a group of creative writing experts to iteratively refine our definition of creative quality. We initially labeled ~1,000 diverse synopses, where three expert writers scored each against the criteria and explained their ratings. Due to the subjectivity of the task, early instance-level agreement was low. To reach a better consensus, we conducted calibration rounds (~50 synopses per round), surfaced disagreements, and evolved our quality scoring guidelines. Key interventions that were found to improve agreement include:

  • Using binary scores (instead of 1–4 Likert scores).
  • Allowing writers to reference past examples.
  • Maintaining a searchable taxonomy of common errors.

Golden evaluation data. After eight calibration rounds, writer agreement reached ~80%. To further stabilize labels, we used a model-in-the-loop consensus where:

  • Multiple writers score each synopsis.
  • An LLM, guided by the rubric, aggregates to a final label.
  • Writers review cases with substantial disagreement.

The result is a golden set of ~600 synopses with binary, criteria-level scores and explanations — our North Star for aligning an LLM judge with expert opinion.

Member Implicit Feedback

Netflix gauges implicit member feedback on a synopsis with two metrics:

  1. Take Fraction: how often members who see a title’s synopsis choose to start watching it.
  2. Abandonment Rate: how often members start a title but stop watching soon after.

Higher take fraction indicates more choosing, while lower abandonment suggests authentic, non-misleading presentation. Both of these metrics have been validated via A/B testing to serve as short-term behavioral proxies for long-term member retention. As part of evaluating our system, we also study the ability of LLM-derived quality scores to predict short-term engagement metrics. This step confirms that our scores capture behaviorally meaningful signals and assesses our ability to forecast member response to a given synopsis.

Scaling Quality Scoring with LLM-as-a-Judge

We begin our experiments by creating simple, per-criteria prompts that:

  1. Supply criterion-specific show metadata.
  2. Summarize the relevant quality guidelines.
  3. Use zero-shot chain-of-thought prompting to elicit an explanation.
  4. Request a binary decision for the synopsis.

Using a single prompt to evaluate all quality criteria is found to overload the LLM and yields poor performance — dedicated judges for each criteria perform better. Because criteria are unique, each task has its own setup, but there are some shared components:

  • We use the same LLM for all criteria.
  • The judge always outputs an explanation before its final score.
  • Final scores are binary.

Due to our use of binary scoring, judges can be evaluated with simple accuracy metrics over the golden dataset. Next, we summarize the experiments that led to our final system.

Prompt optimization. Because LLMs are sensitive to prompt phrasing, we apply Automatic Prompt Optimization (APO) over a ~300-sample dev set. Scoring guidelines are provided as additional context to the prompt optimizer. After APO, we manually refine candidate prompts with the help of an LLM, yielding initial prompts with accuracies shown below. These prompts work well for some criteria (e.g., precision) but poorly for others (e.g., clarity), highlighting criterion-specific nuances.

Improved reasoning. Many failures of our initial system arise due to a lack of accurate reasoning through highly-subjective evaluation examples. To improve reasoning accuracy, we leverage two forms of inference-time scaling:

  • Longer rationales: increase the length of the rationale or explanation generated by the LLM prior to producing a final score.
  • Consensus scoring: sample several outputs from the LLM and aggregate their scores to produce the final result.

Tiered rationales. Using tone as an example, we tested whether longer rationales are helpful by defining three rationale length tiers (shown above) and comparing their accuracies. Accuracy rises with longer rationales but returns are diminishing. Medium rationales noticeably outperform short ones, while long rationales offer only a slight additional gain; see below.

Longer rationales improve performance but degrade human-readability, which is problematic given that explanations are key pieces of evidence for creative experts. As a solution, we adopt tiered rationales: the judge reasons at any length but concisely summarizes its reasoning process prior to the final score. Tiered rationales preserve the benefits of extended reasoning, make outputs easier to inspect, and even benefit scoring accuracy. For example, our tone evaluator improves from 86.55% to 87.85% binary accuracy when using tiered rationales.

Consensus scoring. We can also allocate more inference-time compute by sampling multiple outputs per synopsis and aggregating their scores. We aggregate via a rounded average to ensure that the final score remains binary. For tone and clarity criteria with tiered rationales, 5× consensus scoring yields a clear accuracy boost as shown below.

Consensus scoring on the precision evaluator, which uses a vanilla (short) chain-of-thought, yields no benefit. As an explanation, we notice that longer rationales increase variance in scores across multiple outputs, while short rationales yield consistent scores. Consensus may be most useful for evaluators with longer rationales, where it helps to stabilize score variance. When shorter rationales are used, all scores tend to be the same, making consensus less meaningful.

What about reasoning models? While our setup elicits reasoning from a standard LLM, we also explored quality scoring with true reasoning models (i.e., models that generate long reasoning trajectories prior to final output). For tone, using a reasoning model with 5× consensus yields improving accuracy with increasing reasoning effort, even outperforming tiered rationales at the highest reasoning effort; see below. However, we skip reasoning models in our final system, as they significantly increase inference costs for only a marginal performance gain.

Agents-as-a-Judge for factuality. Synopses have four common types of factuality errors:

  1. Incorrect plot information.
  2. Incorrect metadata (e.g., genre, location, release date).
  3. Incorrect on- or off-screen talent.
  4. Incorrect award information.

Detecting these factuality errors requires comparing the synopsis to ground-truth context, where necessary context varies per criteria. For example, plot information requires a plot summary or script, while award information needs a list of awards. As we have learned, simplicity drives reliability: too much context or too many criteria harms accuracy. Motivated by this idea, we adopt factuality agents, where each agent evaluates one narrow aspect of factuality.

An agent receives context tailored to one facet of factuality and produces both a rationale and a binary factuality score. The final score of the Agents-as-a-Judge system is the minimum factuality score across agents — any failed aspect yields an overall fail. All rationales are fed to an LLM aggregator to produce a combined rationale to accompany the final score. As shown below, leveraging factuality agents significantly benefits scoring accuracy. Further benefits are achieved by using tiered rationales and consensus scoring within each agent.

Final system. In summary, our automatic evaluation system uses a combination of standard LLM-as-a-Judge, tiered rationales, consensus scoring, and Agents-as-a-Judge to maximize binary scoring accuracy for each criteria. A summary of the techniques used for each criteria and the associated binary scoring accuracy is provided below.

Member Validation of LLM-as-a-Judge

Beyond expert agreement, we also study how LLM-as-a-Judge scores relate to member behavior. This analysis serves two goals:

  • Further validating LLM-judge accuracy.
  • Linking creative quality to member-perceived quality.

Framed as predictors of member outcomes, LLM judges help us assess how promotional assets affect viewing and determine which creative attributes matter most to members discovering content they enjoy. To perform this analysis, we take advantage of the fact that most shows have multiple, personalized synopses (i.e., a synopsis “suite”). Using this suite, we can measure the causal effect of synopsis selection on metrics like take fraction and abandonment rate.

Our methodology. We correlate synopsis performance (take fraction or abandonment) with LLM quality scores. Specifically, within each show s, we relate changes in a synopsis’s LLM score to changes in its performance, normalizing by the show-level standard deviation and clustering standard errors by show; see below.

β captures the average association between within-show changes in LLM score and changes in performance. While we don’t have clean, experimental variation in LLM scores, this analysis still validates predictive value and practical utility.

Member-focused results. We report correlations for individual LLM criteria and a “Weighted Score” that combines all criteria to reduce noise and maximize signal from behavioral data. As shown below, results show promising prediction of take fraction and abandonment. Precision and clarity are especially predictive, and the weighted score provides a statistically useful signal of higher take and lower abandonment. In short, LLM evaluators capture factors that matter to members, making them a valuable tool for monitoring synopsis quality and engagement.

Closing Remarks

The LLM-as-a-Judge system used to evaluate show synopses at Netflix is the result of extensive experimentation grounded in both creative expertise and member outcomes. Building an automatic evaluation system that works reliably in practice is hard, and the approach we have described reflects countless lessons learned through iteration to improve accuracy and scalability. We have validated the system extensively with human evaluation at both the system and component levels, and we have shown that its outputs correlate with key streaming metrics. As a result, we are confident that it captures the dimensions of synopsis quality that matter most — both creatively and from the member perspective — which has driven its widespread adoption in the Netflix synopsis authoring workflow.


Evaluating Netflix Show Synopses with LLM-as-a-Judge was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete

mist is now open source and looking for interop

1 Share

A brief update on mist, my ephemeral Markdown editor with Google Docs-style comments and suggested edits:

mist is now open source with an MIT license, and the mist repo is here on GitHub.

(Try mist now and here’s my write-up from February.)

What I love about Markdown is that it’s document-first. The formatting travels with the doc. I can’t tell you how many note-taking apps I’ve jumped between with my exact same folder of Markdown notes.

The same should be true for collaboration features like suggested edits. If somebody makes an edit to your doc, you should be able to download it and upload to a wholly different app before you accept the edit; you shouldn’t be tied to a single service just because you want comments.

(And of course the doc should still be human-readable/writeable, and it’s cheating to just stuff a massive data-structure in a document header.)

So mist mixes Markdown and CriticMarkup – and I would love it if others picked up the same format. If apps are cheap and abundant in the era of vibing, then let’s focus on interop!

With mist itself:

Several people have asked for the ability to self-host it. The README says how (it’s all on Cloudflare naturally). You can add new features to your own fork, though please do share upstream if you think others could benefit.

And yes, contributions welcome! We’ve already received and merged our first pull request – thank you James Adam!


No, a document editor is not what we’re building at Inanimate. But it’s neat to release small useful projects that get made along the way. btw subscribe to our newsletter.


More posts tagged: inanimate (4).

Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete

MySQL 9.7.0 vs sysbench on a small server

1 Share

This has results from sysbench on a small server with MySQL 9.7.0 and 8.4.8. Sysbench is run with low concurrency (1 thread) and a cached database. The purpose is to search for changes in performance, often from new CPU overheads.

I tested MySQL 9.7.0 with and without the hypergraph optimizer enabled. I don't expect it to help much because the queries run here are simple. I hope to learn it doesn't hurt performance in that case.

tl;dr

  • Throughput improves on two tests with the Hypergraph optimizer in 9.7.0 because they get better query plans.
  • One read-only test and several write-heavy tests have small regressions from 8.4.8 to 9.7.0. This might be from new CPU overheads but I don't see obvious problems in the flamegraphs. 

Builds, configuration and hardware

I compiled MySQL from source for versions \8.4.8 and 9.7.0.

The server is an ASUS ExpertCenter PN53 with AMD Ryzen 7 7735HS, 32G RAM and an m.2 device for the database. More details on it are here. The OS is Ubuntu 24.04 and the database filesystem is ext4 with discard enabled.

The my.cnf files os here for 8.4. I call this the z12a configs and variants of it are used for MySQL 5.6 through 8.4.

For 9.7 I use two configs:

All DBMS versions use the latin1 character set as explained here.

Benchmark

I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks and most test only 1 type of SQL statement. Benchmarks are run with the database cached by InnoDB.

The tests are run using 1 table with 50M rows. The read-heavy microbenchmarks run for 600 seconds and the write-heavy for 1800 seconds.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. 

I provide tables below with relative QPS. When the relative QPS is > 1 then some version is faster than the base version. When it is < 1 then there might be a regression.  The relative QPS (rQPS) is:
(QPS for some version) / (QPS for MySQL 8.4.8) 

Results: point queries

I describe performance changes (changes to relative QPS, rQPS) in terms of basis points. Performance changes by one basis point when the difference in rQPS is 0.01. When rQPS decreases from 0.95 to 0.85 then it changed by 10 basis points.

This shows the rQPS for MySQL 9.7.0 using both the z13a and z13b configs. It is relative to the throughput from MySQL 8.4.8.
  • Throughput with MySQL 9.7.0 is similar to 8.4.8 except for point-query where there are regressions as rQPS drops by 5 and 7 basis points. The point-query test uses simple queries that fetch one column from one row by PK. From vmstat metrics the CPU overhead per query for 9.7.0 is ~8% larger than for 8.4.8, with and without the hypergraph optimizer. I don't see anything obvious in the flamegraphs.
z13a    z13b
0.99    1.01    hot-points
0.95    0.93    point-query
0.99    1.01    points-covered-pk
1.00    1.01    points-covered-si
0.98    1.00    points-notcovered-pk
0.99    1.01    points-notcovered-si
1.00    1.02    random-points_range=1000
0.99    1.01    random-points_range=100
0.96    1.00    random-points_range=10

Results: range queries without aggregation

I describe performance changes (changes to relative QPS, rQPS) in terms of basis points. When rQPS decreases from 0.95 to 0.85 then it changed by 10 basis points.

This shows the rQPS for MySQL 9.7.0 using both the z13a and z13b configs. It is relative to the throughput from MySQL 8.4.8.
  • Throughput with MySQL 9.7.0 is similar to 8.4.8. I am skeptical there is a regression for the scan test with the z13b config. I suspect that is noise.
z13a    z13b
0.99    0.99    range-covered-pk
0.99    0.99    range-covered-si
0.99    0.99    range-notcovered-pk
0.98    0.98    range-notcovered-si
1.00    0.96    scan

Results: range queries with aggregation

I describe performance changes (changes to relative QPS, rQPS) in terms of basis points. When rQPS decreases from 0.95 to 0.85 then it changed by 10 basis points.

This shows the rQPS for MySQL 9.7.0 using both the z13a and z13b configs. It is relative to the throughput from MySQL 8.4.8.
  • There might be small regressions in several tests with rQPS dropping by a few points but I will ignore that for now.
  • There is a large improvement for the read-only-distinct test with the z13b config. The query for this test is select distinct c from sbtest where id between ? and ? order by c. The reason for the performance improvment is that the hypergraph optimizer chooses a better plan, see here.
  • There is a large improvement for the read-only test with range=10000. This test uses the read-only version of the classic sysbench transaction (see here). One of the queries it runs is the query used by read-only-distinct. So it benefits from the better plan for that query. 
z13a    z13b
0.97    0.97    read-only-count
0.98    1.26    read-only-distinct
0.96    0.95    read-only-order
0.99    1.15    read-only_range=10000
0.97    1.00    read-only_range=100
0.96    0.97    read-only_range=10
0.99    0.99    read-only-simple
0.97    0.96    read-only-sum

Results: writes

I describe performance changes (changes to relative QPS, rQPS) in terms of basis points. When rQPS decreases from 0.95 to 0.85 then it changed by 10 basis points.

This shows the rQPS for MySQL 9.7.0 using both the z13a and z13b configs. It is relative to the throughput from MySQL 8.4.8.
  • There might be several small regressions here. I don't see obvious problems in the flamegraphs.
z13a    z13b
0.95    0.92    delete
1.00    1.01    insert
0.97    0.98    read-write_range=100
0.96    0.95    read-write_range=10
0.97    0.96    update-index
0.97    0.92    update-inlist
0.95    0.93    update-nonindex
0.95    0.92    update-one
0.95    0.93    update-zipf
0.97    0.95    write-only
Read the whole story
alvinashcraft
2 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Anthropic’s Claude Mythos Problem, Dark DNA Unveiled, Pitfalls for Assistive Models, Simulating Fluid Dynamics

1 Share
The Batch AI News and Insights: As AI agents accelerate coding, what is the future of software engineering?
Read the whole story
alvinashcraft
2 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories