Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
148201 stories
·
33 followers

0.0.415-1

1 Share

Added

  • Add show_file tool for presenting code and diffs to the user
  • Add env loading indicator showing skills, MCPs, plugins, ... being loaded

Improved

  • /mcp show groups servers into User, Workspace, Plugins, and Built-in sections and makes all servers navigable
  • Agent now knows which model is powering it when asked
  • Ctrl+A/E cycle through visual lines in wrapped input; Home/End navigate within a visual line; Ctrl+Home/End jump to text boundaries

Fixed

  • MCP tool results with giant single lines are truncated correctly
  • /plugin marketplace add and /plugin install support local paths containing spaces
Read the whole story
alvinashcraft
10 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

MediaFM: The Multimodal AI Foundation for Media Understanding at Netflix

1 Share

Avneesh Saluja, Santiago Castro, Bowei Yan, Ashish Rastogi

Introduction

Netflix’s core mission is to connect millions of members around the world with stories they’ll love. This requires not just an incredible catalog, but also a deep, machine-level understanding of every piece of content in that catalog, from the biggest blockbusters to the most niche documentaries. As we onboard new types of content such as live events and podcasts, the need to scalably understand these nuances becomes even more critical to our productions and member-facing experiences.

Many of these media-related tasks require sophisticated long-form video understanding e.g., identifying subtle narrative dependencies and emotional arcs that span entire episodes or films. Previous work has found that to truly grasp the content’s essence, our models must leverage the full multimodal signal. For example, the audio soundtrack is a crucial, non-visual modality that can help more precisely identify clip-level tones or when a new scene starts. Can we use our collection of shows and movies to learn how to a) fuse modalities like audio, video, and subtitle text together and b) develop robust representations that leverage the narrative structure that is present in long form entertainment? Consisting of tens of millions of individual shots across multiple titles, our diverse yet entertainment-specific dataset provides the perfect foundation to train multimodal media understanding models that enable many capabilities across the company such as ads relevancy, clip popularity prediction, and clip tagging.

For these reasons, we developed the Netflix Media Foundational Model (MediaFM), our new, in-house, multimodal content embedding model. MediaFM is the first tri-modal (audio, video, text) model pretrained on portions of the Netflix catalog. Its core is a multimodal, Transformer-based encoder designed to generate rich, contextual embeddings¹ for shots from our catalog by learning the temporal relationships between them through integrating visual, audio, and textual information. The resulting shot-level embeddings are powerful representations designed to create a deeper, more nuanced, and machine-readable understanding of our content, providing the critical backbone for effective cold start of newly launching titles in recommendations, optimized promotional assets (like art and trailers), and internal content analysis tools.

Figure 1: MediaFM Architecture

Input Representation & Preprocessing

The model’s fundamental unit of input is a shot, derived by segmenting a movie or episode (collectively referred to as “title”) using a shot boundary detection algorithm. For each shot, we generate three distinct embeddings from its core modalities:

  • Video: an internal model called SeqCLIP (a CLIP-style model fine-tuned on video retrieval datasets) is used to embed frames sampled at uniform intervals from segmented shots
  • Audio: the audio samples from the same shots are embedded using Meta FAIR’s wav2vec2
  • Timed Text: OpenAI’s text-embedding-3-large model is used to encode the corresponding timed text (e.g., closed captions, audio descriptions, or subtitles) for each shot

For each shot, the three embeddings² are concatenated and unit-normed to form a single 2304-dimensional fused embedding vector. The transformer encoder is trained on sequences of shots, so each example in our dataset is a temporally-ordered sequence of these fused embeddings from the same movie or episode (up to 512 shots per sequence). We also have access to title-level metadata which is used to provide global context for each sequence (via the [GLOBAL]token). The title-level embedding is computed by passing title-level metadata (such as synopses and tags) through the text-embedding-3-large model.

Model Architecture and Training Objective

The core of our model is a transformer encoder, architecturally similar to BERT. A sequence of preprocessed shot embeddings is passed through the following stages:

  1. Input Projection: The fused shot embeddings are first projected down to the model’s hidden dimension via a linear layer.
  2. Sequence Construction & Special Tokens: Before entering the Transformer, two special embeddings are prepended to the sequence:
    • a learnable [CLS] embedding is added at the very beginning.
    • the title-level embedding is projected to the model’s hidden dimension and inserted after the [CLS] token as the [GLOBAL] token, providing title-level context to every shot in the sequence and participating in the self-attention process.
  3. Contextualization: The sequence is enhanced with positional embeddings and fed through the Transformer stack to provide shot representations based on their surrounding context.
  4. Output Projection: The contextualized hidden states from the Transformer are passed through a final linear layer, projecting them from the hidden layers back up to the 2304-dimensional fused embedding space for prediction.

We train the model using a Masked Shot Modeling (MSM) objective. In this self-supervised task, we randomly mask 20% of the input shot embeddings in each sequence by replacing them with a learnable [MASK] embedding. The model’s objective is to predict the original, unmasked fused embedding for these masked positions. The model is optimized by minimizing the cosine distance between its predicted embedding and the ground-truth embedding for each masked shot.

We optimized the hidden parameters with Muon and the remaining parameters with AdamW. It’s worth noting that the switch to Muon resulted in noticeable improvements.

Evaluation

To evaluate the learned embeddings, we learn task-specific linear layers on top of frozen representations (i.e., linear probes). Most of the tasks are clip-level, i.e., each example is a short clip ranging from a few seconds to a minute which are often presented to our members while recommending a title to them. When embedding these clips, we find that “embedding in context”, namely extracting the embeddings from within a larger sequence (e.g., the episode containing the clip), naturally does much better than embedding only the shots from a clip.

Tasks

Our embeddings are foundational and we find that they bring value to applications across Netflix. Here are a few:

  • Ad Relevancy: A multilabel classification task to categorize Netflix clips for relevant ad placement, measured by Average Precision. In this task, these representations operate at the retrieval stage, where they help in identifying the candidate set and in turn are fed into the ad serving system for relevance optimization.
  • Clip Popularity Ranking: A ranking task to predict the relative performance (in click-through rate, CTR) of a media clip relative to other clips from that show or movie, measured by a ten-fold with Kendall’s tau correlation coefficient.
  • Clip Tone: A multi-label classification of hook clips into 100 tone categories (e.g., creepy, scary, humorous) from our internal Metadata & Ratings team, measured by micro Average Precision (averaged across tone categories).
  • Clip Genre: A multi-label classification of clips into eleven core genres (Action, Anime, Comedy, Documentary, Drama, Fantasy, Horror, Kids, Romance, Sci-fi, Thriller) derived from the genre of the parent title, measured by macro Average Precision (averaged across genres).
  • Clip Retrieval: a binary classification of clips from movies or episodes into “clip-worthy” (i.e., a good clip to showcase the title) or not, as determined by human annotators, and as measured by Average Precision. The positive to negative clip ratio is 1:3, and for each title we select 6–10 positive clips and the corresponding number of negatives.

It’s worth noting that for the tasks above (as well as other tasks that use our model), the model outputs are utilized as information that the relevant teams use when driving to a decision rather than being used in a completely end-to-end fashion. Many of the improvements are also in various stages of deployment.

Results

Figure 2³ compares MediaFM to several strong baselines:

Figure 2: Performance of MediaFM vs. external and internal models.

On all tasks, MediaFM is better than the baselines. Improvements seem to be larger for tasks that require more detailed narrative understanding e.g., predicting the most relevant ads for an ad break given the surrounding context. We look further into this next.

Ablations

MediaFM’s primary improvements over previous Netflix work stem from two key areas: combining multiple modalities and learning to contextualize shot representations. To determine the contribution of each factor across different tasks, we compared MediaFM to a baseline. This baseline concatenates the three input embeddings, essentially providing the same complete, shot-level input as MediaFM but without the contextualization step. This comparison allows us to isolate which tasks benefit most from the contextualization aspect.

Additional modalities help somewhat for tone but the main improvement comes from contextualization.

Oddly, multiple uncontextualized modalities hurts the clip popularity ranking model, but adding contextualization significantly improves performance.

For clip retrieval we see a natural progression of around 15% for each improvement.

Next Steps

MediaFM presents a way to learn how to fuse and/or contextualize shot-level information by leveraging Netflix’s catalog in a self-supervised manner. With this perspective, we are actively investigating how pretrained multimodal (audio, video/image, text) LLMs like Qwen3-Omni, where the modality fusion has already been learned, can provide an even stronger starting point for subsequent model generations.

Next in this series of blog posts, we will present our method to embed title-level metadata and adapt it to our needs. Stay tuned!

Footnotes

  1. We chose embeddings over generative text outputs to prioritize modular design. This provides a tighter, cleaner abstraction layer: we generate the representation once, and it is consumed across our entire suite of services. This avoids the architectural fragility of fine-tuning, allowing us to enhance our existing embedding-based workflows with new modalities more flexibly.
  2. All of our data has audio and video; we zero-pad for missing timed text data, which is relatively likely to occur (e.g., in shots without dialogue).
  3. The title-level tasks couldn’t be evaluated with the VertexAI MM and Marengo embedding models as the videos exceed the length limit set by the APIs.

MediaFM: The Multimodal AI Foundation for Media Understanding at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read the whole story
alvinashcraft
10 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Building an AI Pull Request Agent for Azure DevOps using GitHub Copilot SDK

1 Share
Many of my customers are still using Azure DevOps. We’ve talked about moving their code to GitHub to take advantage of newer agentic capabilities, but for a lot of teams that move just isn’t possible right now. What does this mean for them? They’re missing out on the excellent GitHub Copilot code review experience. That didn’t sit right with me. Teams on Azure DevOps deserve the same level of innovation and care. So I built an Azure DevOps Pull Request Agent: an AI-powered agent that automatically reviews pull requests in Azure DevOps. The goal is simple: bring high-quality, AI-driven PR reviews to all customers, wherever they are in their DevOps journey. I’ve been spending some time experimenting with the GitHub...
Read the whole story
alvinashcraft
11 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

How to Use Cursor with Modern Angular

1 Share

This practical guide teaches you how to set up Cursor in Angular 21.

In this article, you will learn how to set up Cursor for a modern Angular project (for example, version 21.0) and how to use it in a practical way. We will cover basic setup, useful features and simple examples that you can try right away.

If you are starting a new project with Angular 20 or later, make sure to configure Cursor rules upfront to work with Cursor more effectively.

Cursor – http://docs.cursor.com/en/context/rules

After successfully executing the command, you’ll notice that along with the scaffolded Angular project, the Angular CLI also adds a cursor.mdc rule file to the project.

cursor.mdc in product-app - .cursor/rules

You can modify the generated rule file to suit your project requirements. Feel free to customize the Cursor rules as needed. For example, as shown below, I have specified the Angular version to be used by the agents.

Cursor rules told to use Angular version 21.0

If you are working with an existing project, you can manually add the Cursor rules file to the project. Make sure to add the rule in the root of the project inside .cursor/rules. You can also go to Cursor-> Settings -> Cursor Setting -> Rules, Skills, Subagents to add a rule using the IDE.

Cursor automatically detects rules from .cursor/rules/cursor.mdc or .cursor/rules.md. When using Angular CLI, a cursor.mdc file is already added, so do not add rules.md. You should have only one of these files in the project.

Also, verify that alwaysApply: true is set in the rule file. Alternatively, you can manually add the following line at the top of the cursor.mdc file.

alwaysApply: true

You may also enable this setting by selecting the option from the dropdown at the top of the file in the Cursor IDE, as shown in the image below.

cursor.mdc rules – apply manually dropdown set to always apply

Always make sure to configure project rules to:

  • Apply domain-specific knowledge to the codebase
  • Set the version to be used
  • Select architecture decisions regarding whether to use signals or RxJS
  • Use a specific coding style

To validate whether the rule is applied to your project or not, open the chat in ask mode and ask which version you are using as shown image below:

cursor.mdc chat ask – Which version of Angular are you going to work with? From rules, Angular 21

Next, make sure to connect to the Angular MCP server. To do that on the terminal run the command:

ng mcp

To start using the Angular CLI MCP Server, add this configuration to your host…

Next, in the Cursor IDE, go to Cursor-> Settings -> Cursor Setting -> Tools and MCP to add a custom MCP.

Cursor IDE - Settings - Tools and MCP: add a custom MCP.

Paste the configuration created by the ng mcp command.

{
  "mcpServers": {
    "angular-cli": {
      "command": "npx",
      "args": ["-y", "@angular/cli", "mcp"]
    }
  }
}

After adding the Angular MCP server, you can validate the setup by asking Cursor to list all the MCP servers it is using, and it should list the Angular MCP server as shown in the image below.

Can you tell me which MCP you are going to use? 1. Web fetch – mcp_web_fetch; 2. MCP resources – list_mcp_resources, fetch_mcp_resource; 3. Angular CLI MCP – mcp_angular-cli_ai_tutor

After setting up these elements, open the Agent pane and we’ll create a plan to implement the login screen in Angular.

Angular Cursor chat prompt: Give me a plan to Create a login component. Component should use latest form signal instead of reactive or template forms. Login will have email and password field. Try to keep implementation as simple as possible.

As shown in the image above, I have selected Plan Mode and am asking very specific questions. I am clear on what I want to implement, which features I want to use and which I do not.

Cursor created a plan that closely matches what I wanted, and I am satisfied with it. As shown in the image below, it uses the latest form functions to create a signal-based form and also indicates that it used the Angular MCP server to do this.

Angular login plan in Cursor with heading Component Design (simple) includes list of model, form, imports, decorator, template

Next, click the Build button. The cursor will switch to Agent mode, and the Login feature will be built according to the plan. Once the cursor finishes creating the code, click Review to begin reviewing the generated code.

After review, click either Keep All or Commit to commit the code to the project.

Add login route and link to app, with Commit button, 6 files selected

If you are on the main branch, it may ask you whether you want to create a new branch or commit to the main branch.

Commit on Main Branch – buttons to Commit on Main or Create New Branch

After completing this setup, Cursor generated the LoginComponent, which uses the latest Signal Forms, and I am very happy with the result.

export class LoginComponent {
  protected readonly loginModel = signal<LoginData>({
    email: '',
    password: '',
  });

  protected readonly loginForm = form(this.loginModel, (path) => {
    required(path.email, { message: 'Email is required' });
    email(path.email, { message: 'Enter a valid email address' });
    required(path.password, { message: 'Password is required' });
  });

  protected onSubmit(event: Event): void {
    event.preventDefault();
    submit(this.loginForm, async () => {
      const credentials = this.loginModel();
      console.log('Logging in with:', credentials);
    });
  }
}

The generated code is based on Angular 21.0 features and adheres to project-specific rules defined in the Angular MCP Server. As your project grows, you can add more rules at different levels, but this guide should be enough to help you get started.

I hope you find this article useful. Thanks for reading.

Read the whole story
alvinashcraft
11 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Bun Adds Parallel Script Support

1 Share
Read the whole story
alvinashcraft
11 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Customizing the ways the dialog manager dismisses itself: Detecting the ESC key, second (failed) attempt

1 Share

Last time, we saw that Get­Async­Key­State is not the way to detect whether the ESC key was down at the time the current input message was generated. But what about if we switched to Get­Key­State? Would that allow us to distinguish between an IDCANCEL caused by the ESC and an IDCANCEL that come from the Close button?

It helps, in that it tells you whether the ESC key was down when the event occurred, but just because the ESC is down doesn’t mean that the ESC key is why you got the message.

For example, suppose your policy is to simply ignore the ESC key, but to close the dialog if the user clicks the Close button. If the user holds the ESC key and clicks the Close button, the initial press of the ESC will generate an IDCANCEL, and your call to Get­Key­State will report that the ESC is down, so you will ignore the message.

And then the next IDCANCEL comes in due to the Close button, and your call to Get­Key­State will correctly report “The ESC key is still down.” So your function says, “Oh, this came from the ESC key, so ignore it.”

Except that it didn’t come from the ESC key. It came from the Close button. It just so happens that the ESC is down, but that’s not the reason why you got the second IDCANCEL.

Suppose you have a kiosk in a room with two entrances, a back entrance and a front entrance. If someone enters from the front door, you want to call the receptionist, but you don’t want to do it if they enter from the back door. What we’re doing by checking the ESC key is saying, “If the back door is open, then don’t call the receptionist.” But it’s possible that somebody is just standing in the back doorway, holding the door open, and during that time, somebody comes in the front door. Your logic sees that the back door is open and suppresses the call to the receptionist because you had assumed that only one door can be open at a time.

Next time, we’ll look at distinguishing ESC from Close.

The post Customizing the ways the dialog manager dismisses itself: Detecting the ESC key, second (failed) attempt appeared first on The Old New Thing.

Read the whole story
alvinashcraft
11 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories