Netflix’s core mission is to connect millions of members around the world with stories they’ll love. This requires not just an incredible catalog, but also a deep, machine-level understanding of every piece of content in that catalog, from the biggest blockbusters to the most niche documentaries. As we onboard new types of content such as live events and podcasts, the need to scalably understand these nuances becomes even more critical to our productions and member-facing experiences.
Many of these media-related tasks require sophisticated long-form video understanding e.g., identifying subtle narrative dependencies and emotional arcs that span entire episodes or films. Previous work has found that to truly grasp the content’s essence, our models must leverage the full multimodal signal. For example, the audio soundtrack is a crucial, non-visual modality that can help more precisely identify clip-level tones or when a new scene starts. Can we use our collection of shows and movies to learn how to a) fuse modalities like audio, video, and subtitle text together and b) develop robust representations that leverage the narrative structure that is present in long form entertainment? Consisting of tens of millions of individual shots across multiple titles, our diverse yet entertainment-specific dataset provides the perfect foundation to train multimodal media understanding models that enable many capabilities across the company such as ads relevancy, clip popularity prediction, and clip tagging.
For these reasons, we developed the Netflix Media Foundational Model (MediaFM), our new, in-house, multimodal content embedding model. MediaFM is the first tri-modal (audio, video, text) model pretrained on portions of the Netflix catalog. Its core is a multimodal, Transformer-based encoder designed to generate rich, contextual embeddings¹ for shots from our catalog by learning the temporal relationships between them through integrating visual, audio, and textual information. The resulting shot-level embeddings are powerful representations designed to create a deeper, more nuanced, and machine-readable understanding of our content, providing the critical backbone for effective cold start of newly launching titles in recommendations, optimized promotional assets (like art and trailers), and internal content analysis tools.
Figure 1: MediaFM Architecture
Input Representation & Preprocessing
The model’s fundamental unit of input is a shot, derived by segmenting a movie or episode (collectively referred to as “title”) using a shot boundary detection algorithm. For each shot, we generate three distinct embeddings from its core modalities:
Video: an internal model called SeqCLIP (a CLIP-style model fine-tuned on video retrieval datasets) is used to embed frames sampled at uniform intervals from segmented shots
Audio: the audio samples from the same shots are embedded using Meta FAIR’s wav2vec2
Timed Text: OpenAI’s text-embedding-3-large model is used to encode the corresponding timed text (e.g., closed captions, audio descriptions, or subtitles) for each shot
For each shot, the three embeddings² are concatenated and unit-normed to form a single 2304-dimensional fused embedding vector. The transformer encoder is trained on sequences of shots, so each example in our dataset is a temporally-ordered sequence of these fused embeddings from the same movie or episode (up to 512 shots per sequence). We also have access to title-level metadata which is used to provide global context for each sequence (via the [GLOBAL]token). The title-level embedding is computed by passing title-level metadata (such as synopses and tags) through the text-embedding-3-large model.
Model Architecture and Training Objective
The core of our model is a transformer encoder, architecturally similar to BERT. A sequence of preprocessed shot embeddings is passed through the following stages:
Input Projection: The fused shot embeddings are first projected down to the model’s hidden dimension via a linear layer.
Sequence Construction & Special Tokens: Before entering the Transformer, two special embeddings are prepended to the sequence: • a learnable [CLS] embedding is added at the very beginning. • the title-level embedding is projected to the model’s hidden dimension and inserted after the [CLS] token as the [GLOBAL] token, providing title-level context to every shot in the sequence and participating in the self-attention process.
Contextualization: The sequence is enhanced with positional embeddings and fed through the Transformer stack to provide shot representations based on their surrounding context.
Output Projection: The contextualized hidden states from the Transformer are passed through a final linear layer, projecting them from the hidden layers back up to the 2304-dimensional fused embedding space for prediction.
We train the model using a Masked Shot Modeling (MSM) objective. In this self-supervised task, we randomly mask 20% of the input shot embeddings in each sequence by replacing them with a learnable [MASK] embedding. The model’s objective is to predict the original, unmasked fused embedding for these masked positions. The model is optimized by minimizing the cosine distance between its predicted embedding and the ground-truth embedding for each masked shot.
We optimized the hidden parameters with Muon and the remaining parameters with AdamW. It’s worth noting that the switch to Muon resulted in noticeable improvements.
Evaluation
To evaluate the learned embeddings, we learn task-specific linear layers on top of frozen representations (i.e., linear probes). Most of the tasks are clip-level, i.e., each example is a short clip ranging from a few seconds to a minute which are often presented to our members while recommending a title to them. When embedding these clips, we find that “embedding in context”, namely extracting the embeddings from within a larger sequence (e.g., the episode containing the clip), naturally does much better than embedding only the shots from a clip.
Our embeddings are foundational and we find that they bring value to applications across Netflix. Here are a few:
Ad Relevancy: A multilabel classification task to categorize Netflix clips for relevant ad placement, measured by Average Precision. In this task, these representations operate at the retrieval stage, where they help in identifying the candidate set and in turn are fed into the ad serving system for relevance optimization.
Clip Popularity Ranking: A ranking task to predict the relative performance (in click-through rate, CTR) of a media clip relative to other clips from that show or movie, measured by a ten-fold with Kendall’s tau correlation coefficient.
Clip Tone: A multi-label classification of hook clips into 100 tone categories (e.g., creepy, scary, humorous) from our internal Metadata & Ratings team, measured by micro Average Precision (averaged across tone categories).
Clip Genre: A multi-label classification of clips into eleven core genres (Action, Anime, Comedy, Documentary, Drama, Fantasy, Horror, Kids, Romance, Sci-fi, Thriller) derived from the genre of the parent title, measured by macro Average Precision (averaged across genres).
Clip Retrieval: a binary classification of clips from movies or episodes into “clip-worthy” (i.e., a good clip to showcase the title) or not, as determined by human annotators, and as measured by Average Precision. The positive to negative clip ratio is 1:3, and for each title we select 6–10 positive clips and the corresponding number of negatives.
It’s worth noting that for the tasks above (as well as other tasks that use our model), the model outputs are utilized as information that the relevant teams use when driving to a decision rather than being used in a completely end-to-end fashion. Many of the improvements are also in various stages of deployment.
Results
Figure 2³ compares MediaFM to several strong baselines:
The previously mentioned SeqCLIP, which also provides the video embedding input for MediaFM
Figure 2: Performance of MediaFM vs. external and internal models.
On all tasks, MediaFM is better than the baselines. Improvements seem to be larger for tasks that require more detailed narrative understanding e.g., predicting the most relevant ads for an ad break given the surrounding context. We look further into this next.
Ablations
MediaFM’s primary improvements over previous Netflix work stem from two key areas: combining multiple modalities and learning to contextualize shot representations. To determine the contribution of each factor across different tasks, we compared MediaFM to a baseline. This baseline concatenates the three input embeddings, essentially providing the same complete, shot-level input as MediaFM but without the contextualization step. This comparison allows us to isolate which tasks benefit most from the contextualization aspect.
Additional modalities help somewhat for tone but the main improvement comes from contextualization.
Oddly, multiple uncontextualized modalities hurts the clip popularity ranking model, but adding contextualization significantly improves performance.
For clip retrieval we see a natural progression of around 15% for each improvement.
Next Steps
MediaFM presents a way to learn how to fuse and/or contextualize shot-level information by leveraging Netflix’s catalog in a self-supervised manner. With this perspective, we are actively investigating how pretrained multimodal (audio, video/image, text) LLMs like Qwen3-Omni, where the modality fusion has already been learned, can provide an even stronger starting point for subsequent model generations.
Next in this series of blog posts, we will present our method to embed title-level metadata and adapt it to our needs. Stay tuned!
Footnotes
We chose embeddings over generative text outputs to prioritize modular design. This provides a tighter, cleaner abstraction layer: we generate the representation once, and it is consumed across our entire suite of services. This avoids the architectural fragility of fine-tuning, allowing us to enhance our existing embedding-based workflows with new modalities more flexibly.
All of our data has audio and video; we zero-pad for missing timed text data, which is relatively likely to occur (e.g., in shots without dialogue).
The title-level tasks couldn’t be evaluated with the VertexAI MM and Marengo embedding models as the videos exceed the length limit set by the APIs.
Many of my customers are still using Azure DevOps. We’ve talked about moving their code to GitHub to take advantage of newer agentic capabilities, but for a lot of teams that move just isn’t possible right now. What does this mean for them? They’re missing out on the excellent GitHub Copilot code review experience. That didn’t sit right with me. Teams on Azure DevOps deserve the same level of innovation and care. So I built an Azure DevOps Pull Request Agent: an AI-powered agent that automatically reviews pull requests in Azure DevOps. The goal is simple: bring high-quality, AI-driven PR reviews to all customers, wherever they are in their DevOps journey. I’ve been spending some time experimenting with the GitHub...
This practical guide teaches you how to set up Cursor in Angular 21.
In this article, you will learn how to set up Cursor for a modern Angular project (for example, version 21.0) and how to use it in a practical way. We will cover basic setup, useful features and simple examples that you can try right away.
If you are starting a new project with Angular 20 or later, make sure to configure Cursor rules upfront to work with Cursor more effectively.
After successfully executing the command, you’ll notice that along with the scaffolded Angular project, the Angular CLI also adds a cursor.mdc rule file to the project.
You can modify the generated rule file to suit your project requirements. Feel free to customize the Cursor rules as needed. For example, as shown below, I have specified the Angular version to be used by the agents.
If you are working with an existing project, you can manually add the Cursor rules file to the project. Make sure to add the rule in the root of the project inside .cursor/rules. You can also go to Cursor-> Settings -> Cursor Setting -> Rules, Skills, Subagents to add a rule using the IDE.
Cursor automatically detects rules from .cursor/rules/cursor.mdc or .cursor/rules.md. When using Angular CLI, a cursor.mdc file is already added, so do not add rules.md. You should have only one of these files in the project.
Also, verify that alwaysApply: true is set in the rule file. Alternatively, you can manually add the following line at the top of the cursor.mdc file.
alwaysApply: true
You may also enable this setting by selecting the option from the dropdown at the top of the file in the Cursor IDE, as shown in the image below.
Always make sure to configure project rules to:
Apply domain-specific knowledge to the codebase
Set the version to be used
Select architecture decisions regarding whether to use signals or RxJS
Use a specific coding style
To validate whether the rule is applied to your project or not, open the chat in ask mode and ask which version you are using as shown image below:
Next, make sure to connect to the Angular MCP server. To do that on the terminal run the command:
ng mcp
Next, in the Cursor IDE, go to Cursor-> Settings -> Cursor Setting -> Tools and MCP to add a custom MCP.
Paste the configuration created by the ng mcp command.
After adding the Angular MCP server, you can validate the setup by asking Cursor to list all the MCP servers it is using, and it should list the Angular MCP server as shown in the image below.
After setting up these elements, open the Agent pane and we’ll create a plan to implement the login screen in Angular.
As shown in the image above, I have selected Plan Mode and am asking very specific questions. I am clear on what I want to implement, which features I want to use and which I do not.
Cursor created a plan that closely matches what I wanted, and I am satisfied with it. As shown in the image below, it uses the latest form functions to create a signal-based form and also indicates that it used the Angular MCP server to do this.
Next, click the Build button. The cursor will switch to Agent mode, and the Login feature will be built according to the plan. Once the cursor finishes creating the code, click Review to begin reviewing the generated code.
After review, click either Keep All or Commit to commit the code to the project.
If you are on the main branch, it may ask you whether you want to create a new branch or commit to the main branch.
After completing this setup, Cursor generated the LoginComponent, which uses the latest Signal Forms, and I am very happy with the result.
exportclassLoginComponent{protected readonly loginModel = signal<LoginData>({
email:'',
password:'',});protected readonly loginForm =form(this.loginModel,(path)=>{required(path.email,{ message:'Email is required'});email(path.email,{ message:'Enter a valid email address'});required(path.password,{ message:'Password is required'});});protectedonSubmit(event: Event):void{
event.preventDefault();submit(this.loginForm,async()=>{const credentials =this.loginModel();
console.log('Logging in with:', credentials);});}}
The generated code is based on Angular 21.0 features and adheres to project-specific rules defined in the Angular MCP Server. As your project grows, you can add more rules at different levels, but this guide should be enough to help you get started.
I hope you find this article useful. Thanks for reading.
It helps, in that it tells you whether the ESC key was down when the event occurred, but just because the ESC is down doesn’t mean that the ESC key is why you got the message.
For example, suppose your policy is to simply ignore the ESC key, but to close the dialog if the user clicks the Close button. If the user holds the ESC key and clicks the Close button, the initial press of the ESC will generate an IDCANCEL, and your call to GetKeyState will report that the ESC is down, so you will ignore the message.
And then the next IDCANCEL comes in due to the Close button, and your call to GetKeyState will correctly report “The ESC key is still down.” So your function says, “Oh, this came from the ESC key, so ignore it.”
Except that it didn’t come from the ESC key. It came from the Close button. It just so happens that the ESC is down, but that’s not the reason why you got the second IDCANCEL.
Suppose you have a kiosk in a room with two entrances, a back entrance and a front entrance. If someone enters from the front door, you want to call the receptionist, but you don’t want to do it if they enter from the back door. What we’re doing by checking the ESC key is saying, “If the back door is open, then don’t call the receptionist.” But it’s possible that somebody is just standing in the back doorway, holding the door open, and during that time, somebody comes in the front door. Your logic sees that the back door is open and suppresses the call to the receptionist because you had assumed that only one door can be open at a time.
Next time, we’ll look at distinguishing ESC from Close.