Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
156115 stories
·
33 followers

Are AI Glasses Over?, Big Technology Audience Questions, Alex Stamos on AI Cybersecurity

1 Share

Ranjan Roy from Margins is back for our weekly discussion of the latest tech news LIVE from Big Technology AI Summit. We cover: 1) Do Snapchat Specs signal the end of AR glasses 2) What should an AI device do? 3) Audience questions from the Big Technology AI Summit! 4) How should companies plan for such fast moving technology? 5) What's the ideal AI device form factor? 6) Can AI models be more useful for biology? 7) Can the U.S. and China get along on AI? 8) What responsibility do AI companies have to society? 9) Ex-Meta CSO Alex Stamos joins us to talk Fable's cyber-risks 10) Is it marketing or is it material?

---

Enjoying Big Technology Podcast? Please rate us five stars ⭐⭐⭐⭐⭐ in your podcast app of choice.

Want a discount for Big Technology on Substack + Discord? Here’s 25% off for the first year: https://www.bigtechnology.com/subscribe?coupon=0843016b

Learn more about your ad choices. Visit megaphone.fm/adchoices





Download audio: https://pdst.fm/e/tracking.swap.fm/track/t7yC0rGPUqahTF4et8YD/pscrb.fm/rss/p/traffic.megaphone.fm/AMPP6600094449.mp3
Read the whole story
alvinashcraft
8 hours ago
reply
Pennsylvania, USA
Share this story
Delete

IoT Coffee Talk: Episode 317 - "Populating Dead Planets" (The lost episode of human purpose)

1 Share
From: Iot Coffee Talk
Duration: 5:04
Views: 1

Welcome to IoT Coffee Talk, where hype comes to die a terrible death. We have a fireside chat about all things #IoT over a cup of coffee or two with some of the industry's leading business minds, thought leaders and technologists in a totally unscripted, organic format.

This week David, Rob, and Leonard jump on Web3 for a discussion about:

🎶 🎙️ BAD KARAOKE! 🎸 🥁 "Starman", David Bowie
🐣 Our second lost episode out of 317 episodes!!
🐣 How we got undermined by Polymarket!
🐣 Edge AI 2026 event in London highlights! Inference was the star!
🐣 Each tech event should have a job board to help people take over AI jobs.
🐣 RS232 and null modems at the very foundation of IoT.
🐣 Rob's book at the foundation of saving the planet with tech!
🐣 The real reason why SpaceX is fixated on populating dead planets with robots and AI.
🐣 Why we want to ditch Earth for Mars. Is it worth it? Is it smart?

It's a great episode. Grab an extraordinarily expensive latte at your local coffee shop and check out the whole thing. You will get all you need to survive another week in the world of IoT and greater tech!

Tune in! Like! Share! Comment and share your thoughts on IoT Coffee Talk, the greatest weekly assembly of Thinkers 360 and CBT tech and IoT influencers on the planet!!

If you are interested in sponsoring an episode, please contact Stephanie Atkinson at Elevate Communities. Just make a minimally required donation to www.elevatecommunities.org and you can jump on and hang with the gang and amplify your brand on one of the top IoT/Tech podcasts in the known metaverse!!!

Take IoT Coffee Talk on the road with you on your favorite podcast platform. Go to IoT Coffee Talk on Buzzsprout, like, subscribe, and share: https://lnkd.in/gyuhNZ62

Read the whole story
alvinashcraft
8 hours ago
reply
Pennsylvania, USA
Share this story
Delete

The Models Trying to Replace Fable

1 Share
From: AIDailyBrief
Duration: 25:38
Views: 1,658

G7 talks exposed geopolitical tension over access to US frontier models after the Anthropic Fable shutdown. Open-source and smaller efficient models: GLM 5.2, Kimi 2.7, Vibe Thinker, and Cursor Composer 2.5, are driving moves toward local hosting and lower-cost inference. Model panels, smart routing, and advisor-worker hybrids highlight inference optimization and capability orchestration as new enterprise priorities.

The AI Daily Brief helps you understand the most important news and discussions in AI.
Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614
Get it ad free at http://patreon.com/aidailybrief
Learn more about the show https://aidailybrief.ai/

Read the whole story
alvinashcraft
8 hours ago
reply
Pennsylvania, USA
Share this story
Delete

C# Evolved and Bits for Maps

1 Share
From: Fritz's Tech Tips and Chatter
Duration: 2:04:55
Views: 55

Fritz has a new website he's building for work, and inviting you to join in. We're going to build a site about Modern C#

Read the whole story
alvinashcraft
8 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Data Projects: Managing Data Assets at Netflix Scale

1 Share

By Amer Hesson, Marcelo Mayworm, James Mulcahy, and Brittany Truong

The Problem: Managing Assets at Netflix Scale

Netflix’s Data Platform is vast. We have millions of tables in our data warehouse and tens of thousands of scheduled workloads running across our orchestration systems. Behind each of these assets sits an engineer, a team, or an initiative — and behind each of those sits a set of decisions about who can access what, and how those workloads execute day after day.

For years, the tools we used to manage access and identity for these assets operated at the granularity of the individual asset. Every table had its own Access Control List (ACL). Every workflow ran under the identity of the engineer who authored it. In a workforce that is fluid, where people change teams, change roles, and occasionally leave the company, this fine-grained model broke down in two persistent, painful ways.

Problem 1: Permissions that can’t keep up with organizational changes

Imagine you’re on a team that owns a few hundred tables. Your org restructures, a neighboring team merges into yours, and you inherit another few hundred. Now you have to find every ACL on every table, figure out who should still have access, and update them one by one. Multiply that by every reorg across every team across the company. The result? Two failure modes:

  1. The support team gets flooded. A significant and outsized share of support threads were requests to update table permissions en masse in response to org changes. While self-service tooling and best practices are in place to manage this, adherence is inconsistent. Data Projects addresses this by promoting the solution from optional tooling to a foundational part of the data platform.
  2. Access gets granted far too broadly. Rather than maintain fine-grained ACLs, teams would often open up table access to the whole company. This defeated the purpose of having ACLs in the first place.

Problem 2: Workloads tied to human identities

Scheduled and asynchronous workloads — Maestro workflows, data movement jobs, Spark pipelines — need an identity to run as. Historically, that was a human: whoever authored the workflow.

Human identities are not durable. People change teams, get new responsibilities, and leave the company. When they do, their permissions change, and the workflows running under their identity start to fail. The only fix was to swap in a colleague’s identity, which inevitably had different permissions, kicking off a “permissions whack-a-mole” as each fix surfaced the next missing grant. And then, eventually, that colleague would also move on, and the cycle would repeat.

Enter Data Projects

We introduced Data Projects to tackle both problems head-on. At its core, a Data Project is two things:

  1. A container to manage and view a set of related assets in aggregate: tables, workflows, and other data assets grouped under a single logical umbrella.
  2. A synthetic, durable, and assumable identity: one that asynchronous and scheduled workloads can execute under, independent of any human’s lifecycle.

You can think of it as hoisting the granularity of management up from the individual asset to a meaningful container: the project. Instead of managing permissions on 500 tables, you manage them on one project that contains those 500 tables.

While the initial focus has been access and identity, the abstraction has applications well beyond those concerns. That broader potential is part of what makes it worth investing in.

Figure 1a. Individual assets, each managed in isolation, with per-asset access controls and per-person ownership.
Figure 1b. These assets are logically grouped into projects for easier management.

Grants and Roles

Each Data Project has a set of grants managed by the owning team. Different identity types can be added as grants: users, groups, applications, and continuous integration (CI) jobs. Each grant has a role that determines what the grantee can do within the project. For example, a Contributor has read/write access to the project’s assets, while a Viewer has read-only access. These roles roll up neatly — instead of rewriting hundreds of ACLs when someone joins or leaves a team, you update a single project grant.

The Identity Umbrella: Netflix and IAM

Every Data Project is provisioned with a Netflix application identity, and optionally an AWS IAM role. This is the “identity umbrella” that makes workloads durable:

  • The project’s Netflix identity is what executes the project’s async workloads (e.g. Maestro workflows). It belongs to the project, not to any person.
  • The project’s IAM role supports specialized use cases in AWS like Spark jobs on Amazon EMR. Crucially, the IAM role can be exchanged for the project’s Netflix identity in a cryptographically secure way.

Members with privileged roles can also assume the project’s Netflix identity. This is enormously useful for testing and troubleshooting from a development context like a laptop or a notebook — you get to run commands as the project, exactly as the scheduled workload would.

Gravity

One of the more elegant properties of Data Projects is what we call gravity. When a workload running under a project’s identity creates a new asset — say a Maestro workflow creates three tables — those assets are automatically added to the project as contained assets. The project becomes the center of mass for everything produced under its identity. You get organization for free as a side effect of how the platform already works, eliminating future challenges of discovering relevant assets and gaining access to them.

Securing Data Workflows with Data Projects

Maestro is Netflix’s primary workflow orchestrator for batch analytics, covering scheduled ETL pipelines, data movement jobs, ML training, and much more. Because workflows can run on schedules without the original user present, Maestro is designated a Trusted Workload Manager (TWM), formally authorized to mint fresh identity tokens on behalf of the workloads it manages.

That identity matters everywhere. A single workflow execution may be checked against table ACLs in the Secure Data Warehouse, authorization policies for Netflix resources, and IAM policies for AWS — all in a single run. If the identity is fragile, the whole workflow is fragile.

The Problem with User-Tied Identity

The standard pattern was to run workflows under an On-Behalf-Of (OBO) credential — for example, maestro OBO alice@netflix.com. This gave the workflow the union of Maestro’s and the human’s permissions, but in doing so it also bound the workflow’s permissions to that person’s. When they changed teams or left Netflix, the workflow broke. A colleague might take over ownership, but they rarely had the same access as the previous owner, so the workflow would stay broken for days while permissions were sorted out. At Netflix’s scale, with tens of thousands of scheduled workloads, many of them business-critical, this was unsustainable.

Data Projects: Durable Identity

Data Projects solves this by replacing user-tied identity with a durable, team-owned Netflix application identity: one that doesn’t change teams, go on vacation, or leave the company. Each project groups related workflows, tables, secrets, and other assets under a single consistent identity, and Maestro validates the caller’s access to the project before executing any workflow under it.

The downstream improvements are as follows:

  • Tables created during execution are automatically associated with the project’s identity through gravity, inheriting its access controls without additional configuration.
  • Secrets are scoped to project policies, so ownership transfers no longer strand credentials.
  • Access is managed once at the project level, replacing fragmented per-user grants across every asset the workflow touches.

The result is a workflow identity model that is stable, auditable, and built to survive the organizational changes inevitable at any company operating at this scale.

Success Stories

Many Data Projects have already grown to contain tens of thousands of assets in production. A couple examples are highlighted below:

  • Streaming Quality of Experience: A core observability pipeline tracking quality of experience (QoE) metrics whose continuity used to depend on whichever engineer happened to own the underlying workflows. Now it runs under the project’s identity, stable regardless of team membership changes.
  • Member Analytics: Analytical models and ETL workflows for member data products. A concentrated set of business-critical analytics whose access is managed at the project level rather than across hundreds of individual tables and workflows.

More broadly, we’ve seen Data Projects adopted as the organizing principle for entire analytics domains. Where teams previously maintained their own access policies, ad-hoc grant lists, and tribal knowledge about “who should have access to what,” the project is now the single answer.

Using Data Projects

Onboarding workflows onto Data Projects is a matter of:

  1. Creating a project for the logical grouping of assets (or using an existing suitable one).
  2. Granting the right people and groups the appropriate roles.
  3. Configuring the workflow to run with the project’s identity.

Thanks to gravity, new assets produced by project workflows land in the project automatically. Migrating existing workflows can be a challenge as it requires setting up the Data Project with the appropriate permissions before changing its execution identity. We are actively working on infrastructure to track the access patterns of existing workflows so that we can recommend precise permission updates for the destination project. Our goal is to make the Data Project the de facto option for executing any kind of asynchronous workload.

What’s Next

Data Projects started as an Analytics Platform initiative, a response to specific pains in the data warehouse, but the underlying ideas are not unique to data. We see a potential future where Projects (not just Data Projects) are a first-class platform concept spanning data assets, software assets (GitHub repositories, Spinnaker applications, Docker images), and even studio assets (production content, pipelines, and transformations).

We’re also investing in:

  • Rightsizing: we’re integrating a layer on top of our authorization policies that automatically rightsizes permissions based on actual usage patterns, proactively eliminating unnecessary access and preventing “permission creep”.
  • Hoisting beyond access and identity: the project is a natural unit for surfacing other concerns at the aggregate level — cost attribution, health indicators, and more.
  • Ad-hoc use case integrations: extending project identities beyond scheduled workloads to cover interactive, on-demand actions like running a query through the Data Portal.
  • Activity logs and audits: a unified timeline of grant changes, asset changes, and workflow versions at the project level.

Conclusion

Data Projects is an answer to a simple observation: at Netflix’s scale, the unit of identity and access management can’t be the individual asset or the individual human. It has to be something larger, something durable, something that matches the way teams actually think about the work they own.

A project is that unit. And as we continue to generalize the concept beyond the data warehouse, we expect it to become one of the foundational primitives of how engineering at Netflix is organized, not just how data is organized.

Acknowledgments

We would like to express our gratitude to the following individuals for their contributions to this effort: Ryan Bordo, Doug Clark, Luke Fernandez, Sarrah Figueroa, Ankit Gupta, Brian Hoying, Ye Ji, Abhishek Kapatkar, Anmol Khurana, Matheus Leão, Hechao Li, Raymond Liu, Alice Naghshineh, David Noor, Anjali Norwood, Javier Garcia Palacios, Kunaal Parekh, Brandon Quan, Andrew Seier, Jason Seo, and Ethan Zhang.

If you are interested in helping us solve these types of problems and helping entertain the world, please take a look at some of our open positions on the Netflix jobs page.


Data Projects: Managing Data Assets at Netflix Scale was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read the whole story
alvinashcraft
8 hours ago
reply
Pennsylvania, USA
Share this story
Delete

The Data Canary: How Netflix Validates Catalog Metadata

1 Share

By Celina Amados

At Netflix, our catalog metadata is crucial to our member experience, and a single corrupted data state can impact millions of viewers immediately. To protect streaming reliability, we built an automated data canary system that validates data transformations using production traffic. This canary detects issues in under 10 minutes, and blocks bad data from reaching our members.

Intro

Catalog metadata is what makes Netflix functional. It defines what titles exist, where they’re available, whether they can be played, and more. This data gets transformed and distributed across our vast infrastructure near-continuously, powering everything that helps members find what they want to watch. Accurate catalog data delivers moments of joy. Corrupted catalog data breaks streaming.

What Went Wrong

A production incident revealed a critical gap in our resilience strategy. No code had been deployed. No configuration had changed. But, a manual mitigation action taken during a previous incident had inadvertently corrupted a data feed, rendering it empty for a subset of titles.

The impact was immediate: missing metadata prevented manifest generation, causing failures in our catalog service and playback issues.

Engineers were alerted immediately, but identifying the root cause took time. After intense triaging, responders pinpointed the corrupted data feed and pinned services back to a known-good state, restoring playback.

The problem? Our sophisticated code canary deployments had caught nothing. No code had changed — the data had.

This incident exposed a fundamental gap in our resiliency capabilities: we can validate code deployments, but we had no equivalent for our high-velocity data pipelines. Our catalog metadata, consisting of titles, artwork, availability, and more, was continuously transformed from multiple upstream sources and published at a regular cadence. Each upstream source had its own validation, but these checks didn’t catch corruption in the final transformed output.

We needed to treat data deployments with the same rigor as code deployments.

The Challenge: Validating Data at Short Intervals

Our catalog metadata service operates as a high-velocity data pipeline: it processes multiple input feeds, transforms them, and publishes the final catalog state that gets distributed across our infrastructure.

This creates unique validation challenges that our traditional canary analysis tools aren’t designed to handle:

Time Constraints: Our existing canary analysis tools require 30–60 minutes to reach statistical confidence. We had a much shorter window between data cycles; we needed to detect issues, make a decision, and block publishing all within a single cycle.

Emergent Issues: While each upstream data source has independent validation, problems often only manifest in the final transformed state. We needed to validate the actual output that clients would consume, not just the inputs, as close to the clients as possible.

Production Traffic is Essential: We initially considered shadow traffic, but quickly realized it was insufficient. Shadow traffic can only replay requests to our catalog metadata service; it can’t simulate the entire playback lifecycle across multiple services and domains. To detect real customer impact, we needed real production traffic.

Limit Blast Radius: Despite using production traffic for validation, we couldn’t allow customers to experience widespread issues during the validation process. Any regression needed to be detected and contained immediately.

Our Solution: The Data Canary Orchestrator Pattern

After evaluating several architectural approaches, we developed a solution built around three key innovations:

1. Dedicated Orchestrator Pattern

We created a dedicated cluster for the purposes of canarying new catalog metadata that separates concerns, avoids self-testing, and provides a pattern for extensibility. Here’s how it works:

Orchestrator Instance: A dedicated orchestrator instance of our catalog metadata service coordinates the data canary flow. When a new catalog version is published to the canary environment, the orchestrator validates that both baseline and canary clusters are healthy and version-synchronized, then triggers a chaos experiment.

Permanent Baseline & Canary Clusters: Two dedicated service clusters run continuously in our canary region. The baseline cluster always serves the latest production catalog version, while the canary cluster receives new versions for validation.

Generic Integration Point: Upon chaos experiment completion, the orchestrator reports results back to the transformer service via a REST endpoint. This generic interface means new data sources can implement their own orchestrator patterns without requiring transformer code changes.

This pattern can now be adopted by other teams at Netflix for validating different data sources, which is exactly the kind of extensibility we designed for.

Data Canary workflow

2. Utilizing and Extending our Chaos Platform

Meeting the 10-minute constraint required not only leaning on our chaos platform, but also extending it to meet our needs:

Custom Threshold Tuning: We worked with our Resilience team to customize experiment thresholds for our use case. Standard chaos experiment thresholds were too conservative for our time constraints.

Multi-Tenant Testing: Our catalog service supports multiple client types with different traffic patterns and downstream dependencies. We ran separate experiments for major client types and discovered that running traffic through the tenant that handles playback requests consistently identified failures fastest.

Sticky Canaries: To isolate experiment traffic, sticky canaries use session affinity to guarantee that once a user’s traffic is routed to the baseline or canary clusters, it stays there for the duration of the experiment window. This prevents cross-contamination from concurrent chaos experiments, ensuring a clean apples-to-apples comparison between data versions.

Behavioral Metrics Over Technical Metrics: We focused on Starts Per Second (SPS), or actual customer playback attempts, as our primary signal. SPS proved more reliable than latency or error rates for detecting catalog corruption because it directly measures customer impact, and data errors may not always manifest as application errors to our catalog metadata service.

Immediate Abort on Regression: Instead of collecting data for post-hoc analysis, we stream metrics in real-time and abort experiments the moment we detect regression. This trades some statistical confidence for speed, but our tight thresholds and clear signal make this not only acceptable, but necessary.

3. Production-Hardened Edge Case Handling

Building a system that runs in production every 10 minutes taught us that the devil is in the details:

In-Flight Experiments During Redeployment: When the orchestrator restarts, it must detect and continue polling any ongoing experiments, as we can’t abandon a validation cycle mid-flight.

Leader Election: During orchestrator deployments, multiple instances might be running simultaneously. We implemented safeguards to ensure only one experiment is triggered per version announcement.

Version Synchronization: In a multi-tenant service where different clients consume data at different cadences, we track version state to ensure baseline and canary clusters are properly aligned before triggering experiments.

Validating the Validator: Controlled Failure Injection

To prove the system worked, we needed to break things on purpose. We ran a series of controlled experiments where we deliberately corrupted catalog data — denylisting high-profile titles and simulating real data corruption scenarios — to validate that the canary could detect issues and block publication.

These experiments were coordinated as proactive incidents during business hours, with product operations teams on standby. We routed approximately 0.2% of global traffic through the validation flow, minimizing blast radius while still generating meaningful signal.

Key Results:

  • Detection Speed: Issues identified in 2.5–4 minutes depending on client type
  • Clear Signal: 10x error differential between canary and baseline
  • Automatic Blocking: Publishing workflow blocked as designed when regressions detected

The experiments validated our end-to-end workflow and revealed important operational insights: different client traffic patterns detect failures at different speeds, and threshold tuning requires careful refinement based on the magnitude of impact we want this system to detect. Most importantly, they proved that even with a 10-minute validation window, far shorter than traditional 30–60 minute canary analysis, we had sufficient signal to catch high-impact catalog corruption.

Bringing Code Validation Principles to Data

This effort wasn’t just about building a validation system, it was about recognizing that data deployments deserve the same rigor as code deployments. Just because something isn’t a binary doesn’t mean it can’t break production. The patterns we landed on aren’t specific to catalog metadata, and can be applied to systems with high-velocity data pipelines more broadly.

If you’re working with data that changes frequently and impacts customers directly, ask yourself:

  • What’s your MTTD for data corruption?
  • Can you validate with production traffic safely?
  • How would you detect emergent issues in transformed data?
  • What behavioral metric most closely indicates customer impact in your domain?

Today, the failure mode that caused the aforementioned incident would be caught and mitigated in under 10 minutes. We all know outages aren’t a question of if, but when. The next time you find yourself faced with bad data, how fast will you be able to respond?

Acknowledgments

This work was a collaborative effort across multiple teams at Netflix. Special thanks to Jongyoon Lee, David Su, and Zubeen Lalani of the Catalog Foundations & Distribution team for their contributions to the design, and to Ales Plsek of the Resilience team for their support in customizing our chaos platform for our unique use case.


The Data Canary: How Netflix Validates Catalog Metadata was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read the whole story
alvinashcraft
8 hours ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories