Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
145639 stories
·
32 followers

Evolving Code Methodology: IF/ELSE Should Fit On One Screen

1 Share
Ben Nadel continues to evolve his coding heuristics, preferring if/else blocks only when the control flow fits on a single screen....
Read the whole story
alvinashcraft
3 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Learning web development: Version control via Git and GitHub

1 Share

In this chapter, we learn how to use the version control system Git and a useful companion website, GitHub. Both are important tools when programming in teams but even help programmers who work on their own.

Read the whole story
alvinashcraft
3 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Implementing DPoP with Auth0

1 Share
Learn to implement DPoP with Auth0 to secure your SPA and API. This guide shows how to protect your tokens and prevent token replay attacks with Auth0's SDKs.

Read the whole story
alvinashcraft
3 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Generative AI in the Real World: Faye Zhang on Using AI to Improve Discovery

1 Share

In this episode, Ben Lorica and AI Engineer Faye Zhang talk about discoverability: how to use AI to build search and recommendation engines that actually find what you want. Listen in to learn how AI goes way beyond simple collaborative filtering—pulling in many different kinds of data and metadata, including images and voice, to get a much better picture of what any object is and whether or not it’s something the user would want.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform.

Transcript

This transcript was created with the help of AI and has been lightly edited for clarity.

0:00: Today we have Faye Zhang of Pinterest, where she’s a staff AI engineer. And so with that, very welcome to the podcast.

0:14: Thanks, Ben. Huge fan of the work. I’ve been fortunate to attend both the Ray and NLP Summits. I know where you serve as chairs. I also love the O’Reilly AI podcast. The recent episode on A2A and the one with Raiza Martin on NotebookLM have been really inspirational. So, great to be here. 

0:33: All right, so let’s jump right in. So one of the first things I really wanted to talk to you about is this work around PinLanding. And you’ve published papers, but I guess at a high level, Faye, maybe describe for our listeners: What problem is PinLanding trying to address?

0:53: Yeah, that’s a great question. I think, in short, trying to solve this trillion-dollar discovery crisis. We’re living through the greatest paradox of the digital economy. Essentially, there’s infinite inventory but very little discoverability. Picture one example: A bride-to-be asks ChatGPT, “Now, find me a wedding dress for an Italian summer vineyard ceremony,” and she gets great general advice. But meanwhile, somewhere in Nordstrom’s hundreds of catalogs, there sits the perfect terracotta Soul Committee dress, never to be found. And that’s a $1,000 sale that will never happen. And if you multiply this by a billion searches across Google, SearchGPT, and Perplexity, we’re talking about a $6.5 trillion market, according to Shopify’s projections, where every failed product discovery is money left on the table. So that’s what we’re trying to solve—essentially solve the semantic organization of all platforms versus user context or search. 

2:05: So, before PinLanding was developed, and if you look across the industry and other companies, what would be the default—what would be the incumbent system? And what would be insufficient about this incumbent system?

2:22: There have been researchers across the past decade working on this problem; we’re definitely not the first one. I think number one is to understand the catalog attribution. So, back in the day, there was multitask R-CNN generation, as we remember, [that could] identify fashion shopping attributes. So you would pass in-system an image. It would identify okay: This shirt is red and that material may be silk. And then, in recent years, because of the leverage of large scale VLM (vision language models), this problem has been much easier. 

3:03: And then I think the second route that people come in is via the content organization itself. Back in the day, [there was] research on join graph modeling on shared similarity of attributes. And a lot of ecommerce stores also do, “Hey, if people like this, you might also like that,” and that relationship graph gets captured in their organization tree as well. We utilize a vision large language model and then the foundation model CLIP by OpenAI to easily recognize what this content or piece of clothing could be for. And then we connect that between LLMs to discover all possibilities—like scenarios, use case, price point—to connect two worlds together. 

3:55: To me that implies you have some rigorous eval process or even a separate team doing eval. Can you describe to us at a high level what is eval like for a system like this? 

4:11: Definitely. I think there are internal and external benchmarks. For the external ones, it’s the Fashion200K, which is a public benchmark anyone can download from Hugging Face, on a standard of how accurate your model is on predicting fashion items. So we measure the performance using the recall top-k metrics, which says whether the label appears among the top-end prediction attribute accurately, and as a result, we were able to see 99.7% recall for the top ten.

4:47: The other topic I wanted to talk to you about is recommendation systems. So obviously there’s now talk about, “Hey, maybe we can go beyond correlation and go towards reasoning.” Can you [tell] our audience, who may not be steeped in state-of-the-art recommendation systems, how you would describe the state of recommenders these days?

5:23: For the past decade, [we’ve been] seeing tremendous movement from foundational shifts on how RecSys essentially operates. Just to call out a few big themes I’m seeing across the board: Number one, it’s kind of moving from correlation to causation. Back then it was, hey, a user who likes X might also like Y. But now we actually understand why contents are connected semantically. And our LLM AI models are able to reason about the user preferences and what they actually are. 

5:58: The second big theme is probably the cold start problem, where companies leverage semantic IDs to solve the new item by encoding content, understanding the content directly. For example, if this is a dress, then you understand its color, style, theme, etc. 

6:17: And I think of other bigger themes we’re seeing; for example, Netflix is merging from [an] isolated system into a unified intelligence. Just this past year, Netflix [updated] their multitask architecture where [they] shared representations, into one they called the UniCoRn system to enable company-wide improvement [and] optimizations. 

6:44: And very lastly, I think on the frontier side—this is actually what I learned at the AI Engineer Summit from YouTube. It’s a DeepMind collaboration, where YouTube is now using a large recommendation model, essentially teaching Gemini to speak the language of YouTube: of, hey, a user watched this video, then what might [they] watch next? So a lot of very exciting capabilities happening across the board for sure. 

7:15: Generally it sounds like the themes from years past still map over in the following sense, right? So there’s content—the difference being now you have these foundation models that can understand the content that you have more granularly. It can go deep into the videos and understand, hey, this video is similar to this video. And then the other source of signal is behavior. So those are still the two main buckets?

7:53: Correct. Yes, I would say so. 

7:55: And so the foundation models help you on the content side but not necessarily on the behavior side?

8:03: I think it depends on how you want to see it. For example, on the embedding side, which is a kind of representation of a user entity, there have been transformations [since] back in the day with the BERT Transformer. Now it’s got long context encapsulation. And those are all with the help of LLMS. And so we can better understand users, not to next or the last clicks, but to “hey, [in the] next 30 days, what might a user like?” 

8:31: I’m not sure this is happening, so correct me if I’m wrong. The other thing that I would imagine that the foundation models can help with is, I think for some of these systems—like YouTube, for example, or maybe Netflix is a better example—thumbnails are important, right? The fact now that you have these models that can generate multiple variants of a thumbnail on the fly means you can run more experiments to figure out user preferences and user tastes, correct? 

9:05: Yes. I would say so. I was lucky enough to be invited to one of the engineer network dinners, [and was] speaking with the engineer who actually works on the thumbnails. Apparently it was all personalized, and the approach you mentioned enabled their rapid iteration of experiments, and had definitely yielded very positive results for them. 

9:29: For the listeners who don’t work on recommendation systems, what are some general lessons from recommendation systems that generally map to other forms of ML and AI applications? 

9:44: Yeah, that’s a great question. A lot of the concepts still apply. For example, the knowledge distillation. I know Indeed was trying to tackle this. 

9:56: Maybe Faye, first define what you mean by that, in case listeners don’t know what that is. 

10:02: Yes. So knowledge distillation is essentially, from a model sense, learning from a parent model with larger, bigger parameters that has better world knowledge (and the same with ML systems)—to distill into smaller models that can operate much faster but still hopefully encapsulate the learning from the parent model. 

10:24: So I think what Indeed back then faced was the classic precision versus recall in production ML. Their binary classifier needs to really filter out the batch job that you would recommend to the candidates. But this process is obviously very noisy, and sparse training data can cause latency and also constraints. So I think back in the work they published, they couldn’t really get effective separate résumé content from Mistral and maybe Llama 2. And then they were happy to learn [that] out-of-the-box GPT-4 achieved something like 90% precision and recall. But obviously GPT-4 is more expensive and has close to 30 seconds of inference time, which is much slower.

11:21: So I think what they do is use the distillation concept to fine-tune GPT 3.5 on labeled data, and then distill it into a lightweight BERT-based model using the temperature scale softmax, and they’re able to achieve millisecond latency and a comparable recall-precision trade-off. So I think that’s one of the learnings we see across the industry that the traditional ML techniques still work in the age of AI. And I think we’re going to see a lot more in the production work as well. 

11:57: By the way, one of the underappreciated things in the recommendation system space is actually UX in some ways, right? Because basically good UX for delivering the recommendations actually can move the needle. How you actually present your recommendations might make a material difference.  

12:24: I think that’s very much true. Although I can’t claim to be an expert on it because I know most recommendation systems deal with monetization, so it’s tricky to put, “Hey, what my user clicks on, like engage, send via social, versus what percentage of that…

12:42: And it’s also very platform specific. So you can imagine TikTok as one single feed—the recommendation is just on the feed. But YouTube is, you know, the stuff on the side or whatever. And then Amazon is something else. Spotify and Apple [too]. Apple Podcast is something else. But in each case, I think those of us on the outside underappreciate how much these companies invest in the actual interface.

13:18: Yes. And I think there are multiple iterations happening on any day, [so] you might see a different interface than your friends or family because you’re actually being grouped into A/B tests. I think this is very much true of [how] the engagement and performance of the UX have an impact on a lot of the search/rec system as well, beyond the data we just talked about. 

13:41: Which brings to mind another topic that is also something I’ve been interested in, over many, many years, which is this notion of experimentation. Many of the most successful companies in the space actually have invested in experimentation tools and experimentation platforms, where people can run experiments at scale. And those experiments can be done much more easily and can be monitored in a much more principled way so that any kind of things they do are backed by data. So I think that companies underappreciate the importance of investing in such a platform. 

14:28: I think that’s very much true. A lot of larger companies actually build their own in-house A/B testing experiment or testing frameworks. Meta does; Google has their own and even within different cohorts of products, if you’re monetization, social. . . They have their own niche experimentation platform. So I think that thesis is very much true. 

14:51: The last topic I wanted to talk to you about is context engineering. I’ve talked to numerous people about this. So every six months, the context window for these large language models expands. But obviously you can’t just stuff the context window full, because one, it’s inefficient. And two, actually, the LLM can still make mistakes because it’s not going to efficiently process that entire context window anyway. So talk to our listeners about this emerging area called context engineering. And how is that playing out in your own work? 

15:38: I think this is a fascinating topic, where you will hear people passionately say, “RAG is dead.” And it’s really, as you mentioned, [that] our context window gets much, much bigger. Like, for example, back in April, Llama 4 had this staggering 10 million token context window. So the logic behind this argument is quite simple. Like if the model can indeed handle millions of tokens, why not just dump everything instead of doing a retrieval?

16:08: I think there are quite a few fundamental limitations towards this. I know folks from contextual AI are passionate about this. I think number one is scalability. A lot of times in production, at least, your knowledge base is measured in terabytes or petabytes. So not tokens. So something even larger. And number two I think would be accuracy.

16:33: The effective context windows are very different. Honestly, what we see and then what is advertised in product launches. We see performance degrade long before the model reaches its “official limits.” And then I think number three is probably the efficiency and that kind of aligns with, honestly, our human behavior as well. Like do you read an entire book every time you need to answer one simple question? So I think the context engineering [has] slowly evolved from a buzzword, a few years ago, to now an engineering discipline. 

17:15: I’m appreciative that the context windows are increasing. But at some level, I also acknowledge that to some extent, it’s also kind of a feel-good move on the part of the model builders. So it makes us feel good that we can put more things in there, but it may not actually help us answer the question precisely. Actually, a few years ago, I wrote kind of a tongue-and-cheek post called “Structure Is All You Need.” So basically whatever structure you have, you should help the model, right? If it’s in a SQL database, then maybe you can expose the structure of the data. If it’s a knowledge graph, you leverage whatever structure you have to provide the model better context. So this whole notion of just stuffing the model with as much information, for all the reasons you gave, is valid. But also, philosophically, it doesn’t make any sense to do that anyway.

18:30: What are the things that you are looking forward to, Faye, in terms of foundation models? What kinds of developments in the foundation model space are you hoping for? And are there any developments that you think are below the radar? 

18:52: I think, to better utilize the concept of “contextual engineering,” that they’re essentially two loops. There’s number one within the loop of what happened. Yes. Within the LLMs. And then there’s the outer loop. Like, what can you do as an engineer to optimize a given context window, etc., to get the best results out of the product within the context loop. There are multiple tricks we can do: For example, there’s the vector plus Excel or regex extraction. There’s the metadata fillers. And then for the outer loop—this is a very common practice—people are using LLMs as a reranker, sometimes across the encoder. So the thesis is, hey, why would you overburden an LLM with a 20,000 ranking when there are things you can do to reduce it to top hundred or so? So all of this—context assembly, deduplication, and diversification—would help our production [go] from a prototype to something [that’s] more real time, reliable, and able to scale more infinitely. 

20:07: One of the things I wish—and I don’t know, this is wishful thinking—is maybe if the models can be a little more predictable, that would be nice. By that, I mean, if I ask a question in two different ways, it’ll basically give me the same answer. The foundation model builders can somehow increase predictability and maybe provide us with a little more explanation for how they arrive at the answer. I understand they’re giving us the tokens, and maybe some of the, some of the reasoning models are a little more transparent, but give us an idea of how these things work, because it’ll impact what kinds of applications we’d be comfortable deploying these things in. For example, for agents. If I’m using an agent to use a bunch of tools, but I can’t really predict their behavior, that impacts the types of applications I’d be comfortable using a model for. 

21:18: Yeah, definitely. I very much resonate with this, especially now most engineers have, you know, AI empowered coding tools like Cursor and Windsurf—and as an individual, I very much appreciate the train of thought you mentioned: why an agent does certain things. Why is it navigating between repositories? What are you looking at while you’re doing this call? I think these are very much appreciated. I know there are other approaches—look at Devin, that’s the fully autonomous engineer peer. It just takes things, and you don’t know where it goes. But I think in the near future there will be a nice marriage between the two. Well, now since Windsurf is part of Devin’s parent company. 

22:05: And with that, thank you, Faye.

22:08: Awesome. Thank you, Ben.



Read the whole story
alvinashcraft
3 hours ago
reply
Pennsylvania, USA
Share this story
Delete

Why experts writing AI evals is creating the fastest-growing companies in history | Brendan Foody (CEO of Mercor)

1 Share

Brendan Foody is the CEO and co-founder of Mercor, the fastest-growing company in history to go from $1M to $500M in revenue (in just 17 months!). At 22, he is also the youngest American unicorn founder ever. Mercor works with 6 of the Magnificent 7 and all top 5 AI labs to help them hire experts to create evaluations and training data that improve their models. In this conversation, Brendan explains why evals have become the critical bottleneck for AI progress, how he discovered this massive opportunity, and what the future of work might look like in an AI-driven economy.

What you’ll learn:

1. Why evals are becoming the primary bottleneck for AI progress and what this means for AI startups

2. How Mercor grew to $500M revenue in 17 months (fastest in history)

3. Brendan’s meeting with xAI that changed his company’s trajectory

4. Which skills and jobs will remain most valuable as AI continues to advance (hint: jobs with “elastic” demand)

5. Why Brendan believes AGI and superintelligence are not happening anytime soon

6. The three unique core values that drove Mercor’s success

7. How Harvard Lampoon writers are making Claude funnier

Brought to you by:

WorkOS—Modern identity platform for B2B SaaS, free up to 1 million MAUs

Jira Product Discovery—Atlassian’s new prioritization and roadmapping tool built for product teams

Enterpret—Transform customer feedback into product growth

Transcript: https://www.lennysnewsletter.com/p/experts-writing-ai-evals-brendan-foody

My biggest takeaways (for paid newsletter subscribers): https://www.lennysnewsletter.com/i/173303790/my-biggest-takeaways-from-this-conversation

Where to find Brendan Foody:

• X: https://x.com/BrendanFoody

• LinkedIn: https://www.linkedin.com/in/brendan-foody-2995ab10b/

Where to find Lenny:

• Newsletter: https://www.lennysnewsletter.com

• X: https://twitter.com/lennysan

• LinkedIn: https://www.linkedin.com/in/lennyrachitsky/

In this episode, we cover:

(00:00) Introduction to Brendan Foody and Mercor

(05:38) The “era of evals”

(09:26) Understanding the AI training landscape

(17:10) The future of work and AI

(25:54) The evolution of labor markets

(29:55) Understanding how AI models are trained

(38:58) Building Mercor

(53:27) Lessons from past ventures

(56:55) The future of AI and model improvement

(01:00:41) His personal use of AI and final thoughts

References: https://www.lennysnewsletter.com/p/experts-writing-ai-evals-brendan-foody

Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email podcast@lennyrachitsky.com.

Lenny may be an investor in the companies discussed.



To hear more, visit www.lennysnewsletter.com



Download audio: https://api.substack.com/feed/podcast/173303790/ef273ebb0bff431d6b63d171656b90ab.mp3
Read the whole story
alvinashcraft
3 hours ago
reply
Pennsylvania, USA
Share this story
Delete

The Marathon Mindset—Building Agile Teams That Last Beyond Sprint Deadlines | Shawn Dsouza

1 Share

Shawn Dsouza: The Marathon Mindset—Building Agile Teams That Last Beyond Sprint Deadlines

Read the full Show Notes and search through the world's largest audio library on Agile and Scrum directly on the Scrum Master Toolbox Podcast website: http://bit.ly/SMTP_ShowNotes.

Shawn defines himself as a "people-first Scrum Master" who measures success not through metrics but through daily interactions and team growth. He contrasts two teams: one that hit deadlines but lacked collaboration (unsustainable success) versus another that struggled with deadlines but excelled in conversations and continuous improvement (sustainable growth). For Shawn, protecting deep work and fostering genuine team collaboration indicates true success. He emphasizes that product development is a marathon, not a sprint, and warns that lack of meaningful conversations will inevitably lead to team problems.

In this segment, we refer to the book Clean Language by Sullivan and Rees

Featured Retrospective Format for the Week: Sprint Awards

Shawn champions the Sprint Awards retrospective format, moving beyond viewing retrospectives as just another Scrum event to recognizing them as critical team development opportunities. In this format, team members give awards to colleagues for various contributions during the sprint, with each award recipient explaining why they were chosen. Shawn prefers face-to-face, offline retrospectives and always starts with ice breakers to gauge how the team feels—whether they feel heard and connected. He believes in experimenting with different retrospective formats since no single approach works for every situation.

Self-reflection Question: How do you balance achieving deliverable outcomes with building sustainable team relationships and collaboration patterns?

[The Scrum Master Toolbox Podcast Recommends]

🔥In the ruthless world of fintech, success isn’t just about innovation—it’s about coaching!🔥

Angela thought she was just there to coach a team. But now, she’s caught in the middle of a corporate espionage drama that could make or break the future of digital banking. Can she help the team regain their mojo and outwit their rivals, or will the competition crush their ambitions? As alliances shift and the pressure builds, one thing becomes clear: this isn’t just about the product—it’s about the people.

🚨 Will Angela’s coaching be enough? Find out in Shift: From Product to People—the gripping story of high-stakes innovation and corporate intrigue.

Buy Now on Amazon

[The Scrum Master Toolbox Podcast Recommends]

About Shawn Dsouza
Shawn, a Mangalore native and Software Technology postgraduate from AIMIT, brings 8+ years of IT expertise, excelling as a Scrum Master fostering innovation and teamwork. Beyond technology, he leads SPARK, a social service initiative, and pursues his passion as an aquarist, nurturing vibrant aquatic ecosystems with dedication.

You can link with Shawn Dsouza on LinkedIn





Download audio: https://traffic.libsyn.com/secure/scrummastertoolbox/20250918_Shawn_Dsouza_Thu.mp3?dest-id=246429
Read the whole story
alvinashcraft
3 hours ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories