In this episode, Ben Lorica and Drew Breunig, a strategist at the Overture Maps Foundation, talk all things context engineering: what’s working, where things are breaking down, and what comes next. Listen in to hear why huge context windows aren’t solving the problems we hoped they might, why companies shouldn’t discount evals and testing, and why we’re doing the field a disservice by leaning into marketing and buzzwords rather than trying to leverage what current crop of LLMs are actually capable of.
About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.
Check out other episodes of this podcast on the O’Reilly learning platform.
Transcript
This transcript was created with the help of AI and has been lightly edited for clarity.
00.00: All right. So today we have Drew Breunig. He is a strategist at the Overture Maps Foundation. And he’s also in the process of writing a book for O’Reilly called the Context Engineering Handbook. And with that, Drew, welcome to the podcast.
00.23: Thanks, Ben. Thanks for having me on here.
00.26: So context engineering. . . I remember before ChatGPT was even released, someone was talking to me about prompt engineering. I said, “What’s that?” And then of course, fast-forward to today, now people are talking about context engineering. And I guess the short definition is it’s the delicate art and science of filling the context window with just the right information. What’s broken with how teams think about context today?
00.56: I think it’s important to talk about why we need a new word or why a new word makes sense. I was just talking with Mike Taylor, who wrote the prompt engineering book for O’Reilly, exactly about this and why we need a new word. Why is prompt engineering not good enough? And I think it has to do with the way the models and the way they’re being built is evolving. I think it also has to deal with the way that we’re learning how to use these models.
And so prompt engineering was a natural word to think about when your interaction and how you program the model was maybe one turn of conversation, maybe two, and you might pull in some context to give it examples. You might do some RAG and context augmentation, but you’re working with this one-shot service. And that was really similar to the way people were working in chatbots. And so prompt engineering started to evolve as this thing.
02.00: But as we started to build agents and as companies started to develop models that were capable of multiturn tool-augmented reasoning usage, suddenly you’re not using that one prompt. You have a context that is sometimes being prompted by you, sometimes being modified by your software harness around the model, sometimes being modified by the model itself. And increasingly the model is starting to manage that context. And that prompt is very user-centric. It is a user giving that prompt.
But when we start to have these multiturn systematic editing and preparation of contexts, a new word was needed, which is this idea of context engineering. This is not to belittle prompt engineering. I think it’s an evolution. And it shows how we’re evolving and finding this space in real time. I think context engineering is more suited to agents and applied AI programing, whereas prompt engineering lives in how people use chatbots, which is a different field. It’s not better and not worse.
And so context engineering is more specific to understanding the failure modes that occur, diagnosing those failure modes and establishing good practices for both preparing your context but also setting up systems that fix and edit your context, if that makes sense.
03.33: Yeah, and also, it seems like the words themselves are indicative of the scope, right? So “prompt” engineering means it’s the prompt. So you’re fiddling with the prompt. And [with] context engineering, “context” can be a lot of things. It could be the information you retrieve. It might involve RAG, so you retrieve information. You put that in the context window.
04.02: Yeah. And people were doing that with prompts too. But I think in the beginning we just didn’t have the words. And that word became a big empty bucket that we filled up. You know, the quote I always quote too often, but I find it fitting, is one of my favorite quotes from Stuart Brand, which is, “If you want to know where the future is being made, follow where the lawyers are congregating and the language is being invented,” and the arrival of context engineering as a word came after the field was invented. It just kind of crystallized and demarcated what people were already doing.
04.36: So the word “context” means you’re providing context. So context could be a tool, right? It could be memory. Whereas the word “prompt” is much more specific.
04.55: And I think it also is like, it has to be edited by a person. I’m a big advocate for not using anthropomorphizing words around large language models. “Prompt” to me involves agency. And so I think it’s nice—it’s a good delineation.
05.14: And then I think one of the very immediate lessons that people realize is, just because. . .
So one of the things that these model providers do when they have a model release, one of the things they note is, What’s the size of the context window? So people started associating context window [with] “I stuff as much as I can in there.” But the reality is actually that, one, it’s not efficient. And two, it also is not useful to the model. Just because you have a massive context window doesn’t mean that the model treats the entire context window evenly.
05.57: Yeah, it doesn’t treat it evenly. And it’s not a one-size-fits-all solution. So I don’t know if you remember last year, but that was the big dream, which was, “Hey, we’re doing all this work with RAG and augmenting our context. But wait a second, if we can make the context 1 million tokens, 2 million tokens, I don’t have to run RAG on all of my corporate documents. I can just fit it all in there, and I can constantly be asking this. And if we can do this, we essentially have solved all of the hard problems that we were worrying about last year.” And so that was the big hope.
And you started to see an arms race of everybody trying to make bigger and bigger context windows to the point where, you know, Llama 4 had its spectacular flameout. It was rushed out the door. But the headline feature by far was “We will be releasing a 10 million token context window.” And the thing that everybody realized is. . . Like, all right, we were really hopeful for that. And then as we started building with these context windows, we started to realize there were some big limitations around them.
07.01: Perhaps the thing that clicked for me was in Google’s Gemini 2.5 paper. Fantastic paper. And one of the reasons I love it is because they dedicate about four pages in the appendix to talking about the kind of methodology and harnesses they built so that they could teach Gemini to play Pokémon: how to connect it to the game, how to actually read out the state of the game, how to make choices about it, what tools they gave it, all of these other things.
And buried in there was a real “warts and all” case study, which are my favorite when you talk about the hard things and especially when you cite the things you can’t overcome. And Gemini 2.5 was a million-token context window with, eventually, 2 million tokens coming. But in this Pokémon thing, they said, “Hey, we actually noticed something, which is once you get to about 200,000 tokens, things start to fall apart, and they fall apart for a host of reasons. They start to hallucinate. One of the things that is really demonstrable is they start to rely more on the context knowledge than the weights knowledge.
08.22: So inside every model there’s a knowledge base. There’s, you know, all of these other things that get kind of buried into the parameters. But when you reach a certain level of context, it starts to overload the model, and it starts to rely more on the examples in the context. And so this means that you are not taking advantage of the full strength or knowledge of the model.
08.43: So that’s one way it can fail. We call this “context distraction,” though Kelly Hong at Chroma has written an incredible paper documenting this, which she calls “context rot,” which is a similar way [of] charting when these benchmarks start to fall apart.
Now the cool thing about this is that you can actually use this to your advantage. There’s another paper out of, I believe, the Harvard Interaction Lab, where they look at these inflection points for. . .
09.13: Are you familiar with the term “in-context learning”? In-context learning is when you teach the model to do something that doesn’t know how to do by providing examples in your context. And those examples illustrate how it should perform. It’s not something that it’s seen before. It’s not in the weights. It’s a completely unique problem.
Well, sometimes those in-context learning[s] are counter to what the model has learned in the weights. So they end up fighting each other, the weights and the context. And this paper documented that when you get over a certain context length, you can overwhelm the weights and you can force it to listen to your in-context examples.
09.57: And so all of this is just to try to illustrate the complexity of what’s going on here and how I think one of the traps that leads us to this place is that the gift and the curse of LLMs is that we prompt and build contexts that are in the English language or whatever language you speak. And so that leads us to believe that they’re going to react like other people or entities that read the English language.
And the fact of the matter is, they don’t—they’re reading it in a very specific way. And that specific way can vary from model to model. And so you have to systematically approach this to understand these nuances, which is where the context management field comes in.
10.35: This is interesting because even before those papers came out, there were studies which showed the exact opposite problem, which is the following: You may have a RAG system that actually retrieves the right information, but then somehow the LLMs can still fail because, as you alluded to, they have weights so they have prior beliefs. You saw something [on] the internet, and they will opine against the precise information you retrieve from the context.
11.08: This is a really big problem.
11.09: So this is true even if the context window’s small actually.
11.13: Yeah, and Ben, you touched on something that’s really important. So in my original blog post, I document four ways that context fails. I talk about “context poisoning.” That’s when you hallucinate something in a long-running task and it stays in there, and so it’s continually confusing it. “Context distraction,” which is when you overwhelm that soft limit to the context window and then you start to perform poorly. “Context confusion”: This is when you put things that aren’t relevant to the task inside your context, and suddenly they think the model thinks that it has to pay attention to this stuff and it leads them astray. And then the last thing is “context clash,” which is when there’s information in the context that’s at odds with the task that you are trying to perform.
A good example of this is, say you’re asking the model to only reply in JSON, but you’re using MCP tools that are defined with XML. And so you’re creating this backwards thing. But I think there’s a fifth piece that I need to write about because it keeps coming up. And it’s exactly what you described.
12.23: Douwe [Kiela] over at Contextual AI refers to this as “context” or “prompt adherence.” But the term that keeps sticking in my mind is this idea of fighting the weights. There’s three situations you get yourself into when you’re interacting with an LLM. The first is when you’re working with the weights. You’re asking it a question that it knows how to answer. It’s seen many examples of that answer. It has it in its knowledge base. It comes back with the weights, and it can give you a phenomenal, detailed answer to that question. That’s what I call “working with the weights.”
The second is what we referred to earlier, which is that in-context learning, which is you’re doing something that it doesn’t know about and you’re showing an example, and then it does it. And this is great. It’s wonderful. We do it all the time.
But then there’s a third example which is, you’re providing it examples. But those examples are at odds with some things that it had learned usually during posttraining, during the fine-tuning or RL stage. A really good example is format outputs.
13.34: Recently a friend of mine was updating his pipeline to try out a new model, Moonshots. A really great model and really great model for tool use. And so he just changed his model and hit run to see what happened. And he kept failing—his thing couldn’t even work. He’s like, “I don’t understand. This is supposed to be the best tool use model there is.” And he asked me to look at his code.
I looked at his code and he was extracting data using Markdown, essentially: “Put the final answer in an ASCII box and I’ll extract it that way.” And I said, “If you change this to XML, see what happens. Ask it to respond in XML, use XML as your formatting, and see what happens.” He did that. That one change passed every test. Like basically crushed it because it was working with the weights. He wasn’t fighting the weights. Everyone’s experienced this if you build with AI: the stubborn things it refuses to do, no matter how many times you ask it, including formatting.
14.35: [Here’s] my favorite example of this though, Ben: So in ChatGPT’s web interface or their application interface, if you go there and you try to prompt an image, a lot of the images that people prompt—and I’ve talked to user research about this—are really boring prompts. They have a text box that can be anything, and they’ll say something like “a black cat” or “a statue of a man thinking.”
OpenAI realized this was leading to a lot of bad images because the prompt wasn’t detailed; it wasn’t a good prompt. So they built a system that recognizes if your prompt is too short, low detail, bad, and it hands it to another model and says, “Improve this prompt,” and it improves the prompt for you. And if you inspect in Chrome or Safari or Firefox, whatever, you inspect the developer settings, you can see the JSON being passed back and forth, and you can see your original prompt going in. Then you can see the improved prompt.
15.36: My favorite example of this [is] I asked it to make a statue of a man thinking, and it came back and said something like “A detailed statue of a human figure in a thinking pose similar to Rodin’s ‘The Thinker.’ The statue is made of weathered stone sitting on a pedestal. . .” Blah blah blah blah blah blah. A paragraph. . . But below that prompt there were instructions to the chatbot or to the LLM that said, “Generate this image and after you generate the image, do not reply. Do not ask follow up questions. Do not ask. Do not make any comments describing what you’ve done. Just generate the image.” And in this prompt, then nine times, some of them in all caps, they say, “Please do not reply.” And the reason is because a big chunk of OpenAI’s posttraining is teaching these models how to converse back and forth. They want you to always be asking a follow-up question and they train it. And so now they have to fight the prompts. They have to add in all these statements. And that’s another way that fails.
16.42: So why I bring this up—and this is why I need to write about it—is as an applied AI developer, you need to recognize when you’re fighting the prompt, understand enough about the posttraining of that model, or make some assumptions about it, so that you can stop doing that and try something different, because you’re just banging your head against a wall and you’re going to get inconsistent, bad applications and the same statement 20 times over.
17.07: By the way, the other thing that’s interesting about this whole topic is, people actually somehow have underappreciated or forgotten all of the progress we’ve made in information retrieval. There’s a whole. . . I mean, these people have their own conferences, right? Everything from reranking to the actual indexing, even with vector search—the information retrieval community still has a lot to offer, and it’s the kind of thing that people underappreciated. And so by simply loading your context window with massive amounts of garbage, you’re actually, leaving on the field so much progress in information retrieval.
18.04: I do think it’s hard. And that’s one of the risks: We’re building all this stuff so fast from the ground up, and there’s a tendency to just throw everything into the biggest model possible and then hope it sorts it out.
I really do think there’s two pools of developers. There’s the “throw everything in the model” pool, and then there’s the “I’m going to take incremental steps and find the most optimal model.” And I often find that latter group, which I called a compound AI group after a paper that was published out of Berkeley, those tend to be people who have run data pipelines, because it’s not just a simple back and forth interaction. It’s gigabytes or even more of data you’re processing with the LLM. The costs are high. Latency is important. So designing efficient systems is actually incredibly key, if not a total requirement. So there’s a lot of innovation that comes out of that space because of that kind of boundary.
19.08: If you were to talk to one of these applied AI teams and you were to give them one or two things that they can do right away to improve, or fix context in general, what are some of the best practices?
19.29: Well you’re going to laugh, Ben, because the answer is dependent on the context, and I mean the context in the team and what have you.
19.38: But if you were to just go give a keynote to a general audience, if you were to list down one, two, or three things that are the lowest hanging fruit, so to speak. . .
19.50: The first thing I’m gonna do is I’m going to look in the room and I’m going to look at the titles of all the people in there, and I’m going to see if they have any subject-matter experts or if it’s just a bunch of engineers trying to build something for subject-matter experts. And my first bit of advice is you need to get yourself a subject-matter expert who is looking at the data, helping you with the eval data, and telling you what “good” looks like.
I see a lot of teams that don’t have this, and they end up building fairly brittle prompt systems. And then they can’t iterate well, and so that enterprise AI project fails. I also see them not wanting to open themselves up to subject-matter experts, because they want to hold on to the power themselves. It’s not how they’re used to building.
20.38: I really do think building in applied AI has changed the power dynamic between builders and subject-matter experts. You know, we were talking earlier about some of like the old Web 2.0 days and I’m sure you remember. . . Remember back at the beginning of the iOS app craze, we’d be at a dinner party and someone would find out that you’re capable of building an app, and you would get cornered by some guy who’s like “I’ve got a great idea for an app,” and he would just talk at you—usually a he.
21.15: This is back in the Objective-C days. . .
21.17: Yes, way back when. And this is someone who loves Objective-C. So you’d get cornered and you’d try to find a way out of that awkward conversation. Nowadays, that dynamic has shifted. The subject-matter expertise is so important for codifying and designing the spec, which usually gets specced out by the evals that it leads itself to more. And you can even see this. OpenAI is arguably creating and at the forefront of this stuff. And what are they doing? They’re standing up programs to get lawyers to come in, to get doctors to come in, to get these specialists to come in and help them create benchmarks because they can’t do it themselves. And so that’s the first thing. Got to work with the subject-matter expert.
22.04: The second thing is if they’re just starting out—and this is going to sound backwards, given our topic today—I would encourage them to use a system like DSPy or GEPA, which are essentially frameworks for building with AI. And one of the components of that framework is that they optimize the prompt for you with the help of an LLM and your eval data.
22.37: Throw in BAML?
22.39: BAML is similar [but it’s] more like the spec for how to describe the entire spec. So it’s similar.
22.52: BAML and TextGrad?
22.55: TextGrad is more like the prompt optimization I’m talking about.
22:57: TextGrad plus GEPA plus Regolo?
23.02: Yeah, those things are really important. And the reason I say they’re important is. . .
23.08: I mean, Drew, those are kind of advanced topics.
23.12: I don’t think they’re that advanced. I think they can appear really intimidating because everybody comes in and says, “Well, it’s so easy. I could just write what I want.” And this is the gift and curse of prompts, in my opinion. There’s a lot of things to like about.
23.33: DSPy is fine, but I think TextGrad, GEPA, and Regolo. . .
23.41: Well. . . I wouldn’t encourage you to use GEPA directly. I would encourage you to use it through the framework of DSPy.
23.48: The point here is if it’s a team building, you can go down essentially two paths. You can handwrite your prompt, and I think this creates some issues. One is as you build, you tend to have a lot of hotfix statements like, “Oh, there’s a bug over here. We’ll say it over here. Oh, that didn’t fix it. So let’s say it again.” It will encourage you to have one person who really understands this prompt. And so you end up being reliant on this prompt magician. Even though they’re written in English, there’s kind of no syntax highlighting. They get messier and messier as you build the application because they start to grow and become these growing collections of edge cases.
24.27: And the other thing too, and this is really important, is when you build and you spend so much time honing a prompt, you’re doing it against one model, and then at some point there’s going to be a better, cheaper, more effective model. And you’re going to have to go through the process of tweaking it and fixing all the bugs again, because this model functions differently.
And I used to have to try to convince people that this was a problem, but they all kind of found out when OpenAI deprecated all of their models and tried to move everyone over to GPT-5. And now I hear about it all the time.
25.03: Although I think right now “agents” is our hot topic, right? So we talk to people about agents and you start really getting into the weeds, you realize, “Oh, okay. So their agents are really just prompts.”
25.16: In the loop. . .
25.19: So agent optimization in many ways means injecting a bit more software engineering rigor in how you maintain and version. . .
25.30: Because that context is growing. As that loop goes, you’re deciding what gets added to it. And so you have to put guardrails in—ways to rescue from failure and figure out all these things. It’s very difficult. And you have to go at it systematically.
25.46: And then the problem is that, in many situations, the models are not even models that you control, actually. You’re using them through an API like OpenAI or Claude so you don’t actually have access to the weights. So even if you’re one of the super, super advanced teams that can do gradient descent and backprop, you can’t do that. Right? So then, what are your options for being more rigorous in doing optimization?
Well, it’s precisely these tools that Drew alluded to, which is the TextGrads of the world, the GEPA. You have these compound systems that are nondifferentiable. So then how do you actually do optimization in a world where you have things that are not differentiable? Right. So these are precisely the tools that will allow you to turn it from somewhat of a, I guess, black art to something with a little more discipline.
26.53: And I think a good example is, even if you aren’t going to use prompt optimization-type tools. . . The prompt optimization is a great solution for what you just described, which is when you can’t control the weights of the models you’re using. But the other thing too, is, even if you aren’t going to adopt that, you need to get evals because that’s going to be step one for anything, which is you need to start working with subject-matter experts to create evals.
27.22: Because what I see. . . And there was just a really dumb argument online of “Are evals worth it or not?” And it was really silly to me because it was positioned as an either-or argument. And there were people arguing against evals, which is just insane to me. And the reason they were arguing against evals is they’re basically arguing in favor of what they called, to your point about dark arts, vibe shipping—which is they’d make changes, push those changes, and then the person who was also making the changes would go in and type in 12 different things and say, “Yep, feels right to me.” And that’s insane to me.
27.57: And even if you’re doing that—which I think is a good thing and you may not go create coverage and eval, you have some taste. . . And I do think when you’re building more qualitative tools. . . So a good example is like if you’re Character.AI or you’re Portola Labs, who’s building essentially personalized emotional chatbots, it’s going to be harder to create evals and it’s going to require taste as you build them. But having evals is going to ensure that your whole thing didn’t fall apart because you changed one sentence, which sadly is a risk because these are probabilistic software.
28.33: Honestly, evals are super important. Number one, because, basically, leaderboards like LMArena are great for narrowing your options. But at the end of the day, you still need to benchmark all of these against your own application use case and domain. And then secondly, obviously, it’s an ongoing thing. So it ties in with reliability. The more reliable your application is, that means most likely you’re doing evals properly in an ongoing fashion. And I really believe that eval and reliability are a moat, because basically what else is your moat? Prompt? That’s not a moat.
29.21: So first off, violent agreement there. The only asset teams truly have—unless they’re a model builder, which is only a handful—is their eval data. And I would say the counterpart to that is their spec, whatever defines their program, but mostly the eval data. But to the other point about it, like why are people vibe shipping? I think you can get pretty far with vibe shipping and it fools you into thinking that that’s right.
We saw this pattern in the Web 2.0 and social era, which was, you would have the product genius—everybody wanted to be the Steve Jobs, who didn’t hold focus groups, didn’t ask their customers what they wanted. The Henry Ford quote about “They all say faster horses,” and I’m the genius who comes in and tweaks these things and ships them. And that often takes you very far.
30.13: I also think it’s a bias of success. We only know about the ones that succeed, but the best ones, when they grow up and they start to serve an audience that’s way bigger than what they could hold in their head, they start to grow up with AB testing and ABX testing throughout their organization. And a good example of that is Facebook.
Facebook stopped being just some choices and started having to do testing and ABX testing in every aspect of their business. Compare that to Snap, which again, was kind of the last of the great product geniuses to come out. Evan [Spiegel] was heralded as “He’s the product genius,” but I think they ran that too long, and they kept shipping on vibes rather than shipping on ABX testing and growing and, you know, being more boring.
31.04: But again, that’s how you get the global reach. I think there’s a lot of people who probably are really great vibe shippers. And they’re probably having great success doing that. The question is, as their company grows and starts to hit harder times or the growth starts to slow, can that vibe shipping take them over the hump? And I would argue, no, I think you have to grow up and start to have more accountable metrics that, you know, scale to the size of your audience.
31.34: So in closing. . . We talked about prompt engineering. And then we talked about context engineering. So putting you on the spot. What’s a buzzword out there that either irks you or you think is undertalked about at this point? So what’s a buzzword out there, Drew?
31.57: [laughs] I mean, I wish you had given me some time to think about it.
31.58: We are in a hype cycle here. . .
32.02: We’re always in a hype cycle. I don’t like anthropomorphosizing LLMs or AI for a whole host of reasons. One, I think it leads to bad understanding and bad mental models, that means that we don’t have substantive conversations about these things, and we don’t learn how to build really well with them because we think they’re intelligent. We think they’re a PhD in your pocket. We think they’re all of these things and they’re not—they’re fundamentally different.
I’m not against using the way we think the brain works for inspiration. That’s fine with me. But when you start oversimplifying these and not taking the time to explain to your audience how they actually work—you just say it’s a PhD in your pocket, and here’s the benchmark to prove it—you’re misleading and setting unrealistic expectations. And unfortunately, the market rewards them for that. So they keep going.
But I also think it just doesn’t help you build sustainable programs because you aren’t actually understanding how it works. You’re just kind of reducing it down to it. AGI is one of those things. And superintelligence, but AGI especially.
33.21: I went to school at UC Santa Cruz, and one of my favorite classes I ever took was a seminar with Donna Haraway. Donna Haraway wrote “A Cyborg Manifesto” in the ’80s. She’s kind of a tech science history feminist lens. You would just sit in that class and your mind would explode, and then at the end, you just have to sit there for like five minutes afterwards, just picking up the pieces.
She had a great term called “power objects.” A power object is something that we as a society recognize to be incredibly important, believe to be incredibly important, but we don’t know how it works. That lack of understanding allows us to fill this bucket with whatever we want it to be: our hopes, our fears, our dreams. This happened with DNA; this happened with PET scans and brain scans. This happens all throughout science history, down to phrenology and blood types and things that we understand to be, or we believed to be, important, but they’re not. And big data, another one that is very, very relevant.
34.34: That’s my handle on Twitter.
34.55: Yeah, there you go. So like it’s, you know, I fill it with Ben Lorica. That’s how I fill that power object. But AI is definitely that. AI is definitely that. And my favorite example of this is when the DeepSeek moment happened, we understood this to be really important, but we didn’t understand why it works and how well it worked.
And so what happened is, if you looked at the news and you looked at people’s reactions to what DeepSeek meant, you could basically find all the hopes and dreams about whatever was important to that person. So to AI boosters, DeepSeek proved that LLM progress is not slowing down. To AI skeptics, DeepSeek proved that AI companies have no moat. To open source advocates, it proved open is superior. To AI doomers, it proved that we aren’t being careful enough. Security researchers worried about the risk of backdoors in the models because it was in China. Privacy advocates worried about DeepSeek’s web services collecting sensitive data. China hawks said, “We need more sanctions.” Doves said, “Sanctions don’t work.” NVIDIA bears said, “We’re not going to need any more data centers if it’s going to be this efficient.” And bulls said, “No, we’re going to need tons of them because it’s going to use everything.”
35.44: And AGI is another term like that, which means everything and nothing. And when the point we’ve reached it comes, isn’t. And compounding that is that it’s in the contract between OpenAI and Microsoft—I forget the exact term, but it’s the statement that Microsoft gets access to OpenAI’s technologies until AGI is achieved.
And so it’s a very loaded definition right now that’s being debated back and forth and trying to figure out how to take [Open]AI into being a for-profit corporation. And Microsoft has a lot of leverage because how do you define AGI? Are we going to go to court to define what AGI is? I almost look forward to that.
36.28: So because it’s going to be that thing, and you’ve seen Sam Altman come out and some days he talks about how LLMs are just software. Some days he talks about how it’s a PhD in your pocket, some days he talks about how we’ve already passed AGI, it’s already over.
I think Nathan Lambert has some great writing about how AGI is a mistake. We shouldn’t talk about trying to turn LLMs into humans. We should try to leverage what they do now, which is something fundamentally different, and we should keep building and leaning into that rather than trying to make them like us. So AGI is my word for you.
37.03: The way I think of it is, AGI is great for fundraising, let’s put it that way.
37.08: That’s basically it. Well, until you need it to have already been achieved, or until you need it to not be achieved because you don’t want any regulation or if you want regulation—it’s kind of a fuzzy word. And that has some really good properties.
37.23: So I’ll close by throwing in my own term. So prompt engineering, context engineering. . . I will close by saying pay attention to this boring term, which my friend Ion Stoica is now talking more about “systems engineering.” If you look at particularly the agentic applications, you’re talking about systems.
37.55: Can I add one thing to this? Violent agreement. I think that is an underrated. . .
38.00: Although I think it’s too boring a term, Drew, to take off.
38.03: That’s fine! The reason I like it is because—and you were talking about this when you talk about fine-tuning—is, looking at the way people build and looking at the way I see teams with success build, there’s pretraining, where you’re basically training on unstructured data and you’re just building your base knowledge, your base English capabilities and all that. And then you have posttraining. And in general, posttraining is where you build. I do think of it as a form of interface design, even though you are adding new skills, but you’re teaching reasoning, you’re teaching it validated functions like code and math. You’re teaching it how to chat with you. This is where it learns to converse. You’re teaching it how to use tools and specific sets of tools. And then you’re teaching it alignment, what’s safe, what’s not safe, all these other things.
But then after it ships, you can still RL that model, you can still fine-tune that model, and you can still prompt engineer that model, and you can still context engineer that model. And back to the systems engineering thing is, I think we’re going to see that posttraining all the way through to a final applied AI product. That’s going to be a real shades-of-gray gradient. It’s going to be. And this is one of the reasons why I think open models have a pretty big advantage in the future is that you’re going to dip down the way throughout that, leverage that. . .
39.32: The only thing that’s keeping us from doing that now is we don’t have the tools and the operating system to align throughout that posttraining to shipping. Once we do, that operating system is going to change how we build, because the distance between posttraining and building is going to look really, really, really blurry. I really like the systems engineering type of approach, but I also think you can also start to see this yesterday [when] Thinking Machines released their first product.
40.04: And so Thinking Machines is Mira [Murati]. Her very hype thing. They launched their first thing, and it’s called Tinker. And it’s essentially, “Hey, you can write a very simple Python code, and then we will do the RL for you or the fine-tuning for you using our cluster of GPU so you don’t have to manage that.” And that is the type of thing that we want to see in a maturing kind of development framework. And you start to see this operating system emerging.
And it reminds me of the early days of O’Reilly, where it’s like I had to stand up a web server, I had to maintain a web server, I had to do all of these things, and now I don’t have to. I can spin up a Docker image, I can ship to render, I can ship to Vercel. All of these shared complicated things now have frameworks and tooling, and I think we’re going to see a similar evolution from that. And I’m really excited. And I think you have picked a great underrated term.
40.56: Now with that. Thank you, Drew.
40.58: Awesome. Thank you for having me, Ben.