In this episode, Laurence Moroney, director of AI at Arm, joins Ben Lorica to chat about the state of deep learning frameworks—and why you may be better off thinking a step higher, on the solution level. Listen in for Laurence’s thoughts about posttraining; the evolution of on-device AI (and how tools like ExecuTorch and LiteRT are helping make it possible); why culturally specific models will only grow in importance; what Hollywood can teach us about LLM privacy; and more.
About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.
Check out other episodes of this podcast on the O’Reilly learning platform.
Transcript
This transcript was created with the help of AI and has been lightly edited for clarity.
00.00: All right. So today we have Laurence Moroney, director of AI at Arm and author of the book AI and ML for Coders in PyTorch. Laurence is someone I’ve known for a while. He was at Google serving as one of the main evangelists for TensorFlow. So welcome to the podcast, Laurence.
00.23: Thank you Ben. It’s great to be here.
00.26: I guess, before we go on to the present, let’s talk about a little bit of the past of deep learning frameworks. In fact, this week is interesting because Soumith Chintala just announced he was leaving Meta, and Soumith was one of the leaders of the PyTorch project. I interviewed Soumith in an O’Reilly podcast after PyTorch was released, but coincidentally, exactly about a year before I interviewed Rajat Monga right around the time that TensorFlow was released. I was actually talking to these project leaders very early on.
So, Laurence, you move your book to PyTorch, and I’m sure TensorFlow still holds a special place in your heart, right? So where does TensorFlow sit right now in your mind? Because right now it’s all about PyTorch, right?
01.25: Yeah, that’s a great question. TensorFlow definitely has a very special place in my heart. I built a lot of my recent career on TensorFlow. I’ll be frank. It feels like there’s not that much investment in TensorFlow anymore.
If you take a look at even releases, it went 2.8, 2.9, 2.10, 2.11. . .and you know, there’s no 3.0 on the horizon. I can’t really share any insider stuff from Google, although I left there over a year ago, but it does feel that unfortunately [TensorFlow has] kind of withered on the vine a little bit internally at Google compared to JAX.
02.04: But then the problem, at least for me from an external perspective, is, first of all, JAX isn’t really a machine learning framework. There are machine learning frameworks that are built on top of it. And second of all, it’s not a 1.0 product. It’s hard for me to encourage anybody to bet their business or get their career on something that isn’t a 1.0 product, or at least a 1.0 product.
02.29: That really just leaves (by default) PyTorch. Obviously there’s been all of the momentum around PyTorch. There’s been all of the excitement around it. It’s interesting, though, that if you look at things like GitHub star history, it still lags behind both TensorFlow and JAX. But in perception it is the most popular. And unfortunately, if you do want to build a career now on creating machine learning models, not just using machine learning models, it’s really the—oh well, I shouldn’t say unfortunately. . . The truth is that it’s really the only option. So that’s the negative side.
The positive side of it is of course, it’s really, really good. I’ve been using it extensively for some time. Even during my TensorFlow and JAX days, I did use PyTorch a lot. I wanted to keep an eye on how it was used, how it’s shaped, what worked, what didn’t, the best way for somebody to learn how to learn using PyTorch—and to make sure that the TensorFlow community, as I was working on it, were able to keep up with the simplicity of PyTorch, particularly the brilliant work that was done by the Keras team to really make Keras part of TensorFlow. It’s now been kind of pulled aside, pulled out of TensorFlow somewhat, but that was something that leaned into the same simplicity as PyTorch.
03.52: And like I said, now going forward, PyTorch is. . . I rewrote my book to be PyTorch specific. Andrew and I are teaching a PyTorch specialization with deep learning AI in Coursera. And you know, if my emphasis is less on frameworks and framework wars and loyalties and stuff like that and more on, I really want to help people to succeed, to build careers or to build startups, that kind of thing, that this was the direction that I think it should go in.
04.19: Now, maybe I’m wrong, but I think even about two years ago, maybe a little more than that, I was still hearing and seeing job posts around TensorFlow, primarily around people working in computer vision on edge devices. So is that still a place where you would run into TensorFlow users?
04.41: Absolutely, yes. Because of what was previously called TensorFlow Lite and is now called LiteRT as a runtime for models to be able to run on edge devices. I mean, that really was the only option until recently— just last week at the PyTorch Summit, ExecuTorch went 1.0. And if I go back to my old mantra of “I really don’t want anybody to invest their business or their career on something that’s prerelease,” it’s good to learn and it’s good to prepare.
05.10: [Back] then, the only option for you to be able to train models and deploy them, particularly to mobile devices, was effectively either LiteRT or TensorFlow Lite or whatever it’s called now, or Core ML for Apple devices. But now with ExecuTorch going 1.0, the whole market is out there for PyTorch developers to be able to deploy to mobile and edge devices.
05.34: So those job listings, I think as they evolve and as they go forward that the skills may kind of veer more towards PyTorch, but I’d also encourage everybody to kind of double click above the framework level and start thinking on the solution level. There’ve been a lot of framework wars in so many things, you know, Mac versus PC, Darknet versus Java. And in some ways, that’s not the most productive way of thinking about things.
I think the best thing to do is [to] think about what’s out there to allow you to build a solution that you can deploy, that you can trust, and that will be there for some time. And let the framework be secondary to that.
06.14: All right. So one last framework question. And this is also an observation that might be slightly dated—I think this might be from around two years ago. I was actually surprised that, for some reason, I think the Chinese government is also encouraging Chinese companies to use local deep learning frameworks. So it’s not just PaddlePaddle. There’s another one that I came across and I don’t know what’s the status of that now, as far as you know. . .
06.43: So I’m not familiar with any others other than PaddlePaddle. But I do generally agree with [the idea that] cultures should be thinking about using tools and frameworks and models that are appropriate for their culture. I’m going to pivot away from frameworks towards large language models as an example.
Large language models are primarily built on English. And when you start peeling apart large language models and look at what’s underneath the hood and particularly how they tokenize words, it’s very, very English oriented. So if you start wanting to build solutions, for example, for things like education—you know, important things!—and you’re not primarily an English language-speaking country, you’re already a little bit behind the curve.
07.35: Actually, I just came from a meeting with some folks from Ireland. And for the Gaelic language, the whole idea of posttraining models that were trained primarily with English tokens is already setting you apart at a disadvantage if you’re trying to build stuff that you can use within your culture.
At the very least, missing tokens, right? There were subwords in Gaelic that don’t exist in English, or subwords in Japanese or Chinese or Korean or whatever that don’t exist in English. So if you start even trying to do posttraining, you realize that the model was trained on using tokens that are. . . You need to use tokens that the model wasn’t trained with and stuff like that.
So I know I’m not really answering the framework part of it, but I do think it’s an important thing, like you mentioned, that China wants to invest in their own frameworks. But I think every culture should also be looking at. . . Cultural preservation is very, very important in the age of AI, as we build more dependence on AI.
08.37: When it comes to a framework, PyTorch is open source. TensorFlow is open source. I’m pretty sure PaddlePaddle is open source. I don’t know. I’m not really that familiar with it. So you don’t have the traps of being locked into somebody else’s cultural perspective or language or anything like that, that you would have with an obscure large language model if you’re using an open source framework. So that part isn’t as difficult when it comes to, like, a country wanting to adopt a framework. But certainly when it comes to building on top of pretrained models, that’s where you need to be careful.
09.11: So [for] most developers and most enterprise AI teams, the reality is they’re not going to be pretraining. So it’s mostly about posttraining, which is a big topic. It can run the gamut of RAG, fine-tuning, reinforcement learning, distillation, quantization. . . So from that perspective, Laurence, how much should someone who’s in an enterprise AI team really know about these deep learning frameworks?
09.42: So I think two different things there, right? One is posttraining and one is deep learning frameworks. I’m going to lean into the posttraining side to argue that that’s the single number one important skill for developers going forward: posttraining and all of their types of code.
10.00: And all of the types of posttraining.
10.01: Yeah, totally. There’s always trade-offs, right? There’s the very simple posttraining stuff like RAG, which is relatively low value, and then there’s the more complex stuff like a full retrain or a LoRA-type training, which is more expensive or more difficult but has higher value.
But I think there’s a whole spectrum of ways of doing things with posttraining. And my argument that I’m making very passionately is that if you’re a developer, that is the number one skill to learn going forward. “Agents” was kind of the buzzword of 2025; I think “small AI” will be the buzzword of 2026.
10.40: We often talk about open source AI with open source models and stuff like that. It’s not really open source. It’s a bit of a misnomer. The weights have been released for you to be able to use and self-host them—if you want a self-hosted chatbot or self-host something that you want to run on them.
But more importantly, the weights are there for you to change, through retraining, through fine-tuning and stuff like that. I’m particularly passionate about that because when you start thinking in terms of two things—latency and privacy—it becomes really, really important.
11.15: I spent a lot of time working with folks who are passionate about IP. I’ll share one of them: Hollywood movie studios. And we’ve probably all seen those semi-frivolous lawsuits of, person A makes a movie, and then person B sues person A because person B had the idea first. And movie studios are generally terrified of that kind of thing.
I actually have a movie in preproduction with a studio at the moment. So I’ve learned a lot through that. And one of the things [I learned] was, even when I speak with producers or the financiers, a lot of time we talk on the phone. We don’t email or anything like that because the whole fear of IP leaks is out there, and this has led to a fear there of, think of all the things that an LLM could be used to [do]. The shallow stuff would be to help you write scenes and all that kind of stuff. But most of them don’t really care about that.
The more important things where an LLM could be used [are it could] evaluate a script and count the number of locations that would be needed to film this script. Like the Mission:Impossible script, where one scene’s in Paris and another scene’s in Moscow, and another scene is in Hong Kong. To be able to have a machine that can evaluate that and help you start budgeting. Or if somebody sends in a speculative script with all of that kind of stuff in it, and you realize you don’t have half a billion to make this movie from an unknown, because they have all these locations.
12.41: So all of this kind of analysis that can be done—story analysis, costing analysis, and all of that type of stuff—is really important to them. And it’s great low-hanging fruit for something like an LLM to do. But there’s no way they’re going to upload their speculative scripts to Gemini or OpenAI or Claude or anything like that.
So local AI is really important to them—and the whole privacy part of it. You run the model and the machine; you do the analysis on the machine; the data never leaves your laptop. And then extend that. I mean, not everybody’s going to be working with Hollywood studios, but extend that to just general small offices—your law office, your medical office, your physiotherapists, or whatever [where] everybody is using large language models for very creative things, but if you can make those models far more effective at your specific domain. . .
13.37: I’ll use a small office, for example, in a particular state in a particular jurisdiction, to be able to retrain a model, to be an expert in the law for that jurisdiction based on prior, what is it they call it? Jury priors? I can’t remember the Latin phrase for it, but, you know, based on precedents. To be able to fine-tune a model for that and then have everything locally within your office so you’re not sharing out to Claude or Gemini or OpenAI or whatever. Developers are going to be building that stuff.
14.11: And with a lot of fear, uncertainty and doubt out there for developers with code generation, the optimist in me is seeing that [for] developers, your value bar is actually raising up. If your value is just your ability to churn out code, now models can compete with you. But if you’re raising the value of yourself to being able to do things that are much higher value than just churning out code—and I think fine-tuning is a part of that—then that actually leads to a very bright future for developers.
14.43: So here’s my impression of the state of tooling for posttraining. So [with] RAG and different variants of RAG, it seems like people have enough tools or have tools or have some notion of how to get started. [For] fine-tuning, there’s a lot of services that you can use now, and it mainly comes down to collecting a fine-tuning dataset it seems like.
[For] reinforcement learning, we still need tools that are accessible. The workflow needs to be at a point where a domain expert can actually do it—and that’s in some ways kind of where we are in fine-tuning, so the domain expert can focus on the dataset. Reinforcement learning, not so much the case.
I don’t know, Laurence, if you would consider quantization and distillation part of posttraining, but it seems like that might also be something where people would also need more tools. More options. So what’s your sense of tooling for the different types of posttraining?
15.56: Good question. I’ll start with RAG because it’s the easiest. There’s obviously lots of tooling out there for it.
16.04: And startups, right? So a lot of startups.
16.07: Yep. I think the thing with RAG that interests me and fascinates me the most is in some ways it shares [similarities] with the early days of actually doing machine learning with the likes of Keras or PyTorch or TensorFlow, where there’s a lot of trial and error. And, you know, the tools.
16.25: Yeah, there’s a lot there’s a lot of knobs that you can optimize. People underestimate how important that is, right?
16.35: Oh, absolutely. Even the most basic knob, like, How big a slice do you take of your text, and how big of an overlap do you do between those slices? Because you can have vastly different results by doing that.
16.51: So just as a quick recap from if anybody’s not familiar with RAG, I’d like to give one little example of it. I actually wrote a novel about 12, 13 years ago, and six months after the novel was published, the publisher went bust. And this novel is not in the training set of any LLM.
So if I go to an LLM like Claude or GPT or anything like that and I ask about the novel, it will usually either say it doesn’t know or it will hallucinate and it’ll make stuff up and say it knows it. So to me, this was the perfect thing for me to try RAG.
17.25: The idea with RAG is that I will take the text of the novel and I’ll chop it up into maybe 20-word increments, with five-word overlap—so the first 20 words of the book and then word 15 through 35 and then word 30 through 50 so you get those overlaps—and then store those into a vector database. And then when somebody wants to ask about something like maybe ask about a character in the novel, then the prompts will be vectorized, and the embeddings for that prompt can be compared with the embeddings of all of these chunks.
And then when similar chunks are found, like the name of the character and stuff like that, or if the prompt asks, “Tell me about her hometown,” then there may be a chunk in the book that says, “Her hometown is blah,” you know?
So they will then be retrieved from the database and added to the prompt, and then sent to something like GPT. So now GPT has much more context: not just the prompt but also all these extra bits that it retrieves from the book that says, “Hey, she’s from this town and she likes this food.” And while ChatGPT doesn’t know about the book, it does know about the town, and it does know about that food, and it can give a more intelligent answer.
18.34: So it’s not really a tuning of the model in any way or posttuning of the model, but it’s an interesting and really nice hack to allow you to get the model to be able to do more than you thought it could do.
But going back to the question about tooling, there’s a lot of trial and error there like “How do I tokenize the words? What kind of chunk size do I use?” And all of that kind of stuff. So anybody that can provide any kind of tooling in that space so that you can try multiple databases and compare them against each other, I think is really valuable and really, really important.
19.05: If I go to the other end of the spectrum, then for actual real tuning of a model, I think LoRA tuning is a good example there. And tooling for that is hard to find. It’s few and far between.
19:20: I think actually there’s a lot of providers now where you can focus on your dataset and then. . . It’s a bit of a black box, obviously, because you’re relying on an API. I guess my point is that even if you’re [on] a team where you don’t have that expertise, you can get going. Whereas in reinforcement learning, there’s really not much tooling out there.
19:50: Certainly with reinforcement learning, you got to kind of just crack open the APIs and start coding. It’s not as difficult as it sounds, once you start doing it.
20:00: There are people who are trying to build tools, but I haven’t seen one where you can just point the domain expert.
20.09: Totally. And I would also encourage [listeners that] if you’re doing any other stuff like LoRA tuning, it’s really not that difficult once you start looking. And PyTorch is great for this, and Python is great for this, once you start looking at how to do it. Shameless self-plug here, but [in] the final chapter of my PyTorch book, I actually give an example of LoRA tuning, where I created a dataset for a digital influencer and I show you how to retune and how to LoRA-tune the Stable Diffusion model to be a specialist in creating for this one particular individual—just to show how to do all of that in code.
Because I’m always a believer that before I start using third-party tools to do a thing, I kind of want to look at the code and the frameworks and how to do that thing for myself. So then I can really understand the value that the tools are going to be giving me. So I tend to veer towards “Let me code it first before I care about the tools.”
21.09: Spoken like a true Googler.
21.15: [laughs] I have to call that one tool that, while it’s not specifically for fine-tuning large language models, I hope they converted for it. But this one changed the game for me: Apple has a tool called Create ML, which was really used for transfer learning off of existing models—which is still posttraining, just now posttraining of LLMs.
And that tool’s ability to be able to take a dataset and then to fine-tune a model like a MobileNet or something, or an object detection model on that codelessly and efficiently blew my mind with how good it was. The world needs more tooling like that. And if there’s any Apple people listening, I’d encourage them to extend Create ML for large language models or for any other generative models.
22.00: By the way, I want to make sure, as we wind down, I ask you about edge—that’s what’s occupying you at the moment. You talk about this notion of “build once, deploy everywhere.” So what’s actually feasible today?
22.19: So what’s feasible today? I think the best multideployment surface today that I would invest in going forward is creating for ExecuTorch, because ExecuTorch runtime is going to be living in so many places.
At Arm, obviously we’ve been working very closely with ExecuTorch and we are part of the ExecuTorch 1.0 release. But if you’re building for edge, you know, to make sure that your models work on the ExecuTorch, which, I think would be the number one, low-hanging fruit that I would say that people would invest in. So that’s PyTorch’s model.
22.54: Does it really live up to the “run everywhere”?
23.01: Define “everywhere.”
23.02: [laughs] I guess, at the minimum, Android and iOS.
23.12: So yes, at a minimum, for those—the same as LiteRT or TensorFlow Lite from Google does. What I’m excited about with ExecuTorch is that it also runs in other physical AI areas. We are going to be seeing it in cars and robots and other things as well. And I anticipate that that ecosystem will spread a lot faster than the Lite or T1. So if you’re starting with Android and iOS, then you’re in good shape.
23.42: What about the kinds of devices that our mutual friend Pete Warden, for example, targets? The really compute-hungry [ones]? Well, not so much compute hungry, but basically not much compute.
24.05: They sip power rather than gulping it. I think that would be a better question for Pete than for me. If you see him, tell him I said hi.
24.13: I mean, is that something that the ExecuTorch community also kind of thinks about?
24.22: At short. Yes. In long, that’s a bit more of a challenge to go on microcontrollers and the like. One of the things that when you start getting down onto the small that I’m really excited about is a technology called SME, which is scalable matrix extensions. And it’s something that Arm have been working on with various chip makers and handset makers, with the idea being that SME is all about being able to run AI workloads on the CPU. So without needing a separate external accelerator. And then as a result, the CPU’s going to be drawing less battery, those kinds of things, etc.
That’s one of the growth areas that I’m excited about, where you’re going to see more and more AI workloads being able to run on handsets, particularly the diverse Android handsets, because the CPU is capable of running models instead of you needing to offload to a separate accelerator, being an NPU or a TPU or GPU.
And the problem with the Android ecosystem is the sheer diversity makes it difficult for a developer to target any specific one. But if more and more workloads can actually move on to the CPU, and every device has a CPU, then the idea of being able to do more and more AI workloads through SME is going to be particularly exciting.
25.46: So actually, Laurence, for people who don’t work on edge deployments, give us a sense of how capable some of these small models are.
First I’ll throw out an unreasonable example: coding. So obviously, me and many people love all these coding tools like Claude Code, but sometimes it really consumes a lot of compute, gets expensive. And not only that, you end up getting somewhat dependent so that you have to always be connected to the cloud. So if you are on a plane, suddenly you’re not as productive anymore.
So I’m sure in coding it might not be feasible, but what are these language models or these foundation models capable of doing locally [on smartphones, for example] that people may not be aware of?
26.47: Okay, so let me kind of answer that in two different ways: [what] device foundation models are capable of that people may not be aware of [and] the overall on-device ecosystem and the kind of things you can do that people may not be aware of. And I’m going to start with the second one.
You mentioned China earlier on. Alipay is a company from China, and they’ve been working on the SME technology that I spoke about, where they had an app, which I’m sure we’ve all seen these kind of apps where you can get your vacation photographs and then you can search your vacation photographs for things, like “Show me all the pictures I took with a panda.”
And then you can create a slideshow or a subset of your folder with that. But when you build something like that, the AI required to be able to search images for a particular thing needs to live in the cloud because on-device just wasn’t capable of doing that type of image-based searching previously.
27.47: So then as a company, they had to stand up a cloud service to be able to do this. As a user, I had privacy and latency issues if I was using this: I have to share all of my photos with a third party and whatever I’m looking for in those photos I have to share with the third party.
And then of course, there’s the latency: I have to send the query. I have to have the query execute in the cloud. I have to have the results come back to my device and then be assembled on my device.
28.16: Now with an on-device AI, thinking about it from both the user perspective and from the app vendor perspective, it’s a better experience. I’ll start from the app vendor perspective: They don’t need to stand up this cloud service anymore, so they’re saving a lot of time and effort and money because everything is moving on-device. And with a model that’s capable of understanding images, and understanding the contents of images so that you can search for those, executing completely on-device.
The user experience is also better. Show me all the pictures of pandas that I have where it’s able to search the device for those pictures or look through all the pictures on the device, get an embedding that represents the contents of that picture map that match that embedding to the query that the user is doing, and then assemble those pictures. So you don’t have the latency, and you don’t have the privacy issues, and the vendor doesn’t have to stand up stuff.
29.11: So that’s the kind of area where I’m seeing great improvements, not just in user experience but also making it much cheaper and easier for somebody to build these applications—and all of that then stems from the capabilities of foundation models that are executing on the device, right? In this case, it’s a model that’s able to turn an image into a set of embeddings so that you can search those embeddings for matching things.
As a result, we’re seeing more and more on-device models, like Gemini Nano, like Apple Intelligence, becoming a foundational part of the operating system. Then more and more will be able to see applications like these being made possible.
I can’t afford to stand up a cloud service. You know, it’s costing millions of dollars to be able to build an application for somebody, so I can’t do that. And how many small startups can’t do that? But then as it moves on-device, and you don’t need all of that, and it’s just going to be purely an on-device thing, then suddenly it becomes much more interesting. And I think there’ll be a lot more innovation happening in that space.
30.16: You mentioned Gemma. What are the key families of local foundation models?
30.27: Sure. So, there’s local foundation models, and then also embedded on-device models. So Gemini Nano and Android and the Apple Intelligence models and Apple, as well as this ecosystem of smaller models that could work either on-device or on your desktop, like the Gemma family from Google. There’s the OpenAI gpt-oss, there’s the Qwen stuff from China, there’s Llama, you know that there’s a whole bunch of them out there.
I’ve recently been using the gpt-oss, which I find really good. And obviously I’m also a big fan of Gemma, but there’s lots of families out there—there’s so many new ones coming online every day, it seems. So there’s a lot of choice for those, but many of them are still too big to work on a mobile device.
31.15: You brought up quantization earlier on. And that’s where quantization will have to come into play, at least in some cases. But I think for the most part, if you look at where the vectors are trending, the smaller models are getting smarter. So what the 7 billion-parameter model can do today you needed 100 billion parameters to do two years ago.
And you keep projecting that forward, like the 1 billion-parameter model’s kind of [going to] be able to do the same thing in a year or two time, and then it becomes relatively trivial to put them onto a mobile device if they’re not part of the core operating system, but for them to be something that you ship along with your application.
I can see more and more of that happening where third-party models being small enough to work on mobile devices will become the next wave of what I’ve been calling small AI, not just on mobile but also on desktop and elsewhere.
32.13: So in closing, Laurence, for our listeners who are already familiar and may already be building AI applications for cloud or enterprise, this conversation may prompt them to start checking out edge and local applications.
Besides your book and your blog, what are some of the key resources? Are there specific conferences where a lot of these local AI edge AI people gather, for example?
32.48: So local AI, not yet. I think that that wave is only just beginning. Obviously things like the Meta conferences, we’ll talk a lot about Llama; Google conferences, we’ll talk a lot about Gemma; but an independent conference for just general local AI as a whole, I think that wave is only just beginning.
Mobile is very vendor specific or [focused on] the ecosystem of a vendor. Apple obviously have their WWDC, Google have their conferences, but there’s also the independent conference called droidcon, which I find really, really good for understanding mobile and understanding AI on mobile, particularly for the Android ecosystem.
But as for an overall conference for small AI and for the ideas of fine-tuning, all of the types of posttuning small AI that can be done, that’s that’s a growth area. I would say for posttraining, there’s a really excellent Coursera course that a friend of mine, Sharon Zhou, just released. It just came out last week or the week before. That’s an excellent course in all of the ins and outs of posttraining fine-tuning. But, yeah, I think it’s a great growth area.
34.08: And for those of us who are iPhone users. . . I keep waiting for Apple Intelligence to really up its game. It seems like it’s getting close. They have multiple initiatives in the works. They have alliances with OpenAI and now with Google. But then apparently they’re also working on their own model. So any inside scoop? [laughs]
34.33: Well, no inside scoop because I don’t work at Apple or anything like that, but I’ve been using Apple Intelligence quite a lot, and I’m a big fan. The ability to have the on-device large language model is really powerful. There’s a lot of scenarios I’ve been kind of poking around with and helping some startups with in that space.
The one thing that I would say that’s a big gotcha for developers to look out for is the very small context window. It’s only 8K, so if you try to do any kind of long-running stuff or anything interesting like that, you’ve got to go off-device. Apple have obviously been investing in this private cloud so that your sessions, when they go off-device into the cloud. . . At least they try to solve the privacy part of it. They’re getting ahead of the privacy [issue] better than anybody else, I think.
But latency is still there. And I think that deal with Google to provide Gemini services that was announced a couple of days ago is more on that cloud side of things and less on the on-device.
35.42: But going back to what I was saying earlier on, the 7 billion-parameter model of today is as good as the 120 billion of yesterday. The 1 billion-parameter [model] of next year is probably as good as that, if not better. So, as smaller parameter-size models and therefore memory space models are becoming much more effective, I can see more of them being delivered on-device as part of the operating system, in the same way as Apple Intelligence are doing it. But hopefully with a bigger context window because they can afford it with the smaller model.
36.14: And to clarify, Laurence, that trend that you just pointed out, the increasing capability of the smaller models, that holds not just for LLMs but also for multimodal?
36.25: Yes.
36.26: And with that, thank you, Laurence.
36.29: Thank you, Ben. Always a pleasure.





