Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
149117 stories
·
33 followers

Generative AI in the Real World: Laurence Moroney on AI at the Edge

1 Share

In this episode, Laurence Moroney, director of AI at Arm, joins Ben Lorica to chat about the state of deep learning frameworks—and why you may be better off thinking a step higher, on the solution level. Listen in for Laurence’s thoughts about posttraining; the evolution of on-device AI (and how tools like ExecuTorch and LiteRT are helping make it possible); why culturally specific models will only grow in importance; what Hollywood can teach us about LLM privacy; and more.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform.

Transcript

This transcript was created with the help of AI and has been lightly edited for clarity.

00.00: All right. So today we have Laurence Moroney, director of AI at Arm and author of the book AI and ML for Coders in PyTorch. Laurence is someone I’ve known for a while. He was at Google serving as one of the main evangelists for TensorFlow. So welcome to the podcast, Laurence. 

00.23: Thank you Ben. It’s great to be here.

00.26: I guess, before we go on to the present, let’s talk about a little bit of the past of deep learning frameworks. In fact, this week is interesting because Soumith Chintala just announced he was leaving Meta, and Soumith was one of the leaders of the PyTorch project. I interviewed Soumith in an O’Reilly podcast after PyTorch was released, but coincidentally, exactly about a year before I interviewed Rajat Monga right around the time that TensorFlow was released. I was actually talking to these project leaders very early on. 

So, Laurence, you move your book to PyTorch, and I’m sure TensorFlow still holds a special place in your heart, right? So where does TensorFlow sit right now in your mind? Because right now it’s all about PyTorch, right? 

01.25: Yeah, that’s a great question. TensorFlow definitely has a very special place in my heart. I built a lot of my recent career on TensorFlow. I’ll be frank. It feels like there’s not that much investment in TensorFlow anymore.

If you take a look at even releases, it went 2.8, 2.9, 2.10, 2.11. . .and you know, there’s no 3.0 on the horizon. I can’t really share any insider stuff from Google, although I left there over a year ago, but it does feel that unfortunately [TensorFlow has] kind of withered on the vine a little bit internally at Google compared to JAX.

02.04: But then the problem, at least for me from an external perspective, is, first of all, JAX isn’t really a machine learning framework. There are machine learning frameworks that are built on top of it. And second of all, it’s not a 1.0 product. It’s hard for me to encourage anybody to bet their business or get their career on something that isn’t a 1.0 product, or at least a 1.0 product.

02.29: That really just leaves (by default) PyTorch. Obviously there’s been all of the momentum around PyTorch. There’s been all of the excitement around it. It’s interesting, though, that if you look at things like GitHub star history, it still lags behind both TensorFlow and JAX. But in perception it is the most popular. And unfortunately, if you do want to build a career now on creating machine learning models, not just using machine learning models, it’s really the—oh well, I shouldn’t say unfortunately. . . The truth is that it’s really the only option. So that’s the negative side. 

The positive side of it is of course, it’s really, really good. I’ve been using it extensively for some time. Even during my TensorFlow and JAX days, I did use PyTorch a lot. I wanted to keep an eye on how it was used, how it’s shaped, what worked, what didn’t, the best way for somebody to learn how to learn using PyTorch—and to make sure that the TensorFlow community, as I was working on it, were able to keep up with the simplicity of PyTorch, particularly the brilliant work that was done by the Keras team to really make Keras part of TensorFlow. It’s now been kind of pulled aside, pulled out of TensorFlow somewhat, but that was something that leaned into the same simplicity as PyTorch.

03.52: And like I said, now going forward, PyTorch is. . . I rewrote my book to be PyTorch specific. Andrew and I are teaching a PyTorch specialization with deep learning AI in Coursera. And you know, if my emphasis is less on frameworks and framework wars and loyalties and stuff like that and more on, I really want to help people to succeed, to build careers or to build startups, that kind of thing, that this was the direction that I think it should go in. 

04.19: Now, maybe I’m wrong, but I think even about two years ago, maybe a little more than that, I was still hearing and seeing job posts around TensorFlow, primarily around people working in computer vision on edge devices. So is that still a place where you would run into TensorFlow users?

04.41: Absolutely, yes. Because of what was previously called TensorFlow Lite and is now called LiteRT as a runtime for models to be able to run on edge devices. I mean, that really was the only option until recently— just last week at the PyTorch Summit, ExecuTorch went 1.0. And if I go back to my old mantra of “I really don’t want anybody to invest their business or their career on something that’s prerelease,” it’s good to learn and it’s good to prepare.

05.10: [Back] then, the only option for you to be able to train models and deploy them, particularly to mobile devices, was effectively either LiteRT or TensorFlow Lite or whatever it’s called now, or Core ML for Apple devices. But now with ExecuTorch going 1.0, the whole market is out there for PyTorch developers to be able to deploy to mobile and edge devices.

05.34: So those job listings, I think as they evolve and as they go forward that the skills may kind of veer more towards PyTorch, but I’d also encourage everybody to kind of double click above the framework level and start thinking on the solution level. There’ve been a lot of framework wars in so many things, you know, Mac versus PC, Darknet versus Java. And in some ways, that’s not the most productive way of thinking about things.

I think the best thing to do is [to] think about what’s out there to allow you to build a solution that you can deploy, that you can trust, and that will be there for some time. And let the framework be secondary to that. 

06.14: All right. So one last framework question. And this is also an observation that might be slightly dated—I think this might be from around two years ago. I was actually surprised that, for some reason, I think the Chinese government is also encouraging Chinese companies to use local deep learning frameworks. So it’s not just PaddlePaddle. There’s another one that I came across and I don’t know what’s the status of that now, as far as you know. . .

06.43: So I’m not familiar with any others other than PaddlePaddle. But I do generally agree with [the idea that] cultures should be thinking about using tools and frameworks and models that are appropriate for their culture. I’m going to pivot away from frameworks towards large language models as an example. 

Large language models are primarily built on English. And when you start peeling apart large language models and look at what’s underneath the hood and particularly how they tokenize words, it’s very, very English oriented. So if you start wanting to build solutions, for example, for things like education—you know, important things!—and you’re not primarily an English language-speaking country, you’re already a little bit behind the curve.

07.35: Actually, I just came from a meeting with some folks from Ireland. And for the Gaelic language, the whole idea of posttraining models that were trained primarily with English tokens is already setting you apart at a disadvantage if you’re trying to build stuff that you can use within your culture.

At the very least, missing tokens, right? There were subwords in Gaelic that don’t exist in English, or subwords in Japanese or Chinese or Korean or whatever that don’t exist in English. So if you start even trying to do posttraining, you realize that the model was trained on using tokens that are. . . You need to use tokens that the model wasn’t trained with and stuff like that.

So I know I’m not really answering the framework part of it, but I do think it’s an important thing, like you mentioned, that China wants to invest in their own frameworks. But I think every culture should also be looking at. . . Cultural preservation is very, very important in the age of AI, as we build more dependence on AI. 

08.37: When it comes to a framework, PyTorch is open source. TensorFlow is open source. I’m pretty sure PaddlePaddle is open source. I don’t know. I’m not really that familiar with it. So you don’t have the traps of being locked into somebody else’s cultural perspective or language or anything like that, that you would have with an obscure large language model if you’re using an open source framework. So that part isn’t as difficult when it comes to, like, a country wanting to adopt a framework. But certainly when it comes to building on top of pretrained models, that’s where you need to be careful.

09.11: So [for] most developers and most enterprise AI teams, the reality is they’re not going to be pretraining. So it’s mostly about posttraining, which is a big topic. It can run the gamut of RAG, fine-tuning, reinforcement learning, distillation, quantization. . . So from that perspective, Laurence, how much should someone who’s in an enterprise AI team really know about these deep learning frameworks?

09.42: So I think two different things there, right? One is posttraining and one is deep learning frameworks. I’m going to lean into the posttraining side to argue that that’s the single number one important skill for developers going forward: posttraining and all of their types of code.

10.00: And all of the types of posttraining.

10.01: Yeah, totally. There’s always trade-offs, right? There’s the very simple posttraining stuff like RAG, which is relatively low value, and then there’s the more complex stuff like a full retrain or a LoRA-type training, which is more expensive or more difficult but has higher value. 

But I think there’s a whole spectrum of ways of doing things with posttraining. And my argument that I’m making very passionately is that if you’re a developer, that is the number one skill to learn going forward. “Agents” was kind of the buzzword of 2025; I think “small AI” will be the buzzword of 2026. 

10.40: We often talk about open source AI with open source models and stuff like that. It’s not really open source. It’s a bit of a misnomer. The weights have been released for you to be able to use and self-host them—if you want a self-hosted chatbot or self-host something that you want to run on them. 

But more importantly, the weights are there for you to change, through retraining, through fine-tuning and stuff like that. I’m particularly passionate about that because when you start thinking in terms of two things—latency and privacy—it becomes really, really important. 

11.15: I spent a lot of time working with folks who are passionate about IP. I’ll share one of them: Hollywood movie studios. And we’ve probably all seen those semi-frivolous lawsuits of, person A makes a movie, and then person B sues person A because person B had the idea first. And movie studios are generally terrified of that kind of thing. 

I actually have a movie in preproduction with a studio at the moment. So I’ve learned a lot through that. And one of the things [I learned] was, even when I speak with producers or the financiers, a lot of time we talk on the phone. We don’t email or anything like that because the whole fear of IP leaks is out there, and this has led to a fear there of, think of all the things that an LLM could be used to [do]. The shallow stuff would be to help you write scenes and all that kind of stuff. But most of them don’t really care about that. 

The more important things where an LLM could be used [are it could] evaluate a script and count the number of locations that would be needed to film this script. Like the Mission:Impossible script, where one scene’s in Paris and another scene’s in Moscow, and another scene is in Hong Kong. To be able to have a machine that can evaluate that and help you start budgeting. Or if somebody sends in a speculative script with all of that kind of stuff in it, and you realize you don’t have half a billion to make this movie from an unknown, because they have all these locations.

12.41: So all of this kind of analysis that can be done—story analysis, costing analysis, and all of that type of stuff—is really important to them. And it’s great low-hanging fruit for something like an LLM to do. But there’s no way they’re going to upload their speculative scripts to Gemini or OpenAI or Claude or anything like that.

So local AI is really important to them—and the whole privacy part of it. You run the model and the machine; you do the analysis on the machine; the data never leaves your laptop. And then extend that. I mean, not everybody’s going to be working with Hollywood studios, but extend that to just general small offices—your law office, your medical office, your physiotherapists, or whatever [where] everybody is using large language models for very creative things, but if you can make those models far more effective at your specific domain. . .

13.37: I’ll use a small office, for example, in a particular state in a particular jurisdiction, to be able to retrain a model, to be an expert in the law for that jurisdiction based on prior, what is it they call it? Jury priors? I can’t remember the Latin phrase for it, but, you know, based on precedents. To be able to fine-tune a model for that and then have everything locally within your office so you’re not sharing out to Claude or Gemini or OpenAI or whatever. Developers are going to be building that stuff. 

14.11: And with a lot of fear, uncertainty and doubt out there for developers with code generation, the optimist in me is seeing that [for] developers, your value bar is actually raising up. If your value is just your ability to churn out code, now models can compete with you. But if you’re raising the value of yourself to being able to do things that are much higher value than just churning out code—and I think fine-tuning is a part of that—then that actually leads to a very bright future for developers.

14.43: So here’s my impression of the state of tooling for posttraining. So [with] RAG and different variants of RAG, it seems like people have enough tools or have tools or have some notion of how to get started. [For] fine-tuning, there’s a lot of services that you can use now, and it mainly comes down to collecting a fine-tuning dataset it seems like.

[For] reinforcement learning, we still need tools that are accessible. The workflow needs to be at a point where a domain expert can actually do it—and that’s in some ways kind of where we are in fine-tuning, so the domain expert can focus on the dataset. Reinforcement learning, not so much the case. 

I don’t know, Laurence, if you would consider quantization and distillation part of posttraining, but it seems like that might also be something where people would also need more tools. More options. So what’s your sense of tooling for the different types of posttraining?

15.56: Good question. I’ll start with RAG because it’s the easiest. There’s obviously lots of tooling out there for it. 

16.04: And startups, right? So a lot of startups. 

16.07: Yep. I think the thing with RAG that interests me and fascinates me the most is in some ways it shares [similarities] with the early days of actually doing machine learning with the likes of Keras or PyTorch or TensorFlow, where there’s a lot of trial and error. And, you know, the tools.

16.25: Yeah, there’s a lot there’s a lot of knobs that you can optimize. People underestimate how important that is, right? 

16.35: Oh, absolutely. Even the most basic knob, like, How big a slice do you take of your text, and how big of an overlap do you do between those slices? Because you can have vastly different results by doing that. 

16.51: So just as a quick recap from if anybody’s not familiar with RAG, I’d like to give one little example of it. I actually wrote a novel about 12, 13 years ago, and six months after the novel was published, the publisher went bust. And this novel is not in the training set of any LLM.

So if I go to an LLM like Claude or GPT or anything like that and I ask about the novel, it will usually either say it doesn’t know or it will hallucinate and it’ll make stuff up and say it knows it. So to me, this was the perfect thing for me to try RAG. 

17.25: The idea with RAG is that I will take the text of the novel and I’ll chop it up into maybe 20-word increments, with five-word overlap—so the first 20 words of the book and then word 15 through 35 and then word 30 through 50 so you get those overlaps—and then store those into a vector database. And then when somebody wants to ask about something like maybe ask about a character in the novel, then the prompts will be vectorized, and the embeddings for that prompt can be compared with the embeddings of all of these chunks. 

And then when similar chunks are found, like the name of the character and stuff like that, or if the prompt asks, “Tell me about her hometown,” then there may be a chunk in the book that says, “Her hometown is blah,” you know?

So they will then be retrieved from the database and added to the prompt, and then sent to something like GPT. So now GPT has much more context: not just the prompt but also all these extra bits that it retrieves from the book that says, “Hey, she’s from this town and she likes this food.” And while ChatGPT doesn’t know about the book, it does know about the town, and it does know about that food, and it can give a more intelligent answer. 

18.34: So it’s not really a tuning of the model in any way or posttuning of the model, but it’s an interesting and really nice hack to allow you to get the model to be able to do more than you thought it could do. 

But going back to the question about tooling, there’s a lot of trial and error there like “How do I tokenize the words? What kind of chunk size do I use?” And all of that kind of stuff. So anybody that can provide any kind of tooling in that space so that you can try multiple databases and compare them against each other, I think is really valuable and really, really important.

19.05: If I go to the other end of the spectrum, then for actual real tuning of a model, I think LoRA tuning is a good example there. And tooling for that is hard to find. It’s few and far between. 

19:20: I think actually there’s a lot of providers now where you can focus on your dataset and then. . . It’s a bit of a black box, obviously, because you’re relying on an API. I guess my point is that even if you’re [on] a team where you don’t have that expertise, you can get going. Whereas in reinforcement learning, there’s really not much tooling out there. 

19:50: Certainly with reinforcement learning, you got to kind of just crack open the APIs and start coding. It’s not as difficult as it sounds, once you start doing it.

20:00: There are people who are trying to build tools, but I haven’t seen one where you can just point the domain expert. 

20.09: Totally. And I would also encourage [listeners that] if you’re doing any other stuff like LoRA tuning, it’s really not that difficult once you start looking. And PyTorch is great for this, and Python is great for this, once you start looking at how to do it. Shameless self-plug here, but [in] the final chapter of my PyTorch book, I actually give an example of LoRA tuning, where I created a dataset for a digital influencer and I show you how to retune and how to LoRA-tune the Stable Diffusion model to be a specialist in creating for this one particular individual—just to show how to do all of that in code.

Because I’m always a believer that before I start using third-party tools to do a thing, I kind of want to look at the code and the frameworks and how to do that thing for myself. So then I can really understand the value that the tools are going to be giving me. So I tend to veer towards “Let me code it first before I care about the tools.”

21.09: Spoken like a true Googler. 

21.15: [laughs] I have to call that one tool that, while it’s not specifically for fine-tuning large language models, I hope they converted for it. But this one changed the game for me: Apple has a tool called Create ML, which was really used for transfer learning off of existing models—which is still posttraining, just now posttraining of LLMs.

And that tool’s ability to be able to take a dataset and then to fine-tune a model like a MobileNet or something, or an object detection model on that codelessly and efficiently blew my mind with how good it was. The world needs more tooling like that. And if there’s any Apple people listening, I’d encourage them to extend Create ML for large language models or for any other generative models.

22.00: By the way, I want to make sure, as we wind down, I ask you about edge—that’s what’s occupying you at the moment. You talk about this notion of “build once, deploy everywhere.” So what’s actually feasible today? 

22.19: So what’s feasible today? I think the best multideployment surface today that I would invest in going forward is creating for ExecuTorch, because ExecuTorch runtime is going to be living in so many places. 

At Arm, obviously we’ve been working very closely with ExecuTorch and we are part of the ExecuTorch 1.0 release. But if you’re building for edge, you know, to make sure that your models work on the ExecuTorch, which, I think would be the number one, low-hanging fruit that I would say that people would invest in. So that’s PyTorch’s model.

22.54: Does it really live up to the “run everywhere”?

23.01: Define “everywhere.”

23.02: [laughs] I guess, at the minimum, Android and iOS. 

23.12: So yes, at a minimum, for those—the same as LiteRT or TensorFlow Lite from Google does. What I’m excited about with ExecuTorch is that it also runs in other physical AI areas. We are going to be seeing it in cars and robots and other things as well. And I anticipate that that ecosystem will spread a lot faster than the Lite or T1. So if you’re starting with Android and iOS, then you’re in good shape. 

23.42: What about the kinds of devices that our mutual friend Pete Warden, for example, targets? The really compute-hungry [ones]? Well, not so much compute hungry, but basically not much compute.

24.05: They sip power rather than gulping it. I think that would be a better question for Pete than for me. If you see him, tell him I said hi. 

24.13: I mean, is that something that the ExecuTorch community also kind of thinks about?

24.22: At short. Yes. In long, that’s a bit more of a challenge to go on microcontrollers and the like. One of the things that when you start getting down onto the small that I’m really excited about is a technology called SME, which is scalable matrix extensions. And it’s something that Arm have been working on with various chip makers and handset makers, with the idea being that SME is all about being able to run AI workloads on the CPU. So without needing a separate external accelerator. And then as a result, the CPU’s going to be drawing less battery, those kinds of things, etc. 

That’s one of the growth areas that I’m excited about, where you’re going to see more and more AI workloads being able to run on handsets, particularly the diverse Android handsets, because the CPU is capable of running models instead of you needing to offload to a separate accelerator, being an NPU or a TPU or GPU.

And the problem with the Android ecosystem is the sheer diversity makes it difficult for a developer to target any specific one. But if more and more workloads can actually move on to the CPU, and every device has a CPU, then the idea of being able to do more and more AI workloads through SME is going to be particularly exciting.

25.46: So actually, Laurence, for people who don’t work on edge deployments, give us a sense of how capable some of these small models are. 

First I’ll throw out an unreasonable example: coding. So obviously, me and many people love all these coding tools like Claude Code, but sometimes it really consumes a lot of compute, gets expensive. And not only that, you end up getting somewhat dependent so that you have to always be connected to the cloud. So if you are on a plane, suddenly you’re not as productive anymore. 

So I’m sure in coding it might not be feasible, but what are these language models or these foundation models capable of doing locally [on smartphones, for example] that people may not be aware of?

26.47: Okay, so let me kind of answer that in two different ways: [what] device foundation models are capable of that people may not be aware of [and] the overall on-device ecosystem and the kind of things you can do that people may not be aware of. And I’m going to start with the second one.

You mentioned China earlier on. Alipay is a company from China, and they’ve been working on the SME technology that I spoke about, where they had an app, which I’m sure we’ve all seen these kind of apps where you can get your vacation photographs and then you can search your vacation photographs for things, like “Show me all the pictures I took with a panda.”

And then you can create a slideshow or a subset of your folder with that. But when you build something like that, the AI required to be able to search images for a particular thing needs to live in the cloud because on-device just wasn’t capable of doing that type of image-based searching previously.

27.47: So then as a company, they had to stand up a cloud service to be able to do this. As a user, I had privacy and latency issues if I was using this: I have to share all of my photos with a third party and whatever I’m looking for in those photos I have to share with the third party.

And then of course, there’s the latency: I have to send the query. I have to have the query execute in the cloud. I have to have the results come back to my device and then be assembled on my device. 

28.16: Now with an on-device AI, thinking about it from both the user perspective and from the app vendor perspective, it’s a better experience. I’ll start from the app vendor perspective: They don’t need to stand up this cloud service anymore, so they’re saving a lot of time and effort and money because everything is moving on-device. And with a model that’s capable of understanding images, and understanding the contents of images so that you can search for those, executing completely on-device.

The user experience is also better. Show me all the pictures of pandas that I have where it’s able to search the device for those pictures or look through all the pictures on the device, get an embedding that represents the contents of that picture map that match that embedding to the query that the user is doing, and then assemble those pictures. So you don’t have the latency, and you don’t have the privacy issues, and the vendor doesn’t have to stand up stuff.

29.11: So that’s the kind of area where I’m seeing great improvements, not just in user experience but also making it much cheaper and easier for somebody to build these applications—and all of that then stems from the capabilities of foundation models that are executing on the device, right? In this case, it’s a model that’s able to turn an image into a set of embeddings so that you can search those embeddings for matching things.

As a result, we’re seeing more and more on-device models, like Gemini Nano, like Apple Intelligence, becoming a foundational part of the operating system. Then more and more will be able to see applications like these being made possible. 

I can’t afford to stand up a cloud service. You know, it’s costing millions of dollars to be able to build an application for somebody, so I can’t do that. And how many small startups can’t do that? But then as it moves on-device, and you don’t need all of that, and it’s just going to be purely an on-device thing, then suddenly it becomes much more interesting. And I think there’ll be a lot more innovation happening in that space. 

30.16: You mentioned Gemma. What are the key families of local foundation models?

30.27: Sure. So, there’s local foundation models, and then also embedded on-device models. So Gemini Nano and Android and the Apple Intelligence models and Apple, as well as this ecosystem of smaller models that could work either on-device or on your desktop, like the Gemma family from Google. There’s the OpenAI gpt-oss, there’s the Qwen stuff from China, there’s Llama, you know that there’s a whole bunch of them out there.

I’ve recently been using the gpt-oss, which I find really good. And obviously I’m also a big fan of Gemma, but there’s lots of families out there—there’s so many new ones coming online every day, it seems. So there’s a lot of choice for those, but many of them are still too big to work on a mobile device.

31.15: You brought up quantization earlier on. And that’s where quantization will have to come into play, at least in some cases. But I think for the most part, if you look at where the vectors are trending, the smaller models are getting smarter. So what the 7 billion-parameter model can do today you needed 100 billion parameters to do two years ago.

And you keep projecting that forward, like the 1 billion-parameter model’s kind of [going to] be able to do the same thing in a year or two time, and then it becomes relatively trivial to put them onto a mobile device if they’re not part of the core operating system, but for them to be something that you ship along with your application.

I can see more and more of that happening where third-party models being small enough to work on mobile devices will become the next wave of what I’ve been calling small AI, not just on mobile but also on desktop and elsewhere. 

32.13: So in closing, Laurence, for our listeners who are already familiar and may already be building AI applications for cloud or enterprise, this conversation may prompt them to start checking out edge and local applications.

Besides your book and your blog, what are some of the key resources? Are there specific conferences where a lot of these local AI edge AI people gather, for example? 

32.48: So local AI, not yet. I think that that wave is only just beginning. Obviously things like the Meta conferences, we’ll talk a lot about Llama; Google conferences, we’ll talk a lot about Gemma; but an independent conference for just general local AI as a whole, I think that wave is only just beginning.

Mobile is very vendor specific or [focused on] the ecosystem of a vendor. Apple obviously have their WWDC, Google have their conferences, but there’s also the independent conference called droidcon, which I find really, really good for understanding mobile and understanding AI on mobile, particularly for the Android ecosystem.

But as for an overall conference for small AI and for the ideas of fine-tuning, all of the types of posttuning small AI that can be done, that’s that’s a growth area. I would say for posttraining, there’s a really excellent Coursera course that a friend of mine, Sharon Zhou, just released. It just came out last week or the week before. That’s an excellent course in all of the ins and outs of posttraining fine-tuning. But, yeah, I think it’s a great growth area.

34.08: And for those of us who are iPhone users. . . I keep waiting for Apple Intelligence to really up its game. It seems like it’s getting close. They have multiple initiatives in the works. They have alliances with OpenAI and now with Google. But then apparently they’re also working on their own model. So any inside scoop? [laughs]

34.33: Well, no inside scoop because I don’t work at Apple or anything like that, but I’ve been using Apple Intelligence quite a lot, and I’m a big fan. The ability to have the on-device large language model is really powerful. There’s a lot of scenarios I’ve been kind of poking around with and helping some startups with in that space. 

The one thing that I would say that’s a big gotcha for developers to look out for is the very small context window. It’s only 8K, so if you try to do any kind of long-running stuff or anything interesting like that, you’ve got to go off-device. Apple have obviously been investing in this private cloud so that your sessions, when they go off-device into the cloud. . . At least they try to solve the privacy part of it. They’re getting ahead of the privacy [issue] better than anybody else, I think. 

But latency is still there. And I think that deal with Google to provide Gemini services that was announced a couple of days ago is more on that cloud side of things and less on the on-device. 

35.42: But going back to what I was saying earlier on, the 7 billion-parameter model of today is as good as the 120 billion of yesterday. The 1 billion-parameter [model] of next year is probably as good as that, if not better. So, as smaller parameter-size models and therefore memory space models are becoming much more effective, I can see more of them being delivered on-device as part of the operating system, in the same way as Apple Intelligence are doing it. But hopefully with a bigger context window because they can afford it with the smaller model. 

36.14: And to clarify, Laurence, that trend that you just pointed out, the increasing capability of the smaller models, that holds not just for LLMs but also for multimodal? 

36.25: Yes. 

36.26: And with that, thank you, Laurence. 

36.29: Thank you, Ben. Always a pleasure.



Read the whole story
alvinashcraft
17 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Announcing v2.32.0 of the Vonage Video API

1 Share
ovember Updates of the Video API featuring v2.32.0 across our Web, Android, iOS, Linux, and Windows SDKs.
Read the whole story
alvinashcraft
17 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Hands-on with MCP Resources in VS Code

1 Share

In the first post of this series, we explored what MCP resources are and why they're the overlooked piece of the MCP puzzle. Now it's time to get practical—let's dive into using resources in Visual Studio Code.

By the end of this post, you'll know how to discover, browse, attach, and leverage MCP resources to supercharge your AI-powered development workflow.

Understanding resources in VS Code

When an MCP server exposes resources, VS Code makes them accessible in several ways:

  1. Browse all resources across all installed servers
  2. Browse resources per server to see what each provides
  3. Attach resources to chat as context for your conversations with Copilot
  4. View resources directly in the editor
  5. Save resources from tool call results to your workspace

Think of resources as a context menu for your AI—a way to give Copilot exactly the information it needs without copy-pasting or explaining everything manually.

Setting up your first MCP server with resources

Let's start by installing an MCP server that provides resources. We'll use the GitHub MCP Server as our example because it's widely used and demonstrates several resource patterns.

Method 1: Install from the Extensions View

The easiest way to install an MCP server is through VS Code's built-in gallery:

  1. Open the Extensions view (Ctrl+Shift+X or Cmd+Shift+X on Mac)
  2. Type @mcp in the search field
  3. Find "GitHub MCP Server" and click Install
  4. VS Code will prompt you to trust the server—review the configuration and confirm

Method 2: Use the MCP: Add Server Command

Alternatively, you can use the command palette:

  1. Press Ctrl+Shift+P (or Cmd+Shift+P on Mac)
  2. Type "MCP: Add Server"
  3. Select your package manager (NPM, PyPI, or Docker)
  4. Enter the server details
  5. VS Code handles the rest!

Method 3: Manual Configuration

For team projects, you can share MCP server configurations using a workspace file:

  1. Create .vscode/mcp.json in your workspace root
  2. Add your server configuration:
  1. Save the file—VS Code will detect it and offer to start the server

Browsing resources: Your first look

Once you have an MCP server installed and running, let's explore its resources.

Browse resources

Open the command palette and run MCP: Browse Resources. This shows resources from all your installed MCP servers in one unified view.

You'll see:

  • Resource names (human-readable descriptions)
  • Resource URIs (unique identifiers like github://repo/owner/name/readme)
  • Server attribution (which MCP server provides each resource)
 

Understanding resource templates

Some resources use templates—URIs with placeholders that let you provide parameters. For example:

  • github://repo/{owner}/{name}/file/{path}
  • database://query/{table_name}

When you select a templated resource, VS Code prompts you for each parameter. This makes resources dynamic and flexible—you're not limited to pre-defined values.

Attaching resources to chat

Here's where resources become truly powerful. Instead of explaining context to Copilot, you attach it directly.

Using the Add Context Button

  1. Open GitHub Copilot Chat
  2. Click the Add Context button (the paperclip icon)
  3. Select MCP Resources…
  4. Choose the resource you want to attach

The resource content is now part of your conversation context. Copilot can reference it when answering questions or generating code.

What’s next?

In the next post, we'll level up by building our own MCP resource server from scratch. You'll learn:

  • The anatomy of a resource server
  • Implementing resources in C#
  • Best practices for resource URIs and metadata
  • Testing and debugging your server

Stay tuned!

    Read the whole story
    alvinashcraft
    49 minutes ago
    reply
    Pennsylvania, USA
    Share this story
    Delete

    Rust in Android: More Memory Safety, Fewer Revisions, Fewer Rollbacks, Shorter Reviews

    1 Share
    Android's security team published a blog post this week about their experience using Rust. Its title? "Move fast and fix things." Last year, we wrote about why a memory safety strategy that focuses on vulnerability prevention in new code quickly yields durable and compounding gains. This year we look at how this approach isn't just fixing things, but helping us move faster. The 2025 data continues to validate the approach, with memory safety vulnerabilities falling below 20% of total vulnerabilities for the first time. We adopted Rust for its security and are seeing a 1000x reduction in memory safety vulnerability density compared to Android's C and C++ code. But the biggest surprise was Rust's impact on software delivery. With Rust changes having a 4x lower rollback rate and spending 25% less time in code review, the safer path is now also the faster one... Data shows that Rust code requires fewer revisions. This trend has been consistent since 2023. Rust changes of a similar size need about 20% fewer revisions than their C++ counterparts... In a self-reported survey from 2022, Google software engineers reported that Rust is both easier to review and more likely to be correct. The hard data on rollback rates and review times validates those impressions. Historically, security improvements often came at a cost. More security meant more process, slower performance, or delayed features, forcing trade-offs between security and other product goals. The shift to Rust is different: we are significantly improving security and key development efficiency and product stability metrics. With Rust support now mature for building Android system services and libraries, we are focused on bringing its security and productivity advantages elsewhere. Android's 6.12 Linux kernel is our first kernel with Rust support enabled and our first production Rust driver. More exciting projects are underway, such as our ongoing collaboration with Arm and Collabora on a Rust-based kernel-mode GPU driver. [They've also been deploying Rust in firmware for years, and Rust "is ensuring memory safety from the ground up in several security-critical Google applications," including Chromium's parsers for PNG, JSON, and web fonts.] 2025 was the first year more lines of Rust code were added to Android than lines of C++ code...

    Read more of this story at Slashdot.

    Read the whole story
    alvinashcraft
    2 hours ago
    reply
    Pennsylvania, USA
    Share this story
    Delete

    Why Diffusion Models Could Change Developer Workflows in 2026

    1 Share

    Developers spend much of their time editing, refactoring, and debugging rather than producing entirely new code. Code creation tends to involve non-sequential back-and-forth refinement rather than typing out a complete function in one uninterrupted sequence. You might sketch a part, adjust parameters, skip ahead, then revisit earlier sections to refine them. 

    Diffusion models, and in particular diffusion large language models (d-LLMs), operate differently from current coding assistants. Unlike autoregressive models, which generate token by token in a strict left-to-right sequence, d-LLMs condition on both past and future context. This enables them to model edits and refinements more directly, reflecting how developers iteratively construct and adjust code. As shown by Gong et al. (2025): “the [d-LLM] model often plans token generation more globally, much like a programmer jumping back and forth through code to refine a code implementation.” This matches the reality of code authorship, which is non-linear: you sketch a bit, revise earlier parts, jump ahead, and keep iterating.

    Out-of-order generation feels more human

    One of the most striking demos of diffusion-based models like DiffuCoder showed exactly this: the model skipped a parameter mid-function, continued writing later parts, then circled back to fill in what was missing.

    (The prompt used here is: “Write a Python function to implement binary search together with docstrings and type hints.” The example is generated using the apple/DiffuCoder-7B-Instruct model, configured to produce one token per forward pass with a limit of 256 new tokens. The blue slots illustrate positions that the model later revisits and refines during the diffusion process.)

    This structure may mirror how many developers think. You may not know every detail upfront, but you can scaffold a function and refine as you go. A model that generates in a sequential way is better suited to support this process.

    Bi-directional context improves reasoning

    Autoregressive models can be prompted to consider bidirectional context by providing both prefix and suffix in the prompt, but this remains a workaround rather than a native capability. Diffusion models, particularly diffusion large language models (d-LLMs), are designed from the ground up to condition on both past and future context during generation.

    This design proves valuable for tasks requiring reversal reasoning, where coherence must hold in both directions, and for code generation, where a variable’s usage downstream should inform its earlier definition. As shown by Nie at al. (2025), d-LLMs exhibit “consistent zero-shot performance across both forward and reversal tasks.”

    For developers, this translates into improved handling of structured logic, long-range dependencies, and code constraints that depend on order-sensitive relationships.

    Flexibility in editing and refactoring

    Because diffusion models mask and unmask tokens gradually at any random position, they are naturally suited to infilling. If you ask a diffusion model to rewrite a block of code with a different parameter or to refactor a loop into a comprehension, it can directly operate on masked sections. 

    The distinction with autoregressive LLMs is subtle here, since both require the relevant code region to appear in the prompt. Where diffusion models add value is in integration with deterministic tooling such as IDEs. An IDE could highlight several problematic or incomplete regions, mask them, and allow the diffusion model to unmask and regenerate all affected parts in a single coherent pass. This distinguishes diffusion models from FIM-enabled autoregressive models, which can handle isolated infilling but struggle to maintain global consistency across multiple edits.

    Example: coordinated multi-region updates

    Consider adding a field to a class that must be initialised in the constructor, used in a method, and serialised elsewhere. Rather than orchestrating multiple FIM calls or regenerating entire methods, a diffusion model can mask the relevant locations and generate all necessary updates at once. 

    This makes diffusion models well-suited to refactoring tasks where changes must satisfy global constraints, such as ensuring a new parameter appears consistently in a function signature, its documentation, call sites, and test cases.

    For example, an IDE might flag a type mismatch in a function signature. Instead of regenerating the entire function, a diffusion model could unmask just the problematic parameter declaration and rewrite it to match the expected type, leaving the rest of the code untouched. This localised editing process mirrors how developers typically fix errors and refactor code incrementally.

    Potential speed improvements

    Autoregressive models operate sequentially, generating one token per forward pass. Diffusion models, by contrast, can produce multiple tokens in a single forward pass. Benchmarks reveal a current practical shortcoming: increasing the number of tokens per step often reduces quality. The underlying mechanism, however, allows faster inference and is likely to improve in future.

    Researchers have proposed semi-autoregressive approaches to bridge the gap between autoregressive and diffusion-based generation – most notably Block Diffusion – Arriola et al. (2025). This method generates blocks of tokens from left to right while allowing diffusion models to unmask flexibly within each block. In principle, this allows reuse of the KV cache, which plays a key role in the efficiency of autoregressive models. In practice, however, unmasking too many tokens in parallel creates a trade-off. Throughput increases, but quality often drops sharply, especially if the KV cache is not reused carefully.

    Semi-autoregressive generation represents an intermediate step between autoregressive and truly out-of-order inference. Diffusion-based language models work fundamentally out of sequence, yet current methods still borrow ideas from autoregressive design, such as KV cache reuse, because the optimisation tools for autoregressive generation remain highly developed and effective. Ironically, these mature autoregressive mechanisms improve generation speed even as research moves towards models that can generate fully out of order.

    Current limitations

    For now, developers should temper expectations. Our internal experimentations with latest open-source models show that:

    • The best quality comes when unmasking one token per step, which slows things down and makes these models not differ much from AR models in practice.
    • Diffusion models can repeat prefixes or suffixes, or even output incoherent text when pushed too far.
      • Repetition: model re-outputs entire prefix blocks multiple times.
    def factorial(n):
        if n == 0:
            return 1
        else:
            return n * factorial(n - 1)
    
    def factorial(n):   # repeated
        if n == 0:
            return 1
        else:
            return n * factorial(n - 1)
    
    def factorial(n):   # repeated again
        if n == 0:
            return 1
        else:
            return n * factorial(n - 1)
    • Early termination: incomplete function bodies or truncated expressions.
    def factorial(n):
        if n == 0:
            return 1
        else:
            return n * factorial(   # truncated, no argument
    • Malformed syntax: unmatched brackets, dangling commas, or gibberish tokens.
    def factorial(n):
        if (n == 0:
            return 1,
        else:
            return n ** factorial[n - 1))

    Benchmarking current state-of-the-art d-LLMs – open source (DiffuCoder, Seed-Diffusion) and closed-source (Mercury, Gemini-Diffusion) – shows mixed performance when compared against strong autoregressive baselines such as Qwen2.5-Coder. See Gong et al. (2025) and Song, Yuxuan et al. (2025).

    Despite these issues, diffusion models still introduce valuable new possibilities for code generation and editing. At the same time, their ecosystem is very immature compared to autoregressive LLMs. 

    Training and inference techniques that help mitigate sequential bottlenecks in LLMs, such as chunked prefill, speculative decoding, or prefix caching, have no direct equivalents yet for diffusion models. 

    Diffusion also requires defining output length in advance, which often leads to inefficiency compared to the <eos> termination signal in LLMs. 

    Finally, the open-source diffusion models for code makes it harder for developers to experiment and refine these methods.

    Where they can be useful today

    • Code completion with context editing – filling in missing parts of a function rather than only extending text.
    • Refactoring support – restructuring code blocks where order is less rigid.
    • Structured text tasks – mathematical reasoning or reversal problems where bi-directional context matters.

    These niches give developers a reason to experiment, even if production-ready tools remain a little way off. Beyond these early applications, the major promise of d-LLMs lies in their potential for much faster generation, since they can produce N tokens per forward pass rather than one. 

    This capability could eventually reshape performance expectations for coding assistants once the quality–efficiency trade-offs are better understood.

    Looking ahead

    Diffusion models won’t replace autoregressive models overnight. But they represent a new paradigm that better reflects how developers think and work. Their ability to edit flexibly, consider context in both directions, and potentially accelerate inference sets them apart.

    For developers, the practical benefit is clear: more snappy generation, and more support for the unstructured, iterative way you actually write code.

    As research continues, diffusion models could become the backbone of coding assistants that feels less like next token generators and more like principled, code structure-aware, programming collaborators.

    Read the whole story
    alvinashcraft
    2 hours ago
    reply
    Pennsylvania, USA
    Share this story
    Delete

    489: .NET 10 and Visual Studio 2026

    1 Share

    Follow Us

    ⭐⭐ Review Us ⭐⭐

    Machine transcription available on http://mergeconflict.fm

    Support Merge Conflict





    Download audio: https://aphid.fireside.fm/d/1437767933/02d84890-e58d-43eb-ab4c-26bcc8524289/649e8871-ed0d-4f39-8cf6-65526059773d.mp3
    Read the whole story
    alvinashcraft
    2 hours ago
    reply
    Pennsylvania, USA
    Share this story
    Delete
    Next Page of Stories