Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
152002 stories
·
33 followers

Copilot Cowork — A New Way of Getting Work Done in Microsoft 365

1 Share

The pace of AI innovation continues to accelerate, and Microsoft keeps moving fast. This time the leap is significant: Copilot Cowork — the new execution layer for Microsoft 365 — is now available in the Frontier program, and it takes Copilot from a helpful assistant to an AI coworker that actually does the work with you. After using it heavily for the past weeks, I can say this is another meaningful step in how AI is reshaping the way we work.

  1. What is Copilot Cowork?
  2. Why Copilot Cowork matters for the Future of Work
  3. Copilot Cowork vs. Claude Cowork — what’s the difference?
  4. What Copilot Cowork can do for you
  5. Stay in control — approval-gated actions
  6. My experience: from everyday tasks to training content
  7. A custom skill for training and session content
  8. How to enable Copilot Cowork in your tenant
  9. Cowork is changing how we work

What is Copilot Cowork?

Copilot Cowork is an agentic coworking experience inside Microsoft 365 Copilot. Rather than answering single questions, Cowork carries out multi-step tasks on your behalf — it drafts and sends emails, schedules meetings, creates documents, posts in Teams, browses SharePoint and OneDrive, and searches across your organization. You describe the outcome in natural language, and Cowork generates a plan grounded in your Microsoft 365 context and works through it step by step — visibly, inside the conversation, so you can follow every move it makes.

Cowork is available in the browser at m365.cloud.microsoft and in the Microsoft 365 Copilot desktop app for Windows and Mac, installable from the Agent Store and pinnable to the left rail.

The key shift is not the conversation. It is the execution — from talking about work to doing work.

Why Copilot Cowork matters for the Future of Work

For years we have been promised AI that truly collaborates with us. With Copilot Cowork, that promise becomes something you can use today. It is the kind of change that moves organizations closer to being Frontier Firms — workplaces where people and intelligent agents co-create together.

A few reasons this is such a big deal:

  • It runs inside your Microsoft 365 tenant, with your work data, under your identity and permissions. Note: it uses Anthropic Claude models, which run as a subprocessor (Anthropic will operate with Microsoft oversight through contractual safeguards and appropriate technical and organizational measures.)
  • It plans multi-step tasks instead of answering single prompts.
  • It produces finished artifacts — real files, real messages, real calendar events.
  • It carries work forward over time, with visible checkpoints and progress tracking.
  • It keeps you in control — every sensitive action requires your explicit approval.

This is exactly the kind of co-creation between humans and AI I have been talking about for a long time — and now it is a reality inside Microsoft 365.

Copilot Cowork vs. Claude Cowork — what’s the difference?

It is a fair question, because both products share a lot of DNA. Both can research, reason, plan, and produce documents. Both can build PowerPoint decks, Word documents, Excel spreadsheets, and PDFs. Both can orchestrate multi-step work and be extended with custom skills.

The critical difference — and this is the whole point — is where the work happens and what data it can reach.

CapabilityClaude CoworkCopilot Cowork
Multi-step task executionYesYes
Creates Word / Excel / PowerPoint / PDFYesYes
Web / deep researchYesYes
Custom skillsYesYes (up to 20, stored in your OneDrive)
Runs inside Microsoft 365NoYes ( as a subprocessor)
Accesses your mailbox, Teams chats, calendar, SharePoint, OneDriveNot directly, but possible Yes, via Work IQ
Sends emails, posts in Teams, creates meetings on your behalfNot directly, but possibleYes, with approval gating

Claude Cowork is an excellent general-purpose AI coworker. It knows your work — your inbox, your meetings, your team, your files, your org — and can act inside it. For knowledge workers living in Microsoft 365, that is a completely different level.

What Copilot Cowork can do for you

Out of the box, Cowork ships with several built-in skills: Word, Excel, PowerPoint, PDF, Email, Scheduling, Calendar Management, Meetings, Daily Briefing, Enterprise Search, Communications, Deep Research, and Adaptive Cards. Together they cover the most common daily workflows of a knowledge worker:

  • Communication — draft and send emails, post in Teams channels or chats, create HTML newsletters, sort your inbox, prepare stakeholder updates.
  • Documents and files — create and edit Word, Excel, PowerPoint, and PDF; browse your entire Work IQ to pull in the right content; create and reorganize SharePoint and OneDrive folders.
  • Calendar and meetings — schedule meetings in natural language, move things around, decline conflicts (with a reason message to the organizer), get meeting intelligence, and start your day with a daily briefing.
  • Research and search — enterprise search across your org, plus deep research that synthesizes multiple sources into a comprehensive report.
  • Automation — run prompts on a schedule so recurring tasks happen automatically.

And you can manage your work with built-in task views — a sortable list, a kanban board, or a Scheduled tab — with each task showing a clear status: In progress, Needs user input, Done, or Failed.

Stay in control — approval-gated actions

One of the things I appreciate the most is how Cowork handles trust. Before it does anything sensitive — sending an email, posting a Teams message, scheduling or declining a meeting, editing or moving files — it pauses and asks for your go-ahead. Approval buttons match the action (Send, Post, Schedule) and medium and high-risk actions get a visible risk indicator. You can skip future prompts for similar actions when you want more speed, and you can pause, resume, or cancel any running task at any time.

This is exactly the right balance: agentic autonomy with human oversight.

My experience: from everyday tasks to training content

I have been using Copilot Cowork a lot lately, for both routine work and the more creative parts of my job. On the everyday side it is the quiet helper you do not notice any more — drafting replies, summarizing threads, preparing briefs, turning a messy meeting into clean action items, building a small spreadsheet from a message.

Where it has really impressed me is content creation for training sessions. This is work I love doing, but there is a lot of repetitive structural effort before the fun part begins. With Cowork, the flow looks like this:

  1. I give Cowork the topic, the audience, and a few guidelines & info I see relevant.
  2. Cowork researches the topic — from the input I provide, from Microsoft Learn and Microsoft Support, from Tech Community, and from the latest news.
  3. It builds a first draft of the presentation — structure, slides, talking points.
  4. It prepares the demo by creating the demo content scripts I need: the files, the emails, the Teams messages — so the demo feels realistic.
  5. It writes a demo script so I know exactly what to click, what to say, and what to show.

The quality of the first draft is clearly good. Not final — never final — but good enough that I can iterate fast instead of staring at an empty slide. That alone is a huge change in how I prepare sessions.

A custom skill for training and session content

One of the things I love the most about Copilot Cowork is that you can extend it with your own skills. Custom skills live in your OneDrive under /Documents/Cowork/Skills/, each in its own subfolder with a SKILL.md file, and Cowork discovers them automatically at the start of each conversation. You can have up to 20 of them.

So I built one for this purpose, and some others to other needs.

My skill helps me create both training and session content drafts in a consistent, repeatable way. When I invoke it, Cowork:

  • First researches from the sources I trust — my own input, Microsoft Learn and Support, Tech Community, and current news.
  • Follows my own guidelines on structure, tone, and depth.
  • Produces a first draft of the presentation, the demo content, and the demo script together as a package.

The result is that I get to the “iterate and polish” part of the work much faster than before. And because the skill encodes my way of working, every draft starts from a good baseline instead of a blank page.

How to enable Copilot Cowork in your tenant

Copilot Cowork is a Frontier preview feature, so a few things need to be in place before you and your users can start working with it.

Prerequisites

  • A Microsoft 365 Copilot license for the users who will use Cowork.
  • The tenant enrolled in the Frontier program.
  • Microsoft-built agents enabled in the Microsoft 365 admin center.
  • Anthropic as a subprocessor enabled for the tenant. This is on by default globally, but off by default for tenants in the EU Data Boundary, where it must be explicitly enabled. If Anthropic is off, users may see Cowork but will not be able to use it.
  • Currently English language only.

A small tip for admins

If Cowork does not show up in Agent management in the Microsoft 365 admin center, make sure your admin account is also enrolled in Frontier (Copilot → Settings → Frontier). Without that, admins will not see Cowork in the Agent Inventory.

For users

Once the tenant is ready, users install Cowork themselves from the Agent Store in the Microsoft 365 Copilot app and pin it to the left rail. They can start using it immediately in the browser at m365.cloud.microsoft or in the Microsoft 365 Copilot desktop app. No extra admin action is required per user — permissions and policies already in place in Microsoft 365 are fully respected.

For details, see the official Microsoft Learn pages:

Cowork is changing how we work

Copilot Cowork is not a small update. It is a step-change in how we work with AI inside Microsoft 365 — from a chat companion to an actual coworker that plans, researches, and delivers. It is saving me time every single day, and it is making the more creative parts of my job (like preparing training and sessions) genuinely more fun. And it is very versatile – you can use it from sending messages to organizing your OneDrive to larger workflows.

This is the kind of move that brings the AI-Native workplace closer, and I am excited to see how organizations adopt it and co-create with it.

And one more thing — in the spirit of transparency: I used Copilot Cowork to create the first draft of this very blog post, and then edited it further myself. Which feels like exactly the right way to work with an AI coworker.

Try Cowork. Teach it your way of working. You will be surprised how quickly it becomes part of your team.

Stay tuned — I will keep exploring Copilot and Future Work on the blog!



Read the whole story
alvinashcraft
54 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Theseus, a static Windows emulator

1 Share

This post is likely the end of my series on retrowin32.

I bring you: Theseus, a new Windows/x86 emulator that translates programs statically, solving a bunch emulation problems while surely introducing new ones.

What happened to retrowin32?

I haven't been working on retrowin32, my win32 emulator, in part due to life stuff and in part because I haven't been sure where I wanted to go with it. And then someone who had contributed to it in the past posted retrotick, their own web-based Windows emulator that looks better than my years of work, and commented on HN that it took them an hour with Claude.

This is not a post about AI, both because there are too many of those already and because I'm not yet sure of my own feelings on it. But one small thing I have been thinking about is that (1) AI has been slowly but surely climbing the junior to senior engineer ladder; and (2) one of the main pieces of being a senior engineer is better understanding what you ought to be building, as distinct from how to build it.

(Is that just the Innovator's Dilemma's concept of "retreating upmarket", applied to my own utility as a human? Not even sure. I am grateful I do this work for the journey, to satisfy my own curiosity, because that means I am not existentially threatened like a business would be in this situation. As Benny Feldman says: "I cheat at the casino by secretly not having an attachment to material wealth!")

So, Mr. Senior Engineer, what ought we build? What problem are we even solving with emulators, and how do our approaches meet that? I came to a kind of unorthodox solution that I'd like to tell you about!

Emulators and JITs

The simplest CPU emulator is very similar to an interpreter. An input program, after parsing, becomes x86 instructions like:

mov eax, 3
add eax, 4
call ...  ; some Windows system API

An interpreting emulator is a big loop that steps through the instructions. It looks like:

loop {
   let instr = next_instruction();
   match instr {
      // e.g. `mov eax, 3`
      Mov => { set(argument_1(), argument_2()); }
      // e.g. `add eax, 4`
      Add => { set(argument_1(), argument_1() + argument_2()); }
      ...
  }
}

Like an interpreter, this approach is slow.

At a high level interpreters are slow because they are doing a bunch of dynamic work for each instruction. Imagine emulating a program that runs the same add instruction in a loop; the above emulator loop has all these function calls to repeatedly ask "what instruction am I running now?" and inspect the arguments, only to eventually do the same add on each iteration. x86 memory references are extra painful because they are very flexible.

Further, on x86 the add instruction not only adds the numbers but also computes six derived values, including things like the parity flag: whether the result contains an even number of 1 bits(!). A correct emulator needs to either compute all of these as well, or perform some sort of side analysis of the code to decide how to run it efficiently.

There are various fun techniques to improve emulators. But if you want to go fast what you really need is some combination of analyzing the code and generating native machine code from it — a JIT. JITs are famously hard to write! They are effectively optimizing compilers, which means all the complexity of optimization and generating machine code, but also where the runtime of the compilation itself is in the critical performance path. I liked this post's discussion of why JITs are hard which mentions there have been more than 15 attempts at a Python JIT.

Static binary translation

So suppose you want to generate efficient machine code, but you don't want to write a JIT. You know what's really good at analyzing code and generating efficient machine code from it? A compiler!

So here's the main idea. Given code like the above input x86 snippet, we can process it into source code that looks like:

regs.eax = 3;
regs.eax = add(regs.eax, 4);
windows_api();  // some native implementation of the API that was called

We then feed this code back in to an optimizing compiler to get a program native to your current architecture, x86 no longer needed.

In other words, instead of handing an .exe file directly to an emulator that might JIT code out, we instead have a sort of compiler that statically translates the .exe (via a second compiler in the middle) directly into a "native" executable.

(I write native in scare quotes because while the resulting executable is a native binary, it is a binary that is carrying around a sort of inner virtual machine representing the x86 state, like the regs struct in the above code. More on this in a bit.)

I think I came up with this basic idea on my own just by thinking hard about what I was trying to achieve, but it turns out this approach is known as static binary translation and is well studied. It has some nice properties, and also some big problems.

Decompilation

I'll go into those, but first, a minor detour about how I ended up here.

Have you heard of decompilation? These madmen (madpeople?) are manually recreating the source code to old video games, one function at a time. They take the game binary, extract the machine code of one function, then use a fancy UI (click one of the entries under "Recent activity") to iteratively tinker on reproducing the higher-level code that generates the exact same machine code. It's kind of amazing.

(To do this, they need to even run the same original compiler that was used to compile the target game. Those compilers are often Windows programs, which means implementing the above fancy UI involves running old Windows binaries on their Linux servers. This is how I first learned about them — they need a Windows emulator!)

Decompilation is not only just a weird and fascinating (and likely tedious?) human endeavor. It also highlighted something important for me: I don't so much care about having an emulator that can run any random program, I care about running a few very specific programs and I'm willing to go to even some manual lengths to help out.

In practice, if you look at a person building a Windows emulator, they end up as surgeons needing to kind of manually reach in and pump the heart of the target program themselves anyway, including debugging the target program and working around its individual bugs. It's common for emulators to even manually curate a list of programs that are known to work or fail.

An old idea

Statically translating machine code is not a new idea. Why isn't it more popular? My impression in trying to read about it is that it is often dismissed because it can't work, but at least so far it's worked well. Maybe I haven't yet encountered some impossible problem that I've so far overlooked?

(When trying to look up related work for this blog post, I saw this attempt at statically translating NES that concluded it can't be done, but then also these people seem to be succeeding at it so it's hard to say.)

I think there are two main problems, a technical one and a more cultural one.

The technical part is that the simple idea has complex details. To start with, any program that generates code at runtime (e.g. itself containing a JIT) won't work, but it's easy for me to just dismiss those programs as out of scope. There are also challenges around things like how control flow works, but those are small and interesting and I might go into them in future posts.

A common topic of research is that it's in the limit impossible to statically find all of the code that might be executed even in a program that doesn't generate code at runtime, because of dynamic control flow from vtables or jump tables. In particular, while there are techniques to find most of the code, no approach is guaranteed to work perfectly. This is where decompilation changed my view: if I'm willing to manually help out a bit on a specific program, then this problem might be fine?

The main cultural reason I think binary translation isn't more common is that it's not as convenient as a generic emulator that handles most programs already. Users aren't likely to want to run a compiler toolchain, though I have seen projects embed the compiler (e.g. LLVM) directly to avoid this.

The other cultural problem is there are legal ramifications if you intend to distribute translated programs. Every video game emulator relies on the legal fiction of "first, copy the game data from the physical copy you already own and pass that in as an input", so they get to plausibly remain non-derivative works.

But I'm not solving for users, I'm solving for my own interest. These cultural problems don't matter to me.

Benefits

Again consider the snippet above, which is adding 3 and 4. In a static translator world we parse the instruction stream ahead of time, so the compiler gets to see that we want to put a 3 in eax and not (as an interpreter would) spend runtime considering what values we are reading and writing where.

A compiler will not only generate the correct machine code for the target architecture, it even will optimize code like the above to just store the resulting value 7. And a compiler is capable of eliminating unneeded code like parity computations if you frame things right. Because the Theseus code generation happens "offline", separately from the execution of the program, I can worry less than a JIT might to about spending runtime analyzing the code to try to help.

When I started this I had thought that performance would be the whole benefit of this approach, but it turns out to be easier to develop as well because it brings in all of the other developer tools:

  • The translated instructions appear as regular code in the output program, which means the native debugger can step translated instructions, which appear as regular source code.
  • If the program crashes, the native stack trace traces back in to the (translated assembly of the) original program.
  • I haven't tried it yet, but CPU profiling ought to have the same benefit.

In retrowin32 I ended up building a whole debugger UI to help track down problems, but in Theseus I've just used my system debugger so far and it's been fine.

In retrowin32 I also spent a lot of time fiddling with the bridge between the emulator and native code. This boundary still exists in Theseus but it is so much smaller, because the translated code can directly call my native win32 system API implementation (with a bit of glue code to move data in and out of the inner machine's representation).

On MacOS retrowin32 could run under Rosetta but it meant the entire executable needed to be an x86-64 binary, which meant it required a cross-compiled SDL. A Theseus binary is native code that just calls the native SDL.

All told it is just much simpler. From the start of this idea to getting the test program I've tinkered with all this while running its first scene, including DirectX, FPU, and MMX, only took me a couple weeks.

Partial evaluation

You can think of the different approaches of interpreter to JIT to static binary as a spectrum of how much work you do ahead of time versus at runtime. Theseus take the dynamic question of "what kind of mov is this" and move it to the ahead of time compilation step, partially evaluating the generic instruction handler into a specific instruction with nailed-down arguments. (I'll link again to the excellent blog about meta-tracing C code. Read about Futamura projections for this idea taken to its extreme conclusion!)

For another example, a typical Windows emulator must parse and load the PE executable on startup, but Theseus does that at compile time and writes out just the data structures needed to execute it. The PE-parsing code isn't needed in the output.

Similarly, executable startup involves linking and loading any referenced DLLs including those from the system, but Theseus must see all the code it will run, so it does this linking ahead of time. Here's some output near a call to a Windows API, where at compile time it resolved an IAT reference (the ds:[...] address) directly to the Rust implementation I wrote:

// 004012a0 push 4070A4h
push(ctx, 0x4070a4u32);
// 004012a5 push 8
push(ctx, 0x8u32);
// 004012a7 call dword ptr ds:[4060E8h]
call(ctx, 0x4012ad, Cont(user32::CreateWindowExA_stdcall))

In some sense it's as if Theseus at compile time is partially running the system binary loader and the output source code is a snapshot of the ready state. It reminds me a bit of the problem of unpacking executables.

WebAssembly

Theseus should easily extend to running on the web under WebAssembly; most of it is just compiling the generated program with wasm as the target architecture. (I initially had this working then decided I don't need the additional complexity for now, so it isn't implemented.)

Separately, the output program from Theseus is inspired by how WebAssembly is executed. In both there is an outer host program that carries within it a "machine" with its own idea of code and memory. The code within that machine can only read/write to its own memory and must call provided hooks to bridge out to the host. Like WebAssembly, the Theseus output executable code is isolated from the data, with the nice property that no amount of unintentional/malicious memory writes can create new code.

A wasm Theseus would be a turducken of machines:

  1. the native host machine's WebAssembly implementation (e.g. the Chrome runtime), with its notion of memory, runs a
  2. WebAssembly virtual machine with the Theseus wasm blob, with its own idea about memory (e.g. where my Rust implementation of the Windows API puts allocations), and within that there is
  3. the x86 virtual machine and Windows program's notion of memory (which e.g. might say "read from the static data table at memory offset $x").

In thinking about it, it's tempting to try to blend some layers of machines here, and make the WebAssembly program's memory 1:1 with the input Windows program's idea of memory. That is, if the input program writes to some address $x, you could translate that to exactly writing to WebAssembly memory address $x. (You'd need to adjust the middle layer to hide its data structures in places the x86 program doesn't use.) I had to do something like this to make retrowin32 work under an x86 emulator. WebAssembly even would let me lay out the memory directly from the binary. I don't think this really buys you much, it would just be kind of cute.

On the topic of WebAssembly and static binary translation, check out wastrel which is static binary translation applied to the problem of executing WebAssembly. Reading about it surely gave me the seeds of this idea.

Theseus

I named this project Theseus, as in the ship.

Consider again the x86 assembly at the top of the post. What does it do? Depending on how you look at it, one correct answer is "adds three and four" or even just "computes 7". Or you could say it puts 3 in the eax register, adds 4 to the eax register, consumes some CPU clocks, and sets various CPU flags.

If I or my compiler replaces one of these interpretations with another, is it still the same program? Depending on which context you care about — my impression is that emulating systems like the NES requires getting the clocks exactly right — these details either matter or don't. In the case of Theseus I am explicitly throwing away the input program because I have replaced all its parts, one by one.

I have one farther off idea, again along the lines of the ship of Theseus. Implementing the Windows API is an endless stream of working around four decades of Hyrum's Law. Consider that random bug workaround again: if you were documenting the API of DirectPlayEnumerateA would you write that it calls the callback, or would it be more correct to say that it calls the callback and also restores a preserved stack pointer? If you look at the code of a Windows emulator like Wine today it is full of things like this.

One idea I've been thinking about is that for problems like these, rather than making the emulator more complicated, you could take a page from the decompilation playbook and provide an easy way to manage replacing parts of the program itself.

Once you're willing to replace pieces of a program there are more interesting possibilities. If a program has some bit of code that doesn't perform well, instead of making a JIT fancier, you could just manually replace the code with your own implementation. (It's plausible you wouldn't even need to change algorithms, it might be enough to just write the same algorithm in native code and let your modern compiler apply its autovectorization logic to it.) With enough machinery, you could even replace parts to add features, as one contributor to retrowin32 investigated here and even implemented for some GameBoy games.

Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete

RAG – A Quick Example

1 Share

In the previous blog post, we imported a few Python modules and configured our AI key, using Colab.

In this blog post we’ll use Retrieval-Augmented Generation (RAG) to extend an LLM that we’ll get from OpenAI. I’ll use a number of features from the libraries we imported with only a cursory explanation and will come back to them in upcoming blog posts to examine them in more depth. But I want to get to RAG right away because it is rapidly becoming central to AI and because it is cool.

LLMs are incredibly expensive to create and train, and it isn’t feasible to train them on everything. Besides that, much data is proprietary. It may be that you want an LLM that handles (to use the canonical case) your HR policies. Clearly no commercial LLM knows about those policies, nor should they. And equally clearly, you’re not going to train an LLM from scratch. What you want to do is to combine your own corpus of data (HR policy papers, etc.) with an existing LLM, and that is exactly what RAG is for.

In this simple example, we’re going to take a scene or two from Romeo and Juliet* and feed it to gpt-40-mini; one of many LLMs available for use at minimal cost (we’ll get into how cost is computed in an upcoming post).

The first thing we’ll do after configuring the OPEN_API_KEY will be to get a TextLoader to import the text file with the scenes from Romeo and Juliet

To do that, we’ll use the TextLoader from langchain_community.document_loaders (again, we’ll examine this and the other referenced modules in upcoming blog posts). We do this in three steps:

  1. load the import statement
  2. Point the TextLoader to our file
  3. Load the file
from langchain_community.document_loaders import TextLoader
loader = TextLoader("RomeoAndJuliet.txt", encoding="utf-8")
docs = loader.load()

Next, we need to divide the text into chunks that the LLM can work with. We do that with a RecursiveCharacterTextSplitter from langchain. We’ll use the cl100k_base encoder, and we’ll set the chunk_size to 1000 (that is 1,000 of those mysterious tokens that, e.g., words are divided into). To ensure that nothing is dropped, we set a property, chunk_overlap to 200.

from langchain.text_splitter import RecursiveCharacterTextSplitter 
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    encoding_name='cl100k_base',
    chunk_size=1000,
    chunk_overlap=200
)
chunks = loader.load_and_split(text_splitter)

In this particular example, we get six chunks,

len(chunks)
6

By now you are getting annoyed that so much is going by that I’m not explaining. As promised, however, all will be clarified in coming blog posts. In fact, we’ll go back through this line by line and explain what each step is doing.

We need an embedding model which we’ll use to create our vector store (the place we hold onto our chunks) As an aside, the other things held in the vector store are the metadata about each chunk and vectors which are numerical embeddings of the chunks. The vectors are actually the key part of this, they are a long list of numbers that represent the semantic meaning of each chunk which can be used, and will be used below, in a similarity search.

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
embedding_model = OpenAIEmbeddings(model="text-embedding-ada-002")
vectorstore = Chroma.from_documents(
    chunks,
    embedding_model,
    collection_name="RomeoAndJuliet"
)

Now that we have the vector store, we need a way to conduct the search, for which we need a retriever. When we instantiate it, we’ll tell the retriever to use a similarity search.

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 10}
)

The kwargs are additional keyword-arguments you can pass in. In this case, we’re telling the retriever to get 10 results.

Now! We are ready to create our user message and our retrieval Query. A retrieval query is the search query you send to your vector store to fetch relevant chunks. The user message is a natural-language question—it’s the optimized text you give to the retriever so it can find the right documents.

userMessage = "Give me every line having the word swear in it" 
retrievalQuery = "the play 'Romeo and Juliet'"

We can now extract the relevant chunks, iterate through them and create a long string of the resulting context chunks.

relevantChunks = retriever.invoke(retrievalQuery)
contextChunks = [d.page_content for d in relevantChunks]
contextString = ". ".join(contextChunks)

We need to give the LLM context to work from. One great way to do that is to assign a role to the LLM (e.g., “you are a human resource assistant”). In this case, we’ll use a reviewer who knows about plays. This is also a good place to provide explicit directions on how you want the LLM to respond.

qna_system_message = """
You are a play reviewer using the RAG to combine the text of the play with your knowledge of plays in general.
You will review RomeoAndJuliet.txt and provide appropriate answers from the context.
The user input will have the context required by you and will begin with the token: ###Context.
The user questions will begin with the token: ###Question.
Please answer only using the context provided and do not mention anything about the context in your answer.
If the answer is not found in the context, respond "I don't know."
"""

We just need a way to tell the LLM how the context and question will appear, for which we create a template.

qna_user_message_template = """
###Context
{context}

###Question
{question}
"""

Let’s create the final userQuery by combining the context with the user message we created above.

userQuery = qna_user_message_template.format(
    context=contextString,
    question=userMessage
)

Finally, we’re ready to create the prompt that we’ll feed to the LLM

prompt = f"""
[INST]{qna_system_message}

{userQuery}
[/INST]
"""

Next, we instantiate our LLM filling in some parameters that, again, we’ll review in an upcoming blog post

from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
    model="gpt-4o-mini",                      
    temperature=0,                
    max_tokens=10000,                 
    top_p=0.95,
    frequency_penalty=1.2,
    stop_sequences=['INST']
)

And we are now, at last, ready to feed our prompt to the LLM which will incorporate the RAG we created from the Romeo and Juliet text. Remember that we asked it to give us the lines with the word swear in it

response = llm.invoke(prompt)
response.content

ROMEO.  
O, then, dear saint, let lips do what hands do:  
They pray, grant thou, lest faith turn to despair.  

JULIET.  
Saints do not move, though grant for prayers’ sake.  

ROMEO.  
Then move not while my prayer’s effect I take.  
Thus from my lips, by thine my sin is purg’d.

JULIET.  
O swear not by the moon, th’inconstant moon,  
That monthly changes in her circled orb,  
Lest that thy love prove likewise variable.

ROMEO.  
What shall I swear by?

JULIET.   
Do not swear at all.   
Or if thou wilt, swear by thy gracious self,
Which is the god of my idolatry,
And I’ll believe thee.

ROMEO.
If my heart’s dear love,—

One thing to note is that the LLM interpreted “swear” liberally. For example, in the first verse Romeo says “They pray,” which is pretty close to “swear.”

We need a method to ask more questions. Let’s create a method, that takes a user message, chunks it, creates the prompt and invokes the LLM with that prompt

def UseRag(userMessage):
    """
    Args:
    userMessage: Takes a user input for which the response should be retrieved from the vectorDB.
    Returns:
    relevant context as per user query.
    """
    chunks = retriever.invoke(userMessage)
    contextContent = [d.page_content for d in chunks]
    contextString = ". ".join(contextContent)

    prompt = f"""[INST]{qna_system_message}\n
                {'user'}: {qna_user_message_template.format(context=contextString, question=userMessage)}
                [/INST]"""

    # Quering the LLM
    try:
        response = llm.invoke(prompt)

    except Exception as e:
        response = f'Sorry, I encountered the following error: \n {e}'

    return response.content 

To prove that we’re getting our answers from the RAG, let’s ask a question about text that is not in our excerpt but that would be known by anyone (anything?) that is familiar with the play.

print(UseRag("What town does Romeo live in?"))

I don't know.

Finally, let’s have a bit of fun,

UseRag("Write a 10 line poem in the style of 'Romeo and Juliet'")

In shadows deep where whispered secrets lie,  
Two hearts entwined beneath the moonlit sky.  
A glance exchanged, a spark ignites the night,  
Forbidden love that dances out of sight.  

O sweet Juliet, with beauty rare and bright,  
Your name a curse yet brings my soul delight.  
Though feuding kin may seek to tear apart,  
Our love shall bloom within each beating heart.  

For in this world of strife and bitter woe,  
Together we shall rise; our passion's glow.  

I think that is actually pretty good.

OK, that was a lot, and it went by fast. I look forward to going back through it, line by line, and exploring what each line is doing.

Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete

The fastest way to match characters on ARM processors?

1 Share

Consider the following problem. Given a string, you must match all of the ASCII white-space characters (\t, \n, \r, and the space) and some characters important in JSON (:, ,, [, ], {, }). JSON is a text-based data format used for web services. A toy JSON document looks as follows.

{
  "name": "Alice",
  "age": 30,
  "email": "alice@example.com",
  "tags": ["developer", "python", "open-source"],
  "active": true
}

We want to solve this problem using SIMD (single-instruction-multiple-data) instructions. With these instructions, you can compare a block of 16 bytes with another block of 16 bytes in one instruction.

It is a subproblem in the fast simdjson JSON library when we index a JSON document. We call this task vectorized classification. We also use the same technique when parsing DNS records, and so forth. In the actual simdjson library, we must also handle strings and quotes, and it gets more complicated.

I need to define what I mean by ‘matching’ the characters. In my case, it is enough to get, for each block of 64 bytes, two 64-bit masks: one for spaces and one for important characters. To illustrate, let me consider a 16-byte variant:

{"name": "Ali" }
1000000100000001 // important characters
0000000010000010 // spaces

Thus, I want to get back the numbers 0b1000000100000001 and 0b0000000010000010 in binary format (they are 33025 and 130 in decimal).

I refer you to Langdale and Lemire (2019) for how to do it using the conventional SIMD instructions available on ARM processors (NEON). Their key idea is a table-driven, branch-free classifier: for each byte, split into low and high nibbles, use SIMD table lookups to map each nibble to a bitmask, and combine the two masks (with bitwise AND) to decide whether the byte belongs to a target set (whitespace or structural JSON characters). This avoids doing many separate equality comparisons per character.

There is now a better way on recent ARM processors.

The 128-bit version of NEON was introduced in 2011 with the ARMv8-A architecture (AArch64). Apple played an important role and it was first used by the Apple A7 chip in the iPhone 5S. You can count on all 64-bit ARM processors to support NEON, which is convenient. (There are 32-bit ARM processors but they are mostly used for embedded systems, not mainstream computing.)

ARM NEON is good but getting old. It is no match for the AVX-512 instruction set available on x64 (AMD and Intel) processors. Not only do the AVX-512 instructions support wider registers (64 bytes as opposed to ARM NEON’s 16 bytes), but they also have more powerful instructions.

But ARM has something else to offer: Scalable Vector Extension (SVE) and its successor, SVE2. Though SVE was first introduced in 2016, it took until 2022 before we had actual access. The Neoverse V1 architecture used by the Amazon Graviton 3 is the first one I had access to. Soon after, we got SVE2 with the Neoverse V2 and N2 architectures. Today it is readily available: the Graviton4 on AWS, the Microsoft Cobalt 100 on Azure, the Google Axion on Google Cloud (and newer Google Cloud ARM CPUs), the NVIDIA Grace CPU, as well as several chips from Qualcomm, MediaTek, and Samsung. Notice who I am not including? Apple. For unclear reasons, Apple has not yet adopted SVE2.

I have mixed feelings about SVE/SVE2. Like RISC-V, it breaks with the approach from ARM NEON and x64 SIMD that uses fixed-length register sizes (16 bytes, 32 bytes, 64 bytes). This means that you are expected to code without knowing how wide the registers are.

This is convenient for chip makers because it gives them the option of adjusting the register size to better suit their market. Yet it seems to have failed. While the Graviton 3 processor from Amazon had 256-bit registers… all commodity chips have had 128-bit registers after that.

On the plus side, SVE/SVE2 has masks a bit like AVX-512, so you can load and process data only in a subset of the registers. It solves a long-standing problem with earlier SIMD instruction sets where the input is not a multiple of the register size. Both SVE/SVE2 and AVX-512 might make tail handling nicer. Being able to operate on only part of the register allows clever optimizations. Sadly, SVE/SVE2 does not allow you to move masks to and from a general-purpose register efficiently, unlike AVX-512. And that’s a direct consequence of their design with variable-length registers. Thus, even though your registers might always be 128-bit and contain 16 bytes, the instruction set is not allowed to assume that a mask fits in a 16-bit word.

I was pessimistic regarding SVE/SVE2 until I learned that it is designed to be interoperable with ARM NEON. Thus you can use the SVE/SVE2 instructions with your ARM NEON code. This works especially well if you know that the SVE/SVE2 registers match the ARM NEON registers (16 bytes).

For the work I do, there are two SVE2 instructions that are important: match and nmatch. In their 8-bit versions, what they do is the following: given two vectors a and b, each containing up to 16 bytes, match sets a predicate bit to true for each position i where a[i] equals any of the bytes in b. In other words, b acts as a small lookup set, and match tests set membership for every byte of a simultaneously. The nmatch instruction is the logical complement: it sets a predicate bit to true wherever a[i] does not match any byte in b. A single instruction thus replaces a series of equality comparisons and OR-reductions that would otherwise be needed. In the code below, op_chars holds the 8 structural JSON characters and ws_chars holds the 4 whitespace characters; calling svmatch_u8 once on a 16-byte chunk d0 produces a predicate that has a true bit exactly where that input byte is a structural character. The code uses SVE2 intrinsics: compiler-provided C/C++ functions that map almost one-to-one to CPU SIMD instructions, so you get near-assembly control without writing assembly.

// : , [ ] { }
uint8_t op_chars_data[16] = {
    0x3a, 0x2c, 0x5b, 0x5d, 0x7b, 0x7d, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0
};
// \t \n \r ' '
uint8_t ws_chars_data[16] = {
    0x09, 0x0a, 0x0d, 0x20, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0
};

// load the characters in SIMD registers
svuint8_t op_chars = svld1_u8(svptrue_b8(), op_chars_data);
svuint8_t ws_chars = svld1_u8(svptrue_b8(), ws_chars_data);

// load data
// const char * input = ...
svbool_t pg = svptrue_pat_b8(SV_VL16);
svuint8_t d = svld1_u8(pg, input);

// matching
svbool_t op = svmatch_u8(pg, d, op_chars);
svbool_t ws = svmatch_u8(pg, d, ws_chars);

In this code snippet, svuint8_t is an SVE vector type containing unsigned 8-bit lanes (bytes). svbool_t is an SVE predicate (mask) type. svptrue_b8() builds a predicate where all 8-bit lanes are active, and svld1_u8(pg, ptr) loads bytes from memory into an SVE vector, using predicate pg to decide which lanes are actually read.

If you paid attention thus far, you might have noticed that my code is slightly wrong since I am including 0 in the character sets. But it is fine as long as I assume that the zero byte is not present in the input. In practice, I could just repeat one of the characters, or use a bogus character that I do not expect to see in my inputs (such as the byte value 0xFF, which cannot appear in a valid UTF-8 string).

In standard SVE/SVE2, op and ws are predicates, not integer masks. A practical trick is to materialize each predicate as bytes (0xFF for true, 0x00 for false), for example with svdup_n_u8_z.

svuint8_t opm = svdup_n_u8_z(op, 0xFF);
svuint8_t wsm = svdup_n_u8_z(ws, 0xFF);

When SVE vectors are 128 bits, this byte vector maps naturally to a NEON uint8x16_t via svget_neonq_u8, and from there we can build scalar bitmasks efficiently with NEON operations (masking plus pairwise additions). Repeating this over four 16-byte chunks gives the two 64-bit masks needed for a 64-byte block.

How does it compare with pure NEON code? I compiled the routine processing blocks of 64 bytes using different compilers.

method GCC 16 LLVM clang 20
simdjson (NEON) 69 66
SVE/SVE2 (new!) 42  52

Interestingly, GCC 16 even adopts SVE instructions in the pure NEON code, which suggests that recompiling old NEON code while targeting SVE/SVE2 could be beneficial.

I was hoping to test both compilers in a benchmark, but I wanted to quickly run my benchmarks on an AWS Graviton 4. I also did not want to compile GCC 16 from source. So I just kept LLVM clang 20 which was readily available in the images that AWS makes available (I picked RedHat 10).

The AWS Graviton 4 processor is a Neoverse V2 processor. Google has its own Neoverse V2 processors in its cloud. In my tests, it ran at 2.8 GHz.

My benchmark generates a random string of 1 MiB and computes the bitmaps indicating the positions of the characters. It is available on GitHub. My results are as follow.

method GB/s instructions/byte instructions/cycle
simdjson (NEON) 11.4 0.94 3.5
SVE/SVE2 (new!) 14.4  0.67 3.8

So the SVE/SVE2 approach is about 25% faster than the NEON equivalent and uses 30% fewer instructions, and that’s without any kind of fancy optimization. Importantly, the code is relatively simple thanks the match instruction.

It might be that the SVE2 function match is the fastest way to match characters on ARM processors.

Credit: This post was motivated by a sketch by user liuyang-664 on GitHub.

References

Langdale, G., & Lemire, D. (2019). Parsing gigabytes of JSON per second. The VLDB Journal, 28(6), 941-960.

Koekkoek, J., & Lemire, D. (2025). Parsing millions of DNS records per second. Software: Practice and Experience, 55(4), 778-788.

Lemire, D. (2025). Scanning HTML at Tens of Gigabytes Per Second on ARM Processors. Software: Practice and Experience, 55(7), 1256-1265.

Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete

How to spend $38k on a Visual Novel, and Ludum Dare dying in 2028?

1 Share

Hello and Welcome, I’m your Code Monkey!

I hope you're having a great weekend! It's nice and sunny over here, today I'm off to take my nephews to the zoo, should be a fun day! I hope you also have something fun planned!

  • Game Dev: Spend $38k on a VN ; Ludum Dare dying 2028

  • Fun: AI speak like cavemen


Game Dev

How to spend $38k on a Visual Novel

If you're a fan of numbers and costs, then this post is excellent. It's a post mortem from a developer detailing how they spent $38,000 making a Visual Novel (which then netted $3,279) and importantly defining which costs were worth it or not.

This is also a curious example because the developer/post writer seems like it was mostly the “ideas guy” with money and spent that money hiring people to make a game rather than doing it all themselves. This is very different from the usual “I made the game all by myself” and also very different from “I have lots of game ideas, I just need someone to make them but I have no money to pay you”, so let’s see where those $38k went.

Some spending was clearly worth it, like the writer and artist, but the dev found out that a lot of it was not.

In terms of hiring developers, this was the biggest cost at $9800 where the dev thinks the big mistake was using Agile instead of Waterfall. Agile works great when working by yourself or with a fully owned team, but with freelancers that you're paying by the hour going for Agile means lots of revisions which means lots of extra cost.

The composer was also described as basically unnecessary since original music doesn't necessarily add much to this type of game as opposed to stock music which would be a lot cheaper (or free).

Translations is an interesting one, the developer said it's mostly wasted since it was expensive and even ended up with some errors. They think the proper approach for the future is AI translation + native speaker proofreading. I guess this one depends if the game has a lot of context or nuance. If not, if the text is mostly normal talking, then I can see AI + proofreading work.

In terms of the hired marketer, the analysis is how it was expensive for the results. $2300 for 700 wishlists. Reddit ads were also slightly expensive, costing about $1.48 per wishlist.

The things that did work were the free things: Steam festivals and manually reaching out to streamers ended up being way more valuable than a lot of the paid promotion, especially because the streams also functioned as real playtesting. Watching players get bored, confused, or engaged gave the developer a ton of insight into how to improve the game.

There's a ton of detailed information in the post so I recommend you read through the whole thing, and the conclusion on the post is filled with great takeaways:

What I actually got for $38,000:

  • A team I genuinely enjoy working with

  • The feeling of watching strangers play something you made — great feeling.

  • A clear picture of what the next game needs to look like

  • The unshakeable desire to do it again, better

  • Tons of experience

  • My colleagues threw me a surprise Zoom party on launch day. My friends got me a cake. :)

And I love the very last line: "One publisher later told me about a dev who spent 4 years on a 10k-wishlist game, then made a massive hit in 6 months using everything he'd learned. You need the first game to make the second one.", so make your first game!

I love cost breakdowns like this. Sometime when you hear Game X cost $Y it's hard to imagine where does that money go so it's nice to see a very detailed report on exactly that. If you want more numbers check out my own video on my last game where I covered the costs and the results or the one from the game before that.


Affiliate

FREE VFX, and 99% OFF Bundle!

The Unity Asset Store is currently running their Spring Sale!

As always you can get most of the top assets at 50% OFF, and along with Flash Deals up to 95% OFF! 

My own Code Monkey Toolkit is on sale! And I just recently updated it adding a bunch more tools, now it’s over 50!

Oh wow this might be the biggest HumbleBundle in ages!

It includes a Tools, VFX, Meshes, Textures, UI, 2D, 3D, a bit of everything! Thousands and thousands of objects. And it’s all with an insane discount, worth €2,332 and you can get it for just €15!

Get it HERE!


Game Dev

Ludum Dare dying in 2028?

One of the long-standing institutions in indie game dev has been Ludum Dare. The twice-yearly game jam that has been running for almost 25 years!

But there's some sad news. In the post talking about Ludum Dare 59 the caretaker behind it, Mike Kasprzak, has announced that Ludum Dare 64 on October 2028 will the the final one. Meaning there are six more events left, so this is not an immediate shutdown, but it is very much the beginning of the end.

I think this is an interesting reminder of something important: communities and events like this do not just exist automatically forever. They survive because someone is carrying a massive amount of work behind the scenes, and eventually that catches up with people. So if you have ever benefited from a game jam, or a community, or any kind of shared dev space, remember there is usually someone holding that thing together with a ton of invisible effort.

It will be interesting to see what fills that gap. Maybe nothing replaces Ludum Dare exactly and maybe the community simply splinters into smaller jams. Based on the Itch page there is no lack of scheduled game jams so they will continue to happen as a concept, but it really does feel like the end of an era to see the main one go.

I love game jams as a way to gain a ton of knowledge very quickly. Nowadays I don't tend to do them because of reasons but if you're a beginner I highly recommend you join one. Just check out the Itch page to find an upcoming one, you WILL learn a lot!


Built for builders. Not buzzwords. San José 2026

500+ speakers. 18 content tracks. Workshops, masterclasses, and the people actually shipping the tools you use every day. WeAreDevelopers World Congress — September 23–25. Use code GITPUSH26 for 10% off.

Secure Your Pass


Fun

Why waste AI token when few word do trick?

Here is an ingenious project!

When it comes to AI, pricing is usually based on tokens/words, both input and output. If you manage to ask the AI to do the same thing using fewer tokens then it's cheaper, same thing for output.

This awesome project takes a cue from the famous The Office Kevin scene "Why waste time say lot word when few word do trick?" and teaches AI to speak like a caveman drastically cutting down output token usage by 75%!

A simple output like:

"Sure! I'd be happy to help you with that. The issue you're experiencing is most likely caused by your authentication middleware not properly validating the token expiry. Let me take a look and suggest a fix."

Becomes

”Bug in auth middleware. Token expiry check use < not <=. Fix:"

This isn't just fun, it's genuine cost savings!

I love this project! I wonder how the dev thought about it, were they watching that The Office clip and suddenly thought "why not have an AI speak like this?" and then they realized that it's not just fun but genuinely useful? Crazy story!




Get Rewards by Sending the Game Dev Report to a friend!

(please don’t try to cheat the system with temp emails, it won’t work, just makes it annoying for me to validate)

Thanks for reading!

Code Monkey

Read the whole story
alvinashcraft
2 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Claude Design Brings AI to Visual Work

1 Share

Like Claude Code and Cowork, Claude Design aims to replace traditional tools for conversational AI, this time focusing on visual design.

The post Claude Design Brings AI to Visual Work appeared first on Thurrott.com.

Read the whole story
alvinashcraft
9 hours ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories