Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
156493 stories
·
33 followers

Codex is now the recommended agent in JetBrains IDEs

1 Share

JetBrains AI supports multiple coding agents, including Junie, Codex, Claude Agent, and any ACP-compatible agent you bring yourself. Previously, AI users in JetBrains IDEs started in Chat mode and had to choose an agent themselves.

As models became more advanced, agents became more capable and their adoption grew. We recognize that agents help users achieve more, so we recommend to use an agent from the get-go.

To make that experience simpler, we’ve selected a specific agent to be the default. This post explains how we made the choice.

You can still switch to any other agent at any time.

“JetBrains evaluated coding agents on the things that matter in practice: can they solve real software engineering tasks, quickly and at a cost that makes sense. We’re proud that Codex is the recommended starting point in JetBrains AI. It’s a meaningful step in the shift from AI chat to agents that meet developers where they are, work in the tools they already use, and take on complex, multi-step work.”

Stuart McMeechan, EMEA Deployment Engineering Lead, OpenAI

Evaluation using real-world development tasks

We evaluated candidate agents using a benchmark dataset built from real software engineering tasks across three ecosystems: Java (225 tasks), C# (38 tasks), and Python (90 tasks).

Each task is grounded in a real codebase – with a prompt describing what needs to be done and automated tests that verify the result. Together, these tasks cover bug fixes, feature development, enhancements, and other common development tasks across real applications, libraries, frameworks, and developer tools.

Data points used for choosing the recommended agent are accessible in the Developer Productivity AI Arena (DPAIA) repository – JetBrains’ open benchmark for evaluating AI coding tools, making the evaluation reproducible. The C# dataset is internal and not publicly available.

The Java dataset was our primary evaluation set. It’s the largest of the three, spanning 17 repositories across five organizations and covering a broad mix of task types. 

The С# and Python datasets produced a similar overall ranking of candidate agents, giving us additional confidence that the results were not specific to a single ecosystem.

Our methodology

We compared candidates within the same model tier. Our goal was not to find the most powerful model available, but the best agent behavior at comparable model capability and cost. We projected what agent usage would cost, taking into account JetBrains AI token usage. Setups that would push more than 2% of users over $20/month were ruled out before we ranked candidates on quality and latency.

In choosing which agent to recommend, we focused on three questions:

  1. Can it handle the task? → Here, we measured by solve rate: the percentage of benchmark tasks where all tests passed.
  1. Is the cost reasonable? → We looked at the median cost per task.
  1. Is it fast enough? → We looked at median end-to-end latency.

These three metrics (solve rate, cost, and latency) formed the basis of our ranking. We also tracked additional signals, including compilation success and average tool calls, but they did not materially affect the results.

Alongside the offline benchmark, we ran an online A/B test with real users. This experiment served as a validation layer, helping us understand whether the offline results translated into real-world usage. Because it’s difficult to measure task success reliably at scale, we focused on behavioral signals such as engagement and how often users switched to another agent or returned to the chat. The online results were consistent with the offline benchmark, giving us additional confidence in our choice.

Candidate configurations

We tested agents available with JetBrains AI (Codex, Junie, and Claude Agent) – across multiple model configurations. Candidates were selected based on prior benchmarking and internal assessment; we focused on the most promising options within each agent’s model family rather than testing every possible setup. Eventually Codex and Junie were shortlisted. 

Codex – we started with an initial sweep across GPT-5.2 and GPT-5.3. When GPT-5.4 mini became available, it outshined the previous top performer in terms of both solve rate and cost, making the model choice straightforward. The remaining question was reasoning level: medium vs. low. GPT-5.4 mini with default medium reasoning had the best solve rate within reasonable cost range across all three ecosystems and was selected for the final evaluation.

Codex shortlist

GPT-5.4-mini comparison

Medium Reasoning solved more tasks in Java, C#, and Python. Low Reasoning was cheaper and often faster, but the cost and latency gains were not large enough to make up for the more noticeable drop in solve rate. That is why we picked Medium Reasoning.

All

Weighted average across ecosystems
Metric GPT-5.4-mini medium GPT-5.4-mini low
Solve rate 39.9% 35.1%
Median latency 170.40s 137.82s
Median cost USD 0.1387 USD 0.0650

Java

Metric leaders are highlighted
Metric GPT-5.4-mini medium GPT-5.4-mini low
Solve rate 43.9% 40.4%
Median latency 124.11s 78.02s
Median cost USD 0.1292 USD 0.0615

C#

Metric leaders are highlighted
Metric GPT-5.4-mini medium GPT-5.4-mini low
Solve rate 62.6% 51.6%
Median latency 142.95s 87.86s
Median cost USD 0.1152 USD 0.0580

Python

Metric leaders are highlighted
Metric GPT-5.4-mini medium GPT-5.4-mini low
Solve rate 20.2% 14.8%
Median latency 297.72s 308.43s
Median cost USD 0.1724 USD 0.0766

Junie - Junie can work with different model providers. We evaluated the Gemini model family, pre-selected based on the Junie team's own benchmarks as the most promising options. Gemini 3 Flash was selected as the winning model.

Junie shortlist

Gemini model comparison

Gemini 3 Flash had the stronger solve rate; Gemini 3.1 Flash Lite was consistently cheaper and faster.

All

Weighted average across ecosystems
Metric Gemini 3 Flash Gemini 3.1 Flash Lite
Solve rate 39.1% 29.9%
Median latency 147.57s 110.85s
Median cost USD 0.1132 USD 0.0564

Java

Metric leaders are highlighted
Metric Gemini 3 Flash Gemini 3.1 Flash Lite
Solve rate 45.2% 36.3%
Median latency 142.80s 100.54s
Median cost USD 0.1053 USD 0.0551

C#

Metric leaders are highlighted
Metric Gemini 3 Flash Gemini 3.1 Flash Lite
Solve rate 58.7% 41.5%
Median latency 215.87s 173.97s
Median cost USD 0.1189 USD 0.0661

Python

Metric leaders are highlighted
Metric Gemini 3 Flash Gemini 3.1 Flash Lite
Solve rate 15.6% 9.1%
Median latency 130.64s 109.97s
Median cost USD 0.1304 USD 0.0554

Final showdown: Junie vs Codex

The offline results were too close to call on their own. Neither agent dominated across all metrics and ecosystems.

Finalist comparison

Codex vs Junie across ecosystems

The final shortlist compared Codex with GPT-5.4-mini medium against Junie with Gemini 3 Flash.

All

Weighted average across ecosystems
Metric GPT-5.4-mini medium Gemini 3 Flash
Solve rate 39.9% 39.1%
Median latency 170.40s 147.57s
Median cost USD 0.1387 USD 0.1132
Cost per successful solve USD 0.4941 USD 0.4337

Java

Metric leaders are highlighted
Metric GPT-5.4-mini medium Gemini 3 Flash
Solve rate 43.9% 45.2%
Median latency 124.11s 142.80s
Median cost USD 0.1292 USD 0.1053
Cost per successful solve USD 0.3716 USD 0.2864

C#

Metric leaders are highlighted
Metric GPT-5.4-mini medium Gemini 3 Flash
Solve rate 62.6% 58.7%
Median latency 142.95s 215.87s
Median cost USD 0.1152 USD 0.1189
Cost per successful solve USD 0.2307 USD 0.2298

Python

Metric leaders are highlighted
Metric GPT-5.4-mini medium Gemini 3 Flash
Solve rate 20.2% 15.6%
Median latency 297.72s 130.64s
Median cost USD 0.1724 USD 0.1304
Cost per successful solve USD 0.9115 USD 0.8882

We included both in an online A/B test to see which held up better in real-world usage. We tracked activation, churn, and failure rate. Codex came out ahead. That tipped the decision.

What is next for the recommended agent

Codex is now the recommended agent, having delivered the strongest combination of solve rate and cost across the tasks we tested. This isn't a permanent decision, however. As models evolve, new agents join, and our benchmark coverage grows, we'll re-evaluate the decision and update our recommendation based on what the data tells us.

And if a different agent works better for your workflow, you can switch at any time. Our recommendation is a starting point, not a constraint.

Read the whole story
alvinashcraft
17 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

A Future Where Nobody Writes Code Manually Might Be Closer Than It Seems

1 Share

Once again, we brought together some of the finest minds Infobip has to answer tricky questions about the future of software.

This time around, we spoke to four Infobip engineers about how they use AI in their daily work and how they view the AI revolution happening now.

Research, plan, execute

With rapidly changing AI infrastructure, the things that used to be normal in software development are getting different, but some things stay the same.

Petar Dučić, Engineering Director, said that the company’s mantra “you build it, you own it” has remained the same in the AI era. This simply means that engineers are responsible for whatever they build.

Senior IT Research Scientist Ante Kapetanović, added that engineers need to separate their work phases efficiently:

You have to separate your research phase, your planning phase, and your coding implementation, whatever phase. This ultimately means that you own each step of the way. And basically, it is not AI-assisted coding, it is more human-assisting AI.

Engineering is now becoming even more necessary…

It’s true that using AI tools is, in many cases, a cheaper alternative to real people, but Petar pointed out that engineering is now becoming even more necessary, because there’s so many things that can go wrong, and we need real people to check them and undestand what’s going on.

Senior Software Engineer Rino Čala pointed out that there’s three types of mistakes agentic tools make: logical mistakes, code-based mistakes and security mistakes. The solution is, as Rino puts it, just more tests:

So it is definitely important to run tests, to run some local tests, CI tests, and do some static checks as well.

Zvonimir Petković, Staf Engineer, then explained that security issues are the number one flaw with AI software tools:

Security is the main risk with deploying Gen-AI generated code. With the whole Vibe coding setup, nobody looks at the code, and oftentimes we have also non-engineers deploying code. The hiding sensitive data within the source code itself, this is the number one problem.

The second problem for Zvonimir is scalability. Something that is built in a couple days might work fine for a small team, but cannot be scaled to 5,000 people easily.

… and engineers are now more orchestrators than code writers

A stark contrast to the narrative of AI taking away jobs for engineers is that, with more people actively using AI, there’s a bigger need for someone with a technical background to help with not just support, but education.

“We’re slowly becoming context engineers”, added Ante, saying that engineers are now spending a lot of time managing their context in different AI tools. He is personally a big advocate for writing your own code and feels like this is a major part of being an engineer. Still, Ante admits that might not be the case in a couple years.

Zvonimir, interestingly, had a take about exactly that:

The total trend is that in a few years’ time, we’ll have the situation where nobody writes the code manually. Software engineers will be like persons who are the experts in that field, so they will be able to review what gen AI has generated.

In conclusion, as Rino puts it, engineers are now more in the role of orchestrators and organizers than they are code writes, since they spend a lot of time managing AI models to do things properly.

Want to hear more? Check out the video.

Special thanks to our fellow colleagues at Infobip, the publisher of ShiftMag!

The post A Future Where Nobody Writes Code Manually Might Be Closer Than It Seems appeared first on ShiftMag.

Read the whole story
alvinashcraft
27 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

The Unglamorous Side of Rust Web Development

1 Share

This is a guest post by Mateusz Maćkowski and Marek Grzelak, co-maintainers of cot.rs and speakers at Rustikon 2026. You can watch their full talk here.

In the very beginning, all we wanted to do was build a JSON API. After doing that a few times in Rust, we noticed a recurring pattern. Every new project meant choosing libraries, wiring everything together, writing the same glue code, and solving the same setup problems again.

That pattern is one of the reasons we started working on cot.rs. We wanted Rust web development to feel less like assembling a custom toolbox every time, and more like starting with the pieces you already know you’ll need.

The advantages of Rust are well known: its safety, performance, strong types, and the specific confidence you get when your code finally compiles. This post is about everything that happens before compilation. 

TL;DR

  • Async Rust is powerful, but the debugging experience is still rough. One panic! can give you a 100-frame backtrace, with the actual issue in your code appearing between frames 9 and 10.
  • Rust ORMs require you to maintain the same schema in multiple places. Declarative migrations are a better direction, and several projects are exploring this.
  • Error handling in Rust web frameworks is inconsistent. Getting errors to behave predictably across an entire application is harder than it sounds.
  • Macros are everywhere in the Rust web stack. When they work, great. When they don’t, you’re reading thousands of lines of generated code.
  • Compile times are a real cost for web development. A single web framework dependency can grow your dependency tree tenfold.
  • The ecosystem is fragmented. You choose almost every part of the stack yourself, which is powerful for experienced developers and overwhelming for everyone else.
  • Batteries-included frameworks like Loco.rs and cot.rs are closing the gap, but we’re not at Django or Rails level yet.

Async: Fast, clever, and not always friendly

The first problem, and arguably the most prominent one, is async.

Rust’s async model is genuinely ingenious. It was implemented in a very clever way, and it really shows that smart people worked on it. But while it’s technically impressive, the developer experience is not quite there yet.

To use it comfortably, you need to understand both async fn and how to implement Future yourself, and they’re different enough that it can feel like learning two separate things. Then there’s the question of tasks versus threads, and if you mix them up and use the wrong data structures, you can end up with deadlocks or subtler problems. There’s still no async drop in stable Rust. And if you’ve ever tried to truly understand pinning, it’s genuinely difficult to wrap your head around why it’s even needed.

Debugging is where all of this becomes very visible. When you yield from an async function, execution returns to the runtime, and when it resumes in your function, the backtrace is essentially reset. That produces backtraces that are both huge and hard to use, and it creates problems not just for debugging, but for logging, tracing, and learning.

Here is the code that produced one of our own backtraces:

One line. The backtrace ran to 100 frames, most of them async runtime machinery. The actual program, an issue from our code, was visible between frames 9 and 10. Technically correct. Practically, not very useful.

And that’s before you get into questions like why your future isn’t Send, or why a particular future is consuming an unexpected amount of memory. Async Rust gives you performance and flexibility, but for web development specifically, it makes simple things feel more complicated than they need to be. There are active initiatives in the Rust community to improve this, but the current state still has a way to go. If you want to go deeper on where the async ecosystem is heading, this conversation with Carl Lerche, the creator of Tokio, is worth reading.

Database access: Write your schema once, then write it again

The next rough edge is database access.

Rust has solid libraries for working with databases, but the workflow can feel more manual than it should be. In Diesel, you define your model in Rust, write SQL migrations, and also maintain an automatically generated schema.rs file. That’s three representations of the same data. And yes, Diesel still uses raw SQL migrations, which is why tools like diesel-guard exist, to warn you when a migration is written in a way that could cause problems on large tables.

SeaORM improves parts of the experience, but migrations can still feel like SQL written with Rust syntax on top. It’s a common pattern across the ecosystem: specialized query DSLs that each create their own little language. We want to write modern web applications, but we end up feeling like it’s a 2005 PHP project.

The problem isn’t that SQL exists. The problem is how much repeated work builds up around it. If we’re already creating another layer for queries, it should at least be readable. Django figured this out over 20 years ago. You shouldn’t need to write migrations by hand just to change a schema.

Compare these two approaches to the same query:

Rust web development code example

VS

rust web development code example

You read code much more often than you write it. If we’re already inventing another language, we can at least make it a readable one.

The same principle applies to migrations. In cot.rs, we’re exploring declarative migrations, where a migration is represented as a structured operation at the framework level rather than raw SQL you write and maintain yourself:

Instead of writing SQL directly, the framework represents the change as an operation, validates it, and generates the SQL itself. Problems like a migration that would be dangerously slow on a large table can be caught and handled at the framework level rather than by a separate linting tool.

We’re not the only ones thinking about this. Projects like toasty and rorm are experimenting with similar approaches. Database access in Rust doesn’t need to become magical. It just shouldn’t feel like writing the same thing three times in three slightly different languages.

Error handling: Returning an error is easy; returning the right one is harder

Error handling in a web framework sounds simple until you start caring about the details.

A good error handling approach needs to do several things at once: be easy to write and register, have access to as much context as possible, be consistent across the whole application, and pass through the same middleware as any other response. That’s a surprisingly hard combination to get right.

In Axum, there are two different mechanisms for converting errors into responses, IntoResponse and HandleErrorLayer, which means when you’re starting out, it’s not obvious which one you should actually use. IntoResponse requires you to implement it for every individual error type, so different error types from different libraries can end up returning different response shapes. Consistency is hard to guarantee.

Actix-web takes a different approach with the ResponseError trait, which is cleaner in some ways. But it can only access the error object itself, not the broader request context you might need to build a useful response.

Middleware adds another layer of complexity. If your application compresses responses, adds headers, or does any other processing, you want error responses going through the same path. The tower-http ecosystem has a lot to offer here, but it was designed in a way that consumes the request, which means when you’re handling an error, you may no longer have access to the context you need.

Rust is very good at making errors explicit. The harder part for web development is making them consistent, predictable, and well-integrated with the rest of the application stack.

Metaprogramming: Useful magic is still magic

Macros are one of the reasons Rust web frameworks can feel nice to use. They handle boilerplate, provide powerful functionality, and make APIs feel cleaner than they’d otherwise be. Routing, serialization, extractors, database queries, framework setup. Macros are everywhere in the Rust web stack.

When they work, they’re great. When they don’t, they become black boxes. Unless you dig into the generated code, it can be very hard to understand what went wrong. Error messages from failed macro expansions often point somewhere inside the generated code rather than at what you actually did wrong. And if the macro generates a lot of code, figuring out the problem becomes a real task.

IDE support for procedural macros is also uneven. Generating that much code is not a simple challenge, and not all IDEs handle it equally well.

And then there are generics. They help us build composable, reusable software. But in web frameworks, you have many layers stacked on top of each other: middleware, extractors, serializers, the request itself. We found that generics alone could push a binary from around 2MB to 37MB in release mode once everything was compiled. Monomorphization makes your program fast at runtime, but it gives the compiler considerably more work to do.

The issue isn’t that macros or generics are bad choices. It’s that Rust web development tends to layer many powerful abstractions on top of each other, and at some point understanding that stack becomes part of the job.

Iteration speed: “I changed a string, see you in a minute”

All of those abstractions have a cost that becomes very visible when you’re trying to move fast.

Web development is a tight loop: You make a change, run it, see what happens, and then make another change. That loop needs to be fast. In Rust, it often isn’t, and in web applications specifically, the problem compounds because you’re combining several things that each make the compiler work harder.

Monomorphization from generics takes time. Dependencies add up fast; adding a single web framework crate can increase your dependency tree tenfold. Macros have their own cost, too. 

If you want a concrete example of how bad this can get, one documented analysis found expand_crate taking 67.5% of total compile time in a macro-heavy SQLx project. The compile-time SQL verification SQLx does is genuinely useful, but it comes with a real cost.

Can you speed it up? Yes, but only a little. You can disable optimizations in development builds, use alternative compiler backends, and adopt faster linkers. You can also prefer dynamic dispatch over generics in places where the performance trade-off is acceptable. Your code compiles faster, though you give up some compile-time safety guarantees. Hot reloading is also being explored; subsecond from the Dioxus team is one project attempting to tackle this, and we’ve been experimenting with it. 

None of these make the problem disappear. For now, Rust web development asks you to accept a slower feedback loop than many web developers are used to. The friction rarely lives in just one place. It’s the way async, macros, generics, and dependencies all accumulate when you’re trying to build something complete.

Ecosystem fragmentation: Choosing everything yourself

This brings us to the part that makes everything else harder to navigate. When you start a Rust web project, the stack is yours to assemble. You pick the web framework, the database layer, the migration approach, the templating engine, the frontend integration, and the authentication method. Each piece has several options, and they don’t always compose cleanly.

The arewewebyet.org project summarizes it well. The page hasn’t been updated in about five years, but the point mostly still stands: Rust doesn’t have a dominant batteries-included framework at the level of Django or Rails. Most Rust web frameworks are smaller and modular, closer in spirit to Flask or Sinatra. The ecosystem is diverse, but you generally have to wire everything together yourself.

For experienced Rust developers who already have a preferred stack, that flexibility is often welcome. For developers newer to the ecosystem, or teams that just want to start building, it can be overwhelming before you’ve written a single line of business logic.

This is part of what motivates batteries-included frameworks. Loco.rs is the most established of them, already used in production by many teams. We’re working on cot.rs with similar goals, trying to close the gap and give developers a more complete starting point. Roadster is another project in this space, though it’s less widely known and we’re honestly not sure how much production use it sees yet.

None of these projects solve every problem we’ve described. Async is still async. Compile times still matter. Macros still need to be understood. But a more coherent starting point removes at least some of the repeated work that comes before the actual application.

So, is it worth it?

Rust web development in 2026 is better than it’s ever been, but it’s still not the fastest ecosystem to work in. The compiler catches most bugs before you ship, and in production that matters. Whether that tradeoff is worth it comes down to one honest question: How expensive are your bugs really?

If reliability and performance are core requirements, Rust’s drawbacks are probably worth it. If you need to move fast on something simple, Python will still get you there faster. Just go in knowing the difference.

We’re not there yet. But we’re getting there.

Frequently asked questions

Is Rust good for web development in 2026?

Yes, with realistic expectations. The ecosystem is more mature than ever, but it asks more of you upfront than Python or JavaScript. If your project needs performance, memory safety, or long-term reliability, it’s worth the investment. If you need to move fast on a simple CRUD app, you might be happier somewhere else for now.

What are the main problems with async Rust?

Backtraces become hard to read because the runtime resets them when execution yields. There’s still no async drop in stable Rust. Understanding tasks versus threads and how to handle pinning adds real complexity on top of an already demanding model. The performance is excellent. The ergonomics are catching up.

Why are Rust compile times so slow in web projects?

Several things compound each other: monomorphization from generics, large dependency trees, macro expansion at compile time, and the complexity of async code. Tools like faster linkers and alternative compiler backends help, but none of them solve it completely.

What is cot.rs?

A batteries-included Rust web framework we’re building to make starting a web project feel less like assembling a toolbox. It includes declarative migrations, readable query macros, auto-generated OpenAPI documentation, admin panel, and a more integrated approach to the full stack. You can find it at cot.rs.

Which Rust web framework should I use?

Axum and Actix-web are the most widely used for APIs. For a more complete starting point with less configuration, Loco.rs and cot.rs are worth exploring. There’s no single dominant choice yet, and that’s part of what makes starting a new Rust web project harder than it should be.

Read the whole story
alvinashcraft
40 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

So Long and Thanks for All the Context

1 Share

I got a really interesting question last week from Mike Loukides, my editor at Radar, after he read the third part of this trilogy on context management. “Another issue I’ve read about,” Mike asked, “is the tendency for a model to ignore the middle of the context. I’ve seen that particularly for the models with very large context windows. Is there anything to be said about that?”

Excellent question, Mike, and yes, there is. In that same email he pointed out that clearing the context and reloading it with just what’s important does a pretty good job dealing with this “ignore the middle” problem when it happens, but that’s clearly a stopgap.

It’s worth a deeper dive into what’s actually happening when an AI starts forgetting what’s in the middle of its context, because the problem is deeper (and more interesting!) than it might seem at first. It turns out that there’s a basic problem that’s fundamental to how LLMs manage context, and we’re still learning about it as an industry. That problem is called a U-shape. There’s been a lot of really interesting research into the U-shape problem recently, and several useful techniques have emerged that can help you manage it. And it’s probably not a coincidence that I’ve had to use all of them in my ongoing experiments with AI-driven development and agentic engineering (even if I didn’t always realize that’s what I was doing at the time).

A few weeks ago, in fact, I ran into the exact failure mode that Mike described. I was running the Quality Playbook, my open source code quality engineering skill, and ran into trouble with one of its phases—the one that writes up the bugs the earlier phases find. There’s a part of the bug writeup process where it had just created a file called BUGS.md that had an overview of each of the bugs, and had to create individual writeups for each bug it found. But instead of filling in the details correctly, it produced skeletal-looking stub files, with a generic template that had blank values instead of populated ones.

The thing is, the instructions for how to write a populated writeup were in the prompt. The actual bug data was in BUGS.md. I was absolutely certain that everything the agent needed was sitting in its context window, because I could see that it hadn’t compacted yet, and the skill’s intermediate artifacts let me see that earlier phases had read and reasoned about both files (which I talked about in my last article in this series). But the agent was producing stubs anyway. It really looked like the agent had everything it needed sitting in plain sight, and just wasn’t using the information it had. Frustrating!

I thought at the time that the model was just an idiot (which, arguably, was true but beside the point). It turns out that I had run directly into the U-shaped context problem.

In the previous three articles I covered what context is and why it disappears, how to keep important information in files instead of leaving it in the agent’s context window, and how to detect and recover when context has been compacted out from under you. All three were about losing context, through fragmentation, through compaction, through long sessions that overrun the window. This article is about this entirely different U-shaped failure mode, where the context is still sitting in the window and the model just isn’t using it.

The U-shape failure, and why bigger windows don’t fix it

The U-shape is an active area of academic investigation, so I’m going to start by going into a little bit of that research, because I think it will actually help us pin down what’s going on. I’ll start with an experiment run by Nelson Liu, an AI researcher at Stanford, who tested how language models actually use the contents of long inputs by giving them documents with the relevant answer placed at different positions and measuring whether the model could still find it. An interesting thing his findings show is that the U-shape didn’t appear to be a quirk of a single model. The U-shape showed up across model families, and even models with larger context windows still exhibited it.

If you have time, it’s actually worth taking a look at the paper that Liu and his team wrote, called “Lost in the Middle: How Language Models Use Long Contexts.” (It’s surprisingly readable for an academic paper.) The result they reported was a robust U-shape: The model performed best when the relevant information was at the beginning of its context window or at the recent end and worst when it was in the middle. Performance on questions where the answer was buried mid-context fell off sharply, even when the answer was sitting right there in plain sight. The field now uses the terms primacy bias and recency bias for those two preferences, and the U-shape is what you get when you plot them together against position.

I’m going to lean a little into academia here, because a lot of researchers are still learning about how LLM context actually works and what behavior has emerged in it.

One reason the U-shape matters more than “just another LLM quirk” is that recent research has started showing it’s a structural property of how transformers work, not a learned artifact. A 2025 ICML paper called On the Emergence of Position Bias in Transformers” explained it as the equilibrium between two opposing forces inside the model: The causal mask amplifies the influence of the first few tokens (the primacy bias), while position encodings like RoPE heavily weight the tokens closest to where the model is generating (the recency bias). The middle is where those two forces cancel out. A 2026 paper by Borun Chowdhury, a researcher at Meta, called “Lost in the Middle at Birth: An Exact Theory of Transformer Position Bias,” took the argument even further by proving mathematically that the U-shape exists at the moment of initialization, before any training has happened, with random weights.

That matters because the natural assumption about large context windows is that more room means fewer problems. Most of today’s frontier models give you a million tokens or more, with some pushing well past two million, and some have made real progress on the simplest version of the lost-in-the-middle test, the needle-in-a-haystack benchmark, where the model has to retrieve a single sentence buried in a long document. Google’s Gemini 1.5 Pro reported near-perfect single-needle recall at 1M tokens, and current Gemini 3 models are similar.

So the accurate version of “bigger windows don’t fix it” is this: Bigger windows have made simple single-fact retrieval much better. They have not made long-context agent work reliable by default. A two-million-token window means a bigger middle to fall into.

The important idea that’s emerging here is that it’s increasingly looking like the U-shape isn’t just a bug in today’s models that will eventually be worked out or trained away by more data or better fine-tuning. Instead, it seems like the U-shape may actually be a geometric property of the LLM architecture itself.

In other words, we’re all going to have to deal with the U-shape. And that means we need techniques for managing it, and any effective technique we use isn’t likely to become obsolete any time soon. And that’s my goal in this article: to show you the techniques that have emerged for managing U-shaped context memory loss that you can use today in your own work.

Five techniques to help with U-shaped context problems

The previous article in this series laid out a pattern for detecting and recovering from context loss, which I called externalize-recognize-rehydrate. The techniques below extend the same discipline to the lost-in-the-middle problem. The principle I keep coming back to is that working memory is untrustworthy, and the discipline that follows from it is to externalize what matters, curate what stays in context, and verify what the agent claims to know against what’s on disk. The five techniques are how I do that in practice, and each one is drawn from a real moment in the Quality Playbook’s development.

Curate, don’t accumulate

This is the technique which, in its most brute-force form, is exactly what Mike talked about in his email to me: just clear the context and reload it with just what matters, periodically and deliberately. In other words, don’t trust an accumulated session to stay coherent; build the artifact, then start fresh against it. And if you have the AI write down the important parts of the context (like we’ve talked about throughout this series), then you can start a new session with refreshed AI that has a more targeted, curated context as a starting point.

I ran into this during the v1.5.2 release prep for the Quality Playbook. I was using a long Claude Code session that had been working through a series of fixes. But I noticed that it was just starting to show its age: It had forgotten a couple of things it should know, and its thinking times were starting to grow.

When it came time to land the final four fixes for the release, I worked with the AI to write a context brief, or a separate document with everything the implementing session needed. The question was whether to keep using the existing session, which already “knew” the codebase from the earlier work, or open a fresh CLI session and point it at the brief. I asked another session what to do:

Should we run that in a new cli session rather than continue my current
claude code session that has the existing context?

The AI gave me a good answer—start a fresh session, using a starting prompt to read the brief—and it gave three reasons that have stuck with me. First, the brief was self-contained, including file paths, line numbers, exact diffs, regression test bodies, and preflight greps. Anything the new session needed to know was already there, and continuing context bought nothing. Second, fresh context is stricter about adherence. A session that already “knows” the codebase tends to skim the new instructions and improvise from prior assumptions. Surgical fixes are exactly the case where you want the agent to read the brief carefully rather than rely on memory of what felt right last round. And third, the audit trail: The brief is the artifact, and the implementing session is reproducible from just the brief. If the same work has to be redone in six months by a different model, you point at the brief and say, “This is the input.”

The approach worked really well. I was able to pick up development seamlessly, and the model’s memory problems disappeared.

Position critical information at the edges

The U-shape says the model attends best to the beginning and end of its context. The natural move is to put your most load-bearing information in those positions and keep the middle for things you don’t need the model to focus on. Anything important that lives only in the middle of an accumulated context tends to slide out of attention.

The other side of this technique is what not to put in the middle. If something matters, don’t bury it in a long preamble of context you’ve been accumulating; move it to the edges, restate it where the model will act on it, and let the middle absorb the less important material. Luckily, there’s a useful technique that can help with this problem.

In Claude Code, for example, one really clean way to put information at the beginning of context is to use the system prompt. The CLI gives you --append-system-prompt for exactly this. (Most of the other providers’ CLI tools have similar options.) If you put your brief (or selected parts of it) there, the agent will attend to it strongly throughout the session, and that in turn will help keep the per-turn user prompt focused on the action you want the agent to take right now.

Short sessions over long ones

Don’t run one long session. Run many short ones, each reading fresh from disk. This will help you iterate on your brief and your external development context, so instead of relying on an opaque context window, you have a visible and constantly changing set of documents that give you a lot more visibility into—and control over—your AI’s context.

Something useful I started doing was taking all my chat history from Gemini, ChatGPT, Claude, and Cowork and putting it into a single folder I could keep updated and indexed for fast search. I built out an entire system to manage this, which turns out to be a great tool when I’m writing articles like this, because I can search through my development history for specific examples and techniques that I’ve used. The system uses Haiku 4.5 to read through chat history, summarize what happened, and create an index. Haiku turned out to be a smart enough model to read each individual interaction in a chat and write a useful index entry for it. But the model being smart enough to do one summary didn’t mean its context management could keep up across all 18,000 records. I ran smack into the U-shape problem.

The first attempt tried to keep dedupe state and progress counts in the model’s head, and it failed spectacularly. The model really didn’t want to keep track of specific deterministic things like accurate numbers or the current state. Haiku 4.5, in particular, seems especially bad at this. What worked was reframing the architecture entirely. Here’s the actual prompt that I gave it to fix the problem:

ok, so we need context management. it doesn't need to remember things,
it just needs to write them down as they go. we had this same context
management problem with Quality Playbook, when it was running out of
context. Just write down after each message.

The protocol I greenlit for the full run made the short-session discipline explicit:

  1. Resume processing from the cursor recorded in progress.json, working through each input file in order.
  2. Update progress.json after every line.
  3. Expect to run out of context well before finishing—that’s fine. Just stop cleanly after each step (or a group of steps), then spin up a fresh session that reads progress.json and continues.
  4. When all files are complete, set status: “complete” in progress.json and report back.

Item 3 is the technique in one line: expect context loss, so make sure you’ve written your state down, and build fresh restarts into the process. The technical details, like spinning up subagents, orchestrating with script, etc., will change, but the core idea stays the same. In a lot of ways, you can think of treating the agent like a pipe, not a database. The state lives on disk, and the session is something you throw away and replace.

Restate key info close to the point of use

When the model needs a constraint to apply right now, repeat it right now. Don’t trust an instruction from earlier in the session to carry forward through the middle of the context.

This is the technique that fixed the problem I opened the article with, where the Quality Playbook seemed to forget everything it had just written into a file called BUGS.md and produced stubs when it needed to write the same information into more detailed files, and instead writing generic blank templates with the bug-specific fields left blank.

The fix was to restate the read-the-source rule right before the action that needed it, using this prompt:

Before writing BUG-NNN.md, re-read the BUG-NNN entry in BUGS.md.
Copy the Spec basis, Minimal reproduction, Location, Expected behavior,
Actual behavior, Regression test name, and Patches fields
from that entry into the writeup. Do not paraphrase from memory.

“Do not paraphrase from memory” is the line that did the actual work. The instruction couldn’t trust the agent’s memory of what BUGS.md said, even though BUGS.md was sitting right there in the context window. So the instruction forced a fresh read of the file at the moment of writing. The restatement and the fresh-read together fixed the bug.

The same pattern applies any time a rule was stated earlier in the session and the model needs to act on it now. Restate the rule next to the action, and force the model back to the source rather than letting it work from memory.

Test the middle

The previous four techniques are about avoiding lost-in-the-middle failures. This one is about catching them. If you don’t know whether the agent is actually using the information you think it’s using, find out, with a deterministic check rather than a judgment call.

The pattern is the one I used in the Haiku summarizer that I described earlier: compare what the agent claims to know against what’s on disk. You have something the agent claims to know (its progress, its current state, the latest version of a rule), and you have something on disk that’s the ground truth (a file, a log, a database record). At the moment the agent’s claim has to be trusted, you check it.

In the summarizer’s resume protocol, every new session started by cross-checking progress.json against the actual last line written to the summary file, and the agent printed a checkpoint report when it did—at session start, and periodically through the run. A representative one looked like this:

Checkpoint Report:
✓ progress.json confirmed: cursor for cowork_04_06 is at 238, status is

"running"
✓ Disk state verified: Last line in summaries/cowork_04_06.md is [237]

assistant: Tool invocation repeating chat file read.
⚠ Discrepancy noted: The prior session left a bulk note claiming records

238–296 are duplicates but didn't write individual lines for them. Per
your instructions, I must write one line per record, even for duplicates,
in the format [idx] <sender>: Duplicate of record [X] (<note>).
Status: Cursor matches disk state. Ready to resume from record 238.

The agent doesn’t need to introspect whether it lost context, only to compare two files. When they agree, the agent proceeds; when they disagree, the agent flags the discrepancy and stops before adding any new work on top of a broken state. Disagreement is the signal.

You can build this kind of check into any agent that does multistep work. Pick something the agent has to track, pick the file that’s the source of truth for it, and have the agent compare the two at every session start. When the agent’s view of the world drifts from the file, you find out before the drift becomes a buried bug.

The discipline behind these techniques

When I built the Quality Playbook’s multi-phase architecture, I was solving the compaction problem. Long pipeline runs were filling the context window and triggering silent compaction in the middle of work. Breaking the pipeline into separate phases that read fresh from disk and stopped after each phase fixed it.

What I didn’t realize until later was that the same architecture also helps with the lost-in-the-middle problem. Each phase has its own short, focused context, with the phase brief at the beginning and the latest progress update at the end, so there’s almost no middle for information to fall into. The architectural move that helped with working memory disappearing turns out to also help with working memory being there and unused.

That’s the lesson I want to land. Both failure modes, context loss and lost-in-the-middle, are problems of working-memory unreliability, and the discipline that addresses them is the same: keep the working set small, put the load-bearing information at the edges of the window, and check the agent’s claims against ground truth on disk when it matters.

Context windows will keep getting bigger, and compaction will get smarter. Some of the techniques in these four articles may eventually be unnecessary. But the underlying constraint won’t disappear. After all, we’ve added a lot more RAM to our computers since the 1MB 286 I wrote about in the last article, and memory management has gotten much more complex since then. And many of these problems are structural; for example, it’s increasingly looking like the U-shape itself is a geometric property of the transformer architecture, not a training artifact that more compute will smooth out.

The bottom line is that if your agent’s ability to do its job depends on information, that information needs to live somewhere more durable than working memory. That was true for my dad’s 32 kilobytes of core memory at Princeton in the 1970s, it was true for my 640 kilobytes of conventional RAM on my 286 in the 1980s, it was true for the 200K-token windows in last year’s models, and it will be true for whatever comes next.



Read the whole story
alvinashcraft
51 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Explicit Lazy Imports Are Coming to Python 3.15

1 Share

A while ago at PyCon US 2026, I had the pleasure of listening to the Python Steering Council give updates about new features that are being added in Python 3.15. One that stood out was explicit lazy imports (via PEP 810), which defer module loading until first use. I am curious to see how this new feature works, and I want to benchmark its performance with PyCharm. Let’s take a look together.

Overview of explicit lazy imports

PEP 810 introduces an explicit syntax for lazy imports, allowing you to defer the loading and execution of modules until their attributes are actually accessed, unlike standard eager imports that execute immediately. This feature aims to significantly reduce startup latency and memory consumption. Explicitly marking modules as `lazy` can deliver substantial improvements in initial responsiveness and baseline resource usage in large-scale applications and command-line tools.

Because the implementation approach uses proxy objects within the module’s namespace instead of modifying Python’s fundamental dictionary structures, it preserves critical interpreter optimizations. 

This mechanism defers both the finding and the loading of the module to maximize efficiency, especially in environments with high-latency filesystems. To manage potential side effects and ensure backward compatibility, the proposal includes global control flags and a transitional variable for progressive adoption across different Python versions.

In short, Python 3.15 will let you optimize application performance by significantly reducing startup latency and memory consumption, as the loading and execution of modules are deferred until their attributes are actually accessed.

Trying them out in Python 3.15.0b1

At the time this is being written, Python 3.15.0b1 is already out, so we can give this new feature a try. You can build it from source at the CPython GitHub repo, but since getting Python 3.15.0b1 is easy when using `uv` or `pyenv`, we will do that instead.

Make sure you have the latest version of `uv` or `pyenv`, and then download Python 3.15.0b1 via either of the following commands:

  • `uv python install 3.15.0b1`
  • `pyenv install 3.15.0b1`

After that, select the new interpreter in your project in PyCharm.

Now you will need to reinstall the dependencies for your project. You may have to build some of the libraries from source, as most of the libraries will not have a Python 3.15 wheel for download.

Profiling against normal imports

It is a common joke that the first thing data scientists will do is type `import pandas as pd` and `import numpy as np`, even if they are not actually going to use them. Let’s assume this is the case, and you received a script like this from your colleague:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

def main():
    print("Initializing example data science project...")
    
    # Generate some dummy data
    data = {
        'x': np.linspace(0, 10, 100),
        'y': np.sin(np.linspace(0, 10, 100)) + np.random.normal(0, 0.1, 100)
    }
    
    # Plotting
    plt.figure(figsize=(10, 6))
    plt.plot(data['x'], data['y'], label='Sine Wave with Noise')
    plt.title('Sample Visualization')
    plt.xlabel('X-axis')
    plt.ylabel('Y-axis')
    plt.legend()
    
    # Save the plot instead of showing it (since this is non-interactive)
    plt.savefig('sine_wave.png')
    print("Project executed successfully. Plot saved as sine_wave.png.")

if __name__ == "__main__":
    main()

As you see, PyCharm highlights the unused pandas import for you, so removing it would be straightforward. However, for our experiment here, we’ll keep it.

To get a better visualization of import profiles, install a tool from PyPI called tuna.

You can profile your script by setting a custom run script with this script text:

python -X importtime main.py 2> import_log.txt; tuna import_log.txt

When you use it, a new browser window will pop up with the import graph. 

As you see, importing pandas accounts for half of the time it takes to load all the modules, and we never use it!

Now let’s add `lazy` to all the imports.

Don’t worry about the syntax highlighting. PyCharm just doesn’t recognize it yet since `lazy` is a new keyword that has not been officially released.

Let’s profile the script again.

Now we see the pandas import is gone, and loading everything takes way less time.

So, if you have a script that imports a lot of large libraries, and some of them are only used in certain conditions (e.g. in if-else clauses), lazy import can save time by loading modules only when they are first used.

Checking the inner workings with lazy imports

Let’s see how lazy imports are handled internally.

When a module is imported “lazily”, meaning `__lazy_import__` is called instead of `__import__`, a `types.LazyImportType` proxy object will be created. The module name will then be listed in `sys.lazy_modules` instead of `sys.modules`. (See the Lazy import mechanism section in PEP 810.)

When a lazy object is used, it needs to be reified. CPython will try resolving the import at that point and replacing the proxy object with the actual module itself. In this process, `__import__` is called to resolve the import. At the same time, the module is removed from `sys.lazy_modules`.

If there’s an error during reification, AKA importing the module, the lazy object is not reified or replaced. The next time the lazy module is used, the import will try again. The exception raised during reification will also show both where the lazy import was defined and where it was accessed. (See the Reification section in PEP 810.)

To experiment with it ourselves, let’s add some breakpoints with `pdb` and check what’s happening in the code:

import pdb
pdb.set_trace()

lazy import pandas as pd
lazy import matplotlib.pyplot as plt
lazy import numpy as np

pdb.set_trace()
…

And

    # Generate some dummy data
    data = {
        'x': np.linspace(0, 10, 100),
        'y': np.sin(np.linspace(0, 10, 100)) + np.random.normal(0, 0.1, 100)
    }

    pdb.set_trace()
    
    # Plotting
    plt.figure(figsize=(10, 6))
    pdb.set_trace()
…

Now run the script in the console:

python main.py

Note that PyCharm 2026.1 does not yet support Python 3.15, so using the Run or Debug button to run a script using lazy import may result in unexpected behavior.

When it hits the first line of ` pdb.set_trace()` at the top, there should not be any module loaded in. Let’s check:

(Pdb) import sys
(Pdb) sys.lazy_modules

As expected, none of our libraries – pandas, numpy, and matplotlib – are listed.

Now, let’s continue running the program and let it stop at the next breakpoint. In the console, type `continue` and once it stops, we can check by typing `sys.lazy_modules` again:

Here, we see that all of our modules are in `lazy_modules`. Let’s check whether pandas is in `sys.modules`:

(Pdb) 'pandas' in sys.modules

Nope, it’s not. You can try with numpy and matplotlib, and you will see that neither of those is in `sys.module`.

Now let’s type `continue` again and reach the next breakpoint, which occurs after numpy is used. Check `sys.lazy_module` again, and you’ll see that numpy is no longer on the list. When we check whether it is in `sys.module`, we get `True` this time.

However, pandas and matplotlib are still not in `sys.modules`.

When you check the next breakpoint, you’ll see that matplotlib is similarly removed from `sys.lazy_modules` and added to `sys.modules` after it is used.

Trying it yourself with PyCharm

Download the latest version of PyCharm to experiment with Python 3.15.0b1 and experience firsthand how explicit lazy imports can optimize your application’s performance by significantly reducing startup latency and memory consumption.

Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete

Blog: A Minute from the Moderators

1 Share

Welcome back to Moderator Minutes. This is the third post in our series on Community Care in Online Spaces.

A note before we begin, and as with the rest of this series: these posts run a little denser than our usual Moderator Minutes, and we don’t expect anyone to absorb all of it at once. We’ll keep the recap short, because each post is meant to stand on its own, but if you’d like the fuller picture: Part 1 looked at how shifting norms create friction, and asked you to sit with a question: is it working? Part 2 looked at what drives us toward confrontation in the first place, and at the idea of care webs: the networks of mutual support that let us do collective work without each of us carrying it alone.

Both of those posts asked you to lead with connection. To avoid assuming bad faith. To choose communication over correction, and to extend considered grace to people who are, like all of us, navigating norms that keep shifting under their feet.

You may have noticed that we kept setting something aside. We said, more than once, that choosing community care does not guarantee care in return. We said there are times when connection is not the answer, and that we would come back to them. And we mentioned, briefly, that when connection fails you have other tools available and that we would talk more about those later.

This is later. This post is about what to do when connection isn’t enough.

Connection First Is Not Connection Only

We want to be clear about something up front, because it would be easy to read this post as a reversal of the previous two. It isn’t.

Everything we said about leading with grace still holds. For the vast majority of the friction we see, the person on the other side is not acting in bad faith, and connection really is the better choice. None of that changes.

It’s worth saying why we keep recommending it, because the reason is not sentimentality. Connection-first is, in part, a guard against two equally costly mistakes: the naïveté that insists no one would ever act in bad faith, and the suspicion that treats everyone as an enemy until proven otherwise. When you enter an unfamiliar interaction, you usually don’t yet know which kind you’re in. You’re building that picture in real time, from the data the interaction gives you. Leading with connection means keeping the other person’s humanity in view while that picture forms, rather than deciding it in advance. And there’s another reason, and gently that is: how you treat someone you’ve judged to be acting badly is, in the end, as much a statement about you as about them. That matters especially because the judgment that feels most certain in the moment is exactly the kind we sometimes get wrong.

But “lead with connection” was never the same as “extend infinite grace to everyone, forever, regardless of how they respond.” Those are different claims, and the difference is the whole subject of this post. You can afford to lead with generosity precisely because you are not obligated to keep extending it into a void. Boundaries are not the opposite of community care. They are part of what makes it sustainable. They are how you keep the door open without leaving yourself standing in the doorway indefinitely, absorbing whatever comes through it.

So this is not the post where we tell you to stop being generous. It’s the post about what holds that generosity up.

Boundaries: The Part You Control

A useful place to start is a distinction that quietly resolves a lot of online conflict: the difference between what you will do and what you want or need someone else to do.

The first is a boundary. The second is a request.

“I’m not going to keep discussing this tonight” is a boundary. You control it completely. “You need to stop replying to me” is a request, and whether it’s honored is no longer in your hands. Both can be reasonable. But only one of them is something you can actually enforce, and a great deal of exhaustion comes from trying to defend a line that depends on someone else’s cooperation to hold.

Boundaries phrased as things you control are ones you can keep. Muting a thread you keep getting pulled back into. Saying once, clearly, that you’re stepping away, and then actually stepping away rather than staying to monitor how it lands. Deciding in advance what you’ll do if a particular kind of message arrives, so the decision is already made when it does. None of these require the other person to agree, which is exactly why they work.

Holding a boundary is harder than setting one, especially in public, and extra especially when someone is testing it. The pull to re-engage is not a personal failing; it’s a normal response to the discomfort of leaving something unresolved and the worry that silence will be read as concession. It helps to remember two things. The first: you can’t control how your boundary is interpreted. Trying to, by staying just long enough to correct the record, is usually how it collapses. The second: you can hold the boundary whether or not it is understood. You do not have to announce it twice, defend it, or win the argument about whether you were allowed to draw it.

This is, itself, a form of the community care we’ve been describing. In Part 1 we wrote that the practice of choosing connection over punishment is part of building the community we all want to live in. Setting a clear boundary is part of that same building. It models that limits are normal, that they can be stated without hostility, and that staying in community does not require dissolving yourself into it.

When the Other Person Isn’t Trying to Hear You

Most of what we’ve written about de-escalation, here and in earlier posts, is for situations where the other person wants to understand you. In those situations, it’s often true that if you communicate clearly, you’ll be heard as you both share the goal of understanding.

Sometimes it isn’t true, though. Sometimes you communicate clearly, you make a reasonable request, and the other person still doesn’t hear you. As we noted in Part 1, when that happens, that is not on you. And sometimes the person on the other side is not trying to reach an understanding at all. They are reaching for something else. Maybe they want to change your mind as much as you want to change theirs. Maybe they are starting from a different premise, and the two of you are not actually arguing about the same thing. Maybe they just want a reaction. Whatever the reason, the part you can act on is the same: they aren’t hearing you. Sorting out exactly why rarely changes what you do next.

De-escalation looks different in that situation. It is no longer about reaching the other person, because the other person is not reachable in the way de-escalation depends on. It becomes, instead, about not handing them fuel. You lower your own stakes rather than matching theirs. You decline the thread instead of feeding it. You resist the pull to perform for whoever might be watching. The strongest move available is often simply to stop adding to it, not because you’ve conceded, but because continuing only gives the dynamic more to run on.

Part of what makes these situations so sticky is the urge to make the other person admit the thing. As a slightly more precise example: let’s say you witness someone being ableist, to you or another. You are unlikely to convince them of it in the moment, for any number of reasons. They may tell you they are not being ableist: they are only being direct, or honest, or precise. Here it helps to notice that both things can be true at once. Someone can be direct (or precise, etc.) and ableist. Someone can be giving genuine advice and be ableist in how they give it. “I’m actually doing this other thing” rarely makes the first thing stop being true, and arguing the point usually just feeds the dynamic we just described. So it’s worth asking what you actually want. Maybe part of what you want is for other people to come away with better understanding, so that they have an opportunity to grow. A fresh post in a separate thread often serves that far better than the argument in front of you: the people who want to learn can choose to enter it, and it doesn’t drag in the person who has made clear that is now how they are engaging at this time.

This is where disengagement stops being avoidance and becomes a genuine tool. Walking away from an interaction that has nowhere good to go is not losing. It is refusing to spend your energy where it cannot accomplish anything.

The Tools You Already Have

It’s worth pausing on why a post about bad faith has spent so long on boundaries and self-regulation. The reason is a framing choice: a great deal of bad-faith and inauthentic behavior can be modeled as a boundary violation. Content you don’t want shown to you. A phishing attempt you don’t want to engage with. An effort to extract protected data (e.g. personal information, payment details) that you don’t want to disclose. Attempts to slip past the filters so you see content you opted out of. In person-to-person interactions, it is often true that sometimes the other person genuinely doesn’t understand your boundary, since you are two different people. And it is also true that sometimes others put in real effort to get around boundaries. By choosing to frame these as two sides of the same coin, boundary violations, we have a dual purpose model that serves both when people don’t understand and when they definitely do. And the tools for one are the tools for the other.

While not a software feature, disengagement is a tool at your disposal. It is not the only one, and for the harder situations it is not enough on its own. This is the part Part 1 seeded that we wanted to give real space to here.

Beyond simply stepping away, most platforms give you tools to enforce boundaries directly and Mastodon is no exception: you can mute or block an individual, or block an entire server. You can filter keywords in or out. And you can do this independent of the “faith” of the engagement you are in or witnessing. You can do it for any reason at all. Maybe you dislike the way you see someone engaging with others. Maybe there are some forms of content that aren’t harmful, but that you simply don’t want in your day. These tools are yours, and using them does not require you to build a case.

When an interaction has gone past the point where connection or boundaries can resolve it (someone is harassing you, evading your boundaries, escalating instead of de-escalating, etc.), you have a path that does not require you to either keep absorbing it or handle it alone: you can report it to your moderation team.

Bringing in a moderation team is not an escalation nor a failing. It is the opposite of what was flagged in Part 1, where people are fighting it out instead of reaching for other options. Reaching out for help is a de-escalatory move. It takes a conflict that is not resolving on its own, and diverts it to try another approach or neutral eyes.

You might hold back from reporting because you don’t want to add to a team that is already carrying a lot. That hesitation is its own kind of care, and it is worth setting down gently. A report is not a weight dropped on someone else. It is the web doing what it is built for, and a pattern no one names is harder to hold than one that is reported.

This is, in the most concrete sense, the care web from Part 2 doing what it’s for. You were never meant to carry the whole interaction by yourself. A moderation team is part of the infrastructure that holds it with you. When you report, you are not offloading a problem; you are using a structure that exists so that no single person has to be the entire response to harm directed at them.

And reporting is care directed outward, too. A report is information. It helps the people whose job is to keep a space safe see patterns they would otherwise miss, and to protect others who may be experiencing the same thing and saying nothing. It provides visibility when someone is testing more boundaries than only yours. In this way, choosing to report contributes to community care.

Practical Notes for Hachydermians

Everything so far applies on any instance or platform. These last notes are specific to Hachyderm, so reporting here never becomes a barrier.

Report when behavior crosses into harassment or boundary violations, on or off Hachyderm. For anything that has a lot of complexity or deeper back-and-forth, email us. That is also how we receive security reports. For one-to-one or point-in-time things, the Mastodon’s Report feature works very well. Both reach us, so use whichever suits the moment. Please don’t worry if you have all the details, we’ll ask follow up questions if we have them. What matters is that your report comes through one of these, where we can act on it. When another instance is involved, please feel free to reach out to us if you would like us to engage with their moderation team.

Also, and this is important: none of these are last resorts. You do not have to wait until something is unbearable to use them. They are part of how a healthy space takes care of the people in it.

What Holds It Up

Across these three posts, we’ve described a range of practices where some might feel contradictory. It asks you to lead with connection and extend grace, while encouraging you at the same time to set and maintain boundaries, disengage, and ask for help / report to your moderation team as needed.

The tension is easier to feel once you notice that the tools themselves can be turned to more than one purpose. The same block can enforce a boundary or deliver a punishment. The same report can protect a community or settle a score. This is the line Part 1 drew between connecting and punishing, and it runs straight through the tools. Before you act, the question worth asking is which one you are reaching for, because that is the part still in your hands.

That goal changes how you act, even when the visible outcome looks similar. Someone blocked to keep a boundary and someone blocked to be punished may both simply experience being blocked, and you cannot fully control which way it lands for them. How an action is received is its own matter, with its own weight, and worth a fuller conversation another time. What you can govern in the moment is the aim, and the aim shapes the conduct around it. Protecting your own space tends to look quiet: you do it and then stop. Trying to make someone pay tends to look loud: you reach for an audience, you escalate, you keep going. The boundary closes a door. The punishment tries to follow them through it.

This approach to community care is multifaceted, rather than contradictory. The generosity of the first two posts and the limits of this one are both expressions of one purpose: building and protecting the community you actually want to live in. You can keep choosing connection, again and again, because you are not defenseless when it isn’t returned. That community is shaped as much by where you hold a line as by where you extend a hand. Both are how it gets made.

Thus far, we have described moderation from the outside, as something to reach toward when connection runs out. In future posts we’ll discuss the work of moderation itself; not the reporting that is covered here, but what happens on the other side of it.

How you can help others

We know a lot of people are running low right now, so we’ll keep the prompt light this time: what is a content filter or mute you would recommend to someone new to the Fediverse? Answer wherever feels easy, in Zulip or out on the Fediverse with #CommunityCare. As a reminder: we’re not automating the Discord to Zulip migration, so please check in Discord for information for how to be added to Zulip.

Further reading

(Many of these can also be found at the Further Reading section of Part 2.)

On calling in and calling out. The phrase “calling in” comes from Ngọc Loan Trần’s 2013 essay “Calling IN: A Less Disposable Way of Holding Each Other Accountable,” published on Black Girl Dangerous and archived on TransformHarm.org. Loretta Ross has spent the decade since turning the idea into something teachable, drawing on five decades of organizing. Her TED talk runs about fifteen minutes and is a good place to start. Her book Calling In: How to Start Making Change with Those You’d Rather Cancel (Simon & Schuster, 2025) develops a five-part continuum of responses (canceling, calling out, calling off, calling on, and calling in) with practical guidance on when each one fits.

On online harassment and digital safety. Each is most useful read before you need it; the differences are in emphasis.

  • PEN America’s Online Harassment Field Manual: organized by your role in the situation, whether you are being targeted, witnessing it, or running an organization where staff are. Written especially for writers, journalists, artists, and activists. (Appears to have geo restrictions.)
  • Games and Online Harassment Hotline Digital Safety Guide: the most granular on specific tactics like doxxing prevention and hate raids, and direct about how well-meaning allies can amplify harassment by stepping in. The hotline closed in October 2023, but the guide is still online.
  • EFF’s Surveillance Self-Defense: the deepest on technical infrastructure, with a “Security Scenarios” section that tailors a learning path to your situation.

On community safety and transformative justice. Get in Formation: A Community Safety Toolkit, from Vision Change Win, collects security and safety practices developed over years within Black, Indigenous, and People of Color movements in the U.S. It covers verbal and physical de-escalation, bystander intervention, and organizational safety planning, with handouts and worksheets you can adapt to your own conditions. It lives on The Commons Social Change Library, a broader catalogue of openly accessible movement resources worth browsing when your question is harder to name in advance. For the deeper organizing frame, Beyond Survival: Strategies and Stories from the Transformative Justice Movement (Leah Lakshmi Piepzna-Samarasinha and Ejeris Dixon, eds., AK Press, 2020; review on Autostraddle) gathers approaches to addressing harm without relying on punishment.

Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories