Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
146515 stories
·
33 followers

Auto-Reviewing Claude’s Code

1 Share
This post first appeared on Nick Tune’s Weird Ideas and is being republished here with the author’s permission.

A well-crafted system prompt will increase the quality of code produced by your coding assistant. It does make a difference. If you provide guidelines in your system prompt for writing code and tests, coding assistants will follow the guidelines.

Although that depends on your definition of “will follow.” If your definition is “will follow often” then it’s accurate. If your definition is “will follow always” or even “will follow most of the time,” then it’s inaccurate (unless you’ve found a way to make them reliable that I haven’t—please let me know).

Coding agents will ignore instructions in the system prompt on a regular basis. As the context window fills up and starts to intoxicate them, all bets are off.

Even with the latest Opus 4.5 model, I haven’t noticed a major improvement. So if we can’t rely on models to follow system prompts, we need to invest in feedback cycles.

I’ll show you how I’m using Claude Code hooks to implement automatic code review on all AI-generated code so that code quality is higher before it reaches the human in the loop.

You can find a code example that demonstrates the concepts discussed in this post on my GitHub.

Auto Code Review for Fast, Semantic Feedback

When I talk about auto code review in this post, I am describing a fast feedback mechanism intended to review common code quality issues. This will be run whenever Claude has finished making edits, so it needs to be fast and efficient.

I also use coding assistants for detailed code reviews when reviewing a PR, for example. That will spin up multiple subagents and take a bit longer. That’s not what I’m talking about here.

Coding Assistant

The purpose of the auto code review is to reinforce what’s in your system prompt, project documentation, and on-demand skills. Things that Claude may have ignored. Part of a multipronged approach.

Wherever possible, I recommend using your lint and test rules to bake in quality, and leave auto code review for more semantic issues that tools can’t check.

If you want to set a maximum length for your files or maximum level of indentation, then use your lint tool. If you want to enforce a minimum test coverage, use your test framework.

Semantic Code Review

A semantic code review looks at how well the code is designed. For example, naming: Does the code accurately describe the business concepts it represents?

AI will often default to names like “helper” and “utils.” But AI is also good at understanding the nuance and finding better names if you challenge it, and it can do this quickly. So this is a good example of a semantic rule.

You can ban certain words like “helper” and “utils” with lint tools. (I recommend doing that.) But that won’t catch everything.

Another example is logic leaking out of the domain model. When a use case/application service queries an entity and then makes a decision, it’s highly likely your domain logic is leaking into the application layer. Not so easy to catch with lint tools, but worth addressing.

Domain logic leak

Another example is default fallback values. When Claude has an undefined value where a value is expected, it will set a default value. It seems to hate throwing exceptions or challenging the type signature and asking, “Should we allow undefined here?” It wants to make the code run no matter what and no matter how much the system prompt tells it not to.

Default fallback values

You can catch some of this with lint rules but it’s very nuanced and depends on the context. Sometimes falling back to a default value is correct.

Building an Auto Code Review with Claude Hooks

If you’re using Claude Code and want to build an auto code review for checks that you can’t easily define with lint or testing tools, then a solution is to configure a script that runs on the Stop hook.

The Stop hook is when Claude has finished working and passes control back to the user to make a decision. So here, you can trigger a subagent to perform the review on the modified files.

To trigger the subagent you need to return the error status code which blocks the main agent and forces them to read the output.

Trigger the subagent

I think it’s generally considered a best practice to use subagent focused on the review with a very critical mindset. Asking the main agent to mark its own homework is obviously not a good approach, and it will use up your context window.

The solution I use is available on GitHub. You can install it as a plug-in in your repo and customize the code review instructions, or just use it as inspiration for your own solution. Any feedback is welcome.

In the example above you can see it took 52 seconds. Probably quicker than me reviewing and providing the feedback myself. But that’s not always the case. Sometimes it can take a few minutes.

If you’re sitting there blocked waiting for review, this might be slower than doing it yourself. But if you’re not blocked and are working on something else (or watching TV), this saves you time because the end result will be higher quality and require less of your time to review and fix.

Scanning for Updated Files

I want my auto code review to only review files that have been modified since the last pull request. But Claude doesn’t provide this information in the context to the Stop hook.

I can find all files modified or unstaged using Git, but that’s not good enough.

What I do instead is to hook into PostToolUse by keeping a log of each modified file.

PostToolUse

When the Stop hook is triggered, the review will find the files modified since the last review and ask the subagent to review only those. If there are no modified files, the code review is not activated.

Challenges with the Stop Hook

Unfortunately the Stop hook is not 100% reliable for this use case for a few reasons. Firstly, Claude might stop to ask a question, e.g. for you to clarify some requirements. You might not want the auto review to trigger here until you’ve answered Claude and it has finished.

The second reason is that Claude can commit changes before the Stop hook. So by the time the subagent performs the review, the changes are already committed to Git.

That might not be a problem, and there are simple ways to solve it if it is. It’s just extra things to keep in mind and setup.

The ideal solution would be for Anthropic (or other tool vendors) to provide us hooks that are higher level in abstraction—more aligned with the software development workflow and not just low-level file modification operations.

What I would really love is a CodeReadyForReview hook which provides all the files that Claude has modified. Then we can throw away our custom solutions.

Let Me Know If You Have a Better Approach

I don’t know if I’m not looking in the right places or if the information isn’t out there, but I feel like this solution is solving a problem that should already be solved.

I’d be really grateful if you can share any advice that helps to bake in code quality before the human in the loop has to review it.

Until then I’ll continue to use this auto code review solution. When you’re giving AI some autonomy to implement tasks and reviewing what it produces, this is a useful pattern that can save you time and reduce frustration from having to repeat the same feedback to AI.



Read the whole story
alvinashcraft
28 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Roll up your chair: How one small change sparked a DevOps revolution

1 Share

My first encounter with DevOps was so simple that I didn’t even realize its power. Let me share the story so you can see how it went from accidental discovery to deliberate practice, and why it was such a dramatic pivot.

The backdrop to this pivotal moment was a software delivery setup you might find anywhere. The development team built software in a reasonably iterative and incremental fashion. About once a month, the developers created a gold copy and passed it to the ops team.

The ops team installed the software on our office instance (we drank our own champagne). After two weeks of smooth running, they promoted the version to customer instances.

It wasn’t a perfect process, but it benefited from muscle memory, so there wasn’t an urgent imperative to change it. The realization that a change was needed came from the first DevOps moment.

The unplanned first moment

When the ops team deployed the new version, they would review the logs to see if anything interesting or unexpected popped up as a result of the deployment. If they found something, they couldn’t get a quick answer, and it sometimes meant they opted to roll back rather than wait.

This was a comic-strip situation because the development team was a few meters away in their team room. It’s incredible how something as simple as a door transforms co-located teams into remote workers.

The ops team raised their request through official channels, and the developers didn’t even know they were causing more work and stress because the ticket hadn’t reached them yet.

Thankfully, one of the ops team members highlighted this. The next time they started a deployment, a developer was paired with them to watch the logs. A low-fi solution and not one you’d think much about. That developer was me. For this post, we’ll call my ops team partner “Tony”.

Shared surprises lead to learning

The day-one experience of this new collaborative process didn’t seem groundbreaking. When a log message popped up that surprised Tony, it surprised me too. The messages weren’t any more helpful to a developer than they were to the ops team.

I could think through what might be happening, talk it through, and then Tony and I would come up with a theory. We’d test the theory by trying to make another similar log message appear. Then we’d scratch our heads and try to decide whether this could wait for a fix or warranted a rollback.

The plan to bring people from the two teams together was intended to remove the massive communication lag, and it did. But further improvements were to come as a side effect, yielding more significant gains.

Resolve pain pathways by completing the loop

As a developer, when you generate log messages and then have to interpret them, you’ve completed a pain loop. Pain loops are potent drivers of improvement.

Most organizations have unresolved pain pathways. That means someone creates pain, like a developer throwing thousands of vague exceptions every minute, and then someone else feels it, like Tony when he’s trying to work out what the log means.

There are two ways to resolve the pain pathway.

  • Process: You create procedures to bring pain below the threshold and to limit the rate at which it is generated.
  • Loops: You connect the pain into a loop, so the person causing the pain feels its signal.

If I’m the one who gets the electric shock when I press the button, I stop pushing it, even if someone in a white coat instructs me to continue the experiment.

With the pain loop connected, I realized we should log fewer messages to reduce the scroll and review burden. Instead of needing institutional knowledge of which messages were perpetually present and could therefore be ignored, we could stop logging them.

The (perhaps asymptotic) goal was to log only the events that required human review, with a toggle that let more verbose logging be generated on demand. Instead of scrolling through a near-infinite list of logs, you’d have a nearly empty view. If a log appeared, it was important enough to warrant your attention.

The next idea was to improve the information in the log messages. We could identify which customer or user experienced the error and provide context for it. By improving these error messages, we could often identify the bug before we even opened the code, dramatically reducing our investigation time.

This process evolved into the three Fs of event logging.

Create positive spirals with delightful deployments

Another thread that emerged from the simple act of sitting together during deployments was the realization that the deployment process was nasty. We created an installer file, and the ops team would move it to the target server, double-click it, then follow the prompts to configure the instance.

Having to paste configuration values into the installer was slow and error-prone. We spent a disproportionate amount of time improving this process.

Admittedly, we were solving this one “inside the box” by improving an individual installation with DIY scripts, a can of lubricating spray, and sticky tape. This didn’t improve the experience of repeating the install across several environments and multiple production instances.
However, I did get to experience the stress of deployments when their probability of success was anything less than “very high”. When deployments weren’t a solved problem, they could damage team reputation, erode trust, and reduce autonomy.

Failed deployments are the leading cause of organizations working in larger batches. Large batches are a leading cause of failed deployments. This is politely called a negative spiral, and you have to reverse it urgently if you want to survive.

At last, a panacea

The act of sitting a developer with an ops team member during deployments isn’t going to solve all your problems. As we scaled from 6 to 30 developers, pursued innovative new directions for our product, and repositioned our offering and pricing, new pain kept emerging. Continuous improvement really is a game of whack-a-mole, and there’s no final state.

Despite this, the simple act of sitting together, otherwise known as collaboration, caused a chain reaction of beneficial changes.

Sharing goals and pain

When you’re sitting with someone working on the same problem, all the departmental otherness evaporates. You’re just two humans trying to make things work.

Instead of holding developers accountable for feature throughput and the ops team for stability, we shared a combined goal of high throughput and high stability in software delivery.

That removed the goal conflict and encouraged us to share and solve common problems together. This also works when you repeat the alignment exercise with other areas, like compliance and finance.

Completing the pain loop

The problem with our logging strategy was immediately apparent when one of the people generating the logs had to wade through them. This is a powerful motivator for change.

Identifying unresolved pain paths and closing the pain loop isn’t a form of punishment; it’s a moment of realization. It’s the reason we should all use the software we build: it highlights the unresolved pain paths we’re burdening our users with.

Pain loops are crucial to meaningful improvements in software delivery.

Reducing the toil

Great developers are experts at automating things. When you expose this skill set to repetitive work, a developer’s instinct is to eliminate the toil.

For the ops team, the step-by-step deployment checklist was just part of doing business. They were so familiar with the process that it became invisible.

When we reduced the toil, the ops team was definitely happier, even though we hadn’t solved all the rough edges yet.

Refining the early ideas

The fully-formed ideas didn’t arrive immediately. The rough shapes were polished over time into a set of repeatable and connected DevOps habits.

The three Fs, incident causation principles, alerting strategy, and monitor selection guidelines graduated into deliberate approaches long after this story.

I developed an approach to software delivery improvement that used these ideas to address trust issues between developers and the business. By reducing negative signals caused by failed deployments and escaped bugs, we increased trust in the development team, enhanced their reputation, and increased their autonomy.

We combined these practices with Octopus Deploy for deployment and runbook automation and an observability platform, which meant the team was the first to spot problems rather than users. When there was a problem, it was trivial to fix, and the new version could be rolled out in no time.

Unlike the original organization, where we increased collaboration between teams, we created fully cross-functional teams that worked together all the time. Every skill required to deliver and operate the software was embedded, minimizing dependencies and the risk of silos, tickets, and bureaucracy.

These cross-functional teams also proved to be the best way to level up team members.

Unicorn portals

You can’t work with a database whizz for long before you start thinking about query performance, maintenance plans, and normalization. You build better software when you develop these skills. You can’t work with an infrastructure expert without learning about failovers, networking, and zero-downtime deployments. You build better software when you develop these skills, too.

When people say they can’t hire these highly skilled developers, they miss the crucial point. A team designed in this cross-functional style takes new team members and upgrades them into these impossible-to-find unicorns. You may start as a backend developer, a database administrator, or a test analyst, but you grow into a generalizing specialist with many new skills.

Creating these unicorn portals is the most valuable skill development managers can bring to an organization. You need to hire to fill gaps and foster an environment where skills transfer fluidly throughout the team.

Roll up your chair

What became a sophisticated and repeatable process for team transformation could be traced back to that simple act of sitting together. It was a small, easy change that led to increased empathy and understanding, and then a whole set of improvements.

Staring at that rapid stream of logs was the pivot point that led to the most healthy and human approach to DevOps.

We didn’t have the research to confirm it back then, but deployment automation, shared goals, observability, small batches, and Continuous Delivery are all linked to better outcomes for the people, teams, and organization. Everybody wins when you do DevOps right.

Happy deployments!

Read the whole story
alvinashcraft
28 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

JavaScript Obfuscation: Block Code Theft and Reverse Engineering

1 Share

TL;DR: JavaScript applications distribute as readable source code, exposing intellectual property, API keys, and business logic to anyone with browser dev tools. Professional obfuscation transforms code into unreadable output while preserving functionality. Combined with runtime checks, obfuscation blocks both static and dynamic analysis, protecting the assets that actually matter to your business.

JavaScript’s security problem starts with distribution

JavaScript applications ship as source code. Unlike compiled languages that distribute binaries, every JavaScript file you deploy is readable text. Browser dev tools, debuggers, and basic text editors expose your complete codebase to anyone accessing your application.

This visibility creates direct paths to theft. Attackers extract proprietary algorithms, copy business logic, harvest API endpoints, and steal authentication mechanisms without specialized tools. The barrier to JavaScript reverse engineering is essentially zero.

Consider the scope: GitHub reports JavaScript leads all programming languages in both contributors and repositories. This popularity means millions of applications distribute readable source code daily. The scale magnifies the problem.

Professional obfuscation transforms code structure to block comprehension while preserving functionality. Control flow alterations, string encryption, identifier renaming, and dead code injection make reverse engineering impractical. Combined with runtime checks that detect debuggers and tampering, obfuscation increases attacker effort beyond the value of success.

This guide covers how obfuscation blocks static analysis, why runtime checks stop debugger attacks, what separates real protection from minification, and how to integrate obfuscation without disrupting your build process.

Recent attacks demonstrate the risk

Real-world breaches prove JavaScript vulnerability isn’t theoretical. In 2018, British Airways suffered a breach exposing 380,000 customer records. The attack vector was 22 lines of malicious JavaScript injected into the payment page. The code captured credit card data, names, and addresses by intercepting form submissions before legitimate processing.

MyDashWallet, a cryptocurrency service, ran compromised for over two months in 2019 due to vulnerabilities in an external JavaScript library. Attackers modified the library to redirect funds during transactions. Users lost cryptocurrency because the application trusted unprotected third-party code.

The Magecart group built an entire criminal operation around JavaScript skimming. Their technique inserts malicious code into eCommerce checkout processes, capturing payment information as customers enter it. The attacks work because JavaScript executes in the browser where applications can’t reliably verify code integrity without protection mechanisms.

The PCI Security Standards Council and Retail Hospitality ISAC issued joint warnings about JavaScript-based skimming attacks. The frequency and success rate of these attacks prompted regulatory bodies to address JavaScript security as a compliance concern, not just a best practice recommendation.

Code obfuscation blocks static analysis

JavaScript obfuscation transforms source code structure and content to prevent comprehension. Advanced techniques alter control flow, rename identifiers, encrypt strings, and inject misleading code paths. The result renders decompilation useless because there’s nothing meaningful to decompile.

Control flow obfuscation rewrites logical structures. Sequential operations become non-linear execution paths. Simple if-then statements transform into nested conditionals with opaque logic. This restructuring defeats automated analysis tools that depend on recognizable patterns. An attacker viewing obfuscated code sees execution jumping between seemingly unrelated code blocks with no apparent connection to the original algorithm.

Identifier renaming replaces function names, variable labels, and class identifiers with non-descriptive sequences. A function named validatePayment() becomes _0x2a4b(). Multiply this across thousands of identifiers, and the codebase loses all semantic meaning. Without meaningful names, reverse engineers can’t distinguish authentication logic from error handling or data processing from logging.

String encryption protects sensitive values. API endpoints, authentication tokens, configuration parameters, and error messages get encrypted at rest and decrypted only at runtime. Static analysis tools scanning for “api.example.com” or “Bearer token” find nothing. The encryption keys and decryption routines themselves get obfuscated, creating multiple layers that an attacker must penetrate.

Dead code injection adds misleading logic. Fake functions, unreachable branches, and decoy operations increase the difficulty of manual inspection. Attackers waste time analyzing code paths that never execute. Some injected code appears functional but serves no purpose except consuming analysis resources.

Property access transformation modifies how code references object properties. Direct access like object.property becomes bracket notation with encrypted strings: object[_0x1f9a(‘0x3b’)]. This technique obscures which properties code actually uses and makes automated extraction of object structures nearly impossible.

Runtime checks defend against dynamic analysis

Static protection isn’t sufficient. Attackers use debuggers, memory inspection, and runtime manipulation to bypass obfuscation. Runtime application self-protection (RASP) detects these tools and responds to active threats.

Debugger detection identifies when dev tools attach to your application. The protection code monitors for breakpoint insertion, step execution, and console access. Multiple detection techniques work in parallel: checking for specific debugger API calls, measuring execution timing to detect step-through debugging, and monitoring for changes to Function.prototype.toString that attackers use to inspect protected code. Detection triggers predetermined responses: session termination, data wiping, or silent failure that denies attackers feedback.

Tamper detection verifies code integrity during execution. Hash checks confirm that functions haven’t been modified. Signature verification ensures the application matches its original state. The checks run continuously during execution, not just at startup. Detecting tampering allows controlled failure rather than operating with compromised code. Some implementations clear sensitive data from memory and log security events before terminating.

Environment checks identify suspicious execution contexts. The application detects emulators, rooted devices, virtualized environments, and automated testing frameworks. These contexts signal attack reconnaissance. Blocking execution in hostile environments prevents attackers from experimenting freely. The checks examine device characteristics, system APIs, and execution environment properties that differ between legitimate user devices and analysis tools.

Domain locking restricts where code executes. Production JavaScript runs only on authorized domains. Attempts to copy protected code to attacker-controlled servers fail because the code verifies its execution context before running operations. This technique prevents attackers from hosting stolen code in controlled environments where they can analyze it safely.

Function wrapping creates guards around operations. Instead of calling sensitive functions directly, code routes through checking logic that validates execution context, verifies caller identity, and confirms the application state before allowing protected operations to proceed. This approach protects individual functions even if attackers bypass application-level checks.

Minification and uglification don’t provide real protection

Developers often confuse compression techniques with security measures. Minification removes whitespace and shortens variable names to reduce file size. Uglification adds meaningless characters and formatting. Neither technique prevents reverse engineering.

Prettify and similar tools reverse both processes in seconds. Minified code becomes readable again with automatic formatting. Uglified code’s nonsense characters get stripped away, exposing the original logic underneath.

The difference matters. Minification optimizes performance. Obfuscation protects intellectual property. Relying on compression for security is like using a screen door for bank vault protection.

Free tools carry hidden costs

Free online obfuscators present serious risks. Security researcher Peter Gramantik documented a “free” tool that injected malware while obfuscating code. The malicious code collected data and established backdoors. Users assumed their applications gained protection while actually introducing new vulnerabilities.

The economics explains the problem. Professional-grade obfuscation requires substantial engineering investment. Maintaining transformation algorithms, updating bypass defenses, and supporting integration workflows costs money. Truly free tools either lack these capabilities or monetize through methods users don’t see.

Commercial obfuscators from established security vendors provide verifiable protection without hidden risks. The tools undergo security audits, maintain documented transformation techniques, and offer support channels for implementation questions. Paying for protection means knowing exactly what you’re getting.

Protection requirements scale with application value

Not every JavaScript application needs identical protection. A marketing website’s client-side form validation differs from a fintech application’s transaction processing logic. Application hardening scales to match threat profiles.

High-value applications demand layered protection across all code sections. These include applications handling financial transactions, processing healthcare data, implementing proprietary algorithms, or managing authentication systems. The combination of aggressive obfuscation and active runtime checks creates defense in depth.

Lower-risk applications benefit from focused protection. Target code sections rather than the entire codebase. Protect API keys, authentication logic, and business rule implementations while leaving UI code readable. This selective approach balances protection with performance.

JSDefender supports both strategies. Configure full transformation for maximum protection or apply targeted techniques to specific code sections. The tool adapts to your risk profile rather than forcing one-size-fits-all security.

Integration fits existing workflows

Protection tools that disrupt development workflows don’t get used. Effective obfuscation integrates into build pipelines without manual intervention.

JSDefender operates from the command line or configuration files. Specify protection levels, identify priority code sections, and configure runtime checks once. The tool executes automatically during builds, transforming code without developer involvement.

The approach supports continuous integration and continuous deployment. Protected builds deploy automatically after transformation. Developers work with readable source code. Production receives obfuscated output. The separation maintains development velocity while ensuring deployed applications get protected.

Framework support covers modern JavaScript ecosystems. The tool handles TypeScript, webpack bundles, React applications, Node.js servers, and mobile frameworks like React Native. Protection works regardless of development stack.

Your JavaScript code runs in hostile territory

Browser-based applications execute in environments you don’t control. Users can inspect code, modify execution, and extract data without restriction. Mobile applications face similar challenges. Even server-side Node.js applications running in cloud environments need protection against container compromise.

This reality demands proactive security. 

Hoping attackers won’t bother with your application isn’t a strategy. Professional attackers target applications methodically, looking for easy wins. Making your code difficult to analyze reduces attack profitability.

Protection starts with obfuscation.

Transform code structure to block static analysis. Add runtime checks to detect dynamic attacks. Use established tools from security vendors rather than untested free alternatives. The approach works because it increases attacker effort beyond the value of success.

JavaScript’s popularity and distribution model create inherent security challenges. Professional obfuscation provides practical defense without sacrificing functionality or development velocity.

Try JSDefender free to see how code transformation protects intellectual property while maintaining application performance. The evaluation includes full obfuscation capabilities and runtime protection features.

Read the whole story
alvinashcraft
28 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Building Software Organisations Where People Can Thrive

1 Share

Continuous learning, adaptability, and strong support networks are the foundations for thriving teams, Matthew Card mentioned. Trust is built through consistent, fair leadership and addressing toxic behaviour, bias, and microaggressions early. By fostering growth, psychological safety, and accountability, people-first leadership drives resilience, collaboration, and performance.

By Ben Linders
Read the whole story
alvinashcraft
28 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Security Risks of Vibe Coding

1 Share
Read the whole story
alvinashcraft
28 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Distributed System Pattern: Leader and Followers in .NET – One Decision Maker, Many Replicas, Fewer Outages

1 Share

Distributed systems rarely fail because you picked the wrong cloud service. They fail because two nodes believe they are in charge, both act, and both are “correct” from their own perspective. If your domain has any single authority assumption, and most systems do, you need a way to make that authority real.

Leader and Followers is the pattern that turns vague ownership into a concrete contract.

One node is the decision maker for a shard, group, or partition. The rest replicate the leader’s decisions and keep the system available when nodes fail. That is the bargain. The cost is that you must implement leadership as a first-class concept, not an incidental side effect of “whichever instance got there first.”

This post shows a pragmatic .NET implementation using a renewable lease plus a leadership term. You will also see how to fence writes so an old leader cannot corrupt your data after it loses leadership.

The Pattern in One Sentence

Choose one leader per group to serialize decisions, replicate the resulting state to followers, and make leadership transferable without letting two nodes write as leader at the same time.

The Failure You Are Trying to Stop

If you have ever seen any of these, you have already paid for this pattern, just in a worse way:

  • Two schedulers run the same job and you double charge a customer.
  • Two instances process the same command and your inventory goes negative.
  • A node stalls for 30 seconds, then resumes and continues writing as if nothing happened.
  • You “fixed” it with retries and now the bug happens faster.

Leader and Followers prevents these by making leadership explicit and enforceable.

Implementation Goals

A solid implementation needs four properties:

  1. Exclusive leadership per group with a bounded time window.
  2. Fast detection of leadership loss.
  3. A monotonic term that changes on each leadership transition.
  4. Fencing so stale leaders are rejected at the storage boundary.

The lease gives you the bounded time window. The term and fencing prevent the stale leader problem.

Core Interface and the Leader Only Boundary

Start by keeping your public surface area honest. Code that must run only on the leader should not be callable without an explicit leadership check.

public interface ILeadership
{
    ValueTask<bool> IsLeaderAsync(CancellationToken ct);
    ValueTask<long> TermAsync(CancellationToken ct);
}

Now wrap leader-only work behind a small, boring gate:

public sealed class LeaderOnlyService(ILeadership leadership)
{
    private readonly ILeadership _leadership = leadership;

    public async Task RunLeaderTaskAsync(Func<CancellationToken, Task> work, CancellationToken ct)
    {
        if (!await _leadership.IsLeaderAsync(ct)) return;
        await work(ct);
    }
}

This does not solve leadership by itself. It prevents “accidental leadership” from leaking everywhere.

Step 1: Lease Backed Leadership in SQL Server

You can implement leases using Redis, etcd, ZooKeeper, or a database. For many .NET teams, SQL Server is already present and operationally understood. A SQL lease is not glamorous, but it is effective.

Create a table that stores the current lease owner, expiry, and term.

CREATE TABLE dbo.LeadershipLeases
(
    LeaseKey       nvarchar(200) NOT NULL PRIMARY KEY,
    OwnerId        nvarchar(200) NOT NULL,
    Term           bigint        NOT NULL,
    ExpiresAtUtc   datetime2(3)  NOT NULL
);

CREATE INDEX IX_LeadershipLeases_ExpiresAtUtc ON dbo.LeadershipLeases (ExpiresAtUtc);

Define the lease rules:

  • A node can acquire leadership if the lease is expired or already owned by itself.
  • Each successful acquisition increments the term.
  • Renew extends ExpiresAtUtc only if the owner matches and the lease has not expired.

Lease Store Contract

public interface ILeaseStore
{
    Task<(bool Acquired, long Term, DateTimeOffset ExpiresAt)> TryAcquireAsync(
        string leaseKey,
        string ownerId,
        TimeSpan ttl,
        CancellationToken ct);

    Task<(bool Renewed, DateTimeOffset ExpiresAt)> TryRenewAsync(
        string leaseKey,
        string ownerId,
        TimeSpan ttl,
        CancellationToken ct);

    Task<(string OwnerId, long Term, DateTimeOffset ExpiresAt)?> ReadAsync(
        string leaseKey,
        CancellationToken ct);
}

SQL Implementation

This implementation uses a transaction and locking hints to keep acquisition atomic.

using System.Data;
using Microsoft.Data.SqlClient;

public sealed class SqlLeaseStore(string connectionString) : ILeaseStore
{
    public async Task<(bool Acquired, long Term, DateTimeOffset ExpiresAt)> TryAcquireAsync(
        string leaseKey,
        string ownerId,
        TimeSpan ttl,
        CancellationToken ct)
    {
        await using var conn = new SqlConnection(connectionString);
        await conn.OpenAsync(ct);

        await using var tx = (SqlTransaction)await conn.BeginTransactionAsync(IsolationLevel.Serializable, ct);

        var now = DateTimeOffset.UtcNow;
        var expires = now.Add(ttl);

        // Lock the row for the lease key.
        await using var select = conn.CreateCommand();
        select.Transaction = tx;
        select.CommandText = @"
SELECT LeaseKey, OwnerId, Term, ExpiresAtUtc
FROM dbo.LeadershipLeases WITH (UPDLOCK, HOLDLOCK)
WHERE LeaseKey = @k;
";
        select.Parameters.AddWithValue("@k", leaseKey);

        string? currentOwner = null;
        long currentTerm = 0;
        DateTimeOffset currentExpires = DateTimeOffset.MinValue;

        await using (var reader = await select.ExecuteReaderAsync(ct))
        {
            if (await reader.ReadAsync(ct))
            {
                currentOwner = reader.GetString(1);
                currentTerm = reader.GetInt64(2);
                currentExpires = reader.GetDateTimeOffset(3);
            }
        }

        var isExpired = currentOwner is null || currentExpires <= now;
        var isReentrant = currentOwner == ownerId;

        if (currentOwner is null)
        {
            var insertTerm = 1L;
            await using var insert = conn.CreateCommand();
            insert.Transaction = tx;
            insert.CommandText = @"
INSERT INTO dbo.LeadershipLeases(LeaseKey, OwnerId, Term, ExpiresAtUtc)
VALUES (@k, @o, @t, @e);
";
            insert.Parameters.AddWithValue("@k", leaseKey);
            insert.Parameters.AddWithValue("@o", ownerId);
            insert.Parameters.AddWithValue("@t", insertTerm);
            insert.Parameters.AddWithValue("@e", expires);

            await insert.ExecuteNonQueryAsync(ct);
            await tx.CommitAsync(ct);
            return (true, insertTerm, expires);
        }

        if (!isExpired && !isReentrant)
        {
            await tx.RollbackAsync(ct);
            return (false, currentTerm, currentExpires);
        }

        // Acquire by updating owner, bumping term, setting new expiry.
        var newTerm = isReentrant ? currentTerm : currentTerm + 1;

        await using var update = conn.CreateCommand();
        update.Transaction = tx;
        update.CommandText = @"
UPDATE dbo.LeadershipLeases
SET OwnerId = @o,
    Term = @t,
    ExpiresAtUtc = @e
WHERE LeaseKey = @k;
";
        update.Parameters.AddWithValue("@k", leaseKey);
        update.Parameters.AddWithValue("@o", ownerId);
        update.Parameters.AddWithValue("@t", newTerm);
        update.Parameters.AddWithValue("@e", expires);

        await update.ExecuteNonQueryAsync(ct);
        await tx.CommitAsync(ct);

        return (true, newTerm, expires);
    }

    public async Task<(bool Renewed, DateTimeOffset ExpiresAt)> TryRenewAsync(
        string leaseKey,
        string ownerId,
        TimeSpan ttl,
        CancellationToken ct)
    {
        await using var conn = new SqlConnection(_connectionString);
        await conn.OpenAsync(ct);

        var now = DateTimeOffset.UtcNow;
        var expires = now.Add(ttl);

        await using var cmd = conn.CreateCommand();
        cmd.CommandText = @"
UPDATE dbo.LeadershipLeases
SET ExpiresAtUtc = @e
WHERE LeaseKey = @k
  AND OwnerId = @o
  AND ExpiresAtUtc > @now;
";
        cmd.Parameters.AddWithValue("@k", leaseKey);
        cmd.Parameters.AddWithValue("@o", ownerId);
        cmd.Parameters.AddWithValue("@e", expires);
        cmd.Parameters.AddWithValue("@now", now);

        var rows = await cmd.ExecuteNonQueryAsync(ct);
        return (rows == 1, expires);
    }

    public async Task<(string OwnerId, long Term, DateTimeOffset ExpiresAt)?> ReadAsync(string leaseKey, CancellationToken ct)
    {
        await using var conn = new SqlConnection(_connectionString);
        await conn.OpenAsync(ct);

        await using var cmd = conn.CreateCommand();
        cmd.CommandText = @"
SELECT OwnerId, Term, ExpiresAtUtc
FROM dbo.LeadershipLeases
WHERE LeaseKey = @k;
";
        cmd.Parameters.AddWithValue("@k", leaseKey);

        await using var reader = await cmd.ExecuteReaderAsync(ct);
        if (!await reader.ReadAsync(ct)) return null;

        return (reader.GetString(0), reader.GetInt64(1), reader.GetDateTimeOffset(2));
    }
}

Step 2: A Leadership Coordinator That Renews the Lease

Leadership needs a background loop that acquires and renews the lease and exposes the current status to the rest of the process.

using Microsoft.Extensions.Hosting;
using Microsoft.Extensions.Logging;

public sealed class LeaseLeadership : BackgroundService, ILeadership
{
    private readonly ILeaseStore _store;
    private readonly ILogger<LeaseLeadership> _log;
    private readonly string _leaseKey;
    private readonly string _ownerId;

    private readonly TimeSpan _ttl;
    private readonly TimeSpan _renewEvery;

    private volatile bool _isLeader;
    private volatile long _term;

    public LeaseLeadership(
        ILeaseStore store,
        ILogger<LeaseLeadership> log,
        string leaseKey,
        string ownerId,
        TimeSpan? ttl = null)
    {
        _store = store;
        _log = log;
        _leaseKey = leaseKey;
        _ownerId = ownerId;

        _ttl = ttl ?? TimeSpan.FromSeconds(10);
        _renewEvery = TimeSpan.FromMilliseconds(_ttl.TotalMilliseconds / 3);
    }

    public ValueTask<bool> IsLeaderAsync(CancellationToken ct) => ValueTask.FromResult(_isLeader);
    public ValueTask<long> TermAsync(CancellationToken ct) => ValueTask.FromResult(_term);

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        while (!stoppingToken.IsCancellationRequested)
        {
            try
            {
                if (!_isLeader)
                {
                    var acquired = await _store.TryAcquireAsync(_leaseKey, _ownerId, _ttl, stoppingToken);
                    _isLeader = acquired.Acquired;
                    _term = acquired.Term;

                    if (_isLeader)
                        _log.LogInformation("Became leader for {LeaseKey} with term {Term}", _leaseKey, _term);
                }
                else
                {
                    var renewed = await _store.TryRenewAsync(_leaseKey, _ownerId, _ttl, stoppingToken);

                    if (!renewed.Renewed)
                    {
                        _isLeader = false;
                        _log.LogWarning("Lost leadership for {LeaseKey}", _leaseKey);
                    }
                }
            }
            catch (Exception ex)
            {
                _isLeader = false;
                _log.LogError(ex, "Leadership loop error for {LeaseKey}", _leaseKey);
            }

            await Task.Delay(_renewEvery, stoppingToken);
        }
    }
}

Important behavior: if renewal fails, the node stops acting as a leader immediately. That is non-negotiable.

Step 3: Fencing Writes With the Term

A renewable lease is necessary but not sufficient. A node can pause, GC can stall, networking can get weird, and you can end up with a stale leader still connected to your database. The lease can expire and a new leader can be elected while the old leader remains alive and unaware.

The fix is fencing: every leader term is a token. Every write from the leader includes it. Storage rejects writes with a stale term.

Example: Fenced Stream Writes in SQL Server

Create a table for writes that includes the term, and enforce monotonic terms per stream or per aggregate.

CREATE TABLE dbo.StreamWrites
(
    StreamId     nvarchar(200) NOT NULL,
    SequenceNo   bigint        NOT NULL,
    Term         bigint        NOT NULL,
    Payload      varbinary(max) NOT NULL,
    CreatedAtUtc datetime2(3)  NOT NULL,
    CONSTRAINT PK_StreamWrites PRIMARY KEY(StreamId, SequenceNo)
);

CREATE INDEX IX_StreamWrites_StreamId_Term ON dbo.StreamWrites(StreamId, Term);

Now a fenced writer that rejects stale terms:

using Microsoft.Data.SqlClient;

public sealed class FencedStreamWriter(string connectionString)
{
    public async Task AppendAsync(string streamId, long sequenceNo, long term, byte[] payload, CancellationToken ct)
    {
        await using var conn = new SqlConnection(connectionString);
        await conn.OpenAsync(ct);

        await using var tx = (SqlTransaction)await conn.BeginTransactionAsync(ct);

        // Reject if the stream has any write with a greater term.
        await using var check = conn.CreateCommand();
        check.Transaction = tx;
        check.CommandText = @"
DECLARE @maxTerm BIGINT =
(
    SELECT ISNULL(MAX(Term), 0)
    FROM dbo.StreamWrites WITH (UPDLOCK, HOLDLOCK)
    WHERE StreamId = @s
);

IF (@maxTerm > @t)
    THROW 50001, 'Stale leader term', 1;
";
        check.Parameters.AddWithValue("@s", streamId);
        check.Parameters.AddWithValue("@t", term);
        await check.ExecuteNonQueryAsync(ct);

        await using var insert = conn.CreateCommand();
        insert.Transaction = tx;
        insert.CommandText = @"
INSERT INTO dbo.StreamWrites(StreamId, SequenceNo, Term, Payload, CreatedAtUtc)
VALUES (@s, @n, @t, @p, SYSUTCDATETIME());
";
        insert.Parameters.AddWithValue("@s", streamId);
        insert.Parameters.AddWithValue("@n", sequenceNo);
        insert.Parameters.AddWithValue("@t", term);
        insert.Parameters.AddWithValue("@p", payload);

        await insert.ExecuteNonQueryAsync(ct);
        await tx.CommitAsync(ct);
    }
}

This is the storage level refusal that prevents corruption. If you do not fence, you are trusting timing and luck.

Putting It Together: A Leader Only Job Runner

Here is a practical use case: one scheduled job should run per cluster, and it must not run twice.

public sealed class BillingReconciler(LeaderOnlyService leaderOnly, ILeadership leadership, FencedStreamWriter writer)
{
    private readonly LeaderOnlyService _leaderOnly = leaderOnly;
    private readonly ILeadership _leadership = leadership;
    private readonly FencedStreamWriter _writer = writer;

    public Task RunAsync(CancellationToken ct) =>
        _leaderOnly.RunLeaderTaskAsync(async innerCt =>
        {
            var term = await _leadership.TermAsync(innerCt);

            // Example decision: emit a reconciliation command into a stream
            var streamId = "billing-reconcile";
            var nextSeq = DateTimeOffset.UtcNow.ToUnixTimeSeconds(); // placeholder, use real sequencing in production
            var payload = System.Text.Encoding.UTF8.GetBytes($"reconcile:{DateTimeOffset.UtcNow:O}");

            await _writer.AppendAsync(streamId, nextSeq, term, payload, innerCt);
        }, ct);
}

If a stale leader tries to write, the fenced writer throws, and the damage stops there.

Replicating State to Followers

Leader and Followers does not require a specific replication mechanism. It requires that the leader’s decisions can be observed and applied by followers.

Here are three pragmatic replication models in .NET, in increasing sophistication:

Model A: Shared durable state, followers read it

Leader writes to a database. Followers serve reads from the same database or compute projections from it. Replication is “free” because the database is the replication point.

This is often the right starting point. It is also a reminder: your database is part of the distributed system, whether you want it to be or not.

Model B: Leader appends to a log, followers tail the log

Leader writes decisions into an append only log table. Followers poll for new rows and apply them to a local projection.

public sealed class FollowerApplier(string connectionString) : BackgroundService
{
    private long _lastSeq;

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        while (!stoppingToken.IsCancellationRequested)
        {
            var batch = await ReadNextBatchAsync(_lastSeq, stoppingToken);

            foreach (var row in batch)
            {
                Apply(row);
                _lastSeq = row.SequenceNo;
            }

            await Task.Delay(TimeSpan.FromMilliseconds(200), stoppingToken);
        }
    }

    private async Task<IReadOnlyList<(string StreamId, long SequenceNo, long Term, byte[] Payload)>> ReadNextBatchAsync(
        long afterSeq,
        CancellationToken ct)
    {
        var results = new List<(string, long, long, byte[])>();

        await using var conn = new Microsoft.Data.SqlClient.SqlConnection(connectionString);
        await conn.OpenAsync(ct);

        await using var cmd = conn.CreateCommand();
        cmd.CommandText = @"
SELECT TOP (100) StreamId, SequenceNo, Term, Payload
FROM dbo.StreamWrites
WHERE SequenceNo > @n
ORDER BY SequenceNo;
";
        cmd.Parameters.AddWithValue("@n", afterSeq);

        await using var reader = await cmd.ExecuteReaderAsync(ct);
        while (await reader.ReadAsync(ct))
        {
            results.Add((
                reader.GetString(0),
                reader.GetInt64(1),
                reader.GetInt64(2),
                (byte[])reader["Payload"]
            ));
        }

        return results;
    }

    private static void Apply((string StreamId, long SequenceNo, long Term, byte[] Payload) row)
    {
        // Apply to a local projection, cache, or in-memory state machine.
        // Keep it deterministic and idempotent.
    }
}

Model C: Leader publishes events, followers subscribe

This is the same idea as Model B but moved to a message broker. It works when ordering per partition is guaranteed and consumers track offsets.

If you already use Azure Service Bus, Kafka, or similar, this model can scale well.

Reads and Correctness

Followers can be used for reads, but only if you are honest about staleness.

Common approach:

  • Writes go to the leader.
  • Reads go to followers if their replication lag is within a threshold.
  • Otherwise, route reads to the leader for correctness.

This is the point where teams get sloppy. “Eventually consistent” is not a license to guess.

Testing: Prove You Do Not Get Two Leaders

You can integration test this without heroic infrastructure.

  1. Start two instances with the same lease key and different owner ids.
  2. Verify only one becomes leader.
  3. Pause the leader process long enough to miss renewals.
  4. Verify the follower becomes leader with a higher term.
  5. Resume the old leader and attempt a fenced write.
  6. Verify the write is rejected as stale.

If your tests do not include stale leader rejection, you do not have a safety story.

Operational Checklist

Metrics you want on a dashboard:

  • IsLeader boolean per instance
  • Current term per instance
  • Lease renew latency and failures
  • Leadership flaps per hour
  • Fenced write rejections per hour
  • Follower lag if you implement tailing replication

Alerts worth paging on:

  • Leadership flapping
  • Lease renew failures across many nodes
  • Any sustained rate of stale leader write rejections

Those rejections are not “noise.” They are proof that you avoided corruption today.

Closing

Leader and Followers is not a fancy pattern. It is a statement of responsibility. One node decides. Everyone else follows. Leadership changes are explicit, bounded, and fenced. If you build a system that assumes a single decision maker but never implements one, you are not simplifying the architecture. You are outsourcing correctness to chance.

If you want to extend this post next, the natural follow on is Fencing Tokens and Generation Clock as a deeper treatment of terms, plus a replication chapter that builds a small replicated log in .NET with a committed index and follower acknowledgements.

The post Distributed System Pattern: Leader and Followers in .NET – One Decision Maker, Many Replicas, Fewer Outages first appeared on Chris Woody Woodruff | Fractional Architect.

Read the whole story
alvinashcraft
28 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories