Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
156856 stories
·
33 followers

3 Tricks to Help You Stop Procrastinating

1 Share

You have a lot on your plate. But rather than get any of it done, you seek out distractions. If you find yourself procrastinating at work, this post has three tips to help you break this pattern.

Procrastination isn’t usually an issue of laziness or a lack of time-management skills. If we’re talking about chronic procrastination, psychologists suggest it’s an issue having more to do with self-regulation. It goes like this:

  • You have an important task to do, be it big or small.
  • You know that if you don’t do the task or get it done on time, it’s going to create a problem for you (and probably others as well).
  • Yet, you willingly delay the task, knowing full well the consequences.
  • The distractions you seek out feel good, but it’s only temporary.

In this post, we’re going to look at some of the reasons why people procrastinate and various tips and tricks you can do to push past it.

Why Do We Procrastinate?

According to Psychology Today:

“Everyone puts things off sometimes, but procrastinators chronically avoid difficult tasks and may deliberately look for distractions. Procrastination tends to reflect a person’s struggles with self-control. For habitual procrastinators, who represent approximately 20 percent of the population, ‘I don’t feel like it’ comes to take precedence over their goals or responsibilities, setting them on a downward spiral of negative emotions that further deters future effort.”

But why exactly do procrastinators not feel like it? There are a number of reasons.

For some, it’s the pressure to be perfect and the fear of failing to live up to that standard that keeps them from getting started.

For others, it’s because they perceive the task as being unenjoyable. So, they seek out something that will bring them joy, even temporarily.

Here are some other reasons why people may procrastinate:

  • The task seems too big to handle.
  • They see little or no reward in doing it.
  • They’re confused about how to do the task.
  • They feel overstimulated.
  • They feel fatigued.

There are some mental health practitioners who suggest that there’s sometimes something else at play.

Dr. Alice Boyes, for instance, says that “batchers” are often confused for procrastinators. Batchers are people who prefer to complete a set of tasks in a way that maximizes their productivity.

There are seven types. These are the ones most relevant to designers and developers:

  1. The time-based batcher waits to do certain tasks at specific times of the day instead of the second they hit their plate.
  2. The volume-based batcher waits until they’ve accumulated enough tasks and then cranks them out all at once.
  3. The pressure-based batcher waits until they’re closer to the delivery date (just not too close to miss it).
  4. The context-based batcher waits until their physical environment is ideal (like the kids going to bed at night).
  5. The identity-based batcher waits until a pre-determined time when they work in that capacity (like doing onboarding only on Mondays, wireframes on Tuesday, etc.)

For some procrastinators, it’s not about being irresponsible and delaying a task that needs to get done. It’s that they have a preferred work method that only resembles procrastination.

3 Tricks to Help Prevent Procrastination

Procrastination can feel good in the moment, though many people realize deep down inside the consequences won’t feel very good. Here are some of the consequences that can result from procrastination:

  • You wait until the last minute, forcing other critical tasks to go on the backburner.
  • You rush through the task, delivering it with errors, bugs, inconsistencies or other quality issues.
  • You feel stressed and overwhelmed, which leaves you in a heightened state of aggravation the rest of the day.
  • You miss your deadline. Your boss or client is displeased with you, which may keep you from better opportunities or advancements down the road.
  • You regularly procrastinate, which inevitably leads to sleepless nights, health issues and burnout.

If you’re worried you’re headed down this path, here are some tricks you can use to stop procrastinating:

1. Use a Task Management Tool

There are a couple of issues that can be resolved by using a project-management tool to schedule your tasks.

Let’s say your boss calls you up and tells you they need a landing page built by Friday for a new Facebook ad campaign. You’ve got it on your mind all week, but you keep dragging your feet. You hate building landing pages and would rather focus on maintaining and updating their website.

There’s a big difference between knowing you have a task to do versus seeing it on a timeline or task list in front of you. That said, you might still feel a sense of pressure whenever you see this looming task.

What may help is having a tool that allows you to create an actionable and fully editable plan for the day, week and month ahead.

My suggestion is to find a scheduler that:

Allows You to Plan Your Days Down to the Hour

Instead of just adding a three-hour task to build the landing page, you can set aside specific hours when you know you will be ready and able to get it done.

Everyone’s most productive hours are different. If you haven’t found yours yet, spend some time looking into it so you can schedule different kinds of tasks when you’re mentally and energetically up to the challenge.

Comes with Drag-and-Drop Capabilities

If you’re not feeling up to a certain task but it’s up next on your schedule, simply drag it to a new time slot where you can reasonably tackle it.

This is why I love calendar-based time management tools. When you can see the whole week or even month ahead, and your deadlines are clearly marked, you can shift things around to suit how you’re feeling in that moment.

Enables You to Build in Free Time or Buffers

If you’re filling your schedule to the brim every day with no wiggle room, it’s going to make any level of procrastination worse. So, give yourself some breathing room.

For instance, I give myself a two-hour break in the middle of every work day. I don’t have to use it all. But just having it on the calendar gives me the grace to work when I’m up to the task instead of wasting my time on social media, Reddit, etc.

Allows You to Check Off Tasks as You Finish Them

The physical (or digital) act of checking an item off a task list releases a hit of dopamine.

One of the reasons why procrastinators seek out distractions is to activate their pleasure center. By setting up your task manager to create a similar sensation (and one that comes with rewards in the end instead of consequences), it may become addictive in a positive way.

2. Make the Task Smaller

A lot of times, it’s the size of the task that intimidates people and leads them to procrastinate. For example, let’s say you’re building a design system for a new app you’re working on. You’re dreading the task because of how long or complex it’s been in the past. You have a six-hour block on your calendar to get it done and you keep pushing it back.

10:00 a.m. - 4:00 p.m.: CREATE DESIGN SYSTEM FOR CLIENT A

So, how about this?

Look at your deadline. Do you have some time before it needs to be done? Great. Then rather than set aside six hours (or however long you think it’ll take), create a 15-minute task for your next free moment:

10:00 a.m. - 10:15 a.m.: Duplicate design system for Client X and save to Client A folder

Create a copy of the design system from the previous job, and save it in the project folder you’re currently working on. While you’re in there, update the basic client details so you don’t have to worry about it later.

Not ready to do more right now? That’s fine. Add a new 30-minute task to your schedule when you have the time, energy or focus:

3:30 p.m. - 4:00 p.m.: Swap out colors in design system for Client A’s

You can do this with the remainder of the steps required to finish the overarching task.

For a lot of procrastinators, this approach can make difficult or time-consuming tasks feel more manageable. So long as you keep an eye on that deadline, you can make these small, incremental steps toward completing the whole task over time instead of all at once.

3. Cut Down on Your Decision-making

There’s a UX Law called Choice Overload. It states that:

“Overchoice or choice overload is the paradoxical phenomenon that choosing between a large variety of options can be detrimental to decision making processes.”

We see this in UX design all the time. When you give users far too many choices to make or too many options to choose from, some of them just decide it’s best to make no choice at all.

How does this play into procrastination?

Let’s say you have four web development projects you’re working on this month. They’re all at varying stages. You look at the calendar for today and see the following tasks:

  • 1-hour kickoff call with Client B
  • 30-minute weekly check-in with team
  • 3 separate 30-minute user testing sessions to moderate for Client A
  • 2 hours of market research for Client C
  • 3 hours of user persona development for Client D
  • 32 unread emails
  • 11 unread Slack messages

The first three you have to do. The problem is, they’re scattered haphazardly throughout the day. So, trying to get the market research and user persona work done in one single stretch is going to be hard. You tell yourself you’d much rather do that work than check your messages, but you just can’t get started.

Those unread messages are weighing on you. You know that checking them would be the quickest thing to do and it wouldn’t be a big deal if they get disrupted by the calls or user testing sessions. However, you know they might add more work (and possibly stress) to your plate.

So, what do you do?

The more brainpower you expend on “What should I do next?” or “How do I avoid this task I’m dreading,” the more energy you’re sapping away from work you need to do. The best thing is to reduce the number of decisions you have to make.

When it comes to managing tasks, you can do this by having dedicated hours for when you do certain things, like the time-based batcher method mentioned above.

For example, you might hold space on your calendar every day from 8:00 to 8:30 a.m. and again from 4:30 to 5:00 p.m. to check messages. By doing this, the 32 emails and 11 Slack messages no longer become something you have to contend with when figuring out what to do next.

Another thing you could do is set rules for when you can be scheduled and for what kinds of tasks. For instance, you could have dedicated days for meetings and calls. What’s more, you could restrict those calls to a set timeframe, like between 9:00 a.m. and 12:00 p.m. This way, your calls wouldn’t be spread out all over the place, making it challenging to get larger tasks done.

Wrapping Up

We procrastinate because we anticipate some sort of discomfort or displeasure at performing a task. It could be that we believe the task will be too hard, that we won’t be able to do a good job or that it’ll bore our brains out.

Some people turn toward distractions that temporarily pause those feelings that have arisen. The only problem is that the joy and relief that come from those distractions are not long-lasting. What’s more, procrastination can exacerbate the consequences of not doing the task when you had initially planned to.

Rather than get stuck with this kind of habit whenever you feel the urge to not do something, train yourself to develop new habits. Schedule all your tasks, but allow yourself the flexibility to move things around as needed. Break up bigger tasks into smaller steps to reduce overwhelm. And come up with rules so you’re not having to expend so much mental energy on what to work on and when.

Read the whole story
alvinashcraft
just a second ago
reply
Pennsylvania, USA
Share this story
Delete

How to match names in C# without exact string comparisons

1 Share

Duplicate records are a costly problem in any system that stores people’s names. “Jon Smith” vs. “John Smith.” “Liz” vs. “Elizabeth.” “Renée” vs. “Renee.” To a human, these are obviously the same person. To an exact string comparison, they’re four different records. The fix is to stop asking “do these names match?” and start asking “how confident am I that they refer to the same person?”

This article walks through a practical C# pipeline to do exactly that: normalize input, resolve nicknames, narrow candidates with blocking and phonetics, score similarity with Jaro-Winkler, and apply thresholds to decide what to accept, review, or reject.

A brief introduction

If you’ve worked with real-world data for any length of time, you’ve probably run into this problem: the same person shows up multiple times in your system… but with slightly different names.

Maybe it’s “Jon Smith” vs. “John Smith.” Maybe it’s “Liz” vs. “Elizabeth.” Or maybe it’s something more subtle, like punctuation, casing, or a missing accent mark.

These differences, albeit minor, do add up over time – leading to duplicate records, missed matches, and a growing amount of cleanup work. It’s all normal data, but it’s entered inconsistently.

Names, for example – as mentioned, they naturally vary depending on how they’re entered, stored or translated across systems. Typos, inconsistent formatting, etc. Unfortunately, exact string comparisons just aren’t flexible enough to deal with these issues.

The trick is to change your thinking from being all about exact matches, to how confident you are that, for example, two records do indeed refer to the same person. In this article, I’ll explore that shift from exact-matching to confidence-based matching in more detail.

To do so, I’ll demonstrate a practical, multi-stage approach you can implement in your own systems. The idea is to combine a few simple techniques into a pipeline that’s both effective and scalable:

  • Normalize and standardize the input

  • Resolve known variations (like nicknames)

  • Narrow down candidates efficiently using blocking and phonetics

  • Score similarities using a weighted model

  • Apply thresholds to decide what to match, review, or reject

By the end, you’ll have a clear framework you can adapt for your own use case. Hopefully, you’ll also learn a thing or two along the way. Let’s start with some basic cleanup, and the Normalize function.

The Normalize function

To get started, remove noise words such as punctuation, honorifics, and generation makers, and convert everything to a common case.

How you do this is up to you – consistency is the main key – but in practice you should use culture-invariant case-folding (for example, ToUpperInvariant())), so that comparisons behave the same regardless of server locale.

You may also have diacritic marks to consider. Chloë / Zoë, Renée / René, André, José, etc. Any of these names could easily have their diacritic marks removed in one source but not another and still represent the same name. To remove these marks, we can use a function like this:

public static string RemoveDiacritics(string text)
    {
        if (string.IsNullOrWhiteSpace(text))
            return text;

        var normalized = text.Normalize(NormalizationForm.FormD);
        var builder = new StringBuilder(normalized.Length);

        foreach (var c in normalized)
        {
            if (CharUnicodeInfo.GetUnicodeCategory(c) 
               !=UnicodeCategory.NonSpacingMark)
            {
                builder.Append(c);
            }
        }

        return builder.ToString().Normalize(NormalizationForm.FormC);
    }

The magic happens when we initialize the normalized variable. FormD will split a composed Unicode character into two parts: the base letter, and the accent mark. The loop then rebuilds the string explicitly skipping the accent marks.  The last thing we do is convert back to FormC. 

While builder.ToString() and the return value may look alike, they are not the same at the binary level. Converting back to FormC is critical.

We can add this to a complete Normalize function:

private static readonly HashSet<string> PersonNoiseWords = 
         new HashSet<string>(StringComparer.OrdinalIgnoreCase)
    {
        // 1. Prefixes (Titles)
        "MR", "MRS", "MS", "MISS", "DR", "PROF", "REV", "FR", "SR", "SRA",
        "SIR", "MADAM", "HON", "CAPT", "MAJ", "COL", "GEN", "LT", 
        
        // 2. Suffixes (Generational & Professional)
        "JR", "SR", "II", "III", "IV", "ESQ", "PHD", "MD", "DDS", "CPA"
    };

    public static string NormalizePersonName(string rawName)
    {
        if (string.IsNullOrWhiteSpace(rawName))
            return string.Empty;

        // Step 1: Remove Diacritics
        var text = RemoveDiacritics(rawName);

        // Step 2: Uppercase for comparison safety
        text = text.ToUpperInvariant();

        // Step 3: Punctuation Cleanup
        // We want to transform "O'Connor" -> "OCONNOR" 
        // but "Mary-Jane" -> "MARY JANE"
        var sb = new StringBuilder(text.Length);
        char lastChar = ' ';

        foreach (char c in text)
        {
            if (char.IsLetterOrDigit(c))
            {
                sb.Append(c);
                lastChar = c;
            }
            else if (c == '\'')
            {
                // Eat the apostrophe (O'Neil -> ONEIL)
                continue;
            }
            else
            {
                // Turn all other punctuation (hyphens, periods, commas)  
                // into a single space
                if (lastChar != ' ')
                {
                    sb.Append(' ');
                    lastChar = ' ';
                }
            }
        }

        // Step 4: Tokenize & Filter Noise
        string cleanedString = sb.ToString();
        if (string.IsNullOrWhiteSpace(cleanedString)) return string.Empty;

        var tokens = cleanedString.Split(' ', 
            StringSplitOptions.RemoveEmptyEntries);
        var finalTokens = new List<string>(tokens.Length);

        foreach (var token in tokens)
        {
            // If token is a known noise word -> Skip it
            if (PersonNoiseWords.Contains(token))
                continue;

            // OTHERWISE -> Keep it
            finalTokens.Add(token);
        }

        return string.Join(" ", finalTokens);
    }

If you feel like you might see some churn in the noise words, feel free to store the list in the database so that you can update that logic without having to recompile and deploy code.

Database considerations

Normalization solves a lot of problems for our name comparisons, but we still have some problems to work through.  Normalization will not help us with tracking nicknames, so we’ll need a lookup table here. The relationship between nicknames is intuitively a many-to-many relationship, but they are also often transitive. 

For this example, we’ll use two tables to follow a Name Group Model. The tables can be rather simple:

The tables can be rather simple.

We can get a catalog of configured names with queries like this:

SELECT description,
       nametext,
       iscanonical
FROM   namegroup
       INNER JOIN namegroupmapping
               ON namegroup.namegroupid = namegroupmapping.namegroupid
WHERE  namegroup.namegroupid = 1

And you can search for variations on a target name and easily find the canonical name with a query like this:

SELECT NickName.NameText
FROM   namegroupmapping InputName
       JOIN namegroupmapping NickName
         ON InputName.namegroupid = NickName.namegroupid
WHERE  InputName.nametext = 'CHUCK' -- Your input
       AND NickName.IsCanonical =1
ORDER  BY NickName.iscanonical DESC;

The NameGroup and NameGroupMapping tables only need to be populated with known nicknames that you want to track. As you discover new nicknames (or similar variations), you can easily update these tables. Maintenance becomes a simple matter of inserting new records with no code changes needed.

If the NameGroupMapping table is properly configured, it should return CHARLES. We can populate it with seed data such as this:

-- Seed data: NameGroup / NameGroupMapping for CHARLES
-- Assumptions:
-- 1) NameText is stored in UPPERCASE to match normalization
-- 2) NameGroupId is an IDENTITY in NameGroup
 
DECLARE @NameGroupId INT;
 
INSERT INTO dbo.NameGroup (Description)
VALUES ('CHARLES');
 
SET @NameGroupId = SCOPE_IDENTITY();
 
INSERT INTO dbo.NameGroupMapping (NameGroupId, NameText, IsCanonical)
VALUES
    (@NameGroupId, 'CHARLES', 1),
    (@NameGroupId, 'CHUCK', 0),
    (@NameGroupId, 'CHARLIE', 0),
    (@NameGroupId, 'CHAS', 0),
    (@NameGroupId, 'CHAZ', 0),
    (@NameGroupId, 'CHIP', 0),
    (@NameGroupId, 'CHUCKY', 0),
    (@NameGroupId, 'CARL', 0);

Then, we can tie it all together with EF access like this:

/// <summary>
    /// Resolves a nickname to its canonical form.
    /// </summary>
    /// <param name="normalizedName">Expects UPPERCASE, 
    /// TRIMMED input (e.g. "CHUCK").</param>
    /// <returns>The canonical name in UPPERCASE for matching 
    /// (e.g. \"CHARLES\"). Format for display (Title Case) 
    /// at the UI boundary.</returns>
    public async Task<string> GetCanonicalNameAsync(string normalizedName)
    {
        if (string.IsNullOrWhiteSpace(normalizedName)) return normalizedName;

        var canonicalName = await _context.NameGroupMappings
            // 1. Filter: Match the already normalized input
            .Where(input => input.NameText == normalizedName)

            // 2. Join: Connect to siblings in the same group
            .Join(_context.NameGroupMappings,
                  input => input.NameGroupId,
                  nickName => nickName.NameGroupId,
                  (input, nickName) => nickName)

            // 3. Filter: Find the Canonical version
            .Where(nickName => nickName.IsCanonical)

            // 4. Select: Grab the name string
            .Select(nickName => nickName.NameText)

            // 5. Execute
            .FirstOrDefaultAsync();

        if (string.IsNullOrEmpty(canonicalName))
        {
            // Fallback: If no match found, treat the input as the name.
            return normalizedName;
        }

        // Success: Return the resolved canonical name.
        return canonicalName;
    }

Lookup data

Next, we’ll need lookup data to compare data against. This table has a few subtle points that make it deceptively more complicated than you might initially suspect.

Here’s the DDL (Data Definition Language) for defining this table:

CREATE TABLE [dbo].[Person]
(
    -- 1. Identity & Core Data
    [Id] INT IDENTITY(1,1) NOT NULL,
    [UserGuid] UNIQUEIDENTIFIER DEFAULT NEWID() NOT NULL,
    
    -- Raw Data (What the user actually typed)
    [RawFirstName] NVARCHAR(100) NOT NULL,
    [RawLastName] NVARCHAR(100) NOT NULL,
    [RawMiddleName] NVARCHAR(100) NULL,
    [DateOfBirth] DATE NULL,
    [ZipCode] VARCHAR(10) NULL, -- Useful for blocking
    [Gender]  VARCHAR(1) NULL,
    -- 2. Normalization & Canonicalization Columns
    -- Populated by your C# 'NameNormalizer' before insert
    -- Uppercase, trimmed, stripped of accents/punctuation
    [NormalizedFirstName] VARCHAR(100) NOT NULL, 
    [NormalizedLastName] VARCHAR(100) NOT NULL,
    
    -- Stores 'CHARLES' even if Raw is 'Chuck'
    [CanonicalFirstName] VARCHAR(100) NOT NULL, 

    -- 3. Phonetic Keys (Double Metaphone)
    -- Populated by your C# 'PhoneticIdentityGenerator'
    [PhoneticPrimary] VARCHAR(4) NOT NULL,   -- e.g. 'SM0' (Smith)
    [PhoneticSecondary] VARCHAR(4) NULL, -- e.g. 'XMT' (Schmidt alternative)

    -- 4. Blocking Key (The "Net")
    -- Application Logic: ZipCode + first 3 chars of NormalizedLastName 
    -- (with sensible fallbacks for null/short values)
    -- Example: '29601SMY'
    [BlockingKey] VARCHAR(20) NOT NULL,

    -- Metadata
    [CreatedDate] DATETIME2 DEFAULT GETUTCDATE() NOT NULL,
    [LastUpdated] DATETIME2 DEFAULT GETUTCDATE() NOT NULL,

    CONSTRAINT [PK_Person] PRIMARY KEY CLUSTERED ([Id] ASC)
);

We’ll talk more about the Phonetic columns and the BlockingKey shortly.

We want to include the extra demographic data of ZipCode, DateOfBirth, and Gender, to add more context and weight for our confidence score. In your specific scenario, though, you might want different demographic data.

UserGuid can be useful as a stable external identifier for APIs (so you don’t expose sequential IDs). You may also want to include other foreign keys to identify and track the original Person, or maybe even add a Source column to track where the original Person comes from.

In addition to this DDL, we will have key indexes that are critical to the performance we need. Without these, you would still see full table scans or excessive disk reads:

-- 1. The Blocking Index (The "Fast Lane")
-- This allows your app to grab a small pool of \"candidate\" records quickly.
-- We INCLUDE the columns needed for the scoring algorithm, so we don't hit the -- heap again.
CREATE NONCLUSTERED INDEX [IX_Person_BlockingKey] 
ON [dbo].[Person] ([BlockingKey])
INCLUDE ([NormalizedFirstName], [NormalizedLastName], [CanonicalFirstName], [DateOfBirth], [ZipCode]);
GO

-- 2. The Phonetic Index (The "Sound-Alike" Lane)
-- Used when a user searches by name without a Zip Code or when we need to 
-- broaden the search
-- Allows queries like: WHERE PhoneticPrimary = 'SM0'
CREATE NONCLUSTERED INDEX [IX_Person_Phonetic] 
ON [dbo].[Person] ([PhoneticPrimary], [PhoneticSecondary])
INCLUDE ([NormalizedFirstName], [NormalizedLastName], [CanonicalFirstName]);
GO

-- 3. The Canonical Index (The "Nickname" Lane)
-- Used for finding "Elizabeth" when someone searches "Beth"
CREATE NONCLUSTERED INDEX [IX_Person_Canonical] 
ON [dbo].[Person] ([CanonicalFirstName], [NormalizedLastName])
INCLUDE ([RawFirstName]);
GO

Before we can use this as a lookup table, though, we need to make sure that the initial records are populated correctly. We will use a PersonDto like this:

public class PersonDto
{
    public int Id { get; set; }
    public Guid UserGuid { get; set; }
    public string RawFirstName { get; set; }
    public string RawLastName { get; set; }
    public string RawMiddleName { get; set; }
    public DateTime? DateOfBirth { get; set; }
    public string ZipCode { get; set; }
    public string Gender {get;set;}
    
    // Computed Columns
    public string NormalizedFirstName { get; set; }
    public string NormalizedLastName { get; set; }
    public string CanonicalFirstName { get; set; }
    public string PhoneticPrimary { get; set; }
    public string PhoneticSecondary { get; set; }
    public string BlockingKey { get; set; }
    
    public DateTime CreatedDate { get; set; }
    public DateTime LastUpdated { get; set; }
}

Assumption: This retrieval step assumes you have already normalized/canonicalized the incoming name (UPPERCASE) and precomputed its BlockingKey, PhoneticPrimary, and PhoneticSecondary before calling FindCandidatesAsync.

We can query the Person table, and load candidate rows into a PersonDto, using the indexes and filters we built above. Tune minCandidates and maxCandidates to match your data volume and performance needs.

Save 35% on Redgate’s .NET Developer Bundle

Fantastic value on our .NET development tools for performance optimization and debugging.
Learn more

Note: the implementation below replaces the candidate list as it broadens (BlockingKeyPhoneticPrimaryPhoneticSecondary). An alternative approach is to union candidates across tiers (and de-duplicate), so you keep earlier matches while expanding the pool.

// Candidate retrieval strategy:
// 1) Start with BlockingKey (fast + precise)
// 2) If too few candidates, widen using PhoneticPrimary
// 3) If still too few, widen again using PhoneticSecondary
// 4) Cap results so scoring stays cheap
public async Task<List<PersonDto>> FindCandidatesAsync(
    string blockingKey,
    string phoneticPrimary,
    string phoneticSecondary,
    int minCandidates = 10,
    int maxCandidates = 100,
    CancellationToken ct = default)
{
    // Pull only the columns needed for downstream scoring 
    // (plus the keys used for filtering)
    IQueryable<PersonDto> baseQuery = _context.Persons.AsNoTracking()
        .Select(p => new PersonDto
        {
            Id = p.Id,
            UserGuid = p.UserGuid,
            RawFirstName = p.RawFirstName,
            RawLastName = p.RawLastName,
            RawMiddleName = p.RawMiddleName,
            DateOfBirth = p.DateOfBirth,
            ZipCode = p.ZipCode,
            Gender = p.Gender,
            NormalizedFirstName = p.NormalizedFirstName,
            NormalizedLastName = p.NormalizedLastName,
            CanonicalFirstName = p.CanonicalFirstName,
            BlockingKey = p.BlockingKey,
            PhoneticPrimary = p.PhoneticPrimary,
            PhoneticSecondary = p.PhoneticSecondary
        });
 
    // Stage 1: BlockingKey
    var q = baseQuery;
    if (!string.IsNullOrWhiteSpace(blockingKey))
    {
        q = q.Where(p => p.BlockingKey == blockingKey);
    }
 
    // Materialize early. We may not need to go any further
    var candidates = await q.Take(maxCandidates).ToListAsync(ct);
 
    // Stage 2: widen via Primary phonetic key
    if (candidates.Count < minCandidates && 
         !string.IsNullOrWhiteSpace(phoneticPrimary))
    {
        candidates = await baseQuery
            .Where(p => p.PhoneticPrimary == phoneticPrimary)
            .Take(maxCandidates)
            .ToListAsync(ct);
    }
 
    // Stage 3: widen again using Secondary (alternate) phonetic key
    if (candidates.Count < minCandidates && 
          !string.IsNullOrWhiteSpace(phoneticSecondary))
    {
        candidates = await baseQuery
            .Where(p => p.PhoneticSecondary == phoneticSecondary
                || p.PhoneticPrimary == phoneticSecondary) 
        // occasional cross-population some encoders only output Primary
            .Take(maxCandidates)
            .ToListAsync(ct);
    }
 
    return candidates;
}

Phonetics and double metaphone, explained

Phonetics is the study of sound (specifically speech sounds.) Phonetic algorithms convert a string into standardized phonetic codes. Soundex is an older phonetic algorithm which is well implemented, even in SQL Server, but it has some problems. Safe to say, it’s rather dumb compared to modern algorithms!

What is Soundex?

Soundex was designed primarily for English-language use and never considered the wider context of global data. As a result, it often performs poorly for many non-English surnames and naming conventions.

For starters, it only encodes the first letter and, even after that, only considers one character at a time – ignoring how letter combinations change sounds in wider contexts. The GH combination is a common example: the sound in “night” is completely different from the sound in “tough”, but Soundex treats them the same.

Soundex also suffers from truncation issues. No matter how long the string is, the Soundex value is always one letter followed by three numbers. If the original string is long, the end of the string will generally be ignored.

What is Double Metaphone?

Double Metaphone is different in several key ways. While Soundex returns exactly one four-character code, Double Metaphone returns a Primary and Secondary code. 

The Primary will most likely support an English pronunciation, while the Secondary supports an alternate foreign pronunciation. Plus, while Soundex ignores all vowels except the first character, Double Metaphone can consider all vowels as required.

Double Metaphone will also give more accurate matches and fewer false positives. This is why our table and DTO has the two columns/properties: PhoneticPrimary and PhoneticSecondary.

We can use a library such as Lucene.Net to populate the phonetic columns. If you want Double Metaphone specifically, you’ll typically need the analysis package as well.

dotnet add package Lucene.Net.Analysis.Common
/// <summary>
    /// Provides phonetic encoding utilities for fuzzy string 
    /// matching and search.
    /// </summary>
    public static class PhoneticHelper
    {
        /// <summary>
        /// Generates the primary and alternate Double Metaphone phonetic 
        /// encodings for the given input string.
        /// </summary>
        /// <param name="input">The string to encode. Returns a result 
        /// with <see langword="null"/> values if null or whitespace.</param>
        /// <param name="maxLen">The maximum length of each phonetic code. 
        /// Defaults to <c>4</c>.</param>
        /// <returns>
        /// A <see cref="PhoneticResult"/> containing the <c>Primary</c> 
        /// and <c>Alternate</c> Double Metaphone codes,
        /// or <see langword="null"/> for both if <paramref name="input"/> 
        /// is null or whitespace.
        /// </returns>
        public static PhoneticResult GetEncodings(string input, int maxLen = 4)
        {
            if (string.IsNullOrWhiteSpace(input))
                return new PhoneticResult(null, null);

            // Create a local instance to avoid locking bottlenecks
            var encoder = new DoubleMetaphone
            {
                MaxCodeLen = maxLen
            };

            return new PhoneticResult(
                Primary: encoder.GetDoubleMetaphone(input, false),
                Alternate: encoder.GetDoubleMetaphone(input, true)
            );
        }
    }

    /// <summary>
    /// Represents the result of a Double Metaphone phonetic 
    /// encoding operation.
    /// </summary>
    /// <param name="Primary">The primary phonetic encoding 
    /// of the input string.</param>
    /// <param name="Alternate">The alternate phonetic encoding 
    /// of the input string, which may differ for 
    /// ambiguous pronunciations.</param>

    public record PhoneticResult(string Primary, string Alternate);

What are distance algorithms?

Distance algorithms are used to measure how similar or different two pieces of data are. We intuitively understand physical distance, and time differences make sense, but generalized distance algorithms allow us to compare anything that can be represented as data.

This includes anything from simple strings (like we’re interested in here), to user preferences, and even DNA. Let’s take a look at some of the specific algorithms in detail.

The Levenshtein Distance algorithm

Levenshtein Distance measures the minimum number of single-character edits (insertions, deletions, or substitutions) needed to turn one string into another, regardless of length. The possible values range from 0, all the way to the length of the longest string being compared. 

Additionally, it counts and weighs all edits the same. This is potentially problematic for our purposes, however, because it treats a typo at the beginning of the string the same as a typo at the end.

The Jaro-Winkler algorithm

Jaro-Winkler is a similar algorithm but, instead of distance, it measures similarity. It gives values in the range 0 to 1.  In this case, 1 would mean that the two strings are identical, while 0 means that they have nothing in common.  Simply put, bigger numbers are better.

Jaro-Winkler is optimized for comparing short strings. It has optimizations to penalize differences at the beginning of the strings more than differences at the end. This is ideal for comparing names since we’re more likely to get the start of a person’s name right; spelling differences/errors are more likely to show up towards the end.

When you read about string comparisons, you’ll hear about both algorithms and others. It can be difficult to understand the differences. However, since the optimizations found in Jaro-Winkler are especially tuned to our needs for comparing names, so we’ll use this for our example. It’ll give us a similarity metric to factor into our confidence score that two names refer to the same individual.

In our example, we’ll use the StringSimilarity nuget package to provide an implementation of the Jaro-Winkler algorithm. In addition to Jaro-Winkler, it gives us access to Levenshtein and many other useful algorithms.

Finally, it’s worth noting that Jaro-Winkler is surprisingly simple to implement and well understood. If you want, you can easily write your own implementation in about 40 lines of code or so.

dotnet add package F23.StringSimilarity
private static readonly JaroWinkler _jw = new JaroWinkler ();

/// <summary>
/// Calculates the Jaro-Winkler similarity score between two strings.
/// </summary>
/// <param name="source">The first string to compare. </param>
/// <param name="target">The second string to compare. </param>
/// <returns>A double between 0 (no match) and 1 (perfect match). </returns>

/// <summary>
/// Returns the Jaro-Winkler similarity score (0.0 to 1.0).
/// </summary>
public static double CalculateJaroWinkler(string source, string target)
{
    // Library handles null/empty check, but good to be explicit 
    // if you want specific behavior for nulls.
    if (source == null || target == null) return 0.0;
    // The Similarity method returns 1.0 for perfect matches
    return _jw.Similarity(source, target);
}

How to build a confidence score

In our example, it’s now time to pull all the pieces together and decide how confident we are that the matches from our various filters from the database match to the targeted individual.

We’ll use a weighted scoring strategy where matching different parts of an individual’s name might carry different weights, and not matching demographic details might carry differing penalties.

For example, we might pay more attention to matches on the last name than matches on the first name. Or, we may switch that for women who may use their maiden name or a hyphenated name. We may want to penalize (lower the confidence) if the names match but the birth year and/or gender is wrong.

  • Weighted Scoring Strategy:
    • Assigning different weights to different components (e.g., Last Name similarity might be weighed 55%, First Name 35%, Middle Initial 10%). Use similarities in the 0 to 1 range (for example, Jaro-Winkler).

  • The Formula:
  • Score = (WeightFirst × SimilarityFirst) + (WeightLast × SimilarityLast) + (WeightMiddle × SimilarityMiddle)

  • Penalty Logic: reduce the score (or force a non-match) when key demographic fields disagree (for example, an exact DOB mismatch) and apply smaller penalties for weaker signals (for example, missing ZipCode).

Now we can build out the logic for a MatchEvaluator class. We’ll define some constants for configuration and define a coordinating method (EvaluateMatch), and some helper functions (CalculateBaseNameScore and CalculateDemographicPenalties.) 

We’ll call the FindCandidatesAsync method and loop through the results to repeatedly call EvaluateMatch:

public class MatchEvaluator
    {

        // Thresholds for the gatekeeper routing
        private const double AutoAcceptThreshold = 0.85;
        private const double ManualReviewThreshold = 0.65;
        private const double EarlyRejectNameThreshold = 0.40; 
        // Component Weights
        private const double LastNameWeight = 0.55;
        private const double FirstNameWeight = 0.35;
        private const double MiddleNameWeight = 0.10;
/// <summary>
        /// The coordinating function that orchestrates the scoring 
        /// and routing.
        /// </summary>
        public MatchResult EvaluateMatch(Person incoming, Person existing)
        {
            var result = new MatchResult();

            // 1. Calculate Base Name Score
            result.BaseNameScore = CalculateBaseNameScore(incoming, existing);

            // Optimization: If the name score is completely unviable, 
            // reject early to save CPU cycles on penalty calculations.
            if (result.BaseNameScore < EarlyRejectNameThreshold)
            {
                result.FinalScore = result.BaseNameScore;
                result.Resolution = MatchResolution.Reject;
                result.SystemNotes = 
                        "Rejected early due to severe name mismatch.";
                return result;
            }

            // 2. Calculate Tiered Penalties
            result.TotalPenalty = CalculateDemographicPenalties
                  (incoming, existing);

            // 3. Apply Penalties to get Final Score
            // Ensure score doesn't drop below 0
            result.FinalScore = Math.Max(0.0, 
                  result.BaseNameScore - result.TotalPenalty);

            // 4. Determine Resolution Category
            if (result.FinalScore >= AutoAcceptThreshold)
            {
                result.Resolution = MatchResolution.AutoAccept;
                result.SystemNotes = 
                   "High confidence match. Safe for automated merge/update.";
            }
            else if (result.FinalScore >= ManualReviewThreshold)
            {
                result.Resolution = MatchResolution.ManualReview;
                result.SystemNotes = 
                  "Weak match. Routed to exception queue for manual review.";
            }
            else
            {
                result.Resolution = MatchResolution.Reject;
                result.SystemNotes = 
                    "Confidence is too low after demographic penalties.";
            }

            return result;
        }
/// <summary>
        /// Calculates the weighted score of the name components.
        /// </summary>
        private double CalculateBaseNameScore(Person incoming, Person existing)
        {
            // Note: GetSimilarity calls the Jaro-Winkler discussed previously
            double firstScore = 
                StringMetrics.CalculateJaroWinkler(
                    incoming. NormalizedFirstName, 
                    existing. NormalizedFirstName);
            double lastScore = 
                 StringMetrics.CalculateJaroWinkler(
                    incoming.NormalizedLastName, 
                    existing.NormalizedLastName);
            // Handle missing middle initials gracefully
            double middleScore = 1.0;
            if (!string.IsNullOrWhiteSpace(incoming.RawMiddleName) 
                && !string.IsNullOrWhiteSpace(existing.RawMiddleName))
            {
                middleScore = 
                   incoming.RawMiddleName.Equals(existing.RawMiddleName,   
                   StringComparison.OrdinalIgnoreCase) ? 1.0 : 0.0;
            }

            return (firstScore * FirstNameWeight) +
                   (lastScore * LastNameWeight) +
                   (middleScore * MiddleNameWeight);
        }


        /// <summary>
        /// Evaluates demographic discrepancies and returns a cumulative 
        /// penalty percentage.
        /// </summary>
        private double CalculateDemographicPenalties(Person incoming, 
              Person existing)
        {
            double penalty = 0.0;

            // --- A. Date of Birth Penalties ---
            if (incoming.DateOfBirth.HasValue && existing.DateOfBirth.HasValue)
            {
                var inDob = incoming.DateOfBirth.Value;
                var exDob = existing.DateOfBirth.Value;

                if (inDob != exDob)
                {
                    if (inDob.Year == exDob.Year 
                     && inDob.Month == exDob.Day 
                     && inDob.Day == exDob.Month)
                    {
                        // Transposed Day/Month (Common in international files)
                        penalty += 0.05;
                    }
                     && inDob.Day == exDob.Day
                     && Math.Abs(inDob.Year - exDob.Year) == 1)
                    {
                        // Off by exactly one year (common data entry typo)
                        penalty += 0.10;
                    }
                    else
                    {
                        // Major DOB mismatch (> 1 year)
                        penalty += 0.35;
                    }
                }
            }

            // --- B. Gender Penalties ---
            if (!string.IsNullOrWhiteSpace(incoming.Gender) 
                 && !string.IsNullOrWhiteSpace(existing.Gender))
            {
                if (!incoming.Gender.Equals(existing.Gender, 
                    StringComparison.OrdinalIgnoreCase))
                {
                    // Conflicting gender provided
                    penalty += 0.15;
                }
            }

            // --- C. Location/Zip Penalties ---
            if (!string.IsNullOrWhiteSpace(incoming.ZipCode) 
                && !string.IsNullOrWhiteSpace(existing.ZipCode))
            {
                var inZip = incoming.ZipCode.Length >= 5 
                    ? incoming.ZipCode.Substring(0, 5) 
                    : incoming.ZipCode;
                var exZip = existing.ZipCode.Length >= 5 
                    ? existing.ZipCode.Substring(0, 5) 
                    : existing.ZipCode;

                if (inZip != exZip)
                {
                    // Different Zip Code
                    penalty += 0.05;
                }
            }

            return penalty;
        }
   }

Summary

Exact string-matching breaks down quickly in real-world data. Names are messy: spelling variations, punctuation, diacritics, nicknames, and cultural conventions all create legitimate ways to refer to the same person. That’s where a confidence-based approach comes in. It replaces the brittle “match / no match” binary with a repeatable process that can be tuned to your risk tolerance and data quality.

The key is to combine multiple weak signals into one strong decision. We start by normalizing inputs (case-folding, punctuation, diacritics, and noise words) and canonicalizing known nicknames. We then use a blocking key (and, when needed, phonetic keys) to pull a small candidate set efficiently and finally compute a weighted similarity score with targeted demographic penalties.

Finally, the last step applies thresholds to route outcomes: automatic acceptance for high-confidence matches, a review queue for ambiguous cases, and rejection when confidence is too low.

How the solution outlined in this article helps

This layered design improves data integrity while keeping performance predictable at scale. Most records are eliminated by cheap, indexed filters before any expensive scoring occurs. It also makes the system auditable, since your weights, penalties, and thresholds explain why a match was accepted or rejected. Furthermore, it reduces manual work by focusing human review on the narrow band of uncertain cases.

In practice, the biggest wins come from treating configuration as a ‘living’ asset. Keep the nickname tables current, tune thresholds based on observed false positives/negatives, and periodically re-score samples as your data and business rules evolve.

Note: AI was used to generate the feature image for this article.

Simple Talk is brought to you by Redgate Software

Take control of your databases with the trusted Database DevOps solutions provider. Automate with confidence, scale securely, and unlock growth through AI.
Discover how Redgate can help you

FAQs: How to match names in C# without exact string comparisons

1. What is fuzzy name matching?

A technique that scores how similar two names are and uses a confidence threshold to decide whether they refer to the same person, rather than requiring an exact match.

2. Why isn't exact string matching enough?

Real-world name data contains typos, casing differences, diacritics, punctuation variations, and nicknames. Exact matching treats every variation as a different person, creating duplicates and missed matches.

3. What is Double Metaphone and why use it over Soundex?

Double Metaphone is a phonetic algorithm that converts a name into codes representing how it sounds. Unlike Soundex, it returns both primary and secondary codes, handles non-English names well, and produces fewer false positives.

4. Why use Jaro-Winkler instead of Levenshtein distance?

Jaro-Winkler is optimized for short strings and weights matching characters at the start of a name more heavily, which fits how people typically mistype names. Levenshtein treats every edit equally.

5. How does a confidence score work?

Weighted similarity scores for first, last, and middle names combine into a base score, demographic mismatches (date of birth, gender, ZIP) apply penalties, and the final number is compared to thresholds for auto-accept, manual review, or reject.

The post How to match names in C# without exact string comparisons appeared first on Simple Talk.

Read the whole story
alvinashcraft
just a second ago
reply
Pennsylvania, USA
Share this story
Delete

Terraform MCP Server Explained: Setup and Use Cases

1 Share
Learn what the Terraform MCP server is, how to install it locally, common use cases, and best practices for secure setup.
Read the whole story
alvinashcraft
25 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

How AgentDBA Identifies Backup Failures

1 Share
Every DBA has a box like this. Sitting untouched for months. Nobody’s proud of it, nobody’s fixed it, it’s just there — a handful of small compliance gaps that never made it to the top of anyone’s list. I pointed … Continue reading



Read the whole story
alvinashcraft
39 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Learn T-SQL With Erik: Aligning Queries and Indexes Part 2

1 Share

Learn T-SQL With Erik: Aligning Queries and Indexes Part 2


Chapters

Full Transcript

Ahem, Erik Darling here, Darling Data. In today’s video, we’re going to continue talking about query and index alignment. This one’s kind of in the same vein as yesterday’s video, but with a little bit of a twist to it. You know how I love keeping you emotionally hostage? Just kidding. So, again, more sort of aligning queries and indexes. It’ll be fun. Trust me. This is, of course, all material from the Learn T-SQL with Erik course. This is just little dribs and drabs of it to get you excited and entice you and force you to buy things from me, because in this consumerist society, that’s the way the world spins, I guess.

But down in the video description, you’ll find all sorts of helpful information. You’ll find all sorts of helpful links, including a link to purchase the full course material. It’s a great course. You should check out the full thing sometime with money. That’d be nice.

You can also find other ways to spend money on me. You can hire me for consulting. Maybe you think, oh, wow, he sure seems to know what he’s talking about. I wonder if he could come know what he’s talking about on our SQL Server. The answer is yes. Yes, I can.

You can become a supporting member of the channel, too. If you think that the things that I do and say and talk about here are helpful to you, and talk about here are just so outstanding and wonderful that you want to give me four bucks a month, and you want to take part in the glory and the greatness of this YouTube channel, maybe finally outdo that Amiga repair channel, you can do that.

Other stuff that there are links for, asking me office hours questions to do every Tuesday, answering five of them. And of course, if you feel strongly about the content, the content here, but perhaps you’re irresponsible with money in many other ways, and you can’t afford any of the paid stuff, you could always like, subscribe, and tell a friend, which has value of its own.

If part of your financial and your fiscal irresponsibility is spending too much on SQL Server monitoring tools, well, golly, I can help you out there. Boy, can I save you a pretty penny.

I’ve got a free monitoring tool. It’s up on GitHub. Again, the link for all this stuff is down in the video description. Totally free, totally open source performance monitoring. It does all the stuff you would expect a monitoring tool to do, except it’s written by someone who actually looks at SQL Server performance for a living, not someone who has never done that in their life, which is a problem a lot of other monitoring tool companies have.

So, you ought to check that out, shouldn’t you? Getting real close to 10,000 downloads, so I’m feeling like credibility is in the… I’m out in the world a little bit.

June 12th and 13th, I will be at Data Saturday Croatia. And November 9th through 11th, I will be at PaaS Data Summit in Seattle, Washington. At Data Saturday Croatia, I have a pre-con on Advanced T-SQL.

You might even find some of this material that we’re learning about today is in that course. And you might even find that if you show up to Data Saturday Croatia, you will get free access to the full course material if you come into my pre-con.

PaaS Data Summit, a lot of unknowns there so far. Who knows what’s going to happen? It’s going to be crazy. But anyway, we will continue making our way through May somehow, some way.

So, I’ve got these indexes, right? I’ve got one on the badges table on user ID and date. And these will all make a little bit more sense when you see the query.

And I’ve got this one down here on the comments table on user ID and post ID. And then I’ve got this index in here. And if you remember yesterday’s video, we almost had the same index except post ID and owner user ID were kind of swapped around there.

Or rather, owner user ID was at the beginning of the index, post type ID was second. We’re going to deal with a very similar query, but now we’re going to have to figure out a way to take better, to rewrite our query to take better advantage of this index.

I think Joe Sack had a great blog post some years ago. It was called like the gatekeeper problem. And we have a gatekeeper in here because when you create a roadmap, you have to set up a post store indexes.

We had like the ordering of the columns in those indexes is of course, like in like the way that queries can access data in those indexes. This is of course defined by the order of the key.

So like if we want to like do a search on post type ID something, we have like the immediate access to that data ordered in this index here. And if we wanted to search on post type ID and score, well we would have post type ID in order.

in order, and then we would have score in order for any duplicates in post type ID. So this would line up those two things pretty well. But as soon as you get to like wanting to do things like just search on score or search on score and owner user ID, the ordering of the index no longer benefits those searches as well, because we’re not first accessing queries by post type ID in order to sort of maintain the B-tree traversal that you get when you use those types of indexes.

So this query used to be a lot worse when I first wrote it. It was like 2000, well, it was on SQL Server 2017, and it was on a much worse laptop. So I need to play a few tricks here to maintain the nostalgic feelings that I have about this demo, because I truly love this demo.

So we’re going to hint things back in time a little bit. We’re going to tell SQL Server to use compat level 140. That was the 2017 compat level. And we are going to tell SQL Server, you can only run at max.4, because that was what my old laptop sort of permitted, right? So all that out of the way, let’s look at the query plan for this.

And it runs for about 4.2 seconds total. And the majority of that time is spent in one branch over here, right? So 4.2 seconds total, and 4.1 of those 4.2 seconds is spent in this branch, right? We can see 4.1 seconds ending up there.

The branch that this is hitting is, of course, the one where we are trying to find, let me go back up to the query and make a little bit more sense to do that. This thing right here. Now, there are going to be lots of times in your query tuning life where using a temp table is going to be beneficial, right? And so when I was writing this demo, one of the things I experimented with was using a temp table. So what I did was I took all the query that were already fast, and I was like, well, I’m going to put all you into a temp table, right? And that happens pretty quickly, right? That’s 47 milliseconds, right? So no complaints there. But then I was like, now, of course, with those 740 rows, right, in a temp table, right, 740 rows materialized, stabilized into a temp table, SQL Server will have to, this has to be faster. SQL Server will have to do something better or smarter here, make better choices, do something.

Do something helpful, but no, it actually slows down a little bit, 4.7 seconds. I guess if I ran this a few times, it might alleviate, but we’re not going to mess with all that. But again, that whole, like the whole problematic branch is over here, right? So all this stuff going on in this chunk. And again, it’s the same thing over here. So the problem that we really have is that our query, or rather the index that we have is on post type ID, score, and then owner user ID.

Right? So we can’t sort direct, rather, we can’t seek directly to post type ID and owner user ID because that score column is in the way. So in yesterday’s, yesterday’s video, I showed you a rewrite using top one and max and stuff. None of those rewrites here are terribly effective. In the full video, I go into all the ins and outs of why, but rather than the full course material, which is available for purchase, I go into all the ins and outs of why, but here, if we get this estimated plan, the, like the, the, the reason why kind of becomes a little bit more obvious, right? And if let’s move this over here, so my giant head isn’t in the way.

And if we look at the tool tip for this, you’ll see that like, and one of these branches where we’re seeking to post type ID one, but, but then we have this residual predicate over here on owner user ID, right? So because that score column is in the way, we can’t get directly to owner user ID.

We can get to the post type ID. We don’t care about, but then the score column is like, well, no, I don’t think so. I was like a sassy little thing and saying, no, you can’t seek directly to post type ID and owner user ID here.

You are forced to go through me, right? And pry that owner user owner user ID out of my cold, dead data. So one thing that you can do in order to take better advantage of this index is just considering what we want from this query, right?

So it’s a little bit. Yeah. Yeah. More clear. If we go up here a little bit, we want the top score, uh, from the population of post type IDs one and two, right?

So it doesn’t matter if it’s post type ID one or two, we want you to, we just want their highest score, right? Question or answer. It can be the highest score.

What we can do is we can, instead of writing the query with a single outer apply, where we union all both of these things together, what we can do is we can change this query. A little bit so that we use two outer applies and we find the, the, the top, the top, uh, score, uh, first for post type ID one, and then we use that score to act as an additional filter for post type ID two, right? So, uh, what we’re going to do is say, Hey, you know what?

We just found this top question score. Let’s go find the top, uh, sorry, the top. Yeah. The top question score, let’s go find that top answer score, but we’re going to use. The score that we found for questions to filter out and say, you know what?

Maybe we don’t need all of the, we can use this, like to sort of seek to some scores that we care about because we know what the top, the top question score is for this user. We can pass score down a little bit and allow it to act as an additional filter. So this is how we’re going to rewrite this query.

We have the first outer apply right here. Uh, that’s going to find the top one, uh, score for questions, right? Ordered by score descending. Right.

Correlated to owner user ID, just like before, but then down here, we’re going to add in this new thing and we’re going to say only go find me, uh, answer scores for that person for when the answer score is higher than the question score, right? So we’re using this, we’re giving it, we’re adding this additional predicate down here so that SQL Server can better traverse that B tree index from post type ID to score to owner user ID. And if we do this.

Uh, this finishes just about instantly. Now in this first branch, and I, again, I go into this far more and far more detail in the full material, but we have this first branch up here, right? This takes 217 milliseconds.

This one still has the same problem, but this one isn’t really the, the, where we get the big performance at the big performance hit that we get, uh, is on the, the one for post type ID two, right? Cause post type ID one, there’s about 6 million rows for that post type ID two. There’s like almost 12.

million rows for that, but down here in the second branch, right? So like this, this index seek here, you can see this is where we’re finding, uh, post type ID one, right? And then for this second branch, now this looks a lot different.

We have two seek predicates here, right? And this is not the same as having multiple seek keys. I know in another video I talked about like multi seeks and dynamic seeks.

This is not the same thing. Uh, we only have seek keys one here, but now you can see that we are correlating. On, uh, uh, post type ID two here, and we have this additional filter to say where score is greater than expression 1 0 0 3 expression 1 0 0 3 is of course the score column that we found from this first top one query out here.

So finding the top one question score, and then using that as an additional filter in this outer apply to say only give me answer scores that are higher than the top question score. We give SQL Server a better way. Use the index that we already had.

All right, cool. Thank you for watching. I hope you enjoyed yourselves. I hope you learned something and I will see you next Tuesday for office hours. Bye.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.

The post Learn T-SQL With Erik: Aligning Queries and Indexes Part 2 appeared first on Darling Data.

Read the whole story
alvinashcraft
47 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Fundamentals of Azure DevOps with SQL projects

1 Share

Building automated pipelines with your SQL database projects enables you to build a rich CI/CD ecosystem to ensure that your application is being deployed with good quality code and at high confidence of success. SQL Database Projects are compatible with just about every automation environment because fundamentally they’re built on top of the .NET SDK, a free and cross-platform development platform. You can develop the Microsoft.Build.Sql projects in VS Code and SQL Server Management Studio (SSMS). In this post, we’ll take a look at the things you need to know to get started with building and deploying SQL projects in Azure DevOps pipelines, a crucial part of integrating database development with the rest of your application development lifecycle. The concepts from this post will apply to any automation environment you choose.

Prerequisites

  • An Azure DevOps project with a repository containing your SQL project
  • Permissions to create a service connection in Azure DevOps project settings
  • The .NET SDK available in your pipeline environment (pre-installed on Microsoft-hosted agents)
  • An Azure SQL Database on a logical server with:
    • Microsoft Entra authentication enabled (Entra-only recommended)
    • Public network access set to “Selected networks”
    • An Entra admin configured on the server
  • An Azure subscription where you can assign RBAC roles on the SQL Server resource

To start our example, we setup a repository that contains our SQL project in the folder labeled AdventureWorks and an empty folder for our pipeline definitions. The AdventureWorks project is located inside the root of the repository, so our file tree looks similar to:

.
├── 📁 AdventureWorks/
│   ├── AdventureWorks.sqlproj
│   ├── 📁 dbo/
│   │   ├── 📁 Functions/
│   │   │   ├── ufnGetAllCategories.sql
│   │   │   ├── ufnGetCustomerInformation.sql
│   │   │   └── ufnGetSalesOrderStatusText.sql
│   │   ├── 📁 StoredProcedures/
│   │   │   ├── uspLogError.sql
│   │   │   └── uspPrintError.sql
│   │   ├── 📁 Tables/
│   │   │   ├── BuildVersion.sql
│   │   │   └── ErrorLog.sql
│   │   └── 📁 UserDefinedTypes/
│   │       ├── AccountNumber.sql
│   │       ├── Flag.sql
│   │       ├── Name.sql
...
│   └── 📁 Security/
│       └── SalesLT.sql
└── 📁 Pipelines/
└── .gitignore

If you’re unsure of how to set up a SQL project from your current database, here’s a tutorial article on getting started from an existing database: https://learn.microsoft.com/sql/tools/sql-database-projects/tutorials/start-from-existing-database

Build before deploy

When we work with a SQL project in an IDE like VS Code, Visual Studio, or SSMS, we often leverage the build and publish actions as the primary development checkpoints. SQL project build validates the syntax is correct and matches the target platform that we have selected. This is one way to ensure that the database code we’re working with is compatible with the anticipated target (like Azure SQL Database), especially if we have an application that we’ve been developing while uncertain which platform it might be deployed to. While running build from a GUI interface involves using the menu option for build, in automation environments we need a command-line interaction to run project build.

In this case, it’s as straightforward as running “dotnet build” in an environment that has the .NET SDK installed. Microsoft-provided automation environments have a variety of software pre-installed, including the .NET SDK. If you leverage a self-hosted runner in the future, you would be responsible for installing the .NET SDK in that environment. Validating the project syntax represents the minimal continuous integration (CI) pipeline for our SQL project example.

We will use the starter template for pipelines in Azure DevOps to establish the continuous integration (CI) pipeline that validates the project builds successfully. From that starter pipeline template, we:

  1. Modify the trigger to focus only on the main branch and only on changes within the Adventure Works folder.
  2. Remove the provided script steps and use the task panel to add the “.NET (Core)” task template.
  3. Set up the .NET task with the path to our SQL project file (AdventureWorks/AdventureWorks.sqlproj). Optionally, add the RunSqlCodeAnalysis property for additional insights on our database’s code quality.

fig:

fig:

When we save our pipeline definition, it looks similar to the below and can now be run ad-hoc as well as automatically when changes are made to our database project on the main branch.

trigger:
  branches:
    include:
    - main
  paths:
    include:
    - AdventureWorks

pool:
  vmImage: ubuntu-latest

steps:
- task: DotNetCoreCLI@2
  inputs:
    command: 'build'
    projects: 'AdventureWorks/AdventureWorks.sqlproj'
    arguments: '/p:RunSqlCodeAnalysis=true'

Modifying the steps of an Azure DevOps pipeline can be easily done through the graphical task selection and settings pane. However, the YAML code itself is also directly editable. Continued customization, source control of the pipeline definition itself, and repeated use of specific components leverages the code-first nature of the automation definition. The comprehensive documentation for Azure DevOps Pipeline YAML schema is available at https://aka.ms/yaml.

fig:

When the build pipeline runs, it prepares a summary of the errors and warnings on the pipeline run page. SQL project build produces a .dacpac build artifact, but if we do not explicitly preserve that file, it will not be captured by the build pipeline for future use. The .dacpac is temporarily available for use in the pipeline at its output location, which we can spot from the detailed build output. For the default configuration, the .dacpac is located in the bin/Debug folder of the SQL project.

fig:

Now that we have our SQL project successfully building in an automated environment, we’re ready to explore the steps needed to apply the deployment to different SQL environments. Deployment pipelines can be used for a variety of situations, including ephemeral (temporary) validation environments, shared staging instances, as well as production databases through gated pipelines.

Publish the SQL project

To build our understanding of a SQL project deployment, we’re going to create a separate Azure DevOps pipeline and use it for deploying the SQL project to a development Azure SQL Database. Even though this is a development instance where we are free to apply database changes without impacting a production workload, we are not going to compromise on security standards that we would expect protecting our data. Create an Azure SQL Database on a server with only Microsoft Entra authentication enabled and public network access only enabled for selected networks. In Azure DevOps, we’re able to adhere to these standards more easily through the use of service connections, which bind an Azure DevOps pipeline to an Azure Entra app registration. Setup of a service connection is done through the project settings in the Azure DevOps project.

The name of the Azure DevOps service connection is an important value to know, as it will be used throughout the pipeline. The default value can be quite lengthy and difficult to distinguish, but can be modified from the Azure DevOps interface. In this example, our service connection name is ContosoAzure.

fig:

We’ll start a new pipeline definition from the Starter template and apply what we learned previously about building a SQL project to make that the first step in our new pipeline. Instead of this pipeline running automatically on code changes, we’re going to remove all pipeline triggers such that it only runs ad-hoc.

# AdHoc deployment pipeline
trigger:
- none

pool:
  vmImage: ubuntu-latest

steps:

- task: DotNetCoreCLI@2
  inputs:
    command: 'build'
    projects: 'AdventureWorks/AdventureWorks.sqlproj'
    arguments: '--configuration Release'
  displayName: 'Build SQL Project to ./AdventureWorks/bin/Release/AdventureWorks.dacpac'

The SqlPackage CLI provides a flexible way to extract and apply the SQL project database definition. The SqlPackage CLI is an automatable tool that fits well into CI/CD environments like Azure DevOps pipelines. While the SqlPackage CLI is not likely installed in our automation environments, it can be easily installed through a quick script command in the pipeline by adding a step to install it. This is the same command we would use to install SqlPackage across Windows, Linux, and macOS environments.

- script: dotnet tool install -g microsoft.sqlpackage
  displayName: 'Install SqlPackage'

Setup passwordless authentication in Azure

The service connection from Azure DevOps to Microsoft Azure is listed as an “Enterprise Application” (app registration) in Entra. By granting the proper permissions to this identity, we’re able to interact with the database from the Azure DevOps pipeline without storing passwords or access tokens. Use the Entra blades in the Azure Portal to locate the app registration and note its display name, which may look something like “yourorg-yourproject-12345678-1234-5678-9012-123456789012”. We need to provide the service connection with two layers of access:

  1. Database-level permissions as a contained user to deploy schema changes
  2. Azure RBAC permissions on the SQL Server resource to manage firewall rules

Connect to the database with an Entra admin user such that you can add the service connection user in the database and provide it with elevated permissions. We start with “db_ddladmin”, “db_datareader”, and “db_datawriter”. These three roles together allow SqlPackage to create and modify schema, read existing data for comparison, and write static/reference data without granting full database ownership associated with “db_owner”. However, it is possible that in the future your database project will include database settings and changes that require elevated permissions.

CREATE USER [yourorg-yourproject-12345678-1234-5678-9012-123456789012] FROM EXTERNAL PROVIDER;

ALTER ROLE db_ddladmin ADD MEMBER [yourorg-yourproject-12345678-1234-5678-9012-123456789012];
ALTER ROLE db_datareader ADD MEMBER [yourorg-yourproject-12345678-1234-5678-9012-123456789012];
ALTER ROLE db_datawriter ADD MEMBER [yourorg-yourproject-12345678-1234-5678-9012-123456789012];

In addition to access to the SQL permissions, we also want to grant the service connection access to manage the firewall rules on the SQL Server. You may create a custom role with access to “Microsoft.Sql/servers/firewallRules/read”, “Microsoft.Sql/servers/firewallRules/write”, and “Microsoft.Sql/servers/firewallRules/delete” if your Entra license permits it. Alternatively, assigning the “SQL Server Contributor” role grants the identity access to manage the firewall and other aspects of the SQL Server in Azure.

fig:

Learn more about the “SQL Server Contributor” role to see if it’s right for your organization: https://learn.microsoft.com/azure/role-based-access-control/built-in-roles/databases#sql-server-contributor

Run SqlPackage in the pipeline

The database change deployment (migration) is applied through the SqlPackage CLI, which dynamically calculates the difference between the SQL project build artifact (.dacpac) and the database we connect to. We provide a connection string to the database in a pipeline variable – “SQLDBCONNECTIONSTRING” – which we may also set as a secret. The connection string is retrieve for the specific database we’re deploying to and we want the “Active Directory Default” format for ADO.NET:

Server=tcp:yourserver.database.windows.net,1433;Initial Catalog=yourdatabase;Encrypt=True;TrustServerCertificate=False;Connection Timeout=30;Authentication="Active Directory Default";

In addition to the connection string variable, we’ll also need a variable for the server name (prefix to “.database.windows.net”) and the resource group for a total of 3 preset pipeline variables:

  1. SqlDbConnectionString
  2. SqlServerName
  3. ResourceGroup

fig:

The “Azure PowerShell” task in Azure DevOps ensures that the script we provide is run with the pipeline authenticated to Azure with the specified service connection, simplifying the pipeline’s definition a bit. We’ll use this task several times to complete the pipeline. Adding the Azure PowerShell task and providing an inline script to execute sqlpackage publish enables the pipeline to run SqlPackage as the passwordless service connection to Azure.

- task: AzurePowerShell@5
  displayName: 'Run SqlPackage'
  inputs:
    azureSubscription: 'ContosoAzure'
    ScriptType: 'InlineScript'
    Inline: |
      sqlpackage /Action:Publish /SourceFile:"./AdventureWorks/bin/Release/AdventureWorks.dacpac" /TargetConnectionString:"${env:SQLDBCONNECTIONSTRING}"
    azurePowerShellVersion: 'LatestVersion'
  env:
    SQLDBCONNECTIONSTRING: $(SqlDbConnectionString)

Navigate the Azure SQL Database firewall

If we were to run the pipeline now, it would fail to connect to the database and the error message would provide an IP Address that needs to be added to the server firewall. However, we want to ensure that the firewall is only open to the address that the pipeline environment has for the duration of the run and be able to automatically use the correct IP address for the pipeline (since it will change over time in a shared environment). To meet these requirements, we’ll add 2 steps before SqlPackage runs and a final step at the end of the pipeline.

Before the SqlPackage publish step, add a PowerShell step and an Azure PowerShell step. These 2 steps combine to:

  1. determine the IP address assigned to our pipeline environment
  2. add a firewall rule for the pipeline to access the server
- task: PowerShell@2
  displayName: 'Get Public IP and put in variable runnerIP'
  inputs:
    targetType: 'inline'
    script: |
      $runnerIP = (New-Object net.webclient).downloadstring("https://api.ipify.org")
      Write-Host "##vso[task.setvariable variable=runnerIP]$runnerIP"

- task: AzurePowerShell@5
  displayName: 'Create SQL Server Firewall Rule'
  inputs:
    azureSubscription: 'ContosoAzure'
    ScriptType: 'InlineScript'
    Inline: |
      New-AzSqlServerFirewallRule -ResourceGroupName "${env:RESOURCEGROUP}" -ServerName "${env:SQLSERVERNAME}" -FirewallRuleName ${env:FIREWALLRULENAME} -StartIpAddress ${env:RUNNERIP} -EndIpAddress ${env:RUNNERIP}
    azurePowerShellVersion: 'LatestVersion'
  env:
    SQLSERVERNAME: $(SqlServerName)
    RESOURCEGROUP: $(ResourceGroup)
    FIREWALLRULENAME: $(FirewallRuleName)
    RUNNERIP: $(runnerIP)

After the SqlPackage step in the pipeline, add an Azure PowerShell step that will be used to remove the firewall rule. This step includes a “condition” parameter that ensures it will run even if a prior step fails, like the SqlPackage publish, such that a firewall rule doesn’t remain in place.

- task: AzurePowerShell@5
  displayName: 'Remove SQL Server Firewall Rule'
  condition: always()
  inputs:
    azureSubscription: 'ContosoAzure'
    ScriptType: 'InlineScript'
    Inline: |
      Remove-AzSqlServerFirewallRule -ResourceGroupName "${env:RESOURCEGROUP}" -ServerName "${env:SQLSERVERNAME}" -FirewallRuleName ${env:FIREWALLRULENAME}
    azurePowerShellVersion: 'LatestVersion'
  env:
    SQLSERVERNAME: $(SqlServerName)
    RESOURCEGROUP: $(ResourceGroup)
    FIREWALLRULENAME: $(FirewallRuleName)

The deployment pipeline we created is setup to only run on demand, although some development environments may be automatically updated from a shared branch in source control. When we invoke the pipeline from Azure DevOps, we can follow the steps in the logs, including the modification of the firewall rules and the steps taken by SqlPackage to update the database.

fig:

The entire pipeline definition for ad-hoc deployments would be:

# AdHoc deployment pipeline
# Variables: SqlServerName, ResourceGroup, SqlDbConnectionString

trigger:
- none

pool:
  vmImage: ubuntu-latest

steps:

- task: DotNetCoreCLI@2
  inputs:
    command: 'build'
    projects: 'AdventureWorks/AdventureWorks.sqlproj'
    arguments: '--configuration Release'
  displayName: 'Build SQL Project to ./AdventureWorks/bin/Release/AdventureWorks.dacpac'

- script: dotnet tool install -g microsoft.sqlpackage
  displayName: 'Install SqlPackage'

- task: PowerShell@2
  displayName: 'Generate Firewall Rule Name with BuildId suffix'
  inputs:
    targetType: 'inline'
    script: |
      $firewallRuleName = "FirewallRule-$(Build.BuildId)"
      Write-Host "Generated Firewall Rule Name: $firewallRuleName"
      Write-Host "##vso[task.setvariable variable=FirewallRuleName]$firewallRuleName"

- task: PowerShell@2
  displayName: 'Get Public IP and put in variable runnerIP'
  inputs:
    targetType: 'inline'
    script: |
      $runnerIP = (New-Object net.webclient).downloadstring("https://api.ipify.org")
      Write-Host "##vso[task.setvariable variable=runnerIP]$runnerIP"

- task: AzurePowerShell@5
  displayName: 'Create SQL Server Firewall Rule'
  inputs:
    azureSubscription: 'ContosoAzure'
    ScriptType: 'InlineScript'
    Inline: |
      New-AzSqlServerFirewallRule -ResourceGroupName "${env:RESOURCEGROUP}" -ServerName "${env:SQLSERVERNAME}" -FirewallRuleName ${env:FIREWALLRULENAME} -StartIpAddress ${env:RUNNERIP} -EndIpAddress ${env:RUNNERIP}
    azurePowerShellVersion: 'LatestVersion'
  env:
    SQLSERVERNAME: $(SqlServerName)
    RESOURCEGROUP: $(ResourceGroup)
    FIREWALLRULENAME: $(FirewallRuleName)
    RUNNERIP: $(runnerIP)

- task: AzurePowerShell@5
  displayName: 'Run SqlPackage'
  inputs:
    azureSubscription: 'ContosoAzure'
    ScriptType: 'InlineScript'
    Inline: |
      sqlpackage /Action:Publish /SourceFile:"./AdventureWorks/bin/Release/AdventureWorks.dacpac" /TargetConnectionString:"${env:SQLDBCONNECTIONSTRING}"
    azurePowerShellVersion: 'LatestVersion'
  env:
    SQLDBCONNECTIONSTRING: $(SqlDbConnectionString)

- task: AzurePowerShell@5
  displayName: 'Remove SQL Server Firewall Rule'
  condition: always()
  inputs:
    azureSubscription: 'ContosoAzure'
    ScriptType: 'InlineScript'
    Inline: |
      Remove-AzSqlServerFirewallRule -ResourceGroupName "${env:RESOURCEGROUP}" -ServerName "${env:SQLSERVERNAME}" -FirewallRuleName ${env:FIREWALLRULENAME}
    azurePowerShellVersion: 'LatestVersion'
  env:
    SQLSERVERNAME: $(SqlServerName)
    RESOURCEGROUP: $(ResourceGroup)
    FIREWALLRULENAME: $(FirewallRuleName)

Wrap up

In this article we setup two Azure DevOps pipelines that enable minimal continuous integration (CI) and continuous delivery (CD) of our database, all based on a database source code format that applies to the entire Microsoft SQL family and is available in SSMS and VS Code. The CI pipeline we created focused on the SQL project build only and your future exploration could include setting it as a required pull request check for changes to the application code that impacts the database. Depending on the different testing and staging environments used by your database, you may establish multiple deployment pipelines or emit deployment scripts from SqlPackage for further review before updating an environment. Most importantly, you learned the basic components of an Azure DevOps pipeline required to keep your database secure and let the pipeline navigate both the network and authorization security steps through the Azure DevOps service connection.

Links to learn more:

The post Fundamentals of Azure DevOps with SQL projects appeared first on Azure SQL Dev Corner.

Read the whole story
alvinashcraft
52 seconds ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories