Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
148661 stories
·
33 followers

Parse, Don't Guess

1 Share

cover

Last week, I shared with you how sneaky I was on transaction handling. Today, the opposite: I’ll tell you how I fixed the issue when I tried to be too sneaky. I already told you that Sneaky Code Bites Back. The moral? Do as I tell, not how I do.

In some environments, we’re spoiled. We’re getting a lot from a Base Class Library or standard frameworks, so we stop thinking that those issues can exist. For instance, serialisation. Do you know how many data types JSON has? 6. Six. Sześć.

Exactly those:

  • string,
  • number,
  • boolean,
  • object,
  • array,
  • and (TADA!) null.

What about number precision and size? It is. That’s what I can tell you, but it’s not enough, e.g., to keep big int/long, etc. What about Dates? Also, there are none. I wrote about it longer here or how much fun that brings.

If you have strongly typed languages and runtimes like C#, Java, etc., your serialiser can, besides parsing, also perform additional mapping and, sometimes, validation. And it can also be tricky, as nicely Alexis King put in his “Parse, don’t validate”.

If you’re in a dynamic environment, like JavaScript, then you’re left with parsing and explicit mapping afterwards. What about TypeScript? Same case, types are only used during compilation, then erased and not visible at runtime. So, the place where we do parsing.

Because JSON was defined a long time ago, JavaScript moved on and now supports bigints (Big Integers) and Dates natively (what an achievement!), which creates a gap I wanted to fill in.

As you know from my previous article (e.g. this one), big integers are quite important in distributed processing. You can represent the position in log with them. Since your log may be quite long, regular numbers aren’t enough. Or they’re long enough, until they overflow, then they’re not anymore.

I’m using those bigint types extensively in internals in Emmett and Pongo. And I store them in JSONs. I store them as alphanumeric strings, because strings don’t have a max length (or at least I don’t know such).

So, for instance event payload can look like:

{
  "type": "InvoiceIssued",
  "data": {
    "invoiceNumber": "123",
    "version": 1,
    "issuer": "John Doe",
    "issuedAt": "2026-02-23T14:07:20Z"
  },
  "metadata": {
    "streamPosition": "3",
    "globalPosition": "928391"
  }
}

As you can see in the metadata stream and global positions, the values are bigints (even if they’re smaller than the maximum value), and data can also use bigints if the user decides to (e.g., invoice number).

And encoding data is simple: you convert it to a string, call it a day. But how to get it back?

And here’s where my struggles started. How do you know that someone intentionally used bigint when they just wanted to store a number as a string?

There are several options. The first one is: encode value.

We could store it, for instance, as:

  • prefixed value: “_bigint:928391”. But then you need to find a prefix that will be unique enough not to cause conflicts,
  • nested object, e.g. { “_kind”: “bigint”, value: “928391” }.

Then, either based on the prefix or the object structure, we could automatically decode the value. Still, this creates other issues, as the structure is no longer the same as the original value. If we’re just storing and retrieving, that shouldn’t be so bad, but… But remember that in Pongo I’m allowing the use of PostgreSQL and SQLite as document databases, supporting such queries:

const invoices = pongoDb.collection<Invoice>("invoices");

const invoiceNumber = 123n;
const invoice = await invoices.findOne({ invoiceNumber });

That gets translated into a fancy JSONB SQL query.

Of course, I could work around it by encoding the value, but… But I was lazy!

I decided to use a Get Out of Jail Free Card and just treat all strings with numbers as bigints. Sneaky. And it will get even sneakier.

In JavaScript, JSON.parse accepts a parameter that allows you to provide custom mapping logic. I decided to use it and check if string is alphanumeric, and gulp, I’ve used Regular Expression to parse it:

const bigIntReviver: JSONReviver = (_key, value) => {
  if (typeof value === 'string' && /^[+-]?\d+n?$/.test(value)) {
    return BigInt(value);
  }

  return value;
};

Yes, it’s either DNS or Regex. Or both.

I explained in another article that JavaScript runtime doesn’t like where you do CPU-heavy computations.

Small Regex isn’t CPU-heavy, but if you consider that this will be done for each string in each document or event you try to deserialise, and multiply that by the number of concurrent requests? That can cause the JavaScript event loop to freeze.

What’s more, I plugged that automatically into node-postgres driver custom type handling, so each JSONB deserialization goes through it.

Again, not shit, Sherlock. I should have known it wasn’t the best choice, but I was busy trying to be sneaky at that moment.

Luckily, one of the users, Dawid, did benchmarks and noticed those CPU freezes. And that was a bit shameful. It wasn’t catastrophic, but it definitely needed a fix.

The Shift

And here I had several options. I could keep hacking on the same idea — maybe replace the Regex with a simpler string check, still globally. Or I could just ignore bigints during deserialisation entirely — let them stay strings, call it a day, move on. Or I could apply the encoding I mentioned earlier — prefixed values, nested objects. All of those would fix the performance issue. And all of those would be the same kind of wrong choice I already made: trying to solve a schema problem without the schema.

Because that’s the actual mistake here, not the Regex. The pg driver has no idea what your schema looks like. It doesn’t know that "928391" is a bigint and "John Doe" is a name. It doesn’t know that "123" is an invoice number (bigint!) and "90210" is a zip code (string!). I asked it to guess, and it guessed wrong — because there is no right guess at that level.

I said to myself: enough is enough. I had been planning to do this properly for a while, and the performance issue gave me the push. Kent Beck’s old rule: make it work, make it right, make it pretty. I had “make it work” covered for a long time. Now it was time for “make it right.”

And honestly? It wasn’t that hard. Maybe because “make it work” came first, I already understood the problem well enough to see the shape of the solution.

In Pongo, I dropped the automatic bigint parsing from the driver entirely. If you want bigint or date parsing, you say so at the client level:

const client = pongoClient({
  driver: databaseDriver,
  connectionString: postgresConnectionString,
  serialization: {
    options: {
      parseBigInts: true,
      parseDates: true,
    }
  },
});

By default, strings stay strings. You opt in. I didn’t want to break things for users who don’t care about bigint precision or don’t have performance-sensitive workloads. The serializer became an explicit parameter passed down to each query, each collection, each operation — instead of a global that silently changed everything.

That was the “make it right” part for Pongo. But disabling alone isn’t a solution, it’s a band-aid. Users who need bigints and dates still need a way to get them back after deserialisation. The question is: where does that conversion happen?

And that’s where upcasting comes in. Let me start with a simple example in Pongo, then build up.

Say you have a user document. In the database, dates are stored as ISO strings and the version counter is a numeric string (because JSON). But in your application, you want proper Date objects and bigints:

type UserDocStored = {
  name: string;
  createdAt: string;
  lastLogin: string;
};

type UserDoc = {
  name: string;
  createdAt: Date;
  lastLogin: Date;
};

The upcast function does the conversion:

const upcast = (doc: UserDocStored): UserDoc => ({
  name: doc.name,
  createdAt: new Date(doc.createdAt),
  lastLogin: new Date(doc.lastLogin),
});

You wire it into the collection, and every read goes through it:

const collection = pongoDb.collection<UserDoc, UserDocStored>(
  'users',
  {
    schema: { versioning: { upcast } },
  },
);

// What's in the database: { name: 'Alice', createdAt: '2024-01-15T10:30:00.000Z', ... }
// What you get back:      { name: 'Alice', createdAt: Date, ... }

That’s all. new Date(str) is cheap. Running a Regex against every string in the document is not. The CPU freezes Dawid spotted came from that check running millions of times, at the driver level, for every field, on every concurrent request. With upcasting, the conversion runs only for the fields you declared, in a plain function, no Regex.

But this is just type mapping — the simplest case. As I wrote about in my serialisation article, the explicit mapping pattern — defining the stored shape and the application shape separately, with a function between them — is useful for much more than fixing types. It’s the same pattern you need for schema versioning.

Let’s say business requirements changed. You now need to group user data differently — a profile object for identity, a timestamps object for temporal data. The V1 documents are flat. The new V2 shape is nested:

type UserDocV1 = {
  name: string;
  createdAt: string;
  lastLogin: string;
};

type UserDocV2 = {
  profile: {
    name: string;
  };
  timestamps: {
    createdAt: Date;
    lastLogin: Date;
  };
};

This isn’t just a type change like string-to-Date anymore. The structure itself is different — flat fields became nested objects, field names moved into sub-objects. And you have thousands of V1 documents already stored. You can’t migrate them all at once (or don’t want to — it’s risky, and some consumers might still expect V1). But your application now expects V2.

Compatibility FTW

This is where backward and forward compatibility come in.

Backward compatibility means: old data still works. V1 documents that were stored months ago need to be readable by V2 code. The upcast handles this — it reads whatever shape the document has and transforms it into V2.

Forward compatibility means: new data doesn’t break old consumers. If you have another service, or an older deployment, that still reads V1 format, it needs to keep working. The downcast handles this — when storing V2 documents, it writes the V1 fields alongside the V2 fields, so older readers can still find what they expect.

Together:

type StoredPayload = UserDocV1 & UserDocV2;

const upcast = (doc: StoredPayload): UserDocV2 => ({
  profile: doc.profile ?? { name: doc.name },
  timestamps: {
    createdAt: new Date(doc.timestamps?.createdAt ?? doc.createdAt),
    lastLogin: new Date(doc.timestamps?.lastLogin ?? doc.lastLogin),
  },
});

const downcast = (doc: UserDocV2): StoredPayload => ({
  name: doc.profile.name,
  createdAt: doc.timestamps.createdAt.toISOString(),
  lastLogin: doc.timestamps.lastLogin.toISOString(),
  profile: doc.profile,
  timestamps: doc.timestamps,
});

Look at the upcast: if the nested profile or timestamps fields exist (document was written by V2 code), it uses them. If they don’t exist (old V1 document), it falls back to the flat fields. One function handles both old and new documents — that’s backward compatibility.

And look at the downcast: it writes name, createdAt, lastLogin as flat string fields (V1 shape) alongside profile and timestamps (V2 shape). A service still reading V1 sees the flat fields and works fine. A service reading V2 sees the nested ones. That’s forward compatibility.

You wire both into the collection:

const collection = pongoDb.collection<UserDocV2, StoredPayload>(
  'users',
  {
    schema: { versioning: { upcast, downcast } },
  },
);

From here, your application code only deals with V2. The collection handles the translation in both directions:

const v2Doc: UserDocV2 = {
  profile: { name: 'Alice' },
  timestamps: {
    createdAt: new Date('2024-01-15T10:30:00.000Z'),
    lastLogin: new Date('2024-06-20T14:45:00.000Z'),
  },
};

await collection.insertOne(v2Doc);

// What's stored (downcasted — both shapes for compatibility):
// { name: 'Alice', createdAt: '2024-01-15T10:30:00.000Z',
//   lastLogin: '2024-06-20T14:45:00.000Z',
//   profile: { name: 'Alice' },
//   timestamps: { createdAt: '2024-01-15T10:30:00.000Z',
//                 lastLogin: '2024-06-20T14:45:00.000Z' } }

// What you read back (upcasted to V2):
const doc = await collection.findOne({ ... });
// { profile: { name: 'Alice' },
//   timestamps: { createdAt: Date, lastLogin: Date } }

Same collection, V1 and V2 documents coexisting. insertMany, replaceOne, findOne — all go through the upcast/downcast. No batch migration needed. You roll out the new code, and old documents are handled transparently.

There’s another thing the downcast gives you: querying stays backward compatible. Because the downcast writes the flat V1 fields alongside the nested V2 ones, a query like collection.findOne({ name: 'Alice' }) still works — even though V2 code doesn’t use name directly anymore. The V1 field is there in the stored document. That matters if you have queries or indexes built against the old shape. They don’t break.

Now, for events this matters even more. In event sourcing, stored events are immutable — the log is append-only, and you don’t modify what was already written. I wrote about versioning patterns in more detail, but the core idea is: your business evolves, your code evolves, your event schemas evolve, but the events in the store stay as they were. You can’t go back and rewrite them (well, you can, but you really shouldn’t). Upcasting is how you bridge the gap.

For Emmett, the same pattern works at the event store level. You define the stored shape (what JSON gives you from the database) and the application shape (what your code works with):

type ShoppingCartOpenedFromDB = Event<
  'ShoppingCartOpened',
  { openedAt: string; loyaltyPoints: string }
>;

type ShoppingCartOpened = Event<
  'ShoppingCartOpened',
  { openedAt: Date; loyaltyPoints: bigint }
>;

And an upcast that handles each event type:

const upcast = (event: Event): ShoppingCartEventWithDatesAndBigInt => {
  switch (event.type) {
    case 'ShoppingCartOpened': {
      const e = event as ShoppingCartOpenedFromDB;
      return {
        ...e,
        data: {
          openedAt: new Date(e.data.openedAt),
          loyaltyPoints: BigInt(e.data.loyaltyPoints),
        },
      };
    }
    case 'ShoppingCartConfirmed': {
      const e = event as ShoppingCartConfirmedFromDB;
      return {
        ...e,
        data: {
          confirmedAt: new Date(e.data.confirmedAt),
          totalCents: BigInt(e.data.totalCents),
        },
      };
    }
    default:
      return event as ShoppingCartEventWithDatesAndBigInt;
  }
};

You pass it when reading a stream:

const { state } = await eventStore.aggregateStream<
  ShoppingCartState,
  ShoppingCartEventWithDatesAndBigInt
>(shoppingCartId, {
  evolve: evolveState,
  initialState,
  read: { schema: { versioning: { upcast } } },
});

Or in a command handler:

const handle = CommandHandler<ShoppingCart, ShoppingCartEvent>({
  evolve,
  initialState: () => ({ ... }),
  schema: { versioning: { upcast: upcastDatesAndBigInt } },
});

The difference with events is that you can’t update them in place. For documents, you have both directions — upcast on read, downcast on write. For events, upcasting is the main tool, because the event store is append-only. Old events stay as they were written. But downcasting has its place too.

Consider this: you have a projection or a subscriber that was built months ago against the old event schema. Maybe it’s a read model that listens to ShoppingCartOpened and expects clientId as a flat string. But your current code evolved — now ShoppingCartOpened carries a client object with id and name:

// What old subscribers expect
type ShoppingCartOpenedV1 = Event<
  'ShoppingCartOpened',
  { clientId: string; openedAt: string }
>;

// What current code produces
type ShoppingCartOpenedV2 = Event<
  'ShoppingCartOpened',
  { client: { id: string; name: string }; openedAt: Date }
>;

Upcasting helps the current code read old events — the ones stored with just clientId. Downcasting helps old subscribers consume new events — it transforms the new client object back into the flat clientId they expect. Same principle as with documents, but especially important here because event subscribers often live in separate services or deployments that you can’t update all at once.

And the same upcast function that started as simple type mapping — string → Date, string → bigint — handles this structural change too. You just add another case to the switch:

case 'ShoppingCartOpened': {
  const e = event as ShoppingCartOpenedV1;
  return {
    ...e,
    data: {
      client: { id: e.data.clientId, name: 'Unknown' },
      openedAt: new Date(e.data.openedAt),
    },
  };
}

Old events get the client object synthesised from the flat clientId. New events already have it. The evolve function only deals with V2 shape.

And here’s where things started to click for me. I added upcasting to fix the bigint problem — explicit type mapping instead of a Regex. But the same mechanism, without any changes, also handles structural versioning. The simple string → Date mapping from the first example is the same code path as the clientId → client migration above. It’s one function, one place, one pattern for all of it — type coercion, field restructuring, schema migration.

Right decisions stack

Right decisions stack. The Regex hack was blocking the slot where upcasting should have been all along. Once I removed it, the performance got fixed, and I got schema versioning on top. One fix created room for the next one, which created room for the next. That doesn’t happen when you keep patching around the same bad decision.

Looking back, maybe the Regex wasn’t the wrong first move. Kent Beck’s rule is “make it work, make it right, make it pretty” — in that order. The Regex made it work. It had performance problems, but it let me ship, and it let me learn where the real problem was. If I had tried to design the upcast/downcast system from scratch, without having lived with the Regex for a while, I might have over-engineered it or missed the connection to schema versioning entirely. The understanding came from living with the shortcut.

Dawid raised the performance issue on Pongo projections, but the same Regex was running in Emmett too. I could have fixed it in one place and called it a day. Instead, I used it as a push to do the thing I’d been planning anyway — and applied it to both Pongo and Emmett to keep things consistent. Because I already understood the problem well enough, “make it right” turned out easier than I expected.

You can recover from shortcuts. You should. But you also shouldn’t be afraid to take them in the first place — as long as you come back and do it properly.

Full changes:

Cheers!

Oskar

p.s. Ukraine is still under brutal Russian invasion. A lot of Ukrainian people are hurt, without shelter and need help. You can help in various ways, for instance, directly helping refugees, spreading awareness, putting pressure on your local government or companies. You can also support Ukraine by donating e.g. to Red Cross, Ukraine humanitarian organisation or donate Ambulances for Ukraine.

Read the whole story
alvinashcraft
5 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

The secret life of database keys

1 Share

There’s not much to database keys, right? When you build a new table, you add a column to act as the primary key, then set it to auto-generate. Done. That’s all there is to know…right?

The goal of this article is to demonstrate that there’s much more to the process than this. There are options to selecting, defining, and using keys. For optimal design and performance, consider your choices carefully.

What are the different types of database keys?

Keys come in two basic flavors: natural and surrogate. Natural keys, as their name suggests, are values that occur naturally in the data. They have real-world or business meaning: an email address, the street address of a building, the serial number of an inventory item, etc. In contrast, a surrogate key (sometimes called a synthetic key) is a value added to each row upon creation. A surrogate exists only within the confines of the database; it has no meaning outside of it.

A natural key often contains two or more columns. For instance, the key for a table of vehicle types might include the make, model, year, and trim level, all four columns of which must be combined to create the value that uniquely identifies each row. Surrogate keys are always a single column, though the value of each key may be anything you choose – as long as each value is distinct from all others.

Why use a surrogate database key?

Some tables lack natural keys and so a surrogate must be used. But even when a natural key exists, developers still usually add a surrogate. In fact, in many development shops, it’s best practice to add a surrogate to all tables. But if a natural key is already in the data, why go to the extra trouble to add a surrogate? There are a few reasons why natural keys may be unsuitable:

  1. They can be unstable. Since natural keys have real-world meaning, their values can change over time. Even values we think of as constant – like a social security number or building’s street address – can change on rare occasions. When that happens, it can mean updating not just the parent table, but thousands, millions, or even billions of foreign keys in child tables. Surrogates, though, exist only in our database: we have total control over them, and can ensure they never change.

  2. They can be incomplete. If you track customers by email address, what do you do for the rare individual who either lacks an email or refuses to share their address? If a table of cars for sale tracks vehicles by VIN number, how do you handle older vintage vehicles that lack this ID?

  3. They can be unwieldy or overly long. When a key spans several columns, referencing them all for every lookup makes for cumbersome queries. And even a single-column natural key may run to dozens or even hundreds of bytes. Replacing a long text value with a single 4-byte or 8-byte integer means less storage and smaller, faster indexes, especially when the value is used in many foreign keys.

In the early days of databases, adding a surrogate key and automatically generating its value was difficult and fraught with pitfalls. That’s no longer true: all modern DBMS’ natively support automatically-generated surrogates with just a single statement, though the syntax does vary slightly between platforms.

Should you always use a surrogate database key?

Now we ask the opposite question: with all the advantages surrogates have, why would we not use one? The simple answer is performance. Even with today’s hardware, wider rows mean less performance. On most tables, the overhead of a single additional column is negligible, but for large, narrow tables (such as certain time series data) the performance hit can be significant.

Also, we often must index on the natural key regardless of whether it’s used as the primary, so a surrogate means an extra index, further impacting performance. If the natural key is stable and succinct, you may wish to consider using it alone.

Another common area where surrogates generally aren’t used is in a so-called “junction table” (associative entity), made to implement a many-to-many relationship between two other tables. The primary key for a junction table is (usually) the two foreign keys that link to the base tables.

An image showing the junction model pattern.

How to populate a surrogate database key

The vast majority of surrogate keys are auto-generated from a numeric sequence: every key one value number larger than the record before. But there is nothing inherent in surrogates that require them to work like this, and sometimes the behavior isn’t ideal; because numeric sequences are predictable, they expose a certain amount of information you may wish be kept private.

For example, if a customer is sent invoice #10500 one month and invoice #10600 the next, they know the system is generating 100 invoices a month. If “transaction ID #96591” appears on a receipt, then a potential hacker knows what the next transaction ID will be, and may be able to exploit that.

If users never see the primary key, these issues don’t arise – but if they do, you may wish to consider keys based on something other than a numeric sequence: a non-sequential surrogate.

Fast, reliable and consistent SQL Server development…

…with SQL Toolbelt Essentials. 10 ingeniously simple tools for accelerating development, reducing risk, and standardizing workflows.
Learn more & try for free

Everything you need to know about non-sequential surrogates

For auto-generated keys, the most common non-sequential type is the Universal Unique Identifier (UUID). Sometimes called a Globally Unique Identifier (GUID), it’s 128 bits wide – 32 hexadecimal characters separated by four hyphens. There are several different variants (eight, to be precise) but they all share one common trait: they’re wide enough so that collisions (duplicate keys) are nearly impossible, even when randomly creating quadrillions of random values. Some UUIDs are time-based so, as long as you don’t create more than one every 100 nanoseconds or so, a duplicate key is actually impossible.

Because UUID collisions are so unlikely, you don’t need a central database to assign key values: they can be created at the application level, even by remote computers or mobile devices when they lack a database connection. This ability to work in a disconnected, decentralized manner is the primary benefit of UUIDs.

This feature is very useful but comes with a significant drawback: performance. 128-bit keys are much larger than a typical four-byte or eight-byte numeric key – they take more storage and more time to sort and compare. Indexes on UUIDs take up more space in memory, which means less of the index can be loaded at one time. And using a non-sequential primary key means many more page splits during index updates. For insert-heavy tables like time-series data, this further erodes performance.

An image showing the anatomy of a UUID.

Note: there are multiple UUID variants. If you decide to use UUIDs in your application, investigate which one best fits your needs. The Version 1 form pictured above exposes your server’s MAC address, which may be a problem for some applications.

What are some alternatives to UUID?

There are several successors to UUIDs, such as the Universally Unique Lexicographically Sortable Identifier (ULID), designed to improve performance or address security concerns. If you just want to hide the sequential nature of key values from prying eyes, however, the best way to do this is to generate the values sequentially as normal, but encode or obfuscate them in some manner.

There exists libraries specifically for this, such as Sqids, but replacing a key value with its MMI (modular multiplicative inverse) is fast, secure, and can be calculated in a single line of code.

Sometimes obfuscation is done at the application level, encoding and decoding the value each time it’s accessed, but if you store the obfuscated value directly in the database, there’s no need to ever recover the original value.

If you just wish to make the sequential character of the key less glaringly apparent, you can set the sequence to increment by a value other than one. For instance, in SQL Server, this statement defines an auto-generated primary key that starts at value 100 and increments by 3 each time:

CREATE TABLE employees (
  employee_id  INT IDENTITY(100,3) PRIMARY KEY,
  ...
  )

This obviously isn’t very secure, and it also consumes three times as many key values per row so, if you use this trick, remember to size your key appropriately.

Why you should keep your database keys to yourself

The value of a surrogate key relies upon it being isolated from the outside world. Every time you expose a key to users, you erode that isolation ever so slightly. This is an entirely separate problem from the security issues of sequential keys.

The more users see a key value, the more you risk the value gaining meaning to them, and then you no longer have complete control over it. This isn’t simply hypothetical: in one case I experienced, a major telecom company implemented a new system with fresh surrogate keys for its network elements, but the keys were included on pipelines to other systems.

They appeared in reports, URLs, dashboards….and users began to refer to elements by key value and, eventually, demanded (and received) the ability to search on the key as any other value. The surrogate had become a natural key – owned by users, not the database developers.

Enjoying this article? Subscribe to the Simple Talk newsletter

Get selected articles, event information, podcasts and other industry content delivered straight to your inbox.
Subscribe now

What are smart database keys – and why are they the ‘worst of both worlds’?

A smart key is a sort of hybrid between a surrogate and natural key. The thought process that leads to a smart key generally runs something like this: you have several columns in a table that when combined almost – but not quite – uniquely identifies each row. You decide to add a short sequential or random value to this, creating an almost-surrogate that still preserves the business meaning.

An image showing an example of a smart key in the form of a vehicle's VIN.

Anyone who’s ever designed such a key (including myself) is usually captivated by their surface elegance but, like a used car purchased from a shady dealer, problems usually begin to soon arise.

A smart key has all the drawbacks of both surrogate and natural keys, and no real advantage over either. Sometimes these keys are demanded by business requirements, and sometimes we receive them from other sources, in which case they’re technically a natural key (we have no control over them). If you’re thinking about creating your own, it’s generally best to avoid them.

How big a database key do I need?

One of the most painful acts a DBA must perform is expanding a key column to a larger type, which often means altering not just the primary table itself, but foreign keys in dozens of other tables, and rebuilding all associated indexes. Give yourself room for growth – when a table outgrows its key, you have no choice but to expand it. A four-byte integer is limited to 2.1 billion rows (double that if your particular DBMS supports autogenerating unsigned integers) – estimate the maximum growth over the next several years, then double it. If you’re anywhere near the limit, use an eight-byte key.

Planning for growth, though, can be taken to extremes. Don’t automatically use an overly-large key for tables that will always remain small. Eight-byte BIGINTs take more space, which means much bigger indexes, fewer index pages in memory, and slower operation overall.

Regardless of what size you use, ensure all your foreign key references match the primary’s type. While some DBMS’ allow mismatched types, this will cause problems down the road.

What are primary database keys?

This is a more subtle question than it appears. Since SQL-92, the language has allowed you to declare primary keys within the table definition. But this is simply a declaration; what defines the primary key is your use of it. The value you use primarily in queries to uniquely identify rows and in foreign-key references is the primary key. The declaration in the table definition documents your choice – it doesn’t enforce it.

Why is this distinction important? Because declaring the primary key does more than just document it – it creates an index on the key as well. There are times you may want to override how this index is created, or defer creation until later (such as bulk loading a table). That’s OK. The primary key is still there, whether or not you declare it as such.

Do I always need a primary database key?

For most people, this question is a no-brainer: every table needs a key, period. After all, without a key, there’s no way to uniquely identify a single row for updating or deleting. With no key, there’s no way to link child rows to the table either. Tables need keys like fish need water…right?

There is a rare but important case for which you may wish to consider leaving a table keyless. If the table contains immutable data – data which business rules require to never be altered once inserted – then the lack of a primary key is a powerful tool to enforce this rule. This isn’t iron-clad protection – most DBMS’ have language extensions that can reference a keyless row – but is a clear self-documenting reminder of the immutable nature of the data. The data must be in a leaf table (without child tables), as the lack of a key obviously prevents any foreign key references.

What are foreign database keys?

Foreign key declarations are a contentious topic: developers love them for their ability to securely enforce referential integrity but DBAs often resist them for complicating some database maintenance operations. And all those integrity checks do eventually impact performance. Some shops demand all FKs be declared, others don’t allow any whatsoever, and some leave it up to the individual developer.

Regardless of what standard your shop agrees upon, it’s important to remember these declarations are no different than for a primary key: the declaration documents and enforces the foreign key, but defining the foreign key is done by your usage of it. If you don’t declare a FK at the database level, be prepared to enforce referential integrity at the application level.

Final thought: keep exploring database keys

The topic of database keys is a pool both deep and wide. This article only dived into the shallow end of that pool, but hopefully it’s convinced you that the rest of the water is worth exploring.

FAQs: Database keys

1. What is a database key?

A database key is a column or set of columns that uniquely identifies a row in a table. Keys are used to link tables, enforce data integrity, and improve query performance through indexing.

2. What are the different types of database keys?

The two main database keys are:

  • Natural keys – real-world values like email address, VIN, or serial number.

  • Surrogate keys – artificial identifiers (often auto-generated integers or UUIDs) with no business meaning.

Natural keys may span multiple columns, while surrogate keys are always a single column.

3. Why use a surrogate key instead of a natural key?

Surrogate keys are preferred because natural keys can be:

  • Unstable: real-world values change.

  • Incomplete: some rows may not have a usable natural key.

  • Large or composite: slower joins and bigger indexes.

Surrogate keys are small, stable, and efficient for joins and indexing.

4. Should every table use a surrogate key?

Not every table should use a surrogate key. Avoid surrogates when:

  • The natural key is short, stable, and meaningful.

  • The table is large and narrow (extra columns hurt performance).

  • The table is a junction table (many-to-many relationships), where a composite key of two foreign keys is usually best.

5. How should surrogate keys be generated?

Most systems use auto-incrementing numeric sequence – these are fast and compact, but predictable. If key values are exposed to users, consider non-sequential alternatives or obfuscation.

6. Are UUIDs good primary keys?

Use UUIDs for distributed or offline systems. They ensure global uniqueness but use more space and perform worse than numeric keys.

7. What are alternatives to UUIDs?

  • ULID (sortable, better index performance)

  • Obfuscated sequential IDs using tools like Sqids

8. Should database keys be exposed to users?

No, database keys should not be exposed to users. Exposed keys gain meaning, create coupling, and limit future schema changes. Use public IDs instead.

9. What are smart keys?

Smart keys are hybrid keys with business meaning + sequence. They’re fragile and usually the worst option, so avoid when possible.

10. How big should a primary key be?

Use the smallest type of primary key that supports future growth:

  • INT for smaller tables

  • BIGINT for large or fast-growing tables

    Always match foreign key types to the primary key.

11. Do all tables need a primary key?

Nearly all tables need a primary key – the rare exception is immutable, append-only leaf tables.

The post The secret life of database keys appeared first on Simple Talk.

Read the whole story
alvinashcraft
5 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Simplifying Customer Service with BoldDesk: A Seamless Engagement Experience

1 Share

Simplifying Customer Service with BoldDesk A Seamless Engagement Experience

In today’s business environment, exceptional customer service isn’t just nice to have; it’s a necessity. Customers expect quick responses, personalized interactions, and effortless problem resolution. However, many organizations struggle with fragmented systems and inefficient workflows that slow down support teams and frustrate customers.

With that enters BoldDesk®. This help desk platform is designed to transform the way businesses handle customer service by providing a unified, intuitive platform that streamlines engagement from start to finish. Whether you’re managing inquiries, tracking issues, or collaborating across teams, BoldDesk ensures every interaction is smooth, efficient, and customer-focused.

Tax outreach team

I am a lead for the tax outreach team. BoldDesk assists a great deal, specifically with our customer exemption processes. Tax-exemption engagement for us is how our business interacts with customers when requesting and managing tax-related documents. That includes everything from the tone of our emails and the clarity of our instructions to the ease of submitting forms and the follow-up experience.

A well-designed engagement strategy turns a typically tedious process into a smooth and even positive interaction.

To elevate our customer engagement strategy, integrating BoldDesk has been a game-changer. BoldDesk offers a robust platform for managing our customer interactions, streamlining document collection, and automating some of our communication. All this is essential for an efficient tax outreach process.

How BoldDesk supports our tax outreach workflows

Centralized communication hub

BoldDesk enables my team to manage all customer inquiries related to tax exemption certificates or direct pay permits in one place. This ensures no request goes unnoticed and allows for faster resolution times, thus preventing any part of the process from slipping through the cracks.

Customizable forms and workflows

BoldDesk creates tailored submission forms for exemption certificates. These forms guide customers through the process, reducing errors and improving submission rates.

Knowledge base integration

Our customers are provided with easy access to state-specific exemption forms, FAQs, and instructional content through our self-service portal. This empowers them to find answers and complete tasks independently.

Real-time support channels

BoldDesk supports our chat and email integration. This allows our team to offer responsive assistance when customers have questions or need help navigating the tax outreach process. We can discuss customer requests in the private notes and assist customers in solving specific tasks.

Analytics and reporting

I can track key metrics like response times, ticket resolution rates, and certificate collection efficiency. These insights help us continuously refine our engagement strategy, developing new workflows and processes. It also helps our team ensure SLA compliance.

By integrating BoldDesk into our tax outreach process, we not only improve operational efficiency but also foster a more transparent and customer-friendly experience. This positions your business as proactive, professional, and committed to compliance.

Why it matters

The BoldDesk platform assists our team by facilitating the following.

Ensures tax compliance

BoldDesk ensures that we have the most accurate exemption documentation, which is vital for audit defense and compliance. Engaging customers proactively helps us collect what’s needed without delay or confusion.

Reduces processing delays

In our experience, when customers understand what’s specifically required and how to submit it, our team spends less time chasing documents or correcting errors.

Improves customer trust and satisfaction

A professional, helpful process shows customers that our business is organized and respects their time. This fosters trust between my team and customers, thus making our tax outreach process more effective.

Protects revenue and margins

Improper tax handling can lead to penalties or missed revenue. BoldDesk facilitates accurate engagement, thereby helping minimize financial risks. Accuracy is captured in the process, as documents can be exported to Excel. In Excel, we can run a report and therefore reduce arithmetic errors.

Final thoughts

BoldDesk has truly been a game-changer for my team. Its comprehensive suite of features, ranging from centralized communication and automated ticketing to customizable workflows, has streamlined our tax outreach process and enhanced our operational efficiency. BoldDesk increases the success rate of the tax outreach team.

Read the whole story
alvinashcraft
6 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Podcast: [Video Podcast] Frictionless DevEx with Nicole Forsgren

1 Share

In this episode, Thomas Betts talks with Dr. Nicole Forsgren, the author of Accelerate and one of the most prominent and important minds in DevOps and developer productivity. The conversation is about identifying and removing developer friction, the subject of her new book, Frictionless.

By Nicole Forsgren
Read the whole story
alvinashcraft
6 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Run Pulumi Insights on Your Own Infrastructure

1 Share

Pulumi Insights gives you visibility and governance across your entire cloud footprint: discovery scans catalog every resource in your cloud accounts, and policy evaluations continuously enforce compliance against those resources. Until now, Insights workflows ran exclusively on Pulumi-hosted infrastructure. That works well for many teams, but enterprises with strict data residency requirements, private network constraints, or regulatory obligations need to run this work in their own environments. Today, Pulumi Insights supports customer-managed workflow runners for both SaaS Pulumi Cloud and self-hosted Pulumi Cloud installations.

Insights at a glance

Insights provides two complementary capabilities that together form a governance lifecycle for your cloud infrastructure.

Discovery scans cloud accounts across AWS, Azure, GCP, and more to catalog every resource regardless of how it was provisioned: Pulumi, Terraform, CloudFormation, or manual creation. Once cataloged, you can search, filter, group, and export your resource data. You can also import unmanaged resources into Pulumi to bring them under IaC management.

Policy enforces compliance with policy-as-code written in TypeScript or Python. Pulumi ships pre-built compliance packs for CIS, NIST, PCI DSS, HITRUST, and other frameworks so you can start evaluating without writing any code. Audit policy groups continuously evaluate all discovered resources and IaC stacks, while preventative policies block non-compliant deployments before they reach production.

This enables you to map out your cloud estate, evaluate compliance, and then remediate any issues uncovered by policy.

Why self-hosted?

Running Insights on your own infrastructure with customer-managed workflow runners gives you:

  • Data residency: Scan execution and policy evaluation run entirely within your private network.
  • Private infrastructure access: Scan resources in VPCs and environments that are not accessible from the public internet.
  • Compliance: Cloud provider credentials can stay internal to your network, meeting regulatory requirements for credential handling.
  • Flexible hosting: Run workflow runners on any environment that meets your needs, including Linux, macOS, Docker, and Kubernetes.

How it works

Customer-managed workflow runners are lightweight agents that poll Pulumi Cloud for pending work, execute it locally, and report results back. You can configure runners to handle specific workflow types: discovery scans, policy evaluations, deployments, or all three.

This works identically whether you use SaaS Pulumi Cloud or a self-hosted installation. The runner communicates with the Pulumi Cloud API over HTTPS, so no inbound connectivity is required, making it well suited to run in restricted network environments.

Under the hood, this is powered by a distributed work scheduling system that routes activities to the right runner pool, handles lease-based execution, and recovers automatically from failures. For a deep dive on the architecture, see How We Built a Distributed Work Scheduling System for Pulumi Cloud.

If your team already uses customer-managed workflow runners for Pulumi Deployments, your existing runner pools can handle Insights workflows with no additional infrastructure.

Get started

Self-hosted Insights is available on the Business Critical edition of Pulumi Cloud. To learn more or get set up:

Read the whole story
alvinashcraft
6 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

TempDB Internals – What’s New (Sql Server 2016 to 2025)

1 Share

I wrote about TempDB Internals and understand that Tempdb plays very important role on SQL Server performance and everything temporary / memory spill comes to tempdb and it is important to have tempdb works best, Microsoft had made lot of … Continue reading ?

The post TempDB Internals – What’s New (Sql Server 2016 to 2025) appeared first on SQLServerCentral.

Read the whole story
alvinashcraft
6 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories