Full Transcript
Erik Darling here with Darling Data. Excited to finally get back to some more SQL Server specific related content, not about monitoring tools or vectors or any of the other stuff that we’ve been talking about here on the Chan recently. Today we’re going to get back to some of the Learn T-SQL with Erik content, recent updates, and we’ll see some more updates as soon as the material is fully tech-reviewed and revised. But today we’re going to talk about heaps. And I know what you’re going to say, Erik, heaps are not T-SQL. And I agree. But part of being an advanced T-SQL practitioner is understanding different storage formats that SQL Server uses and some of the upsides, downsides, and sidesides of them. So in today’s video, we’re going to talk about heaps generally, one condition that they avoid and two conditions that they are prone to. This is also, we’ll talk briefly about sort of the proper place in the world for heaps as well. Before we do that, if you look down in the video description, you will find all sorts of helpful links. You can hire me for consulting. You can even buy my training, including the full version of the small snippet of training that we’re going to be talking about in today’s video, which is the Learn T-SQL with Erik course. There is a discount code down there if you want to check that out. You can become a supporting member of the channel. You can ask me office hours questions. And one thing, I mean, it’s not just exactly a link that you can click, but there are buttons that allow you to like, subscribe, and share these videos with people who you feel they might demonstrate some value to. So, you’ve got all those things available for free and for pay. Another thing for free is my SQL Server performance monitoring tool. That’s at my GitHub repo. The link is there, but the link is also down in the video description as well. It’s totally free, totally open source, doesn’t require you to sign up, give me your email or anything like that. It’s just a bunch of T-SQL collectors that go and monitor all the stuff that I would monitor if I really wanted to monitor a SQL Server.
And so that’s what I do. It captures all the stuff that you would care about. It’s got a Nox-style dashboard if you want to just get a quick glance at server health. And if you are a fan of our new robot overlords, it has optional, built-in, you have to opt in to use them, MCP servers. So you can have your favorite robot talk to your performance data and just your performance data uses a bunch of read-only tools to do that. And it can give you all sorts of potentially helpful, dangerous advice that I am not liable for. Anyway, I’m going to be getting out and about in the world, all sorts of places that I am excited to be.
SQL day Poland will be May 11th and 13th, right after Pass on Tour Chicago, which is May 7th and 8th. So I have like a day to get there. I think I land in Poland on like the 10th in the morning. So I’m sure by the 11th, I’ll be in great shape, great spirits. Take some Benadryl with me on that one.
Then I will be at Data Saturday, Croatia, June 12th and 13th. And I will be at a Pass Summit in Seattle, Washington, November 9th through 11th. Excited to see everyone all over the place.
And of course, you know, Seattle and Chicago I’ve been to, but Poland and Croatia, it’s some first-timers for me. So the old man’s going to get some new stamps in the passport. Excited about that. So hopefully I meet nice people along the way.
But it is, well, actually, when does this get published? This, this, I might need to change this to the new image. Let’s see. Because this might be the April 1st video.
And this is not a, an April Fool’s joke video. So we are, we are firmly into baseball season and we are excited about that. And I am excited by this, this fellow right here with his floating, floating hot dog.
I don’t know. I don’t know how that’s staying weird. Maybe he’s got a little hot dog holster there. I don’t know. Everyone else holding their beer and hot dogs firmly.
I don’t know where this guy in red showed up from. All of everyone else is wearing blue, but I don’t know. Anyway, good job. Data baseball. We’re, we’re in there.
Anyway, let’s switch over to SQL Server Management Studio for the first time in a long time. And let’s talk a little bit about heaps. Now, heaps have two things about them that make them, sort of pigeonhole them into, like, like an append-only territory.
And that is that updates and deletes do not get along terribly well with heaps. Heaps are, of course, tables that do not have a clustered index on them. They may, they may very well have a primary key of the non-clustered variety.
They may have any variety of nonclustered indexes or other constraints on them. But without a clustered index, they are still a heap, meaning they are a collection of unordered pages. So you have the base heap table and then other stuff.
We’re going to talk about clustered indexes, which I prefer to call clustered tables these days. Next. Tomorrow’s video. Don’t get ahead of yourself.
Anyway. Whew. Feeling good. Feeling a little jazzy. So if we look at, let’s actually run this procedure called load elhepo. If you’re interested in getting the code or seeing what all this stuff does behind the scenes, you can buy the course and see all that stuff.
But I’m just giving you a small taste of what’s in there. So if we look at forwarded records for the heap at current, we will see that we do not have any. This is the column that we care about.
So we, of course, inserted a little over a million rows into the heap itself. And then we built a nonclustered index on the heap. So we’ve got that. That’s about all we care about here.
Okay. Because you’ve got nothing to show. So this query will help us find sort of the middle value in the table. And the reason I want to find the middle value is because I want to update about half of it. Because updating half of it in one go sort of simulates, you know, sort of data updates over time that you might see if you are misusing heaps and putting them in a position where they are constantly updated but never, like, truncated or cleared out and then reloaded with other data.
Or just the data is not just append only because that’s where things get into trouble. So if we look at these two queries as they are today, we will see that they have absolutely no forwarded records. Well, I mean, okay.
So if we run these two queries and we look at the before and after of the forwarded record counter, we will see that it is the same after both of them, right? 9582.
So we did not encounter any forwarded records. I just showed you the table with none in there. And as far as page splits a second go, we have this number. This number might change a little bit just because there might be other stuff going on in the background on my server. But this number won’t change because of the heaps.
But what I’m going to do is I’m going to update about half of the table and I’m going to inflate one of the columns a bit just to create some forwarded records. We’re going to fill up some data pages so that SQL Server is going to say, well, this row no longer fits there. We need to move you somewhere else.
I talk more about the mechanics behind that in the training course. But for now, just understand that when you update a heap and rows no longer fit on a page, SQL Server moves those rows silently. Well, sometimes not so silently to other pages where they fit.
So, again, this might change a little bit because of other stuff going on, but I don’t think that number changed at all. That looks the exact same to me. The number that will change is if we go back to that original query and we look at this, we will now see that we have a whole bunch of forwarded fetches against the heap.
The nonclustered index doesn’t have any forwarded fetches in it because nonclustered indexes do not behave the same way heaps do. But that heap now has 475,762 forwarded fetches, right? So, that is the number of rows that have moved about in the heap.
So, if we run these queries again and we just look at the forwarded fetches here, or in this case, they are called forwarded records a second, we’ll get these two queries will eventually run and finish, and this counter value will go from 485344 to 961106, right? So, that just about adds the number of forwarded records there are in the table to the counter there.
It’s pretty amusing, right? Now, of course, you can rebuild the table to get rid of the forwarded records, but that will also rebuild the nonclustered index. At the same time, if you have a heap with lots of nonclustered indexes on them, fixing the forwarded record problem might be a bit painful for you.
It’s pretty common if it’s a large heap to drop or disable the nonclustered indexes, rebuild the table, and then recreate the indexes sort of one at a time, so you don’t have to do that all in one big thing. So, if I turn on query plans, and I look at the alter table rebuild for this, the query plan, when this eventually finishes, shows two things, right?
We see the first section, if my little grabby friend will help here. So, we have the insert into the heap here, right? So, this is rebuilding the table, and then this is rebuilding the nonclustered index, right?
So, we do this work twice. Notice that we have to sort the data here before we do the index insert. That is a nonclustered index key order being sorted.
So, forwarded records, if you’re updating heaps a lot, they can really add up. They can really slow queries down eventually. It sort of introduces a strange sort of jagged I.O. pattern because you have to start scanning through the heap, and then if you hit a forwarded record, you have to jump to where that row lives, and then restart, then come back and do this, and jump to the next forwarded record, and you just kind of end up doing a lot of jumpy, jumping-around work that you don’t want to do when you’re trying to scan through something nice and easily.
The next thing that heaps can have trouble with is deletes. So, unless a delete acquires a table-level lock, either naturally through lock escalation or by specifying a tab lock, potentially x hint, then pages will not automatically be deallocated from a heap, which means that if a page completely empties out of rows, SQL Server is not in a rush to be like, hey, we should get rid of this page because it doesn’t have anything on it.
SQL Server uses that as sort of an optimization, kind of figuring, well, if we’re going to load more data in here, we might need these pages around, and we could just reuse these empty pages, and then, you know, we wouldn’t have to create a new page.
We could save ourselves some time, right? Optimize the inserts. Heaps are meant for, right? Fast insert. No sorting data. No indexes to keep track of. Just whoosh, data in.
All right. So, one thing that is interesting here is that if you have a row versioning isolation level, even a tab lock hint won’t help here.
It’s a whole thing. But you may see similar behavior if your heap contains lob columns, varchar, nvarchar, var binary max, XML, vector, JSON should probably be on that list, or essentially any column that could cause row overflow data, even if you have just very wide rows.
Page deallocation is completely disabled, even with a tab lock hint. The data does not actually have to use lob or overflow to separate pages. The mere possibility, that possibility existing, is sufficient for that to happen.
This is, of course, by design. This is not a bug. Don’t report it to Microsoft. Don’t say, Eric said this is bad and you should fix it. This is the way it works.
I’m just trying to teach you how it works. That’s my job here. So, if we look at indexes on the heap table, we will see that we have two of them still. We have our nonclustered index, which is about 17 megs, and we have our heap, which is about 260 megs.
We can see the in-row used page count over here and the row count that we have in the table of 104, 8576, which is exactly how many rows I loaded into it.
Now, we’re going to use everyone’s favorite. We’re going to use a cursor to delete through the table. And we are going to do that in such a way that we do not acquire a table-level lock, right?
So, we’re going to run all this stuff. Do-do-do-do-do. And I turned off query plan, so this shouldn’t take too long. It should be about 10 seconds or so.
But if it takes longer than that, well, I just might start drinking on camera. Cool. Nine seconds on the nose. I guess I’ve run this demo before, haven’t I?
So, now when we look at this, we’ll see a couple interesting things. The nonclustered index, that’s this one right here, is down to 0.16 pages, right? Or megs of pages assigned to it.
But the heat table still has all 260 megs assigned to it, right? And still has all of this, even though it has no rows in it, right? There is not a single row in there, but it is still the same size.
Every page remains after that delete, right? So, if I run this, if I checkpoint, right? And I run these two queries, let’s turn query plans on, right? Like, none of these, neither of these return any rows, right?
But notice that we still do this table scan here, right? And if we look at the statistics I.O. output for this, we still do a whole bunch of logical reads against the heap, right?
But we do almost none against the nonclustered index, right? Coming down here, that’s what was happening. This is our index scan, which was, you know, the index that now has no pages assigned to it.
But this thing up here still scans every single page, right? Fun times are afoot. So, if you’ve got heaps, if you’ve got tables that do not have clustered indexes on them, remember that they are flat structures.
They are not tree-shaped. They use something called a row identifier or a RID, which is the file, page, and slot number combination for the row to, instead of a clustering key, to identify rows.
They’re tracked by something called the index allocation map. That is the IAM. And it tracks about four gigs of heap data apiece, or they track about four gigs of heap data apiece.
The good stuff that you can get out of heaps are fully parallel data loads because you don’t have any indexes to maintain. You don’t have to deal with page splits on updates, just forwarded records. There’s less logging overhead for bulk operations.
And empty pages will stick around. They’re not deallocated in case you want to do some more inserts. The not-so-good stuff is forwarded records when rows grow and move pages. We saw that with the updates.
We saw that pages where all the rows got fully deleted off of them don’t get deallocated automatically, and the scans still read a lot of empty pages.
That can be really surprising and really impact performance if you’re not paying attention to it. You need table locks or lock escalation for page deallocation to occur. If you’re trying to get the fully parallel insert into the heap, and if you have any nonclustered index on there, it messes them up.
Role versioning isolation levels will prevent page deallocation, even with the tab lock hint. Rebuilding the heap also rebuilds all the nonclustered indexes at the same time. Forwarded records make nonclustered index lookups a little bit more expensive because you not only have the lookup, but now you also have the following page forwarded records around in there.
If you really just need to dump a bunch of data into a table quickly, heaps are probably the way to go. And simple recovery model will really help with minimal logging.
But, you know, a lot of people look upon heaps as if they are evil structures, but really, you know, they’re just things that are… They’re very specialized structures with very specific purposes that unfortunately have been sort of abused and misused by people over the years and have gotten a bad name.
Sort of like cursors. Ha! See how I did that. Anyway, thank you for watching. I hope you enjoyed yourselves. I hope you learned something. I hope you’ll check out the full training course. Again, that’s available.
There’s a link down below where you can get the Learn T-SQL with Erik course. And, yeah, that’s that. All right, cool. See you in tomorrow’s video where we will talk about clustered indexes. Won’t we then?
All right. See you then.
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.