INSERT and UPDATE loading practice – The impact of forwarding pointers.

Ever since I’ve been working with databases I have come
across the following pattern of programming. The pattern is an INSERT and LOTS
of UPDATEs. The reason for this pattern is reasonable, and is generally due to
the desired data set being complex to generate from the source database.

For this reason a base set of data
is generated and then updated by a number of updates.

I’ve seen this is two main
scenarios, integration and warehousing. The former is due to a target system
wanting something that the soruce system had in a different structure or a
differnt level and the latter generally wants calculated data like number of
children, or name of last dentist visited.

You might think these should easily
be solved by one query, and if there was only on of these bits of infomation
required that would be easy. However more often than not there are many, and
even worse they are depenent, i.e. given the last dentist someone visted, give
me the number of people that have sued said dentist.

So logically people break this down
into manageable chunks. And do something like,

 

INSERT
INTO CustomerDetails

SELECT
somecolums

FROM
Customers

 

 

ALTER
sometable ADD NumberOfOrders int,
NumberOfChildren int, NumberOfPeopleSued
int
, LastDentistVisited varchar(100)

 

UPDATE
CustomerDetails

   SET NumberOfChildren = (SELECT COUNT(1) FROM ….)

 

UPDATE
CustomerDetails

   SET LastDentistVisited = (SELECT COUNT(1) FROM ….)

 

 

UPDATE
CustomerDetails

   SET NumberOfPeopleSued= (SELECT TOP 1 Surname FROM
….)

 

The reason
for this post is that this approach is right for 1 reason and wrong for many
more.

It is right
because it is good practice to break small problems down into manageable
understandable chunks as this aids understanding and means code is easier to
support in the future.

It is wrong
for many reasons of performance and one functional reason. I will cover the
latter in a later post but will mention the performance reasons now.

1. The
addition of the columns means each row has to be adjusted to fit the new
columns. With fixed width columns this can be very expensive as columns have to
be moved about. If you have a need for this the columns should be added in the
insert statement. To get the correct type, use cast, i.e. CAST(NULL as int) or
CAST(NULL as varchar(24))

2. The second
reason is that for each UPDATE each row has to be updated. If you have N updates
that means each page is touched at least N times. Which is not good for
performance. If you have to do this the solution is to try and batch up dates to
do it in one pass, by using derived tables and CASE statements. i.e.

UPDATE
<table>

   SET Col1 = CASE WHEN <some criteria>

                  
THEN <a new value>
ELSE Col1 END

     , Col2 = CASE WHEN
<Some other criteria>
           
       THEN <some other
new value> ELSE Col2 END

3. The final
reason is one of forward pointers. Even if you add your additional columns by
adding them in the INSERT statement, if any of them are variable length then
they will result in FORWARDED RECORDS when you start doing an update. Whats a
forwarded record, well it occurs when a row has to move because it no longer
fits on the page it was put to start with. Its just like when you move house and
redirect your mail. SQL Server maintains that your record is still located in
its original location but in that location is a pointer to another location, in
another page in your database. Why is this bad. Because it means SQL server has
to perform two IOs to be able to read the data, and also it has to do more work
to update the data. More work equals worse performance.

So lets have
a look at forward pointers.

In this
example we are going to create a table, and put 1 row in it. We will then put
another row in the table that fills the page.

use
tempdb

go

drop
table heap

go

create
table heap
(id int
not null, v1
varchar(5000))

go

insert
into heap(id,v1)

select
1, ''

go

select
object_id,
partition_number, alloc_unit_type_desc, record_count,
avg_record_size_in_bytes,
forwarded_record_count

from
sys.dm_db_index_physical_stats(db_id('tempdb'),object_id('heap'),0,null,'detailed')

go

insert
into heap(id,v1)

select
2, REPLICATE('#',4000)

go

select
object_id,
partition_number, alloc_unit_type_desc, record_count,
avg_record_size_in_bytes,
forwarded_record_count

from
sys.dm_db_index_physical_stats(db_id('tempdb'),object_id('heap'),0,null,'detailed')

go

select
* from
heap

cross
apply sys.fn_PhysLocCracker(%%physloc%%)

 

What you should see is that after
the first insert the DMV
dm_db_index_physical_stats  is reporting 1 row
and no forwarded records. After the second insert we get 2 tows and no forwarded
records. Not we are also using a SQL 2008 undocumented feature
%%physloc%% and fn_PhysLocCracker.
 Both of these are undocumented, but
are very handy as they show the physical location of a row, the File, Page and
Slot.  There is also a scalar
function as well. Thanks to Paul Randal for the pointer
http://www.sqlskills.com/blogs/paul/post/SQL-Server-2008-New-(undocumented)-physical-row-locator-function.aspx
 .
What we will see is that this returns the original location and not any
forwarded location.

Now if we
update the first row so its too big to fit on the page what do you think
happens.

update
heap

set
v1= REPLICATE('@',5000)

where
id = 1

go

select
* from
heap

cross
apply sys.fn_PhysLocCracker(%%physloc%%)

go

select
object_id,
partition_number, alloc_unit_type_desc, record_count,
avg_record_size_in_bytes,
forwarded_record_count

from
sys.dm_db_index_physical_stats(db_id('tempdb'),object_id('heap'),0,null,'detailed')

 

Well you will see that we now have
a forwarded record, but %%PhysLoc%% is still saying its in the same place. If we
use DBCC IND we can see that the table now has 3 pages, an allocation page and
two data pages.

dbcc
ind(tempdb,
heap,1)

go

dbcc
traceon (3604)
go
dbcc page (tempdb,1,???,3)
dbcc page
(tempdb,1,???,3)

Using DBCC PAGE we can look at the
original page and you will see something like this. (note the ??? change these
to the page numbers returned from DBCC IND where PageType =
1)

Slot
0 Offset 0x101a Length 9

 

Record
Type = FORWARDING_STUB       
Record Attributes =                 
Record Size = 9

 

Memory
Dump @0x0000000011A1B01A

 

0000000000000000:   04000900 00010000
00†††††††††††††††††..    
……       

Forwarding
to  =  file 1 page 2304 slot 0

This highlights that the row has
moved to page 2304 (your page will probably be
different)

If we look at that page (which will
be the second DBCC PAGE above) you will find your
row.

So what happens if we move house
again, what does SQL Server do. To check it out we need to add two rows, 1 fills
up the first page and one fills up the second page. If we then update the first
row so that its too big for its new page, it should
move.

 

insert
into heap(id,v1) select 3, REPLICATE('#',4000)

insert
into heap(id,v1) select 4, REPLICATE('#',1000)

go

select
* from heap

cross
apply sys.fn_PhysLocCracker(%%physloc%%)

go

select
object_id,
partition_number, alloc_unit_type_desc, record_count,
avg_record_size_in_bytes,
forwarded_record_count

from
sys.dm_db_index_physical_stats(db_id('tempdb'),object_id('heap'),0,null,'detailed')

go

update
heap

set
v1= REPLICATE('@',8000)

where
id = 1

go

select
* from heap

cross
apply sys.fn_PhysLocCracker(%%physloc%%)

go

select
object_id,
partition_number, alloc_unit_type_desc, record_count,
avg_record_size_in_bytes,
forwarded_record_count

from
sys.dm_db_index_physical_stats(db_id('tempdb'),object_id('heap'),0,null,'detailed')

go

dbcc
ind(tempdb,
heap,1)

Doing the insert we will see that
the 4th row now appears on the second page but the other three appear to be on
the first page. After the update nothing changes and the forward record count is
still 1. So whats happened.

Well if we look at the first page
again, using DBCC PAGE you will find that the page number that the forward
pointer points to has changed to a new page. If you run DBCC IND you will see
you now have a 3rd data page. Interestingly if you have a look at the second
page, you will see that the slot 0 when the old forwarded record resided is now
not being shown, thats because there is nothing in it. Slot 0 is now
free.

So the bottom line is that if you
are resorting to use such an approach doing an INSERT and then an UPDATE
forwarding records can cause you a problem, not just during reading but also
when updating because data has to be moved. If you have to use this approach
then you are best to preallocate space in the table this ensures that the row
doesn’t move and performance will improve. I’ve seen 50% improvements in systems
by allocating defaults to varchar columns using replicate for
instance.

So what should you
do.

Well whilst you might be able to
right 1 query that achieves your goal, it is probably only supportable by the
person that wrote it or someone from MENSA. So breaking it up is a good thing,
but the use of updates is a bad thing. You will probably find that it is quicker
to perform 5 insert statements and then joined them all together rather than 1
INSERT and 4 UPDATES. If you use SELECT INTO it will be a bulk logged operation,
you can then add a clustered index on your key which should then allow the final
query to be an efficient MERGE  JOIN.

If you can you could use INSERT
INTO with the CLUSTERED INDEX already inplace and use trace flag 610
http://blogs.msdn.com/sqlserverstorageengine/archive/2008/03/23/minimal-logging-changes-in-sql-server-2008-part-2.aspx This would normally be the quickest option.

ps. One nice feature in SQL 2008 is that the ALTER TABLE <heap>
REBUILD, removes forward pointers.


Leave a Reply

Your email address will not be published. Required fields are marked *


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>