Category Archives: 4498

Dangers of BEGIN and END

I’ve presented this material at three conferences recently, so it’s about time I wrote a blog post on it…

As programmers, we love modularisation – even in the SQL space. We make stored procedures, views, and functions to encapsulate our code. This improves maintainability, simplifies the development experience, and is generally useful.

But there’s a time when it’s a bad thing for SQL Server.

There’s an amazing component of SQL Server called the Query Optimizer (I always want to write Optimiser, but I’m assuming it’s a proper noun and putting up with the US spelling). When we write queries in T-SQL, it’s the Query Optimizer that works out how to actually run the query. It works out what indexes can be used to improve performance, what order tables (well, indexes and heaps) should be accessed, how to perform the joins, and so on. I find that a rough appreciation of the power of the Query Optimizer can really help query writers.

For example, the Query Optimizer will translate a correlated sub-query in the SELECT clause into a LEFT OUTER JOIN, so that you don’t have to. It will also work out when joins can be rendered pointless and thereby removed from the plan altogether. If you let these principles help you in your query design, you can see significant benefits. It also helps you write queries that are easier to maintain, as there’s little point in trying to be clever by writing a query in a different way if the Query Optimizer will handle it in the same way as before.

If you use a view in another query, the definition of the view is used in the query as if you had written it with a sub-query. A view is simply that – a stored sub-query. They are sometimes referred to as ‘virtual tables’, but I disagree. They are stored sub-queries. Sure, the analogy falls down when you start considering indexed views, but on the whole, a view should be seen as a stored sub-query. The Query Optimizer takes the view definition, applies it in the second query, simplifies it where possible, and works out the best way of executing it. If you’re only interested in a couple of columns out of the view, the Query Optimizer has an opportunity to take that into consideration.

Stored procedures are different. You can’t use a stored procedure in an outer query. The closest you can get to this is to use OPENROWSET to consume the results of a stored procedure in an outer query, but still the whole procedure runs. After all, it’s a procedure. A set of T-SQL commands, not a set of queries. I see the clue to this as the BEGIN and END that stored procedures generally use. I like stored procedures, but I do get frustrated if they’re returning more information than I need, since I have no way of letting the system know that maybe it doesn’t need to do as much work.

Functions are in between, and come in two varieties. A function can be inline, or it can be procedural. I don’t think you find this differentiation in many places – and normally people talk about this particular drawback as being associated with Scalar Functions as compared to Table-Valued Functions, but the problem is actually one of simplification.

An inline function must be a table-valued function at this point in time. It takes the form:

CREATE FUNCTION dbo.fnFunctionName(<paramlist>) RETURNS TABLE AS
RETURN
( SELECT …. );

It is always this form, with a sub-query enclosed in a RETURN statement. It can return many columns and many rows, but the definition of the table is implied by the SELECT clause. This is essentially a view that can take parameters.

The other form is one that involves BEGIN and END. Scalar functions (unfortunately) require this (but hopefully one day will not).

CREATE FUNCTION dbo.fnFunctionName(<paramlist>) RETURNS int AS
BEGIN
RETURN ( … )
END;

As the RETURN statement is enclosed between a BEGIN and END, it can be preceded by other statements, used in working out what value should be returned.

Table-valued functions can use BEGIN and END, when multiple lines are required to calculate the rows in the table being returned.

CREATE FUNCTION dbo.fnFunctionName(<paramlist>) RETURNS @table TABLE (<fields>) AS
BEGIN

RETURN
END;

In this kind of function, the table variable is populated with data, and returned to the outer query when the RETURN command is reached.

But when the Query Optimizer comes across a procedural function, it cannot simplify it out and executes the function in a different context.

The execution plan will report that the cost of running the function is zero. But it’s lying. The way to see the impact of the function is to look in SQL Profiler, where you’ll see potentially many calls to the function, as it needs to work out the result for each different set of parameters it’s passed. The pain can be quite great, and you will never have noticed if you just look at the Execution Plans.

The moral of the story is to make sure that your functions are able to be simplified out by the Query Optimizer. Use inline table-valued functions even in place of scalar functions. You can always hook into them using CROSS/OUTER APPLY in your FROM clause, or even use them in your SELECT clause (not “SELECT Claws” – that would make it related to my company LobsterPot Solutions, and “SELECT Claus” is just a bit Christmassy) using a construct like SELECT (SELECT field FROM dbo.fnMyTVF(someParam)) …

Consider the Query Optimizer your friend. Study Execution Plans well to look at how the Query Optimizer is simplifying your query. And stay away from BEGIN and END if possible.

A busy month – a new book, a new car, a new phone

I don’t know why Septembers are always busy. This one feels like it’s been interesting, and I’m not sure life will be the same again.

But first, some of the biggest news – the book that I wrote a couple of chapters for is now available for purchase!

nielsen_cover150[1] A challenge was put out a while back for SQL MVPs to write a book for charity. Paul Nielsen spearheaded it, and I’m pleased to say that there was a massive response. I wrote two chapters, and this week we have had the notice that the book can now be purchased from Manning Press. If you go to http://www.SQLServerMVPDeepDives.com you will be able to buy the Early Access Edition, which will get you updated electronic copies as the chapters become available (final layouts, images, etc still appearing). All the royalties for this book go to charity rather than the authors, so buy up! I promise to sign any copy put in front of me, but if you go to the PASS conference in November, you can probably get at least 40 or so of the other authors to sign it instead.

I’m planning to get a signed copy brought back from the US, and will auction it off to members of the Adelaide SQL Server User Group, giving the money to charity too.

It feels good to have the book finished!

September has also busy for a number of reasons. The company is growing nicely, celebrating a year this week, and keeping my time somewhat occupied. We achieved Gold Partner status with Microsoft at the end of August, and are ticking along well. On a more negative note, the winter has taken its toll with flu in the family, which is lousy – but we also bought a new car (finally got the people mover we’ve been promising ourselves). I replaced my old phone with an iPhone (part of me thinks that I’ll go back to Windows Mobile next time), and bought my wife one too (plus a DS for her birthday). I feel like we’re more gadgety than ever!

Not to mention TechEd Australia, which was a fun time. Grant Paisley surfing at Dreamworld was a sight to behold, and I hope there are photos somewhere! I gave a talk on SQL Azure, which gave me a number of headaches leading up to the conference, finding new things I wanted to mention on a daily basis! I also gave a talk on the danger of scalar functions in SQL Server, which I will be repeating in just over a week at Wagga, a couple of days after presenting at the Albury/Wodonga .Net User Group. An email arrived about half an hour after my scalar functions talk, saying that someone in the audience had just applied the principles I showed and made some vital queries run thirty times faster! Terrific news I think.

Bringing another laptop (an HP Mini) home from TechEd has also added to the number of gadgets in the house…

Tomorrow I hope to be able to write that I have been awarded MVP status for another year, which will be a tremendous honour. I keep wondering how much longer I’ll be able to remain in the company of such a fantastically skilled and helpful crowd. Every time I receive the award I’m both humbled and proud, and feel amazingly blessed.

Four speaking engagements coming up

I’m just going to list them…

This Saturday (July 18th), at Code Camp SA. I’m going to be talking about functions in SQL, particularly those that involve BEGIN and END.

At the end of the month, at the ACS Branch Conference. I’m going to be part of a panel discussing Open Source v Closed Source.

In August, I’m going to be speaking at SharePoint Saturday (Adelaide), about the integration of Reporting Services and SharePoint.

In September, I’m going to be a presenting at TechEd Australia, about SQL Azure.

Be nice if there was more of an overlap in topics…

Big events every month this quarter

A new Financial Year in Australia, and a bunch of technical events coming up.

Of course there’s the usual monthly user groups, but there’s more – particularly if you’re in Adelaide.

July sees CodeCampSA in Adelaide on the weekend of July 18/19. I’ve put my name into the hat for speakers, and will try to be there for a chunk of Saturday (Sundays are too busy for me). I’m sure at least one of my sons will want to come along as well, which will be fun. Big thanks to David Gardiner for putting the website together.

Map picture

August sees SharePoint Saturday come to Sydney (8th) and Adelaide (15th). You may not agree with Aaron about what SharePoint is, but if you’re into SharePoint, I’m sure you’ll get a lot out of these events.

September brings Australian geeks to the Gold Coast again for the Microsoft’s annual TechEd Australia, this year with the added incentive of an HP Mini for attendees (conditions apply of course). It’s the 2140, which is a discontinued line, but that doesn’t make it any less attractive a machine. I’m sure this will help persuade people to get themselves over to Queensland.

image  

Map picture

And in case you hadn’t realised, I’ve recently discovered how easy it is to put maps into blogs using Windows Live Writer… just so that you can all see the beach, and understand how poorly attended the sessions would be if they ran TechEd Australia in the summer.

GO – repeating batches

GO is very cool. More so than you might think, and in a couple of interesting ways. This is the 6th tip from my TechEd Australia talk, from which you can see my slides and scripts in other posts.

In SQL Server, GO is the traditional batch separator. But what many people don’t know (but has been mentioned around the blogosphere several times over the past year or so), is that you can put a number after GO to make the batch run repeatedly. I often use this method to populate lots of sample data into a table.

INSERT dbo.someTable (col1, col2)
SELECT col1, col2
FROM dbo.someTable;
GO 5

If my table starts with 10 rows, it will insert a copy of those ten rows, then twenty, then forty, eighty, and finally an extra 160, leaving me with 320 rows in my table. I guess I could have hit F5 5 times, but you get the point.

The fun side of GO is that it’s a configurable option in Management Studio. Head into Tools, Options, and you’ll see it under Query Execution.

image

Now, I would never condone changing this on a colleague’s machine. If you do, then GO will start causing errors. Scripts that once worked will probably start failing (but not necessarily all – simple queries might just think GO is an alias).

But if you want to be really mean – and again, I would never suggest you do this to that annoying colleague who always leaves their laptop unlocked when they head off to lunch, even if they just live to frustrate every waking minute of your life – if you want to be really mean, you set the Batch separator to SELECT.

I’m sure your eyes have just dilated at the pure evil. And so they should. Now, even the most basic of queries will give the nastiest of errors. Restarting SSMS won’t fix it. Rebooting won’t fix it. You had better make sure you don’t consider doing this to a colleague – if they realise (having searched online for “A fatal scripting error occurred. Incorrect syntax was encountered while parsing SELECT.” and found this post), they may figure it out and it won’t be good for you. At least make sure you’re around to help them when they first panic.

image

Ordered data requires ORDER BY

When you select data out of a database table, it might seem as if the system will give you data in the order of the clustered index that is on that table (assuming it has one), but this isn’t quite right. I showed this in the fifth of my T-SQL tips at TechEd Australia this year.

If you don’t explicitly order data using the ORDER BY clause, the system makes no attempt to provide you with the data in any particular order. It will just give you the data in the fastest way possible. You might be lucky and get the data in a predictable order most of the time. In fact, on your system, the data might be returned in a predictable order over 99% of the time. Unfortunately, you can’t rely on that.

When the database engine is handling your query, it needs to get your data into RAM first. If it’s not already there, this means a trip to disk, which will generally be the slowest aspect of your query. With multiple processors and multiple disks being the norm these days, there is a good chance that one processor will head off to get the data from one disk, while another processor gets data from another. If your query isn’t explicitly stating the order for your data, whichever processor gets its data first is likely to determine which data appears first in your results. There are other reasons why your data might come back in an unexpected order, but this reason alone should convince you.

I’m sure reading this, you’re very much used to putting ORDER BY in your queries. But what happens if you’re using SQL Server Integration Services (SSIS)? When you set up a data flow and want to pull data out of a particular table or view, do you use the “Table or View” option in the data source? You might – it’s a lot easier than typing your query out if you choose the SQL Command option. Just don’t. Use the SQL Command option instead.

image

The problem is that SSIS can really take advantage of ordered data. If SSIS can’t guarantee that the data is ordered, it will assume it’s not, and this might hurt the performance of your package significantly. Some data flow transformations can’t even run on unordered data. You can tell a data flow source that the data is ordered. You can even tell it that it is, even if it’s not. Don’t do this. Only tell a data flow source that it’s ordered if you’re explicitly ordering it. If you tell SSIS that a data flow is ordered, and the data comes through in a different order (because you’re unlucky one time), you will get unexpected results. What I mean by ‘unexpected’ here is ‘wrong’.

It’s not hard, just write out your query, and use the ORDER BY clause. You know you should…

Unique Indexes with GROUP BY

Indexes are great, but if you don’t understand the significance of unique indexes, then you’re potentially missing out on some decent performance gains.

I’ve been meaning to write this blog post for a long time. This is material that I’ve been teaching and presenting about all year, but somehow it’s never turned into a blog post. It was the third tip in my presentation at TechEd Australia (and at user group events in Adelaide and Melbourne). Today I’m flying back home from the US, so hopefully I’ll be able to spend some battery time getting it done. [Note: It’s ended up out of order with another post that I wrote on the plane. This one is Tip #3.]

This query runs in the AdventureWorks database, in almost any version of SQL Server you have. When I present on this at the moment, I use SQL Server 2005, so that people don’t think I’m just showing new SQL Server 2008 features. This concept will help you regardless of version.

SELECT s.Name, COUNT(*)
FROM Production.ProductSubcategory s
  JOIN
  Production.Product p
  ON p.ProductSubcategoryID = s.ProductSubcategoryID
GROUP BY s.Name;

It’s a simple query that counts the number of products in each non-empty subcategory. Having seen the logical action of the query, let’s look at the execution plan. This will show us what the query is actually doing. We’ll look at it both graphically and in text. The graphical representation is stored as XML, and additional information can be seen using the tool tips and the Properties window.

 

|–Merge Join(Inner Join, MERGE:([p].[ProductSubcategoryID])=([s].[ProductSubcategoryID]), RESIDUAL:([AdventureWorks].[Production].[ProductSubcategory].[ProductSubcategoryID] as [s].[ProductSubcategoryID]=[AdventureWorks].[Production].[Product].[ProductSubcategoryID] as [p].[ProductSubcategoryID]))
     |–Compute Scalar(DEFINE:([Expr1004]=CONVERT_IMPLICIT(int,[Expr1016],0)))
     |    |–Stream Aggregate(GROUP BY:([p].[ProductSubcategoryID]) DEFINE:([Expr1016]=Count(*)))
     |         |–Sort(ORDER BY:([p].[ProductSubcategoryID] ASC))
     |              |–Clustered Index Scan(OBJECT:([AdventureWorks].[Production].[Product].[PK_Product_ProductID] AS [p]))
     |–Clustered Index Scan(OBJECT:([AdventureWorks].[Production].[ProductSubcategory].[PK_ProductSubcategory_ProductSubcategoryID] AS [s]), ORDERED FORWARD)

 

The Stream Aggregate operator is where the GROUP BY work is done. Having sorted the data, the system can look through it for changes, calculating whatever aggregates are needed at the time. The properties window or the text view show clearly that the count is being calculated here.

And they show that the field being grouped by is p.ProductSubcategoryID. But our query said we should group by s.Name. So what’s going on? It’s not picking the right field. It’s not even picking the right table.

But our system has a unique index on s.Name. GROUP BY looks for the distinct values from the fields in the GROUP BY clause, but if a unique index is on one of those fields, every row in that table is already known to be distinct. Our unique index is letting the system have some more freedom.

Because s.ProductSubcategoryID is also known to be unique (it happens to be the Primary Key (PK) of the table), it is logically equivalent to consider this field in place of s.Name. And as we are joining to the Product table on this field, it is even okay to group by the field from Product. The s.Name field is fetched later and brought into our result set using the Merge Join operator. It’s not hard to recognise that our query is asking for the number of products per subcategory, and that we’re only really grouping by s.Name because that’s the field that we want to display.

Now let’s consider what happens if we remove the unique index on s.Name. Interestingly, if you have been evaluating your indexes using sys.dm_db_index_usage_stats, you may consider that the index isn’t being used. Unfortunately, the fact that the query optimizer does actually use the query isn’t reported very obviously. Once the index is removed, the query plan changes. This demonstrates that the index is definitely being used, even if it’s not used in a way that gets reflected in sys.dm_db_index_usage_stats.

|–Compute Scalar(DEFINE:([Expr1004]=CONVERT_IMPLICIT(int,[Expr1007],0)))
     |–Stream Aggregate(GROUP BY:([s].[Name]) DEFINE:([Expr1007]=Count(*)))
          |–Sort(ORDER BY:([s].[Name] ASC))
               |–Hash Match(Inner Join, HASH:([s].[ProductSubcategoryID])=([p].[ProductSubcategoryID]), RESIDUAL:([AdventureWorks].[Production].[ProductSubcategory].[ProductSubcategoryID] as [s].[ProductSubcategoryID]=[AdventureWorks].[Production].[Product].[ProductSubcategoryID] as [p].[ProductSubcategoryID]))
                    |–Clustered Index Scan(OBJECT:([AdventureWorks].[Production].[ProductSubcategory].[PK_ProductSubcategory_ProductSubcategoryID] AS [s]))
                    |–Clustered Index Scan(OBJECT:([AdventureWorks].[Production].[Product].[PK_Product_ProductID] AS [p]))

It has clearly changed a lot. Now, the Stream Aggregate operator reports that it is grouping by s.Name – just like we asked. But you may also notice that the Estimated Subtree Cost is about 30% worse than the previous query. The system is treating this query in exactly the way that we asked – reporting the number of products per subcategory name, rather than per subcategory.

We can address this without the unique index, if we change the query. The answer is to simply include the Primary Key of the subcategory in the GROUP BY clause. Even though we don’t want to include that field in the result set, listing it in the GROUP BY clause will let the system optimise the query better even without the unique index. Bear in mind that if there is a duplicate subcategory name, there will be an extra row that didn’t appear before – but this is correct we’re grouping by subcategory rather than subcategory name. Having s.Name in the GROUP BY clause is ignored (as in our original plan) – it’s only listed there because we want to list it in the SELECT clause.

SELECT s.Name, COUNT(*)
FROM Production.ProductSubcategory s
  JOIN
  Production.Product p
  ON p.ProductSubcategoryID = s.ProductSubcategoryID
GROUP BY s.Name, s.ProductSubcategoryID;

The unique index simply allows the PK to be implicitly inserted into the GROUP BY clause.

As a query developer, you should ask yourself if you’re wanting to group by the entity described by the table or just a particular column. If you’re really wanting it to be entity, then include the PK in the GROUP BY clause, or else understand the impact of the unique index. And definitely appreciate that there may be queries which are helped by the existence of unique indexes.

I do think that it’s worth evaluating index usage. Having an unused index in your system is like having the Yellow Pages in your house if you never use it – just taking up cupboard space. Dropping an unused index may not work for you though, particularly if it’s unique.

JOIN simplification in SQL Server

This is another of my tips/techniques that I demonstrated recently. It is not a new SQL Server 2008 feature – it works just fine in older versions – I demonstrated it running on SQL Server 2005 but using SQL Server 2008 Management Studio. [Note: It’s ended up appearing earlier than another post that I wrote on the plane. This one is Tip #4.]

When you have a single-table query and then introduce a JOIN, there are a number of ways in which that JOIN impacts your query. I want to talk about four such JOIN effects that I’ve identified. Understanding these four effects will help you take advantage of the simplification I am about to show you. Don’t focus on the simplification, focus on the effects. I will show you later how the concept can be applied to help database developers.

1. Extra columns

This seems fairly straight forward – we have access to additional columns (the one in the new table). It’s the main reason we tend to join to other tables – we simply don’t have all the data in our original table.

2. Duplicated rows

Not necessarily duplicated completely, but certainly a row in our original table may appear multiple times in our result set, if it matches with multiple rows in the new table. Notice that this doesn’t occur if the join-fields (those used in the ON clause) in the new table are known to be unique (as is the case with a Foreign-Key lookup).

3. Eliminated rows

In a similar manner, if a row in our original table doesn’t match with any rows in the new table, it might get eliminated from the results. Notice that this doesn’t occur if a LEFT JOIN or FULL JOIN is being used.

4. Added NULLs

This effect is somewhat rarer, but still worth mentioning. If a RIGHT JOIN or FULL JOIN is being used, then there may be NULLs added to our result set to reflect rows in our new table that don’t have matches in our original one. This doesn’t happen unless a RIGHT JOIN or FULL JOIN is being used.

Let’s examine some queries. They all use the AdventureWorks sample database.

SELECT p.ProductID, p.Name, p.Color, p.Size
FROM Production.Product p;

This is our starting place. Our control query if you like. Its execution plan is very simple – a clustered index scan on the Product table. There are no other tables that need to be involved. It returns 504 rows.

Now let’s involve another table – Production.ProductSubcategory.

SELECT p.ProductID, p.Name, p.Color, p.Size, s.Name as SubcatName
FROM Production.Product p
  JOIN
  Production.ProductSubcategory s
  ON p.ProductSubcategoryID = s.ProductSubcategoryID;

This is a fairly standard lookup query. It’s essentially still about the Product table, but we’re hooking into the Subcategory table to be able to fetch some more information. Clearly we want effect #1 to take place.

But what about the other effects?

We don’t have duplicated rows, as s.ProductSubcategoryID is known to be unique (it’s the Primary Key).

We’re not introducing NULLs on the Product table side of the query, as we’re not using RIGHT JOIN or FULL JOIN.

And because this is a foreign key relationship, our constraint should stop us from eliminating rows. Only valid values can be used in p.ProductSubcategoryID – but this field can take NULLs, so in fact, the Eliminated Rows effect is taking place. Looking at the results of the query show that only 295 rows are returned. This can be avoided by using a LEFT JOIN instead of the INNER JOIN.

SELECT p.ProductID, p.Name, p.Color, p.Size, s.Name as SubcatName
FROM Production.Product p
  LEFT JOIN
  Production.ProductSubcategory s
  ON p.ProductSubcategoryID = s.ProductSubcategoryID;

Now we are only seeing the first JOIN effect. As expected, the execution plan involves indexes on both queries.

But look what happens if you remove s.Name from the query.

SELECT p.ProductID, p.Name, p.Color, p.Size
FROM Production.Product p
  LEFT JOIN
  Production.ProductSubcategory s
  ON p.ProductSubcategoryID = s.ProductSubcategoryID;

Now, none of our four effects are coming into play. And when we look at the execution plan, we see something surprising.

The query has simplified back down to to a single-table query. The query optimizer has deemed that the JOIN wasn’t having any effect on the query and therefore removed it.

At the moment, this may seem quite academic, but this can be leveraged to make life a lot easier for the average database developer.

It turns out that about half the time I need to refer to Products, I need to perform a bunch of lookups. I need to look up ProductType information, ProductSubcategory information, ProductCategory information, and so on. But each of these lookups makes my query-writing a little slower. I might be quite quick at writing these queries, but my queries typically have ten lines or more that are just providing the lookup logic.

It would be so much nicer to have a view called ProductsPlus, which was a slightly denormalised version of the Product table – including all my lookup information.

The problem is that without an understanding of the four join-effects, the system is probably going to be performing all the lookups every time you use the ProductsPlus view, even if you’re not interested in all the lookup columns. Examining the join-effects, we can eliminate them, just like in the previous example.

Let’s consider the contents of our ProductsPlus view. We have foreign key relationships in place to assist. All the FK columns can take NULL, so we are using LEFT JOIN. If they didn’t allow NULL, INNER JOIN would be okay.

CREATE VIEW dbo.ProductsPlus
AS
SELECT p.ProductID, p.Name, p.Color, p.Size, s.Name AS SubcatName, c.Name AS CatName, m.Name AS ModelName, su.Name AS SizeUnit
FROM Production.Product p
  LEFT JOIN
  Production.ProductSubcategory s
  ON p.ProductSubcategoryID = s.ProductSubcategoryID
  LEFT JOIN
  Production.ProductCategory c
  ON s.ProductCategoryID = c.ProductCategoryID
  LEFT JOIN
  Production.ProductModel m
  ON m.ProductModelID = p.ProductModelID
  LEFT JOIN
  Production.UnitMeasure su
  ON su.UnitMeasureCode = p.SizeUnitMeasureCode
;

When I query this view using SELECT * FROM dbo.ProductsPlus, I get all the tables involved, of course.

image

But because we took care when creating the view to avoid effects #2, #3 and #4, when we only need certain columns, the other tables are completely ignored.

Now your SELECT queries can be much shorter and easier to read, without involving tables needlessly.

Doing the maths to understand SQL optimiser choices

The second of my tips from my talk (slides and scripts available) was about doing the maths to understand why the SQL Server optimiser behaves in the way that it does – particularly in relation to choosing whether to scan an index/heap, or whether to seek a less-suitable index for some of the data and then do a lookup to fetch the rest. This applies in SQL Server 2005 just fine, by the way, as do all the scripts from my talk.

The basic gist of it is that the system will decide to use a plan which minimises the number of reads required to get all the data. So My example scripts compare a scenario of looking up employees in AdventureWorks.HumanResources.Employee by managerid. There’s a Non-Clustered Index on ManagerID (which naturally then includes EmployeeID, which is the Clustered Index). But if we also want the LoginID field, then our NCIX isn’t quite so handy.

The choice of how to implement this comes down to basic mathematics. First, a couple of useful facts:

  • The whole Employee table (CIX in the physical representation) is on 9 pages.
  • Seeks on small indexes typically take 2 reads – one to find out which page the row is on, and the other to get that page.
  • Lookups are essentially the same kind of thing as seeks.

So let’s consider the two options here – scanning the table v seeking the index and looking up the rest.

Scanning a table is always going to be 9 reads. Easy.

Seeking the table is going to take a couple of reads (but maybe three even) to find the EmployeeIDs for the people that report to the manager of interest. But then it’s going to require another two reads for every employee in this list, to be able to get the LoginID field out of the CIX. So it’s going to be 2+2n reads (where n is the number of employees).

So if our manager has 2 reports, we get 2+2×2=6 reads using the seek, and the system will use this method.

But if our manager has 4 reports, we get 2+2×4=10 reads using the seek, which would make it better to use just scan the CIX instead (only 9 reads).

Naturally, if we change our index to include the LoginID, the query then takes 2 reads, whether the manager has 2 or 4 direct reports. It could take more if those records are stored over multiple pages.

Understanding this simple bit of maths can really impact your indexing strategy, and your appreciation of the way that T-SQL works.

OUTPUT clause – knowing what goes in, and what you’ve accidentally taken out

The OUTPUT clause has to be one of the best T-SQL features out there. It was new in SQL Server 2005, but it’s still remarkably little known. I guess like many of the features that were introduced in SQL 2005, many people just make do with the way they did things before.

The basic point of the OUTPUT clause is to turn a data modification statement into one that returns data (although this data can be redirected to an existing table by simply adding “INTO tablename” after it), by providing a way of accessing the inserted and deleted tables that we have been using in triggers for years. If a DELETE statement is being issued, only the deleted table is available, and similarly inserted for an INSERT statement. UPDATE provides both – and the MERGE statement (new in SQL Server 2008) populates both (although it feels as if a FULL JOIN has been applied to them).

I’m sure it’s easy to imagine the tremendous advantage to using it with UPDATE statements – auditing. It’s very easy to push the changes to an audit table, without the need to create a trigger to do it for us. But I find that the biggest advantage is for those times when carelessness has got the better of us. Hopefully these times are rare (and the more experienced we get with databases the more we seem to respect the sanctity of the data), but if you always use “OUTPUT deleted.*” when deleting, or “OUTPUT deleted.*, inserted.*”, then the odd time when you see more rows come back that you expected, you can have a plan associated with your “Oops” moment.

If you don’t use the OUTPUT clause, you get a message that tells you how many rows were affected by your query. If you expect that number to be small, and it’s actually quite large, you’ve probably done something wrong. “Oops” is probably an understatement. You may have started a transaction and be able to roll it back, but until such time as you get to that, you have locks which aren’t being released. Regardless of whether or not you can roll it back, having something which shows you what you’ve just done can really help you out. You can copy the data presented into Excel, or Notepad, or whatever, and work out your problem. If you can’t roll it back, then this may involve some sort of import process being quickly thrown together.

The benefit is far less obvious when inserting data – but in some ways, it’s actually even more useful.

We’ve had the @@IDENTITY and SCOPE_IDENTITY() functions available for some time, and they’re widely used. But if multiple records are inserted, or if the targetted table doesn’t contain an identity field, then they’re actually not quite so great. Knowing which record is which is a question of re-querying the data and hoping you can tell. But if you “OUTPUT inserted.*”, your application can immediately tell which row was inserted with which surrogate key (it’s less of an issue if you use a natural primary key of course). With a large number of systems opting to use guids for PKs, defaulting to a value of newid(), it really helps to have an OUTPUT clause so that the guid doesn’t have to be generating prior to performing the insert.

The biggest caveat with the OUTPUT clause is that it can be ruined by triggers. A trigger being called can prevent the inserted and deleted tables from being available at the end of the statement. I’d like to have a way of stating that the OUTPUT clause should still work, returning the state of the inserted and deleted tables prior to any triggers being called, but I suppose I can understand the logic behind the decision to disallow it.

This was one of the tips in my TechEd Australia presentation, which was re-delivered to user groups in Adelaide and Melbourne. The scripts and slides are available for download. This was one of the more popular tips, based on feedback.