Category Archives: 12583

The Query Optimizer’s handling of Relationships for T-SQL Tuesday #003

I’m feeling the pressure for this month’s T-SQL Tuesday, probably because I’m also the host. I’ll be posting a roll-up for it soon too, which I’m sure will be great fun researching.

Given that the topic is on relationships, and the main SQL Server database is a relational engine, relationships are incredibly relevant to databases. The idea behind RDBMSs is that keys are used to refer to entities. This leads to foreign keys, such that a particular column(s) is constrained to values which appear in another table, thus referential integrity is enforced. And yet there are so many database designs out there that do not have relationships defined between tables. I shudder when I find them, but that doesn’t make them any less commonplace.

In data warehouses, designed primarily for reporting systems rather than transactional systems, tables are often designed in a more denormalized manner, to avoid needing to perform so many joins to access the data required for reports. This involves having tables with many extra columns, containing data that would otherwise be stored in other tables. Fewer tables are used, and therefore the system has fewer relationships.

I sometimes wonder how this should affect relational database design. I have written before about the fact that the Query Optimizer can leverage foreign key relationships to be able to simplify queries by noticing that joins can be redundant and simplified out of plans, but to summarise:

A foreign key relationship is between a column (or set of columns) in a table to a unique key (typically the primary key) in another table (which could even be the same one). Because of this uniqueness, the relationship can map to at most one record on the other side. And because the foreign key relationship enforces referential integrity, it must map to exactly one record on the other side (a caveat being that the foreign key column(s) could be configured to allow NULLs, which won’t map). Therefore, a join that doesn’t actually select data from any of the columns in the second table might be able to be simplified out of the query completely, as if the query didn’t have the join at all. But read my other post for more on that.

Thinking about many of the queries that I’ve written over the years, I know that I often only want one field from the table I’m joining. For example, I might want to get a ProductName when I have a ProductID, or I might want the Login from a UserID. A standard Lookup situation, which in a data warehouse would often be handled by storing the Name in a dimension table rather than the ID.

So my surmising leads me to this question:

If there is a unique index on the field that I typically want to lookup from a table, does this make it a better candidate for foreign key relationships than the primary key, so that the system can avoid needing as many joins?

This screams against everything I ever learned about relational databases. It would obviously make for a larger row, but if this were offset by performance gains in querying, could it be worthwhile? Maintaining referential integrity based on a string field may be more costly than on an integer field, but I’m wondering if the impact on SELECT queries through the reduction of the number of joins required might not make it a worthy consideration.

Please note: I’m not saying it’s necessarily a good idea – I’m surmising here, and would love to hear comments about whether or not other people have tried it, what circumstances they were trying to handle, and whether or not the idea worked.

But back to the Query Optimizer…

The QO needs information to be able to work out how to run your query. It needs statistical information about the columns that are filtered; it needs to be able to figure out which parts of the query can utilise indexes effectively (see my recent posts about SARGability); but also very significantly, it needs to have information about the relationships between tables.

Any time you write a query that involves a join (which I imagine most of your queries do), the system can use information about the relationship between the two tables. If it is defined as a foreign key relationship that doesn’t allow NULL values, then the system knows that each record in the ‘child’ table must match exactly one record in the other table. The Query Optimizer can use this information, and it should be available. Any time you’re thinking of not putting enforced relationships in your system, consider the fact that you’re blinding the Query Optimizer and ultimately making it do more work.

Put your relationships into your database design. Put constraints where they can apply, mark items as unique. The Query Optimizer will thank you for it, expressing its thanks as better performance.

A CASE study in SARGability

A recent discussion on Twitter about a query that Denny Cherry was looking at led to this post by Josef Richberg: http://josef-richberg.squarespace.com/journal/2010/1/28/is-a-case-statement-considered-sargable.html, and I thought it might be worth going through a few points on the topic of SARGability. Particularly given that I wrote a related post recently myself.

If something is SARGable, it means it can be used as a Search Argument – that is, it can be used effectively in conjunction with an index.

To repeat a common analogy, trying to find “Farley” in the phonebook is a Seek (I can pretty much go straight there), whereas trying to find names which end in “arley”, is not. Those names could be just about anywhere.

The way that an Index Seek works is that the system can go to a record and tell whether the row(s) it’s looking for is to the left of the current value or to the right. Looking for “F” is easy. I open the book at “M” and know that I must go left. I open the book at C and know that I must go right. It’s SARGable.

As Josef writes, applying a function to a field stops it being SARGable. If we wanted to say WHERE RIGHT(Surname,5) = ‘arley’  we would quickly see a performance problem, as finding these records would mean starting the Aardvarks (in case there was an ‘Aardvarkarley’), and end with the ZZzzzs (in case there was a ‘ZZzzarley’). On the other hand, looking for WHERE Surname LIKE ‘Farl%’ is quick, because we see that evaluating surnames using a wildcard that doesn’t appear at the start doesn’t affect the SARGability.

At first glance, an ON clause such as:

ON g.GroupId = CASE
             WHEN c.GroupId <> l.GroupID
               AND l.GroupID <> @AccGroupID
             THEN l.GroupID
             ELSE c.GroupID
             END

…would seem to be nasty for SARGability. We’re not just looking for a row where GroupId = AParticularyGroupID, we’re looking for something that is the result of a CASE function, which is going to change for every combination of rows in the tables ‘c’ and ‘l’.

But actually, it’s a lot more complicated than that, and yet in some ways, a lot more simple.

Let’s think about the term SARGable again. It means that we’re able to use it as a Search Argument – but what is the ‘it’ we’re looking at? What is the thing that is SARGable? (This next bit is the key – so I’ll write it again in a moment, in bold) The thing that is SARGable is the term that is indexed – in other words, g.GroupId. It doesn’t matter how much work is required to work out that value – it’s whether or not you can find it in the index.

SARGability is about the thing that is (hopefully) indexed in the table (or set) that you’re introducing into the query.

In the phonebook, RIGHT(Surname,5) isn’t in the index. No index on Surname is going to help match RIGHT(Surname,5) (unless you want to index the result of that function). But if you wanted to say WHERE Surname = REVERSE(‘yelraF’) then there’s no problem. Despite how nasty REVERSE is, it doesn’t affect what we’re looking up in the book. We can evaluate REVERSE, and then (using the index) find the Surname that matches that. Surname = something is SARGable.

So the CASE clause above is largely irrelevant to finding something by GroupId. The result of the CASE can be easily worked out, and then a seek done on an index on the ‘g’ table.

That was the simple aspect of it.

Consider that you have a diary, and you want to mark a week before some birthdays, so that you can go shopping. Suppose you have Jul 11, Feb 6, Dec 10. You take one of the birthdays, work out the date that is a week before it, then do an Index Seek to find that date in the diary. Then you repeat it twice more. This query might be along the lines of:

FROM people p JOIN diary d ON d.diarydate = dateadd(week, -1, p.birthday)

You’ll notice that we happily did an Index Seek on the diary. The dateadd function on p.birthday had no effect on us. The SARGability applies on whichever table/set we’re thinking about at the time, which in this scenario is the diary.

But consider that we wrote the query like this, which essentially means the same, looking for birthdays and diary dates that are a week apart, but applying the function to d.diarydate instead:

FROM people p JOIN diary d ON p.birthday = dateadd(week, 1, d.diarydate)

Logically, this means the same. But this is much more suited to finding the birthday once we know the diarydate, rather than the other way around. This might be fine if we had a thousands of people to consider, and we wanted to look up a mere 365 diary dates in our list of people (indexed by birthday)… but we’d still be looking 365 dates. The function still kills the SARGability of the column, it’s a question of which column we want to be considered indexable.

Repeating the matching process over and over like this is known as a Nested Loop. As quick as it might be to find the record because of SARGability, we’d still be doing it 365 times. Things would be much better if we could index both sides in this scenario. Then, we might be able to utilise a different kind of join, such as a Merge Join, which involves running through two ordered sets, comparing the values to find matches in a single pass. But for this to apply, both sides must be SARGable. Consider two tables with numbers in them… one with 1, 1, 5, 10, 10, 200, and one with 3, 4, 5, 6, 7, 8, 8, 10. To do a Merge Join between them, the system will see that the first table starts with 1, while the second one starts with 3. It will skip through the 1s in the first table to the 5, and then start skipping through the second table to find matches. It’s very fast, but it needs to be able to know whether the value it’s looking for is to the left or right of where it’s up to. This would involve an Index Scan on both sides, but it would be quicker than doing lots of Index Seeks. A single seek is faster than a single scan, but something quick done over and over can take longer than using a slow method one time.

Unfortunately, the Query Optimizer currently isn’t smart enough to know that dateadd(week, 1, d.diarydate) doesn’t change the order of dates in the diary, and that it can easily tell whether it should go left or right. That’s the topic of my earlier blog post, and a Connect item to ask that Microsoft fix this.

So should you care about SARGability? After all, if the system is doing an Index Seek anyway, you’ve got SARGability in play, and it shouldn’t matter.

You should try to arm the Query Optimizer with as many options as possible, so that it can use the best plan available, based on the statistics. We might be happy with looking up three diary dates in a Nested Loop, but we wouldn’t be happy doing that thousands of times. Think about how YOU would solve the problem without a computer (like my diary analogy, or the phonebook). If your execution plan is similar to your paper-based solution, then you’ve done well. If it’s not, then maybe you should look into it some more and work out if the query is okay or not.

In Josef’s situation, the Index Seek was being done 1155 times. That might be okay, but it also might not be. But should the query be rewritten? That’s a different question, that I couldn’t really answer without knowing more. The most important thing about a query is not its performance, but its correctness. If you need to kill the SARGability to maintain correctness, then so be it. You can always handle it with an indexed computed column if you need to, or even an indexed view.

SARGable functions in SQL Server

SARGable is an adjective in SQL that means that an item can be found using an index (assuming one exists). Understanding SARGability can really impact your ability to have well-performing queries. Incidentally – SARGable is short for Search ARGument Able.

If you have an index on phone numbers using LastName, followed by FirstName, including the suburb and address fields, you have something akin to the phone book. Obviously it becomes very easy to find people with the surname “Farley”, with the first name “Rob”, but often you want to search for people with the surname “Farley” with the first name beginning in ‘R’. I might be listed as “R Farley”, “R J Farley”, “Rob Farley”, “Robert Farley”, “Robert J. Farley”, or a few other variations. It complicates things even more if you need to find someone with a name that shortens a different way, like John/Jack, or Elizabeth/Betty. This is where SARGability comes into play.

Let’s just think about the First names for a minute.

If you want to find all the names that start with R, that’s easy. They’re all together and you can get to them very quickly. This is comparable to a query in SQL Server like this, (taking advantage of the index on the Name column in Production.Product)

select Name, ProductID
from Production.Product
where Name like ‘R%’ ;

Looking in the Execution Plan, we see an Index Seek to find the 52 rows, and the seek has a Seek Predicate like this (by looking in either the ToolTip of the operator, the Properties window, or the XML itself):

Seek Keys[1]: Start: [AdventureWorks].[Production].[Product].Name >= Scalar Operator(N’R’), End: [AdventureWorks].[Production].[Product].Name < Scalar Operator(N’S’)

This shows that the system looks as the LIKE call, and translates it into a greater-than and less-than query. (Interestingly, have a look at the End Seek Key if you tell it to find entries that start with Z)

So the LIKE operator seems to maintain SARGability.

If we want to consider Names that have R for the first letter, this is essentially the same question. Query-wise, it’s:

select Name, ProductID
from Production.Product
where LEFT(Name,1) = ‘R’ ;

Unfortunately the LEFT function kills the SARGability. The Execution Plan for this query shows an Index Scan (starting on page one and going to the end), with the Predicate (not, not Seek Predicate, just Predicate) “substring([AdventureWorks].[Production].[Product].[Name],(1),(1))=N’R’”. This is bad.

You see, a Predicate is checked for every row, whereas a Seek Predicate is used to seek through the index to find the rows of interest. If an Index Seek operator has both a Predicate and a Seek Predicate, then the Predicate is acting as an additional filter on the rows that the Seek (using the Seek Predicate) has returned. You can see this by using LIKE ‘R%r’

Considering the first part of a string doesn’t change the order. SQL knows this because of the way it handles LIKE (if the left of the string is known), but it doesn’t seem to get this if LEFT is used. It also doesn’t get it if you manipulate a field in other ways that we understand don’t affect the order.

select ProductID
from Production.Product
where ProductID + 1 = 901;

This is doing a scan, checking every row, even though we can easily understand what we mean. The same would apply for this query (assuming there’s an index on OrderDate):

select OrderDate
from Sales.SalesOrderHeader
where dateadd(day,1,OrderDate) = ‘20040101’
;

And perhaps most significantly:

select OrderDate
from Sales.SalesOrderHeader
where dateadd(day,datediff(day,0,OrderDate),0) = ‘20040101’
;

…which is largely recognised as being an effective method for date truncation (and why you should always compare dates using >= and < instead)

But more interestingly…

…this query is just fine. Perfectly SARGable.

select OrderDate
from Sales.SalesOrderHeader
where cast(OrderDate as date) = ‘20040101’
;

This query does a little work to figure out a couple constants (presumably one of them being the date 20040101, and another being 20040102), and then does an Index Seek to get the data.

You see, the date and datetime fields are known to have a special relationship. The date type is essentially the left-most three bytes of a datetime type, and therefore the ordering is identical.

It doesn’t work if you want to do something like:

select OrderDate
from Sales.SalesOrderHeader
where convert(char(8), OrderDate, 112) = ‘20040101’
;

…but did you really think it would? There’s no relationship between strings and dates.

I wish it did though. I wish the SQL team would go through every function and think about how they work. I understand that CONVERT will often change the order, but convert using style 112 won’t.

Also, putting a constant string on the end of a constant-length string shouldn’t change the order. So really, this should be able to work:

select OrderDate
from Sales.SalesOrderHeader
where convert(char(6), OrderDate, 112) + ’01’ = ‘20040101’
;

But it doesn’t.

Interestingly (and a prompt for this post), the hierarchyid type isn’t too bad. It understands that some functions, such as getting the Ancestor won’t change the order, and it keeps it SARGable. Here the asker had noticed that GetAncestor and IsDescendantOf are functions that don’t kill the SARGability – basically because the left-most bits of a hierarchyid are the parent nodes.

http://stackoverflow.com/questions/2042826/how-does-an-index-work-on-a-sql-user-defined-type-udt

Spatial types can show similar behaviour.

So I get the feeling that one day we might see the SQL Server team implement some changes with the optimizer, so that it can handle a lot more functions in a SARGable way. Imagine how much code would run so much better if order-preserving functions were more widely recognised. Suddenly, large amounts of code that wasn’t written with SARGability in mind would start running quicker, and we’d all be hailing the new version of SQL Server.

I’ve raised a Connect item about this, at https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=526431

You may have code that would run thousands of times faster with this change. That code may live in third party applications over which you have no control at all. If you think there’s a chance you fall into that bracket, why not go and vote this up?

Foreign Keys against Views

Foreign Keys are great. For those of you who read some of my other posts, or heard me present on this material before, a foreign key can really help the Query Optimizer simplify out your query, because it knows that it must be able to find a match in an equality join (unless the field allows NULLs, which is a whole nother story).

I also blogged recently about the fact that a foreign key doesn’t have to reference the primary key of a table. Turns out it can reference any candidate key, that is, one that has a unique index on it. This presents all kinds of interesting design ideas, such as avoiding joins by storing a different field (such as the username, rather than userid, if the username is the one thing you keep looking up to display at the top of the webpage). Referential integrity can still be maintained happily using an index seek (just a slightly larger index, but quick nonetheless), but the benefit could be huge if many queries no longer need to do join to get that info.

But I found an interesting question on Stack Overflow, which was asking about whether a foreign key could reference a view. He was trying to do it for the purpose of data integrity – which would probably require an expensive trigger. But let’s talk about the Query Optimizer, which I think is another useful reason to have a foreign key hook into a view.

As a view is only a stored sub-query, it’s a strange request, but one that is definitely interesting. A view can be instantiated as an indexed view, so I thought there was definitely potential in his question.

Unfortunately I had no luck trying to help him.

My first thought was “Well, you’re going to need a unique index on the view – well that’s doable – a Clustered Index on an Indexed View.”, but you get an error saying that the foreign key needs to reference a user table. A user table? So it can’t even reference a system table. But I guess that’s fair enough these days, since we don’t really have system tables any more (they’re all system views, referencing underlying functions and the like).

My next thought was whether or not you could recreate something similar using a filtered index. After all, a filtered index satisfies many of the scenarios that indexed views often address. They can be unique for example, so you might use an indexed view to make sure that once a particular value is reached in a Status field, some other field must be unique. In SQL Server 2008 it’s no problem to create a unique index and provide a WHERE clause for it. But unfortunately when you create a foreign key, you indicate the table you’re referencing, not the index. I tried using a check constraint, to make sure that the foreign column could only contain values that were in the unique filtered index, but SQL just doesn’t consider that a filtered index is available for referencing by a foreign key.

If you know a way to help this guy, follow the link and provide an answer. In a couple of days’ time, the answer with the most votes will be automatically marked as the answer by the Stack Overflow system, which won’t help him much. He understands that using views doesn’t negate the FK constraints on the underlying tables, and I hope he can still work out a way of doing it without triggers, but at the end of the day, this other answer of mine, where I eventually say “Use triggers on the tables which contribute to your view” is probably the unfortunate truth.