A moving time…

For those of you who read my blog using RSS, you probably won’t even see this post, but either way.

I’ve decided to move my blog to sqlblog.com. I have reposted some of my recent posts, but will not be posting any further content here at msmvps.com.

Both the feed from msmvps.com and sqlblog.com use feedburner.com, so the old feed will continue working, but I recommend changing to the new one in case my blog disappears from msmvps one day.

So farewell, msmvps.com (and thanks Susan!), it’s been fun. I’m sure I will still have many posts which link back here…

T-SQL Tuesday #003 (Relationships): The round-up

Lots of blog posts for this month, for the first T-SQL Tuesday to leave the safe-haven of Adam Machanic’s blog. Some people obviously missed out, probably because they don’t read this blog, but I guess that’s the nature of the meme. I don’t know who is hosting next month yet, but I’ll be looking out for Adam to post something about it in early March.


All the posts had to appear as trackbacks or comments on the invitation to participate, but this post is a short summary for posterity.


As the second week of February involves Valentine’s Day (and a few days earlier, my wedding anniversary), I thought the topic of Relationships would be a nice one for this event. There was a good range of topics too, which I have ordered by the type of relationships chosen.


Special mention:


It wasn’t quite on the right day, but a big congratulations to Michelle Ufford who almost managed to time the birth of her new baby Chloe for T-SQL Tuesday.


We have “purely technical” posts:


Brad Schulz is keeping the bar way too high, and if you haven’t read his pieces from previous Tuesdays, then I recommend you go through the history of posts on his blog. This month he has written a letter of disappointment to the FROM clause. At some point Brad will likely be asked to compile these posts into a book, but until that happens, you’ll have to follow the link to his blog. It’s entertaining, but still fits in the “purely technical” category.


John Dunleavy demonstrated (complete with screenshots – something I should put more of in my posts) how foreign keys can be made so easily using the Diagram part of Management Studio. It’s not something I do much of, but I have to admit that reading posts like John’s can often inspire me to changing my ways.


Bryan Smith also talked about database diagrams, and how they can be used to discover relationships in a system.


Michael Coles officially missed the deadline, but I’m going to link to him anyway, demonstrating a nice trick for creating a Product aggregate based on the relationship between a number and its logarithm.


Allen White wrote about the fact that any RDBMS should have relationships to really be considered relational. Great reminder of some of the basics.


Marco Russo timed a piece on relating tables in DAX amazingly well, and only realised that it qualified for T-SQL Tuesday after he had initially posted it. Useful piece, which will appear in search engine results for years to come I’m sure.


Rob Farley wrote a pile of rubbish… hey, that’s me! I wrote about the importance of relationships in a database system to help the Query Optimizer do its job. I also surmised that Foreign Keys using candidate keys (rather than the primary key) might help many queries do away with needing joins. If you have thoughts on that I’d still like to hear them.


Some posts that were about less technical relationships:


Jen McCown wrote about how she left her husband Sean, in a piece called ‘I love you, I quit’. Actually, she was just leaving a job, but it’s a nice piece about the degradation of a employer-employee relationship. It rings too true for all of us I think, and I hope that as an employer I manage to ‘keep the mystery’ for my employees to stop them going through that same experience.


Jason Brimhall wrote about some of the different relationships in his life, particularly how he wants to make sure that the Parent-Child relationship in his life doesn’t become a Foreign relationship. Nicely done Jason.


Allen Kinsel recommends that relationships with professional organisations can be deepened, and that this can be very beneficial. I’m sure he’s right. I have a tendency to get extra-involved in groups, and I hope Allen’s sentiments are heard by many.


The link to Steve Jones’ article must’ve changed. The link that I followed today didn’t work, even though I’d read it successfully a couple of days ago. Still, I managed to find it, and I can thoroughly recommend reading about the relationships between Steve and his colleagues. He’s most definitely correct in pointing out that any effort in developing personal relationships with your colleagues will help you get things done!


And some “combination” posts:


Mike Walsh provided the first piece of the day, with an excellent run down of various types of relationships that are important, including a recommendation to read up on database design. Great advice, Mike. Thanks.


With Kalen Delaney’s post we almost got two for the price of one. A brilliant prelude talking about some of the personal relationships that have enhanced her SQL career, followed up by covering how foreign key relationships have developed through the versions of SQL Server. Kalen’s blog posts are always worth reading, as I’m sure everyone who is reading this post appreciates.


Stuart Ainsworth’s piece was on Maslow, drawing parallels between the hierarchy of needs of a person with the hierarchy of needs of a database system. It’s thought-provoking, and something that I feel could be made into a poster for database developers’ walls.


In summary


I’d like to thank everyone who has taken part, and for Adam for having introduced the T-SQL Tuesday concept to us. Keep your eye on his blog to find out what’s going on next month. If your name isn’t listed here, then I encourage you to write something for March.


Also remember that lots of these people are on Twitter and are very much followable. Look at the hashtag #tsql2sday for related posts, and make sure you follow the people who post blogs for these events.

The Query Optimizer’s handling of Relationships for T-SQL Tuesday #003

I’m feeling the pressure for this month’s T-SQL Tuesday, probably because I’m also the host. I’ll be posting a roll-up for it soon too, which I’m sure will be great fun researching.


Given that the topic is on relationships, and the main SQL Server database is a relational engine, relationships are incredibly relevant to databases. The idea behind RDBMSs is that keys are used to refer to entities. This leads to foreign keys, such that a particular column(s) is constrained to values which appear in another table, thus referential integrity is enforced. And yet there are so many database designs out there that do not have relationships defined between tables. I shudder when I find them, but that doesn’t make them any less commonplace.


In data warehouses, designed primarily for reporting systems rather than transactional systems, tables are often designed in a more denormalized manner, to avoid needing to perform so many joins to access the data required for reports. This involves having tables with many extra columns, containing data that would otherwise be stored in other tables. Fewer tables are used, and therefore the system has fewer relationships.


I sometimes wonder how this should affect relational database design. I have written before about the fact that the Query Optimizer can leverage foreign key relationships to be able to simplify queries by noticing that joins can be redundant and simplified out of plans, but to summarise:


A foreign key relationship is between a column (or set of columns) in a table to a unique key (typically the primary key) in another table (which could even be the same one). Because of this uniqueness, the relationship can map to at most one record on the other side. And because the foreign key relationship enforces referential integrity, it must map to exactly one record on the other side (a caveat being that the foreign key column(s) could be configured to allow NULLs, which won’t map). Therefore, a join that doesn’t actually select data from any of the columns in the second table might be able to be simplified out of the query completely, as if the query didn’t have the join at all. But read my other post for more on that.


Thinking about many of the queries that I’ve written over the years, I know that I often only want one field from the table I’m joining. For example, I might want to get a ProductName when I have a ProductID, or I might want the Login from a UserID. A standard Lookup situation, which in a data warehouse would often be handled by storing the Name in a dimension table rather than the ID.


So my surmising leads me to this question:


If there is a unique index on the field that I typically want to lookup from a table, does this make it a better candidate for foreign key relationships than the primary key, so that the system can avoid needing as many joins?


This screams against everything I ever learned about relational databases. It would obviously make for a larger row, but if this were offset by performance gains in querying, could it be worthwhile? Maintaining referential integrity based on a string field may be more costly than on an integer field, but I’m wondering if the impact on SELECT queries through the reduction of the number of joins required might not make it a worthy consideration.


Please note: I’m not saying it’s necessarily a good idea – I’m surmising here, and would love to hear comments about whether or not other people have tried it, what circumstances they were trying to handle, and whether or not the idea worked.


But back to the Query Optimizer…


The QO needs information to be able to work out how to run your query. It needs statistical information about the columns that are filtered; it needs to be able to figure out which parts of the query can utilise indexes effectively (see my recent posts about SARGability); but also very significantly, it needs to have information about the relationships between tables.


Any time you write a query that involves a join (which I imagine most of your queries do), the system can use information about the relationship between the two tables. If it is defined as a foreign key relationship that doesn’t allow NULL values, then the system knows that each record in the ‘child’ table must match exactly one record in the other table. The Query Optimizer can use this information, and it should be available. Any time you’re thinking of not putting enforced relationships in your system, consider the fact that you’re blinding the Query Optimizer and ultimately making it do more work.


Put your relationships into your database design. Put constraints where they can apply, mark items as unique. The Query Optimizer will thank you for it, expressing its thanks as better performance.

Invitation for T-SQL Tuesday #003: Relationships

It’s time for the third of Adam Machanic’s T-SQL Tuesdays, and this time, I’m the host. The first one, last December was on the topic of date/time, and the second was on Puzzling Situations. Check them both out, along with the round-ups that Adam wrote about them. Lots of great topics, which is starting to make me anticipate the content that comes out on the second Tuesday of each month.


As an early volunteer to host, I have been given the honour of being the first person chosen to host one. I’d like to claim that this implies some sort of special relationship between myself and the SQL community as a whole, but it’s actually just a “first-in, best dressed” policy – although the ‘best dressed’ analogy is lost on me.


Theme


Valentine’s Day is coming up. Hopefully I don’t need to tell you that it’s on February 14th, but if you’ve read this far into the post then perhaps you’re involved with databases for some reason and may need reminding. Shopping centres around the world have signs up reminding us to buy flowers for our loved ones, but I know many people in IT circles who don’t tend to go to such places, lurking in dark corners of houses until all hours of the night, surviving on pizza. Hopefully this theme will not only prompt some interesting posts, but also prompt people to go out and invest in the meaningful relationships in their own lives. Actually, if you don’t know that Valentine’s Day is February 14th, I’m guessing you don’t have anyone in your life worth buying for. ;)


For me, Valentine’s Day is only three days after my wedding anniversary, so I can’t forget either – as if I would.


So the theme for this month’s T-SQL Tuesday is Relationships.


There are a massive number of options you could go with for this theme. You could talk about Foreign Keys in the relational world. You could wax lyrical about the benefits of attribute relationships in cube design. You could write a poem for your loved one, apologising for all those hours spent in front of a Management Studio window, trying to tune a query, rather than tuning your guitar to serenade her.


Other ideas include: Relationships between Devs & DBAs, Clients & Vendors, Entities, data types, concepts (eg: Report Model & Cube), and more… if you’re struggling to think of something, drop me a line (twitter, Msgr, email, whatever – a list of contact options is over on the left) and I can help.


But so long as you can loosely tie your post to both the theme and some aspect of SQL Server, that’s fine. Be creative, informative, reflective, and hopefully relevant.


Rules


Please note that the time zone for this Tuesday is UTC. For me, that means between 10:30am Tuesday and 10:30am Wednesday. For you, it might mean some time on Monday afternoon to some time on Tuesday afternoon. If you’re lucky (read ‘English’), then you can publish your post any time on Tuesday. It’s about when it’s published though, not when you write it. I encourage you to write your post in advance, in case you’re busy on the day.


So the rules are:


  • Your post must go live after 00:00:00 UTC on Tuesday, February 9, 2010, but before 00:00:00 UTC on Wednesday, February 10, 2010
  • Your post must link back to this one, and it’s recommended that you clearly identify the post as a T-SQL Tuesday post
  • You are responsible for ensuring that a trackback or comment appears here so that I can find the posts

Follow the rules, and your post will be included in the roundup to be posted on a day or two later. Don’t follow the rules, and it won’t show up there. Simple as that!


Twitter


Follow the event on Twitter by watching for the #TSQL2sDay hash tag. (The ‘2’ refers to the fact that it’s the second Tuesday of the month, and nothing to do with the pronunciation of the word, which is more like “Choose-day” where I come from.)


Additional Notes


Please make sure you put a link in your post to this one, and post a comment here if the trackback doesn’t appear. I normally moderate comments to my blog (to avoid spam), but may consider changing that policy for 24 hours next week. I have noticed people trying to take part, but failing to make sure that a trackback/comment has appeared. So please check this.


Please check the time zone. It’s something that we’re used to here in Australia, but I know a few people missed out last month not realising that the event had shifted to UTC.


Let Adam Machanic know if you want to host. You can contact him in a variety of ways, as he mentioned last time. To host, you must have participated in two previous T-SQL Tuesday events, and your blog must have had at least one post a month for the prior six months. Let him know you’re keen even if you don’t meet these criteria, as I believe there’s a list, and you might be able to get those blog posts sorted before your turn is up. When I volunteered to host, I hadn’t participated in any yet…


If you want to host but don’t have the faintest clue on what topic to use, ask Adam for some ideas. I’ve suggested a few to him, and I think he’s putting a list together to maintain the longevity of all this. Similarly, let him know if you have some ideas – you might suggest something that catches people’s imagination like never before.


Feel free to contact Adam or myself if you have any comments or ideas for all this. I’ll probably refer you to Adam if it’s a general thing, or happily do my best to answer you if it’s about this month’s event in particular.


Have lots of fun! I look forward to reading your posts!


Edited: The round-up is at http://msmvps.com/blogs/robfarley/archive/2010/02/13/t-sql-tuesday-003-relationships-the-round-up.aspx

A CASE study in SARGability

A recent discussion on Twitter about a query that Denny Cherry was looking at led to this post by Josef Richberg: http://josef-richberg.squarespace.com/journal/2010/1/28/is-a-case-statement-considered-sargable.html, and I thought it might be worth going through a few points on the topic of SARGability. Particularly given that I wrote a related post recently myself.


If something is SARGable, it means it can be used as a Search Argument – that is, it can be used effectively in conjunction with an index.


To repeat a common analogy, trying to find “Farley” in the phonebook is a Seek (I can pretty much go straight there), whereas trying to find names which end in “arley”, is not. Those names could be just about anywhere.


The way that an Index Seek works is that the system can go to a record and tell whether the row(s) it’s looking for is to the left of the current value or to the right. Looking for “F” is easy. I open the book at “M” and know that I must go left. I open the book at C and know that I must go right. It’s SARGable.


As Josef writes, applying a function to a field stops it being SARGable. If we wanted to say WHERE RIGHT(Surname,5) = ‘arley’  we would quickly see a performance problem, as finding these records would mean starting the Aardvarks (in case there was an ‘Aardvarkarley’), and end with the ZZzzzs (in case there was a ‘ZZzzarley’). On the other hand, looking for WHERE Surname LIKE ‘Farl%’ is quick, because we see that evaluating surnames using a wildcard that doesn’t appear at the start doesn’t affect the SARGability.


At first glance, an ON clause such as:


ON g.GroupId = CASE
             WHEN c.GroupId <> l.GroupID
               AND l.GroupID <> @AccGroupID
             THEN l.GroupID
             ELSE c.GroupID
             END


…would seem to be nasty for SARGability. We’re not just looking for a row where GroupId = AParticularyGroupID, we’re looking for something that is the result of a CASE function, which is going to change for every combination of rows in the tables ‘c’ and ‘l’.


But actually, it’s a lot more complicated than that, and yet in some ways, a lot more simple.


Let’s think about the term SARGable again. It means that we’re able to use it as a Search Argument – but what is the ‘it’ we’re looking at? What is the thing that is SARGable? (This next bit is the key – so I’ll write it again in a moment, in bold) The thing that is SARGable is the term that is indexed – in other words, g.GroupId. It doesn’t matter how much work is required to work out that value – it’s whether or not you can find it in the index.


SARGability is about the thing that is (hopefully) indexed in the table (or set) that you’re introducing into the query.


In the phonebook, RIGHT(Surname,5) isn’t in the index. No index on Surname is going to help match RIGHT(Surname,5) (unless you want to index the result of that function). But if you wanted to say WHERE Surname = REVERSE(‘yelraF’) then there’s no problem. Despite how nasty REVERSE is, it doesn’t affect what we’re looking up in the book. We can evaluate REVERSE, and then (using the index) find the Surname that matches that. Surname = something is SARGable.


So the CASE clause above is largely irrelevant to finding something by GroupId. The result of the CASE can be easily worked out, and then a seek done on an index on the ‘g’ table.


That was the simple aspect of it.


Consider that you have a diary, and you want to mark a week before some birthdays, so that you can go shopping. Suppose you have Jul 11, Feb 6, Dec 10. You take one of the birthdays, work out the date that is a week before it, then do an Index Seek to find that date in the diary. Then you repeat it twice more. This query might be along the lines of:


FROM people p JOIN diary d ON d.diarydate = dateadd(week, -1, p.birthday)


You’ll notice that we happily did an Index Seek on the diary. The dateadd function on p.birthday had no effect on us. The SARGability applies on whichever table/set we’re thinking about at the time, which in this scenario is the diary.


But consider that we wrote the query like this, which essentially means the same, looking for birthdays and diary dates that are a week apart, but applying the function to d.diarydate instead:


FROM people p JOIN diary d ON p.birthday = dateadd(week, 1, d.diarydate)


Logically, this means the same. But this is much more suited to finding the birthday once we know the diarydate, rather than the other way around. This might be fine if we had a thousands of people to consider, and we wanted to look up a mere 365 diary dates in our list of people (indexed by birthday)… but we’d still be looking 365 dates. The function still kills the SARGability of the column, it’s a question of which column we want to be considered indexable.


Repeating the matching process over and over like this is known as a Nested Loop. As quick as it might be to find the record because of SARGability, we’d still be doing it 365 times. Things would be much better if we could index both sides in this scenario. Then, we might be able to utilise a different kind of join, such as a Merge Join, which involves running through two ordered sets, comparing the values to find matches in a single pass. But for this to apply, both sides must be SARGable. Consider two tables with numbers in them… one with 1, 1, 5, 10, 10, 200, and one with 3, 4, 5, 6, 7, 8, 8, 10. To do a Merge Join between them, the system will see that the first table starts with 1, while the second one starts with 3. It will skip through the 1s in the first table to the 5, and then start skipping through the second table to find matches. It’s very fast, but it needs to be able to know whether the value it’s looking for is to the left or right of where it’s up to. This would involve an Index Scan on both sides, but it would be quicker than doing lots of Index Seeks. A single seek is faster than a single scan, but something quick done over and over can take longer than using a slow method one time.


Unfortunately, the Query Optimizer currently isn’t smart enough to know that dateadd(week, 1, d.diarydate) doesn’t change the order of dates in the diary, and that it can easily tell whether it should go left or right. That’s the topic of my earlier blog post, and a Connect item to ask that Microsoft fix this.


So should you care about SARGability? After all, if the system is doing an Index Seek anyway, you’ve got SARGability in play, and it shouldn’t matter.


You should try to arm the Query Optimizer with as many options as possible, so that it can use the best plan available, based on the statistics. We might be happy with looking up three diary dates in a Nested Loop, but we wouldn’t be happy doing that thousands of times. Think about how YOU would solve the problem without a computer (like my diary analogy, or the phonebook). If your execution plan is similar to your paper-based solution, then you’ve done well. If it’s not, then maybe you should look into it some more and work out if the query is okay or not.


In Josef’s situation, the Index Seek was being done 1155 times. That might be okay, but it also might not be. But should the query be rewritten? That’s a different question, that I couldn’t really answer without knowing more. The most important thing about a query is not its performance, but its correctness. If you need to kill the SARGability to maintain correctness, then so be it. You can always handle it with an indexed computed column if you need to, or even an indexed view.