Category Archives: 7913

Ordered data requires ORDER BY

When you select data out of a database table, it might seem as if the system will give you data in the order of the clustered index that is on that table (assuming it has one), but this isn’t quite right. I showed this in the fifth of my T-SQL tips at TechEd Australia this year.

If you don’t explicitly order data using the ORDER BY clause, the system makes no attempt to provide you with the data in any particular order. It will just give you the data in the fastest way possible. You might be lucky and get the data in a predictable order most of the time. In fact, on your system, the data might be returned in a predictable order over 99% of the time. Unfortunately, you can’t rely on that.

When the database engine is handling your query, it needs to get your data into RAM first. If it’s not already there, this means a trip to disk, which will generally be the slowest aspect of your query. With multiple processors and multiple disks being the norm these days, there is a good chance that one processor will head off to get the data from one disk, while another processor gets data from another. If your query isn’t explicitly stating the order for your data, whichever processor gets its data first is likely to determine which data appears first in your results. There are other reasons why your data might come back in an unexpected order, but this reason alone should convince you.

I’m sure reading this, you’re very much used to putting ORDER BY in your queries. But what happens if you’re using SQL Server Integration Services (SSIS)? When you set up a data flow and want to pull data out of a particular table or view, do you use the “Table or View” option in the data source? You might – it’s a lot easier than typing your query out if you choose the SQL Command option. Just don’t. Use the SQL Command option instead.

image

The problem is that SSIS can really take advantage of ordered data. If SSIS can’t guarantee that the data is ordered, it will assume it’s not, and this might hurt the performance of your package significantly. Some data flow transformations can’t even run on unordered data. You can tell a data flow source that the data is ordered. You can even tell it that it is, even if it’s not. Don’t do this. Only tell a data flow source that it’s ordered if you’re explicitly ordering it. If you tell SSIS that a data flow is ordered, and the data comes through in a different order (because you’re unlucky one time), you will get unexpected results. What I mean by ‘unexpected’ here is ‘wrong’.

It’s not hard, just write out your query, and use the ORDER BY clause. You know you should…

Design Query in Editor bug

Ok, so real database developers don’t use the graphical “Design Query in Editor”… yeah, I know. Sure, there’s the odd time when you’re typing a query and you don’t have an Object Explorer (eg, in SSIS) and a moment of weakness sees you hit “Build Query” to save some typing, but in general I encourage people to write their queries in Management Studio SSMS and then copy them into the SSIS dialog. I was showing someone some of the frustrations I have with the graphical editor, and came across a real beauty – repeated predicates.

I logged a bug at the Connect site, where I wrote the rest of the details. You should vote for this – there may be a time when you click “Build Query” to avoid some typing, and you don’t want to look like an idiot for repeating the same line multiple times.

https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=352874

"No Match Output" in SSIS 2008

The hairy Irishman (his description – I just call him Sacha) is presenting to the Adelaide SQL Server User Group today, talking about improvements in SQL Server 2008 Integration Services. Right now he’s talking about the No Match output of the Lookup Component, and I’m wondering why I had always just accepted that in SQL 2005, rows that didn’t match a lookup would get pumped out to the error output.

For those of you who are saying “Sorry, what?”, let me explain…

If you have a data flow in Integration Services which is missing a particular piece of information, and you need to get that information from another source, you are likely to do a Lookup transformation, which can perform a query to find the values to be inserted into each row in the flow. Of course if the data is just in two tables, you’d just do a join in your original source query, but as soon as you’re talking about files, you don’t have that luxury in quite the same way. Sure you could populate a table and then read it back, but a Lookup may end up being much quicker.

Now, in SQL Server 2005, if a row couldn’t find the looked up value, it would be considered an error, along with truncation errors and errors in connection to the lookup source. I just accepted this as okay before, but the more I come across the No Match output (which can be used as well as the error output), the more I like it. So I can much more easily separate the rows that have caused errors from the rows that just couldn’t be found in the lookup source.