GROUP BY v DISTINCT (group by wins!)

Jamie Thomson (SQL Server MVP from the UK) threw out a challenge recently, and it was interesting to see the responses. His question was about string concatenation, and of course, FOR XML PATH(”) made a strong case for itself. It was put into SQL2005 for exactly this purpose. But I noticed that the responses using this pattern also used the DISTINCT keyword, to stop there being duplicate entries in the results.

And this makes for a good opportunity for me to tell you about the difference between DISTINCT and GROUP BY.

They might seem to be quite different, but if you’re not using aggregate functions, the difference stops being so significant. Certainly if you were to run ‘select distinct col1 from table1′ and ‘select col1 from table1 group by col1′, you will notice that actually they are exactly the same. Both will scan through the data, doing a distinct sort on the results.

So now consider the situation in Jamie’s question. To set up the scenario, run the following:

USE tempdb
GO
CREATE TABLE t1 (id INT, NAME VARCHAR(MAX))
INSERT t1 values (1,’Jamie’)
INSERT t1 values (1,’Joe’)
INSERT t1 values (1,’John’)
INSERT t1 values (2,’Sai’)
INSERT t1 values (2,’Sam’) 

The two queries to compare are as follows:

select
    id,
    stuff((
        select ‘,’ + t.[name]
        from t1 t
        where t.id = t1.id
        order by t.[name]
        for xml path(”)
    ),1,1,”) as name_csv
from t1
group by id
;

select distinct
    id,
    stuff((
        select ‘,’ + t.[name]
        from t1 t
        where t.id = t1.id
        order by t.[name]
        for xml path(”)
    ),1,1,”) as name_csv
from t1
;

Both return the same data. But if we look at the execution plans, we will see they are executed in different ways. The layman’s way of explaining this is to point out that the ‘distinct’ needs to check that the name_csv field is unique each time as well, whereas the ‘group by’ just needs to check that the subquery is valid – ie, doesn’t refer too any non-grouped fields.

groupbydistinct

So you can see that the execution plan (and this is the actual, not the estimated) says that the GROUP BY is quicker than the DISTINCT, but if we look at what’s actually going on, we can clearly see that the first query does the Distinct Sort step before the Nested Loop, whereas the second does the Distinct Sort as the very last step.

I have to point out that you can’t always use GROUP BY instead of DISTINCT in an effective way, and nor should you. If returning unique records is actually what you want, then DISTINCT may be better. It’s certainly clearer code in many situations. But if you’re using it just because your code is inadvertently returning more than you want, then you should try to have the DISTINCT apply earlier. This might be with “WHERE EXISTS” instead of a join, or maybe using a “SELECT DISTINCT …” derived table. In a previous post (at my old blog), I showed that ROW_NUMBER() can be used to remove duplicates from a table. So there are certainly different ways to produce a unique set of rows, and it’s worth considering which one is right for you.

Getting married today

Roslyn & I are getting married this afternoon. Again. But this time, for pretend. Her mum is getting her celebrant’s licence, and needs to conduct a pretend-wedding so that they can be sure she knows how to do it. So Roslyn and I are getting all done up, and trotting off down the beach to say some vows.

I think we have to take our wedding rings off for it, which will be strange, as I don’t tend to take my wedding ring off. Hope it goes okay. Have to take it seriously for Mum, otherwise we’ll have to do it again. Can’t cough at the “impediment” bit or anything…

GotDotNot

Wot with GotDotNet disappearing soon, Paul Stovell asks the question what will happen to all the links to the site. Spot on, Paul. Rotten situation, so let’s hope Microsoft have a solution for this. Not least, it could remain in a read-only mode to make sure links don’t fail. GotDotNet content could also be moved across to CodePlex with a redirection service in place.

Lots of people have used GotDotNet over the years. Hot site, we all agree. Jot down your support for Paul’s suggestion if you can.

BI Certified

Like Darren Gosbell, I took the beta exams for the two new SQL2005 Business Intelligence exams. Like Darren, following the beta period I was invited to help with the exam development process, and now that the results have come out, like Darren, I have two new certifications. MCTS: Microsoft SQL Server 2005 Business Intelligence and MCITP: Business Intelligence Developer.

I received invites to do beta exams in TFS and VSTO, but the beta period timing is bad for me, and I don’t think I’m going to get to do them.

Hiding the Office 2007 ribbon

Not that you’d want to of course… context-sensitive help by presenting a toolbar which is relevant to what you’re doing is great.

But if it gets in your way, because you want the screen real estate to read a document or something, then there are a few ways you can make it disappear temporarily.

The most complicated way is by using the Customize Quick Access Toolbar menu. hideribbon “Of course!” I hear you say. Oh, that’s not you? Must be those voices again. Well, it’s the little drop-down menu that you can see next to the Undo & Redo buttons. And look, you can set a whole bunch of other options along there, like giving yourself a permanent “Print Preview” button. Check out “More Commands” and you’ll see that it gives you a really nice old-school toolbar up there.

And you’ll notice that at the bottom of that list is a “Minimize the Ribbon” option.

The second method is to right-click on the ribbon. You’ll see a context-menu appear, and one of the options there is to “Minimize the Ribbon”. That’s pretty easy.

But the easiest way is to simply double-click on the menu. Like, where it says “Home” in Word, for example. And this is the easiest way to get the ribbon back again, which for me makes it the best way of hiding the Office 2007 ribbon.

MVP Summit

Obviously there’s not much I can say here about the MVP Summit last week, unless you’re an MVP yourself, in which case you have probably heard it all anyway. :)

But as for what I can talk about – the summit was great. I met lots of people who I have known via the virtual community (and by reputation) for some time, but have never met in person. People like Simon Sabin, who lives in the town I caught the bus to school from when I was growing up, and who is a fellow SQL MVP. I caught up with Simon for breakfast on Monday morning (after arriving about six hours late on Sunday), and met a bunch of other English SQL MVPs then too.

I spent time with people that I hold in very high regard (like almost everyone who Greg Low has ever interviewed on his SQL Down Under podcasts), didn’t sleep much, learned a lot about the next versions of SQL Server (if you’re going to ask, “The short answer is No”) and about the MVP program in general, and have come away from it really pleased that I went.

sys.database_files and sys.master_files (rather than sys.sysfiles)

Whenever there are changes, people can be slow to embrace them. One I’ve come across recently is that looking in sysfiles is no longer the best way of getting information about your database files. sys.database_files will tell you a lot more about them, and give you much nicer ways of filtering them. So for example, if you want a list of the log files for your database, try:

select * from sys.database_files where type = 1

And sys.master_files will list them for the whole system. So it becomes really easy to look at the state of things from within T-SQL.

For more information on these, check out http://msdn2.microsoft.com/en-us/library/ms174397.aspx and http://msdn2.microsoft.com/en-us/library/ms186782.aspx.

Now go and make yourself a Vista Sidebar gadget (or use PowerGadgets to do it really easily) to show you the size of your log files (or whatever). Have it update every hour, sitting alongside those gadgets which tell you how nice the Adelaide weather is and show you pictures of the blue skies in Australia.

Great bible software from Laridian for free

As a leaving gift from my previous employer, I got a voucher, which works just as well for a supermarket as it does for anywhere else. This is cool, because I can then spend the money on groceries and buy something I want with the money I save from the shopping. It means that I can buy what I want, effectively for free, because it’s a gift.

So I bought some bible software for my PocketPC. I got it from Laridian – because after trying a couple of different Bible software packages, I’ve really found it’s the best. If you’re looking for a Bible for your PocketPC, I can really recommend it. They even had a deal where if you buy two things (and spend more than $40), you can get a decent discount on the overall price, so I ended up getting more than I expected, for less money than I expected. Can’t lose! Follow the link above…

For me, having a bible on my PocketPC means that I don’t need to take an extra book with me when I travel, because it’s all there. I use my PocketPC for reading RSS feeds, listening to podcasts, reading PDFs, and even studying the bible. I just need to get around to buying that larger SD card for it – with Christmas money from my mum that I haven’t spent yet. ;)

End of an era

I finish today in a job that I’ve had for over 4.5 years. It’s been an interesting time. During this time I’ve had time in hospital, seen both my kids have stints in hospital, seen my youngest start school, become an uncle four times over, reached my 30s, run a user-group, received the MVP award, had three trips to Redmond (with a fourth coming up soon), and of course, made many friends.

But the new role is very exciting. It’s going to be great training for a living – I love training, and it’s going to be a lot of fun. And for a company like Solid Quality Learning – that’s even better. Can’t beat those guys for SQL knowledge.