Unique Indexes with GROUP BY

Indexes are great, but if you don’t understand the significance of unique indexes, then you’re potentially missing out on some decent performance gains.

I’ve been meaning to write this blog post for a long time. This is material that I’ve been teaching and presenting about all year, but somehow it’s never turned into a blog post. It was the third tip in my presentation at TechEd Australia (and at user group events in Adelaide and Melbourne). Today I’m flying back home from the US, so hopefully I’ll be able to spend some battery time getting it done. [Note: It's ended up out of order with another post that I wrote on the plane. This one is Tip #3.]

This query runs in the AdventureWorks database, in almost any version of SQL Server you have. When I present on this at the moment, I use SQL Server 2005, so that people don’t think I’m just showing new SQL Server 2008 features. This concept will help you regardless of version.

SELECT s.Name, COUNT(*)
FROM Production.ProductSubcategory s
  JOIN
  Production.Product p
  ON p.ProductSubcategoryID = s.ProductSubcategoryID
GROUP BY s.Name;

It’s a simple query that counts the number of products in each non-empty subcategory. Having seen the logical action of the query, let’s look at the execution plan. This will show us what the query is actually doing. We’ll look at it both graphically and in text. The graphical representation is stored as XML, and additional information can be seen using the tool tips and the Properties window.

 

|–Merge Join(Inner Join, MERGE:([p].[ProductSubcategoryID])=([s].[ProductSubcategoryID]), RESIDUAL:([AdventureWorks].[Production].[ProductSubcategory].[ProductSubcategoryID] as [s].[ProductSubcategoryID]=[AdventureWorks].[Production].[Product].[ProductSubcategoryID] as [p].[ProductSubcategoryID]))
     |–Compute Scalar(DEFINE:([Expr1004]=CONVERT_IMPLICIT(int,[Expr1016],0)))
     |    |–Stream Aggregate(GROUP BY:([p].[ProductSubcategoryID]) DEFINE:([Expr1016]=Count(*)))
     |         |–Sort(ORDER BY:([p].[ProductSubcategoryID] ASC))
     |              |–Clustered Index Scan(OBJECT:([AdventureWorks].[Production].[Product].[PK_Product_ProductID] AS [p]))
     |–Clustered Index Scan(OBJECT:([AdventureWorks].[Production].[ProductSubcategory].[PK_ProductSubcategory_ProductSubcategoryID] AS [s]), ORDERED FORWARD)

 

The Stream Aggregate operator is where the GROUP BY work is done. Having sorted the data, the system can look through it for changes, calculating whatever aggregates are needed at the time. The properties window or the text view show clearly that the count is being calculated here.

And they show that the field being grouped by is p.ProductSubcategoryID. But our query said we should group by s.Name. So what’s going on? It’s not picking the right field. It’s not even picking the right table.

But our system has a unique index on s.Name. GROUP BY looks for the distinct values from the fields in the GROUP BY clause, but if a unique index is on one of those fields, every row in that table is already known to be distinct. Our unique index is letting the system have some more freedom.

Because s.ProductSubcategoryID is also known to be unique (it happens to be the Primary Key (PK) of the table), it is logically equivalent to consider this field in place of s.Name. And as we are joining to the Product table on this field, it is even okay to group by the field from Product. The s.Name field is fetched later and brought into our result set using the Merge Join operator. It’s not hard to recognise that our query is asking for the number of products per subcategory, and that we’re only really grouping by s.Name because that’s the field that we want to display.

Now let’s consider what happens if we remove the unique index on s.Name. Interestingly, if you have been evaluating your indexes using sys.dm_db_index_usage_stats, you may consider that the index isn’t being used. Unfortunately, the fact that the query optimizer does actually use the query isn’t reported very obviously. Once the index is removed, the query plan changes. This demonstrates that the index is definitely being used, even if it’s not used in a way that gets reflected in sys.dm_db_index_usage_stats.

|–Compute Scalar(DEFINE:([Expr1004]=CONVERT_IMPLICIT(int,[Expr1007],0)))
     |–Stream Aggregate(GROUP BY:([s].[Name]) DEFINE:([Expr1007]=Count(*)))
          |–Sort(ORDER BY:([s].[Name] ASC))
               |–Hash Match(Inner Join, HASH:([s].[ProductSubcategoryID])=([p].[ProductSubcategoryID]), RESIDUAL:([AdventureWorks].[Production].[ProductSubcategory].[ProductSubcategoryID] as [s].[ProductSubcategoryID]=[AdventureWorks].[Production].[Product].[ProductSubcategoryID] as [p].[ProductSubcategoryID]))
                    |–Clustered Index Scan(OBJECT:([AdventureWorks].[Production].[ProductSubcategory].[PK_ProductSubcategory_ProductSubcategoryID] AS [s]))
                    |–Clustered Index Scan(OBJECT:([AdventureWorks].[Production].[Product].[PK_Product_ProductID] AS [p]))

It has clearly changed a lot. Now, the Stream Aggregate operator reports that it is grouping by s.Name – just like we asked. But you may also notice that the Estimated Subtree Cost is about 30% worse than the previous query. The system is treating this query in exactly the way that we asked – reporting the number of products per subcategory name, rather than per subcategory.

We can address this without the unique index, if we change the query. The answer is to simply include the Primary Key of the subcategory in the GROUP BY clause. Even though we don’t want to include that field in the result set, listing it in the GROUP BY clause will let the system optimise the query better even without the unique index. Bear in mind that if there is a duplicate subcategory name, there will be an extra row that didn’t appear before – but this is correct we’re grouping by subcategory rather than subcategory name. Having s.Name in the GROUP BY clause is ignored (as in our original plan) – it’s only listed there because we want to list it in the SELECT clause.

SELECT s.Name, COUNT(*)
FROM Production.ProductSubcategory s
  JOIN
  Production.Product p
  ON p.ProductSubcategoryID = s.ProductSubcategoryID
GROUP BY s.Name, s.ProductSubcategoryID;

The unique index simply allows the PK to be implicitly inserted into the GROUP BY clause.

As a query developer, you should ask yourself if you’re wanting to group by the entity described by the table or just a particular column. If you’re really wanting it to be entity, then include the PK in the GROUP BY clause, or else understand the impact of the unique index. And definitely appreciate that there may be queries which are helped by the existence of unique indexes.

I do think that it’s worth evaluating index usage. Having an unused index in your system is like having the Yellow Pages in your house if you never use it – just taking up cupboard space. Dropping an unused index may not work for you though, particularly if it’s unique.

One thought on “Unique Indexes with GROUP BY”

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>