(Edited on February 11th, 2010 to take account of a few mistakes and changes in the .NET 4.0 release candidate.)
I’ve just been fiddling with the first appendix of C# in Depth, which covers the standard query operators in LINQ, and describes a few details of the LINQ to Objects implementations. As well as specifying which operators stream and which buffer their results, and immediate vs deferred execution, I’ve where LINQ optimises for different collection types – or where it doesn’t, but could. I’m not talking about optimisations which require knowledge of the projection being applied or an overall picture of the query (e.g. seq.Reverse().Any() being optimised to seq.Any()) – this is just about optimisations which can be done on a pretty simple basis.
There are two main operations which can be easily optimised in LINQ: random access to an element by index, and the count of a collection. The tricky thing about optimisations like this is that we do have to make assumptions about implementation: I’m going to assume that any implementation of an interface with a Count property will be able to return the count very efficiently (almost certainly straight from a field) and likewise that random access via an indexer will be speedy. Given that both these operations already have their own LINQ operators (when I say "operator" in this blog post, I mean "LINQ query method" rather than an operator at the level of C# as a language) let’s look at those first.
Count() should be pretty straightforward. Just to be clear, I’m only talking about the overload which doesn’t take a predicate: it’s pretty hard to see how you can do better than iterating through and counting matches when there is a predicate involved.
There are actually two common interfaces which declare a Count property: ICollection<T> and ICollection. While many implemenations of ICollection<T> will also implement ICollection (including List<T> and T), it’s not guaranteed: they’re independent interfaces, unrelated other than by the fact that they both extend IEnumerable.
The MSDN documentation for Enumerable.Count() states:
If the type of source implements ICollection<T>, that implementation is used to obtain the count of elements. Otherwise, this method determines the count.
This is accurate for .NET 3.5, but in .NET 4.0 it does optimise for ICollection as well. (In beta 2 it only optimised for ICollection, skipping the generic interface.)
The equivalent of an indexer in LINQ is the ElementAt operator. Note that it can’t really be an indexer as there’s no such thing as an "extension indexer" which is arguably a bit of a pity, but off-topic. Anyway, the obvious interface to look for here is IList<T>… and that’s exactly what ElementAt does. It ignores the possibility that you’ve only implemented the nongeneric IList – but I think that’s fairly reasonable. After all, the extension method extends IEnumerable<T>, so your collection has to be aware of generics – why would you implement IList but not IList<T>? Also, using the implementation of IList would involve a conversion from object to T, which would at the very least be ugly.
So ElementAt doesn’t actually do too badly. Now that we’ve got the core operations, what else could be optimised?
If you were going to write a method to compare two lists, you might end up with something like this (ignoring nullity concerns for the sake of brevity):
// Optimise for reflexive comparison
if (first == second)
// If the counts are different we know the lists are different
if (first.Count != second.Count)
// Compare each pair of elements in turn
for (int i = 0; i < first.Count; i++)
if (!comparer.Equals(first[i], second[i]))
Note the two separate optimisations. The first is always applicable, unless you really want to deal with sequences which will yield different results if you call GetEnumerator() on them twice. You could certainly argue that that would be a legitimate implementation, but I’d be interested to see a situation in which it made sense to try to compare such a sequence with itself and return false. SequenceEqual perform this optimisation.
The second optimisation – checking for different counts – is only really applicable in the case where we know that Count is optimised for both lists. In particular, I always make a point of only iterating through each source sequence once when I write a custom LINQ operator – you never know when you’ll be given a sequence which reads a huge log file from disk, yielding one line at a time. (Yes, that is my pet example, but it’s a good one and I’m sticking to it.) But we can certainly tell if both sequences implement ICollection or ICollection<T>, so it would make sense to have an "early negative" in that situation.
(All of this applies to LastOrDefault as well, by the way.) The implementation of Last which doesn’t take a predicate is already optimised for the IList<T> case: in that situation the method finds out the count, and returns list[count – 1] as you might expect. We certainly can’t do that when we’ve been given a predicate, as the last value might not match that predicate. However, we could walk backwards from the end of the list… if you have a list which contains a million items, and the last-but-one matches the predicate, you don’t really want to test the first 999998 items, do you? Again, this assumes that we can keep using random access on the list, but I think that’s reasonable for IList<T>.
Reverse is an interesting case, because it uses deferred execution and streams data. In reality, it always takes a complete copy of the sequence (which in itself does optimise for the case where it implements ICollection<T>; in that situation you know the count to start with and can use source.CopyTo(destinationArray) to speed things up). You might consider an optimisation which uses random access if the source is an implementation of IList<T> – you could just lazily yield the elements in reverse order using random access. However, that would change behaviour. Admittedly the behaviour of Reverse may not be what people expect in the first place. What would you predict that this code does?
var query = array.Reverse();
array = "a1";
var iterator = query.GetEnumerator();
array = "a2";
// We’ll assume we know when this will stop
array = "a3";
array = "a4";
array = "a5";
After careful thought, I accurately predicted the result (d, c, b, a2) – but you do need to take deferred execution *and* eager buffering into account. If nothing else, this should be a lesson in not changing the contents of query sources while you’re iterating over the query unless you’re really sure of what you’re doing.
With the candidate "optimisation" in place, we’d see (d, c, b, a5), but only when working on array directly. Working on array.Select(x => x) would have to give the original results, as it would have to iterate through all the initial values before finding the last one.
This is an interesting one… LongCount really doesn’t make much sense unless you expect your sequence to have more than 2^31 elements, but there’s no optimisation present. The contract for IList<T> doesn’t state what Count should do if the list has more than Int32.MaxValue elements, so that can’t really be used – but potentially Array.LongValue could be used for large arrays.
A bigger question is when this would actually be useful. I haven’t tried timing Enumerable.Range(0, int.MaxValue) to see how long it would take to become relevant, but I suspect it would be a while. I can see how LongCount could be useful in LINQ to SQL – but does it even make sense in LINQ to Objects? Maybe it will be optimised in a future version with ILongList<T> for large lists…
EDIT: In fact, given comments, it sounds like the time taken to iterate over int.MaxValue items isn’t that high after all. That’ll teach me to make assumptions about running times without benchmarking… I still can’t say I’ve seen LongCount used in anger in LINQ to Objects, but it’s not quite as silly as I thought
The optimisations I’ve described here all have the potential to take a long-running operation down to almost instantaneous execution, in the "right" situation. There may well be other opportunities lurking – things I haven’t thought of. The good news is that missing optimisations could be applied in future releases without breaking any sane code. I do wonder whether supporting methods (e.g. TryFastCount and TryFastElementAt) would be useful to encourage other people writing their own LINQ operators to optimise appropriately – but that’s possibly a little too much of a niche case.
Blog post frequency may well change in either direction for the near future – I’m going to be very busy with last-minute changes, fixes, indexing etc for the book, which will give me even less time for blogging. On the other hand, it can be mind-numbingly tedious, so I may resort to blogging as a form of relief…