Reimplementing LINQ to Objects: Part 11 – First/Single/Last and the …OrDefault versions

Today I’ve implemented six operators, each with two overloads. At first I expected the implementations to be very similar, but they’ve all turned out slightly differently…

What are they?

We’ve got three axes of permutation here: {First, Last, Single}, {with/without OrDefault }, {with/without a predicate}. That gives us these twelve signatures:

public static TSource First<TSource>(
    this IEnumerable<TSource> source)

public static TSource First<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static TSource FirstOrDefault<TSource>(
    this IEnumerable<TSource> source)

public static TSource FirstOrDefault<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static TSource Last<TSource>(
    this IEnumerable<TSource> source)

public static TSource Last<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static TSource LastOrDefault<TSource>(
    this IEnumerable<TSource> source)

public static TSource LastOrDefault<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static TSource Single<TSource>(
    this IEnumerable<TSource> source)

public static TSource Single<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static TSource SingleOrDefault<TSource>(
    this IEnumerable<TSource> source)

public static TSource SingleOrDefault<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

The shared behaviour is as follows:

  • They’re all extension methods with a single generic type argument
  • They’re all implemented with immediate execution
  • They all validate that their parameters are non-null
  • The overloads with a predicate are equivalent to calling source.Where(predicate).SameOperator() – in other words, they just add a filter before applying the operator.

With those rules applied, we simply need to consider three possibilities for each operator: what happens if the source sequence is empty, contains a single element, or contains multiple elements. (This is after applying the filter if a predicate is specified, of course.) We can draw the results up in a simple table:

Operator Empty sequence Single element Multiple elements
First Throws exception Returns element Returns first element
FirstOrDefault Returns default(TSource) Returns element Returns first element
Last Throws exception Returns element Returns last element
LastOrDefault Returns default(TSource) Returns element Returns last element
Single Throws exception Returns element Throws exception
SingleOrDefault Returns default(TSource) Returns element Throws exception

As you can see, for an input sequence with a single element, the results are remarkably uniform :) Likewise, for an empty input sequence, any operator without "OrDefault" throws an exception (InvalidOperationException, in fact) and any operator with "OrDefault" returns the default value for the element type (null for reference types, 0 for int etc). The operators really differ if the (potentially filtered) input sequence contains multiple elements – First and Last do the obvious thing, and Single throws an exception. It’s worth noting that SingleOrDefault also throws an exception – it’s not like it’s saying, "If the sequence is a single element, return it – otherwise return the default value." If you want an operator which handles multiple elements, you should be using First or Last, with the "OrDefault" version if the sequence can legitimately have no elements. Note that if you do use an "OrDefault" operator, the result is exactly the same for an empty input sequence as for an input sequence containing exactly one element which is the default value. (I’ll be looking at the DefaultIfEmpty operator next.)

Now we know what the operators do, let’s test them.

What are we going to test?

This morning I tweeted that I had written 72 tests before writing any implementation. In fact I ended up with 80, for reasons we’ll come to in a minute. For each operator, I tested the following 12 cases:

  • Null source (predicate-less overload)
  • Null source (predicated overload)
  • Null predicate
  • Empty sequence, no predicate
  • Empty sequence, with a predicate
  • Single element sequence, no predicate
  • Single element sequence where the element matched the predicate
  • Single element sequence where the element didn’t match the predicate
  • Multiple element sequence, no predicate
  • Multiple element
  • Multiple element sequence where one element matched the predicate
  • Multiple element sequence where multiple elements matched the predicate

These were pretty much cut and paste jobs – I used the same data for each test against each operator, and just changed the expected results.

There are two extra tests for each of First and FirstOrDefault, and two for each of Last and LastOrDefault:

  • First/FirstOrDefault should return as soon as they’ve seen the first element, when there’s no predicate; they shouldn’t iterate over the rest of the sequence
  • First/FirstOrDefault should return as soon as they’ve seen the first matching element, when there is a predicate
  • Last/LastOrDefault are optimized for the case where the source implements IList<T> and there’s no predicate: it uses Count and the indexer to access the final element
  • Last/LastOrDefault is not optimized for the case where the source implements IList<T> but there is a predicate: it iterates through the entire sequence

The last two tests involved writing a new collection called NonEnumerableList which implements IList<T> by delegating everything to a backing List<T>, except for GetEnumerator() (both the generic and nongeneric forms) which simply throws a NotSupportedException. That should be a handy one for testing optimizations in the future. I’ll discuss the optimization for Last when we get there.

Let’s implement them!

These operators were more interesting to implement than I’d expect, so I’m actually going to show all twelve methods. It was rarely just a matter of cutting and pasting, other than for the argument validation.

Of course, if we chose to implement the predicated versions using "Where" and the non-predicated form, and the "OrDefault" versions by using "DefaultIfEmpty" followed by the non-defaulting version, we would only have had the three non-predicated, non-defaulting versions to deal with… but as I’ve said before, there are some virtues to implementing each operator separately.

For the sake of avoiding fluff, I’ve removed the argument validation from each method – but obviously it’s there in the real code. Let’s start with First:

public static TSource First<TSource>(
    this IEnumerable<TSource> source)
{
    // Argument validation elided
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        if (iterator.MoveNext())
        {
            return iterator.Current;
        }
        throw new InvalidOperationException("Sequence was empty");
    }
}

public static TSource First<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Argument validation elided
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            return item;
        }
    }
    throw new InvalidOperationException("No items matched the predicate");
}

These look surprisingly different – and that was actually a deliberate decision. I could easily have implemented the predicate-less version with a foreach loop as well, just returning unconditionally from its body. However, I chose to emphasize the fact that we’re not looping in First: we simply move to the first element if we can, and return it or throw an exception. There’s no hint that we might ever call MoveNext again. In the predicated form, of course, we have to keep looping until we find a matching value – only throwing the exception when we’ve exhausted all possibilities.

Now let’s see how it looks when we return a default for empty sequences:

public static TSource FirstOrDefault<TSource>(
    this IEnumerable<TSource> source)
{
    // Argument validation elided
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        return iterator.MoveNext() ? iterator.Current : default(TSource);
    }
}

public static TSource FirstOrDefault<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Argument validation elided
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            return item;
        }
    }
    return default(TSource);
}

Here the predicated form looks very similar to First, but the predicate-less one is slightly different: instead of using an if block (which would be a perfectly valid approach, of course) I’ve used the conditional operator. We’re going to return something, whether we manage to move to the first element or not. Arguably it would be nice if the conditional operator allowed the second or third operands to be "throw" expressions, taking the overall type of the expression from the other result operand… but it’s no great hardship.

Next up we’ll implement Single, which is actually closer to First than Last is, in some ways:

public static TSource Single<TSource>(
    this IEnumerable<TSource> source)
{
    // Argument validation elided
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        if (!iterator.MoveNext())
        {
            throw new InvalidOperationException("Sequence was empty");
        }
        TSource ret = iterator.Current;
        if (iterator.MoveNext())
        {
            throw new InvalidOperationException("Sequence contained multiple elements");
        }
        return ret;
    }
}

public static TSource Single<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Argument validation elided
    TSource ret = default(TSource);
    bool foundAny = false;
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            if (foundAny)
            {
                throw new InvalidOperationException("Sequence contained multiple matching elements");
            }
            foundAny = true;
            ret = item;
        }
    }
    if (!foundAny)
    {
        throw new InvalidOperationException("No items matched the predicate");
    }
    return ret;
}

This is already significantly more complex than First. The predicate-less version starts off the same way, but if we manage to move to the first element, we have to remember that value (as we’re hoping to return it) and then try to move to the second element. This time, if the move succeeds, we have to throw an exception – otherwise we can return our saved value.

The predicated version is even hairier. We still need to remember the first matching value we find, but this time we’re looping – so we need to keep track of whether we’ve already seen a matching value or not. If we see a second match, we have to throw an exception… and we also have to throw an exception if we reach the end without finding any matches at all. Note that although we assign an initial value of default(TSource) to ret, we’ll never reach a return statement without assigning a value to it. However, the rules around definite assignment aren’t smart enough to cope with this, so we need to provide a "dummy" value to start with… and default(TSource) is really the only value available. There is an alternative approach without using a foreach statement, where you loop until you find the first match, assign it to a local variable which is only declared at that point, followed by a second loop ensuring that there aren’t any other matches. I personally think that’s a bit more complex, which is why I’ve just used the foreach here.

The difference when we implement SingleOrDefault isn’t quite as pronounced this time though:

public static TSource SingleOrDefault<TSource>(
    this IEnumerable<TSource> source)
{
    // Argument validation elided
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        if (!iterator.MoveNext())
        {
            return default(TSource);
        }
        TSource ret = iterator.Current;
        if (iterator.MoveNext())
        {
            throw new InvalidOperationException("Sequence contained multiple elements");
        }
        return ret;
    }
}

public static TSource SingleOrDefault<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Argument validation elided
    TSource ret = default(TSource);
    bool foundAny = false;
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            if (foundAny)
            {
                throw new InvalidOperationException("Sequence contained multiple matching elements");
            }
            foundAny = true;
            ret = item;
        }
    }
    return ret;
}

This time we’ve just replaced a "throw" statement in the predicate-less method with a "return" statement, and removed the test for no matches being found in the predicated method. Here our assignment of default(TSource) to ret really works in our favour – if we don’t end up assigning anything else to it, we’ve already got the right return value!

Next up is Last:

public static TSource Last<TSource>(
    this IEnumerable<TSource> source)
{
    // Argument validation elided
    IList<TSource> list = source as IList<TSource>;
    if (list != null)
    {
        if (list.Count == 0)
        {
            throw new InvalidOperationException("Sequence was empty");
        }
        return list[list.Count - 1];
    }

    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        if (!iterator.MoveNext())
        {
            throw new InvalidOperationException("Sequence was empty");
        }
        TSource last = iterator.Current;
        while (iterator.MoveNext())
        {
            last = iterator.Current;
        }
        return last;
    }
}

public static TSource Last<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Argument validation elided
    bool foundAny = false;
    TSource last = default(TSource);
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            foundAny = true;
            last = item;
        }
    }
    if (!foundAny)
    {
        throw new InvalidOperationException("No items matched the predicate");
    }
    return last;
}

Let’s start off with the optimization at the beginning of the predicate-less method. If we find out we’re dealing with a list, we can simply fetch the count, and then either throw an exception or return the element at the final index. As an extra bit of optimization I could store the count in a local variable – but I’m assuming that the count of an IList<T> is cheap to compute. If there are significant objections to that assumption, I’m happy to change it :) Note that this is another situation where I’m assuming that anything implementing IList<T> will only hold at most Int32.MaxValue items – otherwise the optimization will fail.

If we don’t follow the optimized path, we simply iterate over the sequence, updating a local variable with the last-known element on every single iteration. This time I’ve avoided the foreach loop for no particularly good reason – we could easily have had a foundAny variable which was just set to "true" on every iteration, and then tested at the end. In fact, that’s exactly the pattern the predicated method takes. Admittedly that decision is forced upon us to some extent – we can’t just move once and then take the first value as the "first last-known element", because it might not match the predicate.

There’s no optimization for the predicated form of Last. This follows LINQ to Objects, but I don’t honestly know the reason for it there. We could easily iterate backwards from the end of the sequence using the indexer on each iteration. One possible reason which makes a certain amount of sense is that when there’s a predicate, that predicate could throw an exception for some values – and if we just skipped to the end if the collection implements IList<T>, that would be an observable difference. I’d be interested to know whether or not that is the reason – if anyone has any inside information which they can share, I’ll update this post.

From here, we only have one more operator to implement – LastOrDefault:

public static TSource LastOrDefault<TSource>(
    this IEnumerable<TSource> source)
{
    // Argument validation elided
    IList<TSource> list = source as IList<TSource>;
    if (list != null)
    {
        return list.Count == 0 ? default(TSource) : list[list.Count - 1];
    }

    TSource last = default(TSource);
    foreach (TSource item in source)
    {
        last = item;
    }
    return last;
}

public static TSource LastOrDefault<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Argument validation elided
    TSource last = default(TSource);
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            last = item;
        }
    }
    return last;
}

This time, aside from the optimization, the predicated and non-predicated forms look very similar… more so than for any of the other operators. In each case, we start with a return value of default(TSource), and iterate over the whole sequence, updating it – only doing so when it matches the predicate, if we’ve got one.

Conclusion

This was a longer post than I anticipated when I got up this morning, but I hope the slight differences between implementations – and the mystery of the unoptimized predicated "Last/LastOrDefault" operators have made it worth slogging through.

As a contrast – and because I’ve already mentioned it in this post – I’ll implement DefaultIfEmpty next. I reckon I can still do that this evening, if I hurry…

Addendum

It turns out I was missing some tests for Single and SingleOrDefault: what should they do if evaluating the sequence fully throws an exception? It turns out that in LINQ to Objects, the overloads without a predicate throw InvalidOperationException as soon as they see a second element, but the overloads with a predicate keep iterating even when they’ve seen a second element matching a predicate. This seems ludicrously inconsistent to me – I’ve opened a Connect issue about it; we’ll see what happens.

9 thoughts on “Reimplementing LINQ to Objects: Part 11 – First/Single/Last and the …OrDefault versions”

  1. Jon,

    I’m really learning from your mind blowingly well done teaching LINQ to Objects concept, and equally well done implementation. Kudos.

    I tend to favor null over dummy variables because dummys make me think too much. Doing so in this case also removes the need for the foundAny flag.

    public static TSource Single(
    this IEnumerable
    source,
    Func
    predicate)
    {
    // Argument validation elided
    TSource ret = null;
    foreach (TSource item in source)
    {
    if (predicate(item))
    {
    if (null != ret)
    {
    throw new InvalidOperationException(“Sequence contained multiple matching elements”);
    }
    ret = item;
    }
    }
    if (null == ret)
    {
    throw new InvalidOperationException(“No items matched the predicate”);
    }

    return ret;
    }

  2. @Ed S: There are two problems with that approach:

    - What if TSource is a value type? Then you can’t use null. You can use default(TSource), but then you can’t easily compare with it unless you use EqualityComparer.Default which starts becoming pretty messy.

    - What if null (or the default value of TSource) is a valid value, and it matches the predicate? Your code would throw an exception, when it should return the value.

    Fundamentally, we’re representing two different concepts here: whether or not we’ve found the value, and what that value was. I think it’s better two represent the two concepts with two separate variables.

    (As a side-note, why do you use “if (null == ret)” rather than “if (ret == null)”? I believe most people find the latter more readable, and C# doesn’t have the “accidental assignment” issue from C/C++ which normally leads people to use this style.)

  3. With the predicate-less Last function, why do you keep the local variable “last”? Instead of the “return last” you could have easily written “return iterator.Current”?

    Nice serie by the way. The content is pretty nice, but the form is also appreciated. If I couldn’t have “planned” the consumption of the serie without my RSS reader, I probably wouldn’t have read it.

  4. @Doeke: Nope, I couldn’t have returned iterator.Current because the iterator would have been invalid by that point. As it happens, the iterator block implementations generated by the C# compiler will always make Current return the last valid element… but it’s undefined. I wouldn’t be surprised to find that List and arrays threw an exception if you tried to use iterator.Current after iterator.MoveNext() has returned false.

    Glad you’re enjoying the series :)

  5. @skeet: Value Types DOH! That’ll teach me to post air code!

    To answer your question about the “null == ret” vs. “ret == null”, there is a practical reason, and an even larger psychological reason for leading with the constant instead of the variable.

    The practical reason is that the C# compiler will not always protect you.

    Consider this compiles – and inadvertantly mutates all the SomeString property values to null at runtime.

    var oops = list.Select(foo => foo.SomeString = null);

    This will not compile:

    var savedByOldDogCodingConvention = list.Select(foo => null = foo.SomeString);

    Granted my example isn’t an “if” construct, but anybody that habitually types “if(x == null)” will most certainly type “x => x == null”.

    Early versions of C# did not have this issue, but I still used the old form because the problem was that after decades of the visual construct “if (xyz == null)” setting off my mental alarm bells it is near impossible to shed the uncomfortable feeling that occurs when reading it.

    As both constructs were perfectly valid and equally terse, I chose the path of least resistance and less stomach acid churn.

    Also, forgetting all the reasons above, I’m pretty sure my fingers just type these kinds of trivial things without any conscious thought on my part while I’m thinking ahead to the real issues I need to solve.

    Given that, I’m hoping in a couple more years I will be typing “var foo = new Foo();” instead of “Foo foo = new Foo();”. It certainly wouldn’t hurt the readability any. -:)

    Although now that I think about it, this won’t compile and would save as well:

    IEnumerable oops = list.Select(foo => foo.SomeString = null);

    Perhaps this type of psychological discomfort when reading code is why C# unnecessarily requires an explicit “break” of some sort in all switch sections, even though it disallows execution to continue? A C# purist case could certainly be made to simply allow the below syntax, because given x == 1 SomeOtherMethod() is never going to execute as it would in C++:

    switch(x) {
    case 1:
    SomeMethod();
    case 2:
    SomeOtherMethod();
    }

    Even more interesting is the fact that while re-reading the above switch example I found myself reaching for the antacid tablets. LOL

  6. @Ed S: That’s a good point about the lambda expression and type inference – I hadn’t considered that one.

    Of course, it changes the return type, so unless you’re not using the results in a strongly typed way, you’ll still spot it at compile-time, as you mention later. To put it another way: I don’t think I’ve *ever* heard of anyone being caught out by that.

    You didn’t really go into the psychological aspect, but if it was going to be along the lines of “make one language feel like another” then I vehemently disagree. You should always be very aware of what language you’re coding in, and not reach for a lowest common denominator. For example, to compare strings in C# I’d use:

    if (foo == “something”)

    rather than

    if (foo.Equals(“something”))

    or (if I *really* didn’t mind if foo is null)

    if (“something”.Equals(foo))

    The latter two approaches are the correct ones in Java – the other language I use day-to-day. The former is what I would pick in C# as it’s simply more readable, IMO. You need to know which language you’re in – at which point you can write the most readable code.

    Now if your point is that as a seasoned C++ programmer you *personally* find it odd to read comparisons the other way round, that’s fair enough – but I’d encourage you to challenge yourself to use the form that *more* people find readable. (I don’t have study data to back that up or anything, but it certainly seems to be the natural way that people express conditions unless they’ve been *told* to put the constant first due to a C/C++ background.)

    Regarding the switch statement: switch in C# has various unfortunate properties, IMO. I would personally have liked to see each case statement come with braces, like a catch. No break required, and a new scope within the braces.

  7. @skeet: It’s not “make one language like another”, it’s “learn from past mistakes”.

    I realize the form of a test placing the constant first makes some people uncomfortable when they read it, but that is because they never spent all 72 hours of a 3 day weekend chasing that bug before.

    An integration programmer in my company was bitten by an inadvertant assignment in a lambda statement. I’m sure the more creative types out there can come up with several scenarios where this can or has occurred.

    I was bitten by using foo == “something” because foo was of type object. Easy enough to avoid in my coding, but much harder to spot when trouble shooting somebody else’s code, especially if they have heavy use of “var”.

    var s = Foo.x;
    //many lines of complex code and busines rules later…
    if (s == “T”)
    // at this point the last thing on your mind is probably “is s of type object?”.

  8. @Ed S: Well I’m certainly surprised that it actually *has* bitten anyone.

    I suspect that there are lots of obscure situations which can only be avoided be avoided by making *most* code less readable. I would say that in this case the risk/reward balance is in favour of the “normal” approach.

    Imagine if 1 million developers took just *2 seconds* longer to read code every day with the “backward” form… that would have to be balanced by about 7 or 8 developers being hit by a 72-hour blockage *every day* for it to be worth changing. (This is incredibly simplistic arithmetic, of course, but I hope you take my point.)

    Obviously this is all just personal opinion – but I’m certainly not going to change my coding style based on this discussion. (I don’t expect you to either, of course :) Thanks for the debate though… it’s interesting.

  9. @skeet: The humorous part of this discussion is we both posit that it takes longer to read it when written with the other guy’s preference. :)

    Most likely we are both wrong and it doesn’t take any longer either way.

    In which case I’m right because mine is the safer methodology…

    hahaha just kidding
    -Ed

    P.S. Did I forget to mention I’m 100% with you on the uglyness of string.Equals? Especially when compared to if(“s” == stringVar). haha again!

Comments are closed.