Reimplementing LINQ to Objects: Part 34 – SequenceEqual

Nearly there now…

What is it?

SequenceEqual has two overloads – the obvious two given that we’re dealing with equality:

public static bool SequenceEqual<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second)

public static bool SequenceEqual<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)

The purpose of the operator is to determine if two sequences are equal; that is, if they consist of the same elements, in the same order. A custom equality comparer can be used to compare each individual pair of elements. Characteristics:

  • The first and second parameters mustn’t be null, and are validated immediately.
  • The comparer parameter can be null, in which case the default equality comparer for TSource is used.
  • The overload without a comparer uses the default equality comparer for TSource (no funny discrepancies if ICollection<T> instances are involved, this time).
  • It uses immediate execution.
  • It returns as soon as it notices a difference, without evaluating the rest of either sequence.

So far, so good. Note that no optimizations are mentioned above. There are questions you might consider:

  • Should foo.SequenceEqual(foo) always return true?
  • If either or both of the sequences implements another collection interface, does that help us?

The first question sounds like it should be a no-brainer, but it’s not as simple as it sounds. Suppose we have a sequence which always generates 10 random numbers. Is it equal to itself? If you iterate over it twice, you’ll usually get different results. What about a sequence which explicitly changes each time you iterate over it, based on some side-effect? Both the .NET and Edulinq implementations say that these are non-equal. (The random case is interesting, of course – it could happen to yield the same elements as we iterate over the two sequences.)

The second question feels a little simpler to me. We can’t take a shortcut to returning true, but it seems reasonably obvious to me that if you have two collections which allow for a fast "count" operation, and the two counts are different, then the sequences are unequal. Unfortunately, LINQ to Objects appears not to optimize for this case: if you create two huge arrays of differing sizes but equal elements as far as possible, it will take a long time for SequenceEqual to return false. Edulinq does perform this optimization. Note that just having one count isn’t useful: you might expect it to, but it turns out that by the time we could realize that the lengths were different in that case, we’re about to find that out in the "normal" fashion anyway, so there’s no point in complicating the code to make use of the information.

What are we going to test?

As well as the obvious argument validation, I have tests for:

  • Collections of different lengths
  • Ranges of different lengths, both with first shorter than second and vice versa
  • Using a null comparer
  • Using a custom comparer
  • Using no comparer
  • Equal arrays
  • Equal ranges
  • The non-optimization of foo.SequenceEquals(foo) (using side-effects)
  • The optimization using Count (fails on LINQ to Objects)
  • Ordering: { 1, 2 } should not be equal to { 2, 1 }
  • The use of a HashSet<string> with a case-insensitive comparer: the default (case-sensitive) comparer is still used when no comparer is provided
  • Infinite first sequence with finite second sequence, and vice versa
  • Sequences which differ just before they would go bang

None of the test code is particularly interesting, to be honest.

Let’s implement it!

I’m not going to show the comparer-less overoad, as it just delegates to the one with a comparer.

Before we get into the guts of SequenceEqual, it’s time for a bit of refactoring. If we’re going to optimize for count, we’ll need to perform the same type tests as Count() twice. That would be horrifically ugly inline, so let’s extract the functionality out into a private method (which Count() can then call as well):

private static bool TryFastCount<TSource>(
    IEnumerable<TSource> source,
    out int count)
{
    // Optimization for ICollection<T>
    ICollection<TSource> genericCollection = source as ICollection<TSource>;
    if (genericCollection != null)
    {
        count = genericCollection.Count;
        return true;
    }

    // Optimization for ICollection
    ICollection nonGenericCollection = source as ICollection;
    if (nonGenericCollection != null)
    {
        count = nonGenericCollection.Count;
        return true;
    }
    // Can’t retrieve the count quickly. Oh well.
    count = 0;
    return false;
}

Pretty simple. Note that we always have to set the out parameter to some value. We use 0 on failure – which happens to work out nicely in the Count, as we can just start incrementing if TryFastCount has returned false.

Now we can make a start on SequenceEqual. Here’s the skeleton before we start doing the real work:

public static bool SequenceEqual<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    if (first == null)
    {
        throw new ArgumentNullException("first");
    }
    if (second == null)
    {
        throw new ArgumentNullException("second");
    }

    int count1;
    int count2;
    if (TryFastCount(first, out count1) && TryFastCount(second, out count2))
    {
        if (count1 != count2)
        {
            return false;
        }
    }

    comparer = comparer ?? EqualityComparer<TSource>.Default;

    // Main part of implementation goes here
}

I could have included the comparison between count1 and count2 within the single "if" condition, like this:

if (TryFastCount(first, out count1) && 
    TryFastCount(second, out count2) &&
    count1 != count2)
{
    return false;
}

… but I don’t usually like using the values of out parameters like this. The behaviour is well-defined and correct, but it just feels a little ugly to me.

Okay, now let’s implement the "Main part" which at the bottom of the skeleton. The idea is simple:

  • Get the iterators for both sequences
  • Use the iterators "in parallel" (not in the multithreading sense, but in the movement of the logical cursor down the sequence) to compare pairs of elements; we can return false if we ever see an unequal pair
  • If we ever see that one sequence has finished and the other hasn’t, we can return false
  • If we get to the end of both sequences in the same iteration, we can return true

I’ve got three different ways of representing the basic algorithm in code though. Fundamentally, the problem is that we don’t have a way of iterating over pairs of elements in two sequences with foreach – we can’t use one foreach loop inside another, for hopefully obvious reasons. So we’ll have to call GetEnumerator() explicitly on at least one of the sequences… and we could do it for both if we want.

The first implementation (and my least favourite) does use a foreach loop:

using (IEnumerator<TSource> iterator2 = second.GetEnumerator())
{
    foreach (TSource item1 in first)
    {
        // second is shorter than first
        if (!iterator2.MoveNext())
        {
            return false;
        }
        if (!comparer.Equals(item1, iterator2.Current))
        {
            return false;
        }
    }
    // If we can get to the next element, first was shorter than second.
    // Otherwise, the sequences are equal.
    return !iterator2.MoveNext();
}

I don’t have a desperately good reason for picking them this way round (i.e. foreach over first, and GetEnumerator() on second) other than that it seems to still give primacy to first somehow… only first gets the "special treatment" of a foreach loop. (I can almost hear the chants now, "Equal rights for second sequences! Don’t leave us out of the loop! Stop just ‘using’ us!") Although I’m being frivolous, I dislike the asymmetry of this.

The second attempt is a half-way house: it’s still asymmetric, but slightly less so as we’re explicitly fetching both iterators:

using (IEnumerator<TSource> iterator1 = first.GetEnumerator(),
                            iterator2 = second.GetEnumerator())
{
    while (iterator1.MoveNext())
    {
        // second is shorter than first
        if (!iterator2.MoveNext())
        {
            return false;
        }
        if (!comparer.Equals(iterator1.Current, iterator2.Current))
        {
            return false;
        }
    }
    // If we can get to the next element, first was shorter than second.
    // Otherwise, the sequences are equal.
    return !iterator2.MoveNext();
}

Note the use of the multi-variable "using" statement; this is equivalent nesting one statement inside another, of course.

The similarities between these two implementations are obvious – but the differences are worth pointing out. In the latter approach, we call MoveNext() on both sequences, and then we access the Current property on both sequences. In each case we use iterator1 before iterator2, but it still feels like they’re being treated more equally somehow. There’s still the fact that iterator1 is being used in the while loop condition, whereas iterator2 has to be used both inside and outside the while loop. Hmm.

The third implementation takes this even further, changing the condition of the while loop:

using (IEnumerator<TSource> iterator1 = first.GetEnumerator(),
       iterator2 = second.GetEnumerator())
{
    while (true)
    {
        bool next1 = iterator1.MoveNext();
        bool next2 = iterator2.MoveNext();
        // Sequences aren’t of same length. We don’t
        // care which way round.
        if (next1 != next2)
        {
            return false;
        }
        // Both sequences have finished – done
        if (!next1)
        {
            return true;
        }
        if (!comparer.Equals(iterator1.Current, iterator2.Current))
        {
            return false;
        }
    }

This feels about as symmetric as we can get. The use of next1 in the middle "if" condition is incidental – it could just as easily be next2, as we know the values are equal. We could switch round the order of the calls to MoveNext(), the order of arguments to comparer.Equals – the structure is symmetric.

I’m not generally a fan of while(true) loops, but in this case I think I rather like it. It makes it obvious that we’re going to keep going until we’ve got a good reason to stop: one of the three return statements. (I suppose I should apologise to fans of the dogma around single exit points for methods, if any are reading. This must be hell for you…)

Arguably this is all a big fuss about nothing – but writing Edulinq has given me a new appreciation for diving into this level of detail to find the most readable code. As ever, I’d be interested to hear your views. (All three versions are in source control. Which one is active is defined with a #define.)

Conclusion

I really don’t know why Microsoft didn’t implement the optimization around different lengths for SequenceEqual. Arguably in the context of LINQ you’re unlikely to be dealing with two materialized collections at a time – it’s much more common to have one collection and a lazily-evaluated query, or possibly just two queries… but it’s a cheap optimization and the benefits can be significant. Maybe it was just an oversight.

Our next operator also deals with pairs of elements, so we may be facing similar readability questions around it. It’s Zip – the only new LINQ query operator in .NET 4.

3 thoughts on “Reimplementing LINQ to Objects: Part 34 – SequenceEqual”

  1. Getting a little obsessive with the symmetry :) (Though, I do agree with the foreach feeling strange.) Not that I can talk with my obsessive “mimimize basic blocks and variables” approach. I really need to learn to trust variable lifetime analysis and branch prediction :/

    Also, even with “cold” sequences, seq.SequenceEqual(seq) can return false if the comparer’s Equals(x, x) can return false for some x (NaN being the obvious case).

  2. Typo: “the differences are work pointing out” should be “worth pointing out”.

    Why did you use a bool-and-out approach to TryCountFast and not a nullable? I can never decide which is better but I generally use nullables in these situations.

    This is the first time I’ve seen that using syntax, and I must say I _really_ don’t like it. What’s wrong with having two usings? With just one brace it looks right, and it’s a bit clearer.

    I prefer the symmetric approach in this case for pretty much the same reasons as you – it’s a symmetric function, the usage of both iterators should be done the same.

  3. @configurator: Thanks for the typo correction – fixed now.

    The reason for TryFastCount not using a nullable is simply to follow the pattern used by the framework. I prefer using a nullable or a tuple, but as I’ll be posting later, I’d like to see this as a standard operator – which means following conventions.

    I’m personally fine with either style of using statement – it does emphasize that the types are the same…

Comments are closed.