Reimplementing LINQ to Objects: Part 35 – Zip

Zip will be a familiar operator to any readers who use Python. It was introduced in .NET 4 – it’s not entirely clear why it wasn’t part of the first release of LINQ, to be honest. Perhaps no-one thought of it as a useful operator until it was too late in the release cycle, or perhaps implementing it in the other providers (e.g. LINQ to SQL) took too long. Eric Lippert blogged about it in 2009, and I find it interesting to note that aside from braces, layout and names we’ve got exactly the same code. (I read the post at the time of course, but implemented it tonight without looking back at what Eric had done.) It’s not exactly surprising, given how trivial the implementation is. Anyway, enough chit-chat…

What is it?

Zip has a single signature, which isn’t terribly complicated:

public static IEnumerable<TResult> Zip<TFirst, TSecond, TResult>(
    this IEnumerable<TFirst> first,
    IEnumerable<TSecond> second,
    Func<TFirst, TSecond, TResult> resultSelector)

Just from the signature, the name, and experience from the rest of this blog series it should be easy enough to guess what Zip does:

  • It uses deferred execution, not reading from either sequence until the result sequence is read
  • All three parameters must be non-null; this is validated eagerly
  • Both sequences are iterated over "at the same time": it calls GetEnumerator() on each sequence, then moves each iterator forward, then reads from it, and repeats.
  • The result selector is applied to each pair of items obtained in this way, and the result yielded
  • It stops when either sequence terminates
  • As a natural consequence of how the sequences are read, we don’t need to perform any buffering: we only care about one element from each sequence at a time.

There are really only two things that I could see might have been designed differently:

  • It could have just returned IEnumerable<Tuple<TFirst, TSecond>> but that would have been less efficient in many cases (in terms of the GC) and inconsistent with the rest of LINQ
  • It could have provided different options for what to do with sequences of differents lengths. For example:
    • Throw an exception
    • Use the default value of the shorter sequence type against the remaining items of the longer sequence
    • Use a specified default value of the shorter sequence in the same way

I don’t have any problem with the design that’s been chosen here though.

What are we going to test?

There are no really interesting test cases here. We test argument validation, deferred execution, and the obvious "normal" cases. I do have tests where "first" is longer than "second" and vice versa.

The one test case which is noteworthy isn’t really present for the sake of testing at all – it’s to demonstrate a technique which can occasionally be handy. Sometimes we really want to perform a projection on adjacent pairs of elements. Unfortunately there’s no LINQ operator to do this naturally (although it’s easy to write one) but Zip can provide a workaround, so long as we don’t mind evaluating the sequence twice. (That could be a problem in some cases, but is fine in others.)

Obviously if you just zip a sequence with itself directly you get each element paired with the same one. We effectively need to "shift" or "delay" one sequence somehow. We can do this using Skip, as shown in this test:

[Test]
public void AdjacentElements()
{
    string[] elements = { "a", "b", "c", "d", "e" };
    var query = elements.Zip(elements.Skip(1), (x, y) => x + y);
    query.AssertSequenceEqual("ab", "bc", "cd", "de");
}

It always takes me a little while to work out whether I want to make first skip or second – but if we want the second element as the first element of second (try getting that right ten times in a row – it makes sense, honest!) means that we want to call Skip on the sequence used as the argument for second. Obviously it would work the other way round too – we’d just get the pairs presented with the values switched, so the results of the query above would be "ba", "cb" etc.

Let’s implement it!

Guess what? It’s yet another operator with a split implementation between the argument validation and the "real work". I’ll skip argument validation, and get into the tricky stuff. Are you ready? Sure you don’t want another coffee?

private static IEnumerable<TResult> ZipImpl<TFirst, TSecond, TResult>(
    IEnumerable<TFirst> first,
    IEnumerable<TSecond> second,
    Func<TFirst, TSecond, TResult> resultSelector)
{
    using (IEnumerator<TFirst> iterator1 = first.GetEnumerator())
    using (IEnumerator<TSecond> iterator2 = second.GetEnumerator())
    {
        while (iterator1.MoveNext() && iterator2.MoveNext())
        {
            yield return resultSelector(iterator1.Current, iterator2.Current);
        }
    }
}

Okay, so possibly "tricky stuff" was a bit of an overstatement. Just about the only things to note are:

  • I’ve "stacked" the using statements instead of putting the inner one in braces and indenting it. For using statements with different variable types, this is one way to keep things readable, although it can be a pain when tools try to reformat the code. (Also, I don’t usually omit optional braces like this. It does make me feel a bit dirty.)
  • I’ve used the "symmetric" approach again instead of a using statement with a foreach loop inside it. That wouldn’t be hard to do, but it wouldn’t be as simple.

That’s just about it. The code does exactly what it looks like, which doesn’t make for a very interesting blog post, but does make for good readability.

Conclusion

Two operators to go, one of which I might not even tackle fully (AsQueryable – it is part of Queryable rather than Enumerable, after all).

AsEnumerable should be pretty easy…

18 thoughts on “Reimplementing LINQ to Objects: Part 35 – Zip”

  1. That’s another method I implemented in one of our libs which will have to go when we finally move to .NET 4.0. (the other is String.Join(String, IEnumerable) implemented as an extension method). It’s reassuring to see I got the same code as you and Eric.

    I don’t think the using stacking is too dirty, although I understand that you do need to be aware that it only works by virtue of the fact the second using block is regarded as a single statement. It’s a shame there’s no syntax for declaring multiple variables of differing types in a single using block. It’s either using stacking or code creeping across your screen.

    Another way might be a type like CompositeDisposable, although that’s probably overkill.

  2. C# doesn’t allow you to do “using x=E1, y=E2″? VB does.

    Zip is the main operator I wish worked in query expressions. In particular, the naming is a lot worse with lambdas than it would be with some nice behind-the-scenes anonymous types.

    IEnumerable Step(this IEnumerable sequence, int step)
    {
    return from e in sequence
    zip i in naturals
    where i % step == 0
    select e
    }

  3. @Strilanc: C# allows multiple variables to be declared in one using statement, but only if they’re of the same type.

  4. From memory, Visual Studio knows what’s going on and will *unindent* to that formatting for consecutive usings.

    Also, I’m not sure how I feel about the different length behavior. This way does make stuff easier, but the BCL has tended towards ensuring you know what’s going on.

  5. @Simon: I’m not quite sure what you mean about the “different length behaviour” – could you clarify?

  6. @skeet: Sorry, I was a little unclear, the documented behavior of Zip when the sequences have different lengths. Not referencing your implementation.

  7. I was going to write a comment but Simon already wrote it.

    VS treats “stacked” usings very nicely. And the different length behaviour is surprising and would cause loss of information, which isn’t a good combination. What’s wrong with getting default(T) for the rest?

  8. Zip done in LinqBridge compatible terms for VB as

    Sub Main
    dim elements() as string={“a”,”b”,”c”,”d”,”e”}
    dim query = zipbyindex(elements,elements.Skip(1), Function (x,y) x+y)
    ‘query.AssertSequenceEqual(“ab”, “bc”, “cd”, “de”)
    query.Dump()

    End Sub

    ‘ Define other methods and classes here

    Public Function ZipByIndex(Of TFirst, TSecond, TResult)(ByVal first As IEnumerable(Of TFirst), _
    ByVal second As IEnumerable(Of TSecond), _
    ByVal resultSelector As Func(Of TFirst, TSecond, TResult) _
    ) As IEnumerable(Of TResult)

    Dim q = From f In first.Select(Function(item1, index) New With {index, item1}) _
    Join s In second.Select(Function(item2, index) New With {index, item2}) _
    On f.index Equals s.index _
    Select resultSelector(f.item1, s.item2)
    Return q
    End Function

    Where there other tests to pass or run? or any obvious issues you see with this?

  9. @Brandon: Using a join means the second sequence is completely consumed. Zip itself can be used with two infinite sequences – the result will be infinite too, but with no significant memory usage.

  10. @Brandon: Changing the model of how it’s read fundamentally changes the operator IMO. Given that Zip is relatively easy to implement “normally”, why not just do that?

  11. Is this a typo: “I do have tests where first is longer than short”?

    Should it be “first is longer than second”?

  12. Thank you for this article! I could easily expand on it to create a zip for three sequences. :) Just wanted to point out that you forgot the ‘this’ of the first parameter.

  13. @Steven: I didn’t – the code I’ve shown is the “impl” method which would be *called* by the actual extension method, after performing argument validation.

    I deliberately left out the extension method itself as it’s tedious and exactly what you’d expect.

  14. @Jon: Oh, I missed that. :) I did argument checking using code contracts, so I placed everything directly in the extension method. Thanks for pointing it out.

Comments are closed.