Reimplementing LINQ to Objects: Part 9 – SelectMany

The next operator we’ll implement is actually the most important in the whole of LINQ. Most (all?) of the other operators returning sequences can be implemented via SelectMany. We’ll have a look at that at the end of this post, but let’s implement it first.

What is it?

SelectMany has 4 overloads, which look gradually more and more scary:

public static IEnumerable<TResult> SelectMany<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, IEnumerable<TResult>> selector)

public static IEnumerable<TResult> SelectMany<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, int, IEnumerable<TResult>> selector)

public static IEnumerable<TResult> SelectMany<TSource, TCollection, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, IEnumerable<TCollection>> collectionSelector,
    Func<TSource, TCollection, TResult> resultSelector)

public static IEnumerable<TResult> SelectMany<TSource, TCollection, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, int, IEnumerable<TCollection>> collectionSelector,
    Func<TSource, TCollection, TResult> resultSelector)

These aren’t too bad though. Really these are just variations of the same operation, with two "optional" bits.

In every case, we start with an input sequence. We generate a subsequence from each element in the input sequence using a delegate which can optionally take a parameter with the index of the element within the original collection.

Now, we either return each element from each subsequence directly, or we apply another delegate which takes the original element in the input sequence and the element within the subsequence.

In my experience, uses of the overloads where the original selector delegate uses the index are pretty rare – but the others (the first and the third in the list above) are fairly common. In particular, the C# compiler uses the third overload whenever it comes across a "from" clause in a query expression, other than the very first "from" clause.

It helps to put this into a bit more context. Suppose we have a query expression like this:

var query = from file in Directory.GetFiles("logs")
            from line in File.ReadLines(file)
            select Path.GetFileName(file) + ": " + line;

That would be converted into a "normal" call like this:

var query = Directory.GetFiles("logs")
                     .SelectMany(file => File.ReadLines(file),
                                 (file, line) => Path.GetFileName(file) + ": " + line);

In this case the compiler has used our final "select" clause as the projection; if the query expression had continued with "where" clauses etc, it would have created a projection to just pass along "file" and "line" in an anonymous type. This is probably the most confusing bit of the query translation process, involving transparent identifiers. For the moment we’ll stick with the simple version above.

So, the SelectMany call above has three arguments really:

  • The source, which is a list of strings (the filenames returned from Directory.GetFiles)
  • An initial projection which converts from a single filename to a list of the lines of text within that file
  • A final projection which converts a (file, line) pair into a single string, just by separating them with ": ".

The result is a single sequence of strings – every line of every log file, prefixed with the filename in which it appeared. So writing out the results of the query might give output like this:

test1.log: foo
test1.log: bar
test1.log: baz
test2.log: Second log file
test2.log: Another line from the second log file

It can take a little while to get your head round SelectMany – at least it did for me – but it’s a really important one to understand.

A few more details of the behaviour before we go into testing:

  • The arguments are validated eagerly – everything has to be non-null.
  • Everything is streamed. So only one element of the input is read at a time, and then a subsequence is produced from that. Only one element is then read from the subsequence at a time, yielding the results as we go, before we move onto the next input element and thus the next subsequence etc.
  • Every iterator is closed when it’s finished with, just as you’d expect by now.

What are we going to test?

I’m afraid I’ve become lazy by this point. I can’t face writing yet more tests for null arguments. I’ve written a single test for each of the overloads. I found it hard to come up with a clear way of writing the tests, but here’s one example, for the most complicated overload:

[Test]
public void FlattenWithProjectionAndIndex()
{
    int[] numbers = { 3, 5, 20, 15 };
    var query = numbers.SelectMany((x, index) => (x + index).ToString().ToCharArray(),
                                   (x, c) => x + ": " + c);
    // 3 => "3: 3"
    // 5 => "5: 6"
    // 20 => "20: 2", "20: 2"
    // 15 => "15: 1", "15: 8"
    query.AssertSequenceEqual("3: 3", "5: 6", "20: 2", "20: 2", "15: 1", "15: 8");
}

So, to give a bit more explanation to this:

  • Each number is summed with its index (3+0, 5+1, 20+2, 15+3)
  • Each sum is turned into a string, and then converted into a char array. (We don’t really need to ToCharArray call as string implements IEnumerable<char> already, but I thought it made it clearer.)
  • We combine each subsequence character with the original element it came from, in the form: "original value: subsequence character"

The comment shows the eventual results from each input, and the final test shows the complete result sequence.

Clear as mud? Hopefully it’s not to bad when you look at each step in turn. Okay, now let’s make it pass…

Let’s implement it!

We could implement the first three overloads in terms of calls to the final one – or more likely, a single "Impl" method without argument validation, called by all four public methods. For example, the simplest method could be implemented like this:

public static IEnumerable<TResult> SelectMany<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, IEnumerable<TResult>> selector)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (selector == null)
    {
        throw new ArgumentNullException("selector");
    }
    return SelectManyImpl(source,
                          (value, index) => selector(value),
                          (originalElement, subsequenceElement) => subsequenceElement);
}

However, I’ve decided to implement each of the methods separately – splitting them into the public extension method and a "SelectManyImpl" method with the same signature each time. I think that would make it simpler to step through the code if there were ever any problems… and it allows us to see the differences between the simplest and most complicated versions, too:

// Simplest overload
private static IEnumerable<TResult> SelectManyImpl<TSource, TResult>(
    IEnumerable<TSource> source,
    Func<TSource, IEnumerable<TResult>> selector)
{
    foreach (TSource item in source)
    {
        foreach (TResult result in selector(item))
        {
            yield return result;
        }
    }
}

// Most complicated overload:
// – Original projection takes index as well as value
// – There’s a second projection for each original/subsequence element pair
private static IEnumerable<TResult> SelectManyImpl<TSource, TCollection, TResult>(
    IEnumerable<TSource> source,
    Func<TSource, int, IEnumerable<TCollection>> collectionSelector,
    Func<TSource, TCollection, TResult> resultSelector)
{
    int index = 0;
    foreach (TSource item in source)
    {
        foreach (TCollection collectionItem in collectionSelector(item, index++))
        {
            yield return resultSelector(item, collectionItem);
        }
    }
}

The correspondence between the two methods is pretty clear… but I find it helpful to have the first form, so that if I ever get confused about the fundamental point of SelectMany, it’s really easy to understand it based on the simple overload. It’s then not too big a jump to apply the two extra "optional" complications, and end up with the final method. The simple overload acts as a conceptual stepping stone, in a way.

Two minor points to note:

  • The first method could have been implemented with a "yield foreach selector(item)" if such an expression existed in C#. Using a similar construct in the more complicated form would be harder, and involve another call to Select, I suspect… probably more hassle than it would be worth.
  • I’m not explicitly using a "checked" block in the second form, even though "index" could overflow. I haven’t looked to see what the BCL does in this situation, to be honest – I think it unlikely that it will come up. For consistency I should probably use a checked block on every method which uses an index like this… or just turn arithmetic checking on for the whole assembly.

Reimplementing operators using SelectMany

I mentioned early on in this post that many of the LINQ operators can be implemented via SelectMany. Just as a quick example of this, here are alternative implementations of Select, Where and Concat:

public static IEnumerable<TResult> Select<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TResult> selector)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (selector == null)
    {
        throw new ArgumentNullException("selector");
    }
    return source.SelectMany(x => Enumerable.Repeat(selector(x), 1));
}

public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }
    return source.SelectMany(x => Enumerable.Repeat(x, predicate(x) ? 1 : 0));
}

public static IEnumerable<TSource> Concat<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second)
{
    if (first == null)
    {
        throw new ArgumentNullException("first");
    }
    if (second == null)
    {
        throw new ArgumentNullException("second");
    }
    return new[] { first, second }.SelectMany(x => x);
}

Select and Where use Enumerable.Repeat as a convenient way of creating a sequence with either a single element or none. You could alternatively create a new array instead. Concat just uses an array directly: if you think of SelectMany in terms of its flattening operation, Concat is a really natural fit. I suspect that Empty and Repeat are probably feasible with recursion, although the performance would become absolutely horrible.

Currently the above implementations are in the code using conditional compilation. If this becomes a popular thing for me to implement, I might consider breaking it into a separate project. Let me know what you think – my gut feeling is that we won’t actually gain much more insight than the above methods give us… just showing how flexible SelectMany is.

SelectMany is also important in a theoretical way, in that it’s what provides the monadic nature of LINQ. It’s the "Bind" operation in the LINQ monad. I don’t intend to say any more than that on the topic – read Wes Dyer’s blog post for more details, or just search for "bind monad SelectMany" for plenty of posts from people smarter than myself.

Conclusion

SelectMany is one of LINQ’s fundamental operations, and at first sight it’s a fearsome beast. As soon as you understand that the basic operation is a flattening projection just with a couple of optional twiddles, it’s easily tamed.

Next up I’ll implement All and Any – which are nice and easy to describe by comparison.

4 thoughts on “Reimplementing LINQ to Objects: Part 9 – SelectMany”

  1. The mutable i++ in an otherwise functional implementation of the library really threw me for a mental loop. I’ve been racking my brain to see if there was a simple way of eliminating it, perhaps by wrapping the call to “collectionSelector(item, index++)” in some kind of lambda that would use Enumerable.Range, but I can’t work out exactly how. This tends to happen in my code on occasion. I end up wanting ‘interleave’ two enumerables, but still use yield.
    I think that ‘selectMany’ itself can do this, but we’re implementing selectMany. Any Ideas?

  2. @Neal: With tail recursion you could probably do it by starting off with the list and 0, and then recursively calling a method which takes “(list, index)”, does something with the head and the index, then calls the same method with “tail of list, index + 1″.

  3. I’m not sure If you will ever see this comment, but for what it is worth, the BCL *does* check for overflow.

    I strongly suspect that the BCL is compiled with the “/checked” option, so only methods where overflow is desired (and so have an explicit “unchecked” block) will not result in the OverflowException.

Comments are closed.