Data pipelines as a conceptual stepping stone to higher order functions

I was explaining data pipelines in LINQ to Objects to a colleague yesterday, partly as the next step after explaining iterator blocks, and partly because, well, I love talking about C# 3 and LINQ. (Really, it’s becoming a serious problem. Now that I’ve finished writing the main text of my book, my brain is properly processing the stuff I’ve written about. I hadn’t planned to be blogging as actively as I have been recently – it’s just a natural result of thinking about cool things to do with LINQ. It’s great fun, but in some ways I hope it stops soon, because I’m not getting as much sleep as I should.)

When I was describing how methods like Select take an iterator and return another iterator which, when called, will perform appropriate transformations, something clicked. There’s a conceptual similarity between data pipelines using deferred execution, and higher order functions. There’s a big difference though: I can get my head round data pipelines quite easily, whereas I can only understand higher order functions for minutes at a time, and only if I’ve had a fair amount of coffee. I know I’m not the only one who finds this stuff tricky.

So yes, there’s a disconnect in difficulty level – but I believe that the more comfortable one is with data pipelines, the more feasible it is to think of higher order functions. Writing half the implementation of Push LINQ (thanks go to Marc Gravell for the other half) helped my “gut feel” understanding of LINQ itself significantly, and I believe they helped me cope with currying and the like as well.

I have a sneaking suspicion that if everyone who wanted to use LINQ had to write their own implementation of LINQ to Objects (or at least a single overload of each significant method) there’d be a much greater depth of understanding. This isn’t a matter of knowing the fiddly bits – it’s a case of grokking just why OrderBy needs to buffer data, and what deferred execution really means.

(This is moving off the topic slightly, but I really don’t believe it’s inconceivable to suggest that “average” developers implement most of LINQ to Objects themselves. It sounds barmy, but the tricky bit with LINQ to Objects is the design, not the implementation. Suggesting that people implement LINQ to SQL would be a different matter, admittedly. One idea I’ve had for a talk, whether at an MVP open day or a Developer Developer Developer day is to see how much of LINQ to Objects I could implement in a single talk/lecture slot, explaining the code as I went along. I’d take unit tests but no pre-written implementation. I really think that with an experienced audience who didn’t need too much hand-holding around iterator blocks, extension methods etc to start with, we could do most of it. I’m not saying it would be efficient (thinking particularly of sorting in a stable fashion) but it would be feasible, and I think it would encourage people to think about it more themselves. Any thoughts from any of you? Oh, and yes, I did see the excellent video of Luke Hoban doing Select and Where. I’d try to cover grouping, joins etc as well.)

I’m hoping that the more experience I have with writing funky little sequence generators/processors, the more natural functional programming with higher order functions will feel. For those of you who are already comfortable with higher order functions, does this sound like a reasonable hope/expectation? Oh, and is the similarity between the two situations anything to do with monads? I suspect so, but I don’t have a firm enough grasp on monads to say for sure. It’s obviously to do with composition, either way.

4 thoughts on “Data pipelines as a conceptual stepping stone to higher order functions”

  1. Re: implementing linq to objects.

    Some interesting statistics for you.

    The entire C# source code for the linq to objects extension method library is ~2400 lines of code, including blank lines and comments. There are ~240 methods in that file. That means that on average, every method in linq to objects has fewer than ten lines of code. (Since at least two of those lines are blank or braces, most are under eight lines of real code.) A significant number of them are one or two lines. I believe the longest single function is the constructor for the object which does the buffering for the Reverse code, at 28 lines.

    This speaks to the power of both functional composition and iterator “yield” syntax. In general, each one of these methods does something very specific and very clear. That makes it easier to know that you’ve gotten it right.

  2. Re: higher order functions vs sequence generators, vs monads.

    Yes, all of these are at some conceptual level the same thing.

    When programming with “low order” functions we’re used to the idea of “I’ll give you some input data, and you give me back some ouput data”.

    But all of the above are a way to capture the notion of “I’ll give you some input data, and you’ll give me back something which represents an operation on that input data. I’ll then invoke that operation when I want the output data”.

    With a higher order function like x=>y=>x+y, you say “here’s some data: 10. Give me back something which represents the operation “add ten”. When I want to add ten to something, I’ll call that.”

    With a sequence generator, it gets a bit more complicated because the function you get back “remmebers” its state. You say “here’s some data: “list of numbers from 10 to 20 by twos”, give me back something which represents the operation ‘what’s the next number?’. When I want the next number, I’ll call that.”

    Monads are a bit more complicated still. With a monad, you say, “here’s some data: [10, 20, 30]. Give me back an object which represents applying the identity function to that data, ie, no transformation whatsoever. OK, now that I have that monad, give me another monad which represents the action of applying the “count” function to the output of the previous monad. Maybe I’ll use that monad to build up yet more monads, or maybe I’ll just ask it to run itself and get the answer”.

    Does that help?

  3. I think it will help in the long run – I just need time, and reading lots more, before it all sinks in. I’m not expecting to fully “get” it any time soon, but I’m enjoying the journey.

    Oh, and for a point of comparison, the extension methods on DataProducer in Push LINQ are currently 1856 lines – but they don’t include some of the operators as they don’t make all make sense in a push environment. I suspect the implementation of Push LINQ is necessarily a little more complex than LINQ to Objects due to the requirement for futures.

    I do find it amazing how much can be done with how little code. (Cue Winston Churchill.)

  4. The way I usually think of monads is as a pattern, rather than as a type constructor, because outside of languages like Haskell, they aren’t available directly via a type constructor.

    By type constructor, I mean something like the ‘[]’ operator or the ‘class’ keyword in C# – syntaxes which construct types, rather than values.

    A monad is like a generic class which can logically wrap values (let’s say of type A), and supports several methods.

    One of the methods is essentially the constructor – create a monad with this ordinary value as its innards. This is the “monad which represents applying the identity function” to the normal data that Eric was talking about. It gets data into the monad.

    Another of the methods is an apply method, which takes a function of the monad’s inner type to some other type B (i.e. type A -> B), and this method returns a new monad (specialized on type B) which represents the application of that function on the inner state of the monad. The monad doesn’t *have* to blindly apply the function to the inner state – hopefully you can see that for some List monad, such as IEnumerable, instead it might apply the function to each element in the list, or only to some elements filtered by another argument, etc. Similarly, for a Maybe monad, it would only actually apply the function if the monad actually contains a value. Similarly again, it doesn’t have to act right now, instead it can return a monad which, when eventually evaluated, will apply that function – delayed evaluation.

    Logically, this is what the IO monad in Haskell does – rather than perform IO, it returns a monad which contains instructions on what IO to perform at some point in the future, which gets it off the hook in terms of immutability in purely functional languages. It’s not until the final value is returned by the outermost evaluation loop of the Haskell environment that the IO operations need to be logical executed – you could instead think of all IO operations in a Haskell program as building an *imperative* program *functionally*, which then gets executed once the functional bit has been evaluated.

    There’s a few more bits to monads, aka the rules of monads, but hopefully the above gives the essential gist which enlightens sufficiently to see the rest.

Comments are closed.