Lessons learned from Protocol Buffers, part 3: generic type relationships

In part 2 of this series we saw how the message and builder interfaces were self-referential in order to allow the implementation types to be part of the API. That’s one sort of relationship, but in this post we’ll see how the two interfaces relate to each other. If you remember from part 1 every generated message type has a corresponding builder type. As it happens, this is implemented with a nested type, so if you had a Person message, the generated types would be Person and Person.Builder (in a specified namespace, of course).

Without any interfaces involved, this would be very simple. The types would just look like this (with more members, of course):

public class Person
{
    public static Builder CreateBuilder() { … }

    public Builder CreateBuilderForType() { … }

    public class Builder
    {
        public Builder() { … }

        public Person Build() { … }
    }
}

You may well be wondering why there are two methods for creating a builder. The static method is convenient for code which knows it’s dealing with the Person message. The instance method ends up being part of the message interface, which makes it useful for code which can work with any message. In addition, the constructor for Person.Builder is accessible in the C# version. In the original Java code the only way of creating a builder is via the methods in the message class; I decided to remove this restriction for the sake of making the oh-so-readable object initializer syntax available in C# 3.

Redesigning the interfaces to refer to each other

In part 2 we created self-referential interfaces for the message and builder interfaces which looked like this:

public interface IMessage<TMessage> where TMessage : IMessage<TMessage>
{
    …
}

public interface IBuilder<TBuilder> where TBuilder : IBuilder<TBuilder>
{
    …
}

The constraints on the type parameters allow us to make the API very specific, and we can use the same trick again when we relate the builder and message types together. The step where we introduce a new type parameter to each of them is straightforward:

public interface IMessage<TMessage, TBuilder> where TMessage : IMessage<TMessage, TBuilder>
{
    …
}

public interface IBuilder<TMessage, TBuilder> where TBuilder : IBuilder<TMessage, TBuilder>
{
    …
}

Unfortunately without any restrictions on the “foreign” type parameter in each interface, we don’t get enough information to make everything work. We need to tie the two types together more tightly, like this:

public interface IMessage<TMessage, TBuilder>
    where TMessage : IMessage<TMessage, TBuilder>
    where TBuilder : IBuilder<TMessage, TBuilder>
{
    …
}

public interface IBuilder<TMessage, TBuilder>
    where TMessage : IMessage<TMessage, TBuilder>
    where TBuilder : IBuilder<TMessage, TBuilder>
{
    …

To make this concrete for Person and Person.Builder we end up with implementations like this:

public class Person : IMessage<Person, Builder>
{
    public static Builder CreateBuilder() { … }

    public Builder CreateBuilderForType() { … }

    public class Builder : IBuilder<Person, Builder>
    {
        public Builder() { … }

        public Person Build() { … }
    }
}

This works, but it’s really ugly. Any generic methods wanting to take a TMessage type parameter implementing IMessage<TMessage, TBuilder> have to also have a TBuilder type parameter, and the two constraints need to be expressed each time. It’s a real pain. In fact, I’ve got an IMessage<TMessage> interface which contains almost nothing in it (and which the more generic interface extends). This allows me to get hold of the message type (and use it in the API), inferring the builder type by reflection. That’s a pain too, frankly. It’s a particular nuisance because when I do infer the builder type, I haven’t actually got any compile-time constraint the lets any other code know that it’s the right builder type for the message type. In one specific case it’s led to this horrific method (in a type generic in TMessage:

private static TMessage BuildImpl<TMessage2, TBuilder> (Func<TBuilder> builderBuilder,
                                                        CodedInputStream input,
                                                        ExtensionRegistry registry)
    where TBuilder : IBuilder<TMessage2, TBuilder>
    where TMessage2 : TMessage, IMessage<TMessage2, TBuilder>
{
    TBuilder builder = builderBuilder();
    input.ReadMessage(builder, registry);
    return builder.Build();
}

Fortunately this is hidden from public view – and the only reason to do it at all is to enable a pleasant API of MessageStreamIterator<TMessage> : IEnumerable<TMessage> where TMessage : IMessage<TMessage>. The result of the evil method above is exactly what the caller is likely to want, otherwise I wouldn’t put up with it. However, that sort of excuse has been coming up far too much in the PB implementation, so I’ve had a quick think about what could be done about it.

Contemplating a more expressive language

I should really prefix this section by saying that I’m not actually suggesting this as a way forward for C# or .NET. (I suspect it would take more work in the CLR as well as just in the language; I don’t know enough about CLR generics to say for sure, but I’d be surprised if this were feasible.) I haven’t encountered many situations where I’ve wanted anything like this, and the extra complexity in the language would be quite high, I suspect. Suppose an interface could contain extra type parameters, including constraints, in the body of the interface:

// Purely imaginary syntax!
public interface IMessage<TMessage> where TMessage : IMessage<TMessage>
{
    <TBuilder> where TBuilder : IBuilder<TBuilder>, TBuilder.TMessage : TMessage

    // Normal methods, which could use TBuilder
}

public interface IBuilder<TBuilder> where TBuilder : IBuilder<TBuilder>
{
    <TMessage> where TMessage : IMessage<TMessage>, TMessage.TBuilder : TBuilder

    // Normal methods, which could use TMessage
}

There are various ways in which the interface implementation could indicate the type of TBuilder. The syntax itself isn’t particularly interesting – it’s the extra information which is conveyed which is the important bit. I’ve dithered between this being a step forward and it not. At first glance it looks no better than having both type parameters in the interface declaration, but I believe it would genuinely make a difference. For instance, the above evil method could be written as:

private static TMessage BuildImpl(Func<TMessage.TBuilder> builderBuilder,
                                  CodedInputStream input,
                                  ExtensionRegistry registry)
{
    TMessage.TBuilder builder = builderBuilder();
    input.ReadMessage(builder, registry);
    return builder.Build();

This time there’s no need for the method to be generic, because the type is already generic in the message type. Furthermore, we can call this method with no reflection. All other APIs which have previously had to be specify two type parameters can now just specify the one. Apart from anything else, this leaves more scope for type inference in generic methods – passing either a message or a builder to a generic method happens occasionally, but it’s very rare to pass in both.

We’ve essentially expressed the relationship between the message type and the builder type a little more explicitly, so that we can guarantee it exists (and use it) at compile time. That’s at the heart of the problem to start with – without a second type parameter in the initial interface declaration, in the current language there’s no way of expressing a close relationship with another type.

Conclusion

I don’t think it would be fair to say that C# really lets us down here – it happens not to support a pretty rare scenario, and that’s fair enough. I’d be interested to know whether any other languages allow the same sort of concepts to be expressed more pleasantly. The ugly solution I’ve presented here does at least work, and it’s nearly invisible to most users, who are likely to just reference the concrete generated types. I’m not happy with the verbosity which has become necessary in many places, but it’s in a good cause. It’s interesting to note that the Java API doesn’t use this sort of doubly-generic relationship: again, covariant return types allow the concrete message and builder types to express their APIs directly and still implement a more general interface at the same time.

In the next part I’ll look at another possibility which would make interfaces and generics a more powerful combination: static interface methods.