Lessons learned from Protocol Buffers, part 1: messages, builders and immutability

My port of the Protocol Buffers project has proved pretty interesting. I thought I’d share some of the lessons I’ve learned along the way, as well as some of the frustrations at concepts I still can’t express in C#.

This was originally all going to be in one post, but I’m becoming acutely aware of how long some posts can grow. I don’t know about you, but I find very long blog posts quite intimidating, so I’ve decided to split them up into individual topics. You’ll still probably need to read the posts in order to understand them though – and this introductory post is the most important one in that respect.

Messages and Builders

The Protocol Buffers project (or PB for short) is basically another serialization technology, putting emphasis on efficiency, platform neutrality, and backward/forward compatibility. The normal set of steps in using PB is something like this:

  1. Write a .proto file describing your data in terms of messages.
  2. Run protoc to generate C# (and Java/C++ if you so wish).
  3. In your application, use the builder associated with the message type to create an instance of a message.
  4. Serialize the data to a stream.
  5. At some other point in the application (or a different app) deserialize the data.

The idea is that builders are mutable, while the messages they build are immutable. You can use builders either with Set* methods which return the same builder again, or properties which can be used within object initializers. For example:

// Syntax available in C# 2
Person john = new Person.Builder()
    .SetFirstName(“John”)
    .SetLastName(“Doe”)
    .Build();

// Using an object initializer
Person jane = new Person.Builder
    { FirstName=“Jane”, LastName=“Doe” }
    .Build();

Of course, you don’t have to do all the building in one expression, it’s just a handy option in many cases.

As you can see, the builder is generated as a nested type of the message. That’s handy, as it means the builder has access to the private members of the message. To avoid lots of data copying we employ popsicle immutability – the builder directly manipulates the message until it’s built, at which point it makes sure that nothing will change it afterwards. If that makes you uncomfortable in terms of it not being “true” immutability, I sympathise – but I also give String as a counterexample; StringBuilder works in exactly this way, modifying a string directly until it exposes it to the outside world.

Other than the copying – and the fact that all the code exists explicitly, and the caller has to know about the builder – this is quite similar to the suggestion I made about C# immutability a while ago. One point which makes it all simpler is that every data type in Protocol Buffers is itself immutable – so we don’t need to worry about deep copies and the like.

Unfortunately the current implementation doesn’t support collection initializers – if you have a repeated field in your message, you have to call Add* to populate it. The Add* methods return the builder just like the Set* methods, so you can still do it all in one expression, but it’s not terribly neat. Using a collection initializer compiles, but fails at execution time because the properties for repeated fields always return immutable lists. This is by design, to stop callers from creating a builder, fetching the list property, calling Build and then adding to the list. A better solution (and one which I plan to implement soon) is to have a PopsicleList<T> which is initially mutable but which will become immutable at the appropriate time (i.e. when Build() is called). At that point we’ll be able to write:

Person jane = new Person.Builder
    { FirstName=“Jane”, LastName=“Doe”,
      Friends = { “Tom”, “Dick”, “Harry” } }
    .Build(); 

There’s quite a lot more to messages and builders than this – things like the reflection-like API to query properties of the message based on fields in the the message descriptor – but what I’ve described so far ought to be enough for most of what I want to talk about, most of which relates to generics. In the next part, I’ll talk about self-referential generic types.

5 thoughts on “Lessons learned from Protocol Buffers, part 1: messages, builders and immutability”

  1. Hi man!

    I’m new with PB ( Knew it yesterday ) and finally I found a GREAT project for porting this protocol to C#. But I dont know how to run Protoc to generate .cs file ?

    What does parameters here ? Could you please give me some advise :)

    Thank u a lot :X

    P/s : You can send your answer via my email : cnttvn.com@gmail.com. It’ll be more better.

  2. Thank you so much!

    This morning, I received your email and tried to build my own class. It works perfectly :X

    And, I have another question, could you please answer it to me :X ( I think you know it )

    I have a list of my class so long, so large ( About 2 million or more objects ).So my problem here is when I load this list into memory ( using PB of course ), and then, I wanna add some object and this new object must not be appear in my list. That means each object in my list is unique!

    Do you have any solution for this case ?

    Thanks again :)

  3. (Also cc’d)

    Once you’ve loaded the list into memory, you should use some sort of Set type (e.g. HashSet in .NET 3.5). You’ll want to specify some way of identifying that an object “the same as” another, unless you want to use the default implementation of Equals/GetHashCode in Protocol Buffers, which compares every field.

    At this point it’s not really a Protocol Buffers problem so much as normal .NET collections classes.

Comments are closed.