BSON vs YAML vs Protobuf

Posted on October 29, 2012

So, with the hope of using the power of C++, and throwing out boost dependencies, I went ahead and tried to move to C++11.

It was rather difficult at first, because the thing is… when using clang++ and C++11, they actually made a different stdlib.

That is, the standard library which most standard core data types depend on and link to.

They decided to break compatibility so that they could focus on clean, elegant, and optimal solutions for the standard library.

This break however makes it kinda hard to link to other already-compiled libraries, and to make libraries which link other libraries. Simply, you can’t have two stdlibs side by side, the data types won’t talk because they aren’t compatible. This is detected by the compiler by isolating the namespace, so if you see __1:: in the error messages, you’re trying to cross this barrier futilely.

I’ve converted several libraries to C++11. The only one that required no effort on my part was yaml-cpp.

A lot of the hard work was converting autotools autoheader generated files to be configured by CMake, my builder of choice.

It seems that the level of configuration of PocoFoundation is too much for me to consider right now. There’s a lot of platform specific things.

Conversion

My set of libraries can be found on my github

BSON

First, I worked on converting BSON to C++11, and I took out the Boost dependencies, and the viral AGPL license. It was rather easy once I figured out that they used a function in their JSON code, which converts JSON to BSON, only for initiating a static null object. Once I figured out how to replicate that, I was able to cut it out without repercussion. Now the BSON library is only using Apache 2.0 Licensed code, which is acceptable.

Next, Boost System and or Thread were being used for use with date functions. Luckily std::chrono to the rescue! I incorporated the functions in there, and it seems to work, though I haven’t tested to see if the outputs are the same and whatnot.

By moving to C++11 however, this makes the test program unlinkable since it still uses Boost and other things.

Protobuf

Google Protobuf used a autoconfigured setup, which was interesting to deal with, the best solution seemed to be to parse the actual compiled files from the visual studio project files, and then regenerate the config.h file which is generated by autoheader.

Beyond that, there were only a few things that I had to change, I don’t recall all of it, but a few were standard library related, like removing the specified template parameters and letting it resolve it intelligently.

YAML

There were really no changes here, and it was already in CMake form!

Zero MQ

The ØMQ library was autoconfigured as well, it was easier than Protobuf though, the two helped cover a broad spectrum of platform checking that potential future libraries depend on, such as Does alloca.h exist?

I’ve got it working and tested enough to seem fully functional.

One thing to note that I found while using it, is that if you are using the PUB-SUB pattern, Publish-Subscriber, on the subscriber side, if you are trying to accept all messages, you still have to specify a subscribe prefix, even though it is nothing.

Testing

So, then now with all my formats set up in a C++11 environment on OS X, I proceeded to set up a receiving server and several clients which send messages to the server. The relationship might seem backwards, but that’s how I designed it, as a test model for a potentially multi-data format supporting database server.

Note: Protobuf is normally schema’d, I made a value, and a value wrapper to allow it to act non-schema’d, so it is less efficient than it may normally be, however, the amount difference causes me to be skeptical of it’s use for me.

One important thing is that eventually, I might want clients to be able to query the data, and the server, if it is to query, must know about the data format to search it, index it, etc.

Test Conditions

Hackintosh laptop: i3 2.2Ghz, 4GB of RAM, over TCP on localhost.

Message count: 100,000 messages

Test Results for one field

"msgid" as a name, with the message number as the value

BSON

  • Average per message encoding: 1140 ns
  • Average per message decoding: 245 ns
  • Average Bytes per message: 16 bytes

Protobuf

  • Average per message encoding: 3981 ns
  • Average per message decoding: 4036 ns
  • Average Bytes per message: 15 bytes

YAML

  • Average per message encoding: 244490 ns
  • Average per message decoding: 412915 ns
  • Average Bytes per message: Not recorded

Test Results for 2 fields

Putting in two numbers, in random field order and sending to the server.

BSON

  • Average per message encoding: Not recorded
  • Average per message decoding: 299 ns
  • Average Bytes per message: 32 bytes

Protobuf

  • Average per message encoding: Not recorded
  • Average per message decoding: 7422 ns
  • Average Bytes per message: 57 bytes

YAML

  • Average per message encoding: Not recorded
  • Average per message decoding: 316531 ns
  • Average Bytes per message: 32 bytes

Test results for 50 fields

Putting in 50 fields, randomly sorted, each with a number consisting of the current message id multiplied by the index of the shuffled field name array. Decoding time includes access to "field_1" which is in a random location within the entire data set.

BSON

  • Average per message encoding: 14196 ns
  • Average per message decoding: 302 ns
  • Average Bytes per message: 695 bytes

Protobuf

  • Average per message encoding: 110480 ns
  • Average per message decoding: 106609 ns
  • Average Bytes per message: 695 bytes

YAML

  • Average per message encoding: 7135667 ns
  • Average per message decoding: 4893947 ns
  • Average Bytes per message: 853 bytes

Conclusion

This makes me seriously question protobuf as the role of data where schema is only known at compile time, and recompiling the server is not an option for adding new schemas.

YAML, which is a human readable format was way more costly than I had supposed. I guess it is to be expected as a price to pay for a non-binary format.

I really must give kudos to the makers of BSON, being the guys behind mongoDB, a scalable, high performance, open source NoSQL database.

I’m trying to make something where a schema is determined by a union of the property contracts for the properties an entity has. There will be validation, but this (BSON) seems to be by far the best storage and network transfer format. It’s small, known at runtime, has a known type at runtime, and supports several data types more than JSON.