23 Nov 2015

Bencoding

Recently, while writing an internal protocol, I found myself prospecting for data formats. I'd usually opt for JSON as a great all rounder, being reasonably concise, human-readable, and widely supported, but this time I could afford to dig around for alternatives. After all, JSON makes passable config files, but I'd rather use TOML (which I also wrote a Scheme module for) given the right environment. The same can be true with data, and one of the alternative formats I uncovered, Bencode (pronounced B-Encode), turns out to be pretty neat.

Bencoding is used to format structured data in torrent files. It uses ASCII characters as delimiters and digits, making it easy to debug, and being unaffected by endianness it's easy to support cross-platform. Unlike the more space efficient Protocol Buffers, Bencode does not require a schema definition to parse, a restriction that makes incremental updates painful, it includes all the relevant keys in the message itself. Bencode can safely include binary data too, and as you'll see, is incredibly easy to serialize and parse.

Encoding rules

In Bencode, strings are almost identical to Netstrings, a delightfully simple format for bytestrings, where the value is prefixed by it's length in bytes:

12:hello world!  =>  "hello world!"

This means no character escaping is necessary (eg, NULL characters), and when parsing, the memory required is known in advance. In addition to bytestrings, Bencode provides integers, lists and dictionaries.

Integers use ASCII digits delimited by i and e:

i123e  =>  123

Lists are just multiple bencoded values surrounded by l and e:

l5:jelly4:cake7:custarde  =>  ["jelly", "cake", "custard"]

Similarly, dictionaries use d and e to wrap a sequence of keys and values:

d4:name5:cream5:pricei100ee  =>  {"name": "cream", "price": 100}

Dictionary keys must be sorted in lexicographical order, meaning there is only one valid bencoding for a given (possibly complex) value. So bencoded values can be compared or hashed directly without the need to decode.

Bencode module

The simplicity of this system means it's easy to produce correct bencoded messages with minimal effort, even using printf() and friends, a very handy property in embedded environments. It also means a new parser can be added quickly when required, which is exactly what I did for CHICKEN Scheme, implementing a bencode module in very little time:

(use bencode)

(bencode->string '(123 456))
;; => "li123ei456ee"

(string->bencode "12:hello world!")
;; => "hello world!"

Parsing speed

One advantage to Bencode's simple rules, apart from being easy to implement, is that it's easy to process. Parsing and serializing bencoded data is generally faster than the alternatives I've tried, not because of my great parser code, just as a result of a well considered encoding.

I made a quick test, decoding and then encoding some example data 100,000 times.

Example data

This is not chosen to reflect any common usage or benefit any particular parser, I'm just bashing on the keyboard here…

{
    "foo": 123,
    "bar": {
        "baz": [1, 2, 3, 4, "hello"]
    }
    "qux": "This is a much longer string containing\nnew\nlines\netc.",
    "wibble": [123123, "foo", "bar", "baz", 123124, [1 "a"], ["b" 4], ["cd"]]
}

Encoded data length

bencode           171 bytes
json              176 bytes
tagged netstring  205 bytes
msgpack           129 bytes

Parsing time

bencode             2.274s 
json               16.886s
tagged netstring    4.986s
msgpack             7.98s

Serialization time

bencode             2.513s
json                5.444s
tagged netstring    4.903s
msgpack             4.254s

Alternatives

So, that's Bencode, a neat little format I hadn't noticed before. Other JSON alternatives you might consider are: MessagePack, a more complex binary format with better size efficiency though not human-readable, and Tagged Netstrings a backwards compatible extension to Netstrings developed by Zed Shaw as part of the Mongrel2 webserver project. Tagged Netstrings support more data types than Bencode but may not work well for streams since type information is located at the end of each value, not the start. Of course, most of the time I'll still be using JSON, and there's nothing wrong with that!