Saturday, April 24, 2021

Message Framing

In this paragraph, we’ll assume that we’re dealing with the special case of an array of X, where X is a serializable type (either with MP or Protobuf). Furthermore, we’re interested in the case where the length of X[] isn’t known beforehand, which usually happens when we’re generating logs. The only way we can know that we’ve deserialized all elements is when we reach the end of the stream.


Character Count Message Framing:

In order to introduce concurrency, we’ll have to frame our data, i.e., introduce a delimiter between each X’s binary representation, specifying the length each instance of X will take on the stream. In this manner we can quickly read each X’s buffer and then queue it for further parallel processing. The approach followed will differ between Messagepack and Protobuf.


MesssagePack:

MessagePack has some similarities with the logic of a JSON message, so I’ll exemplify in JSON how the original array of X will look like before and after framing.

Before:


        "src": "Images/Sun.png",
        "hOffset": 250,
        "vOffset": 200,
},

        "src": "Images/Earth.png",
        "hOffset": 100,
        "vOffset": 100,
}


Note that this is actually not a single JSON msg, but a concatenation of individual JSON blocks (it’s missing the outer brackets). It’s not obvious by looking at the JSON representation, but it’s necessary to express X[] in this manner since we don’t know in advance how many elements the array has, and MessagePack needs to prefix each array or map with its number of elements. 

Let’s assume the number of bytes taken by the serialized representation of X[0] is 11 and X[1] is 9 (I’m getting these numbers out of thin air). Then we’ll introduce framing by wrapping each X instance thus:

{
        “length”: 11,
        “body”: 
        { 
                "src": "Images/Sun.png",
                "hOffset": 250,
                "vOffset": 200,
        }
},
{
        “length”: 9,
        “body”: 
        { 
                "src": "Images/Earth.png",
                "hOffset": 100,
                "vOffset": 100,
        }
}

And its binary representation would look like:

Array tag|2|int tag|11|Array tag|3|string tag|byte encoding of "Images/Sun.png"|int tag|250|int tag|200|Array tag|2|int tag|9|Array tag|3|string tag|byte encoding of "Images/Earth.png"|int tag|100|int tag|100

which is still a valid MessagePack encoding, insomuch as the original representation was, but now we can extract from the stream enough information to split it into blocks for parallel processing.



Protobuf:

Protobuf is a different protocol, offering us the option of writing something akin to “here’s another instance of an item belonging to the variable-length array property tag#z of our object; it will occupy the next X bytes of the message”, letting us properly frame our object. If the proto spec of X was, for example:

message X 
{
  string src = 1;
  int32 hOffset = 2;
  int32 vOffset = 3;
}

then an array of X’s could be represented by 

message Y 
{
  repeated X myArray;
}

And the actual byte representation for the example X[] = [X1, X2, X3] with its respective serialized sizes [N1, N2, N3] would look like:

PR|Base128 encoding of N1|proto encoding of X1|PR|Base128 encoding of N2|proto encoding of X2|PR|Base128 encoding of N3|proto encoding of X3|

where PR = 2 + (1 << 3), with 2 being the code for a repeated element, and 1 the proto tag of that element.


No comments:

Post a Comment