Understanding Protocol Buffer(PB): A Data Analyst Approach

MrDataPsycho
4 min readMar 8, 2020

First of all Protocol Buffer (PB) has nothing to do with data analytics. But as I am very new to PB, I just tried to use my existing experience to explore PB and was able to figure it out successfully. Actually I already liked that concept of serializing structured data. To follow that document you do not need to be data scientist or data analyst but you just need to know about any RDBMS system and json way of structuring data. As I am going to explain PB thorough sql schema structure and python class.

To give a bit of context, I am currently working on one of my personal project to build an API and I was kind of thinking of choosing between REST, GRAPH and GRPC. So I decide to go with latest technology available in the market and GRAPH/GRPC are 2 of them which bring me into PB as it the building block of GRPC type API. Let’s jump directly in to understanding the PB. I am going to use some Relational Database concept along with python to explain PB.

Let’s create python implementation schema/data model, which we can use to store information for some particular individual. The information includes First Name, Last Name, Age etc.

A Person Data model in Python

Here from line 16 to line 21 we have defined the attributes of a person model should have. From line 3 to 14 I have implemented a type check method which has been used in line 22 to check if the Input data is correct when initializing the class. Now I can initialize that class with some data and use it for further work. Lets store a person using that model:

Initialization of a instance of Person class

First I have created the variables and Initiate the Person class and create a new instance using that class to store a person called Data Psycho. But if we provide wrong type the class will not initialize because of type check function implementation in initialization as shown below.

Error in data type check for wrong data type

Now lets say we want to implement the same schema/data model in to relational database. To do that we need to create a table with correct data type and then insert data in to the table. As SQL implementation will put data type constrains by default we wont be able to store data with wrong type.

A SQL data model

Now it’s time to implement the same data model in protocol buffer as follows:

A PB data model

Here is few convention for PB data model it will start with syntax keyword and the data models are defined as message. As we can see each attribute mush have types specified and a tag. Tag has whole large document you can read in google guide what is tag, how to use and why to use. This is the simplest form of data model. This file is called person.proto. As I am constantly using such words schema/data model to explain the PB just to warn they might sound inappropriate to experienced api developers. But to make things simple I am doing so. .

Now after installing PB compiler which has current version 3.11.4 when writing this post, we can convert that file with .proto format to python readable format by the following shell command.

protoc -I=proto_src — python_out=data_models proto_src/person.proto

Here I option is used to indicate root of the project, then I have a directory for converted file which is data_models and actual person.proto file is in of course inside of proto_src directory. Running the following command in terminal will generate a new file called person_pb2.py file into data_models folder. The generated py file is very long you can have a look on the file in that GitLab link.

After we generate the python readable PB file we are able to import Person class from the file and use it as follows:

Create a instance of Person class using PB serialization

As usual if we provide wrong data type it is not going to accept:

Raise exception for wrong data type

So the journey ends here for today. That was my very first experiment and introduction to PB in brief. You can have the codes of that whole experiment in my GitLab. There is bunch of other files as I am constantly learning and adding more files there. The add data models are in data_model folder proto_src has the PB file, helper.sh has the shell command to generate the PB file for python and finally instances.py has all the test experiment.

References:

  1. https://developers.google.com/protocol-buffers/docs/pythontutorial
  2. Complete Intro. to Protocol Buffer by Stéphane Maarek

--

--

MrDataPsycho

Data Science | Dev | Author @ Learnpub, Educative: Pandas to Pyspark