Spark, Kafka and schema registry- part 1

4 min readDec 2, 2019

Introduction

One of the most tedious processes in processing streaming data using Kafka and spark is deserializing the data and processing each data in the byte stream to form a structured dataframe.

Confluent platform has a schema registry which we can use along with avro data format to bypass the tedious process. Herein, through this technique the data is being transferred in avro format with the schema being stored in the schema registry server. Both producer and consumer Kafka has access to the registry and hence can retrieve the corresponding schema to deserialize the data received in the receiving end.

This post assumes that you have fair knowledge on how to use docker and write code in spark-Scala. It covers the detailed steps on how to setup and use Spark, Kafka and Schema registry to send and receive data.

The entire setup and working will be covered in three different posts:

Introduction of important concepts to be known before we start the work.
Setting up the environment
Writing the code to send and receive structured streaming data through Kafka and schema registry.

In this post we will cover the key concepts and terms required for better understanding of the entire pipeline.

Key concepts

Data serialization and deserialization

Process of converting data objects present in complex data structures into byte streams for storage, transfer and distribution purposes.

Deserialization is the reverse process of converting the byte streams back into data objects such that the objects reconstructed are clones of the original object.

AVRO

Avro is a row based storage format which is widely used in the data serialization platform.

some of the highlights of the format:

Row based storage format
stores data definition and the actual data separately. Data definition or schema is stored as JSON while data is stored as binary streams
Uses sync markers while encoding the data which provides two important advantages, enabling detection of corrupt blocks of data and promotes efficient splitting of files.
supports schema evolution

Schema registry

Schema registry is a server that stores all your schemas and provides a RESTful interface to store and retrieve all the stored avro schemas.

The registry serves outside of Kafka environment and the producers and consumers can access the registry in concurrent with the kafka brokers.

Topic: Kafka topic which contains the messages and each message is key-value pair.

Schema: defines the structure of the avro data format.

Subject:

The scope in which schemas can evolve in the schema registry.

The name of the subject depends on three strategies:

TopicNameStrategy
RecordNameStrategy
TopicRecordNameStrategy

lets assume

Topic name — SchemaRegistryTest

schema name — student

schema namespace — io.schemaregistry

TopicNameStrategy can be used when all the data in the particular topic conforms with the same schema. The strategy names the topic based on topic name.

The subject name here will be “SchemaRegistryTest-value”

RecordNameStrategy can be used when grouping by topic is not optimal. This strategy names the subject based on the schema name or the recordname. The recordname strategy can be used for when a particular record type can occur across multiple topics.

The subject name here will be “io.schemaregistry.student”

TopicRecordNameStrategy optimal to be used when a particular kafka topic contains multiple different record types. This strategy groups multiple record per topic.

The subject name here will be “SchemaRegistryTest-io.schemaregistry.student”

To get a more in-depth learning please use this link.

Compatibility

Schema registry supports schema evolution through certain compatibility checks.

Lets assume there are three schema in the order of evolution — X_OLD, X, X_NEW.

Backward

consumers using the new schema can read data produced with last schema only.
i.e X_NEW can read data that was produced with X_NEW and X but not X_OLD

Backward transitive:

consumers using the new schema can read data produced with all the previous schemas.
i.e consumers using X_NEW can read data that has been produced with X_NEW, X and X_OLD.

Forward:

data produced by new schema can be read by consumers using the last schema only.
in our case consumers using schema X can read data produced with schema X and X_NEW.

Forward transitive:

data produced by new schema can be read by consumers using any of the previous schemas.
i.e consumers using X_OLD can read data produced with X_OLD,X and X_NEW.

Full:

means that schemas are both backward and forward compatible with one last schema or one future schema.
i.e consumers using X can read data that has been produced with X_OLD as well as data that will be produced with X_NEW.

Full transitive:

Full transitive means that consumers are compatible with all versions in both directions backward and forward.

None:

This means that there is no compatibility checks for schemas.
useful in certain cases during which we are forced to make incompatible changes to the schema.

By default schema registry uses Backward compatibility globally.

In the next post we will cover how to setup the environment that is needed to implement the pipeline.

The entire code and resources can be found in the github repository here.

Please feel free to reach out for any doubts, clarifications and suggestions regarding the post. you can reach out to me through my linkedin