I (virtually) attended DTW this week and Michael Dell and others in their keynote segments mentioned that the new world involves both data at rest and data in motion. I was curious as to about this new concept of data in motion, so I spent some time looking into it.
With AWS Lambda, clients deposit data in object buckets and AWS automatically invokes some program, container, service, etc. to process that data and then the service goes away until the next data is deposited. Kafka is AWS Lambda on steroids.
Kafka is a completely open source (GitHub) system that’s run using a cluster of servers and provides a “message processing” system. A minimum Kafka cluster is 3 servers (containers, VMs or bare metal).
How does Kafka work
In Apache Kafka, you have producers, server/brokers and consumers. With Kafka, data comes in as events, with a key, values (essentially a bit stream, could be anything) and time-stamps which are created by producer clients and are automatically stored by Kafka servers or brokers and appended to topics (a sort of folder) in an ordered sequence. Topic events are then processed by consumer clients.
Topics are partitioned (sharded) using keys, and can be optionally replicated across a defined number of Kafka brokers within a cluster. Kafka clusters can span data centers , regions, clouds etc. Replication is done for fault tolerance. Topic partitioning provides scale out, distributed performance for Kafka.
Events can be simple messages for real time analysis or larger files for offline analysis. But they are all essentially produced, stored and consumed in an ordered, log like fashion.
Topic partitions can be multi-producer and multi-consumer. That is there can be 0, 1 or many producers of events in a topic (partition) and topic partitions can have 0, 1 or more consumers.
In Kafka, events are saved for a specified period and are not automatically jetisoned/deleted. As such, events can be read multiple times by consumers.
Kafka can also offer a guarantee that events are only processed once. Kafka can also guarantee that consumers of topic partitioned events always read events in arrival order.
Consumers register to see events they are interested in. As mentioned earlier, there can be multiple consumers of the same events. Consumers can take the form of micro services/containers, programs, systems, etc. When an event is stored, consumer clients registered for that event, get notified to process the event.
In Kafka producers and consumers are fully decoupled. They have no need to know about one another and indeed, can exist in different servers, clusters, data centers, etc. Event producers don’t wait on consumers. Event consumers are notified when an event is available and can do whatever processing is required for that event.
Kafka has APIs for:
- Admin services API to provide monitoring and management of the Kafka cluster and services
- Producers API to publish and create events
- Consumers API to subscribe to read events
- Kafka Streams API to supply higher level stream processing for events, such as micro services, with stateless processing, stateful processing, and within stateful processing, providing events within a (time) window. Events can be processed from one or more topics and used to transform (process) these into other events to be written to one or more topics. It supports per event processing with (2) millisecond latency for highly tuned systems. Streams use a Java API that can be deployed in containers, VMs, bare metal, in the cloud etc. Stream processing is not performed on the Kafka cluster but must be performed elsewhere. Kafka streams can be used to create advanced and complex data pipelines.
- Kafka Connect API to supply the connections needed to get events from other outside, perhaps more traditional applications, environments, services into Kafka topics for processing and vice versa, output topic events to more traditional services. Connect services are available for many different applications, databases, systems, etc.. Connect can be used, for example, to provide a connection between an relational database and topics as well as connect topics to relational databases. You don’t code in Connect but rather provide declarative statements that define what data goes where.
Kafka is used in very many organizations (NY Times, LinkedIn, LINE, etc) to provide an almost, enterprise wide, all encompassing, processing bus where data comes in, is partitioned out to topics and then processed in real time or not. Kafka can be addictive. You start with a relatively small application and find uses for it throughout the company. Pretty soon, you are running your whole organization through Kafka.
Data in motion
So that’s an example of data in motion. Another way to think about Kafka and its data in motion is it’s represents the final step in the evolution of batch processing from mainframes of last century.
Batch processing of old, took a bunch of transactions, batched them together, and processed them one by one until the batch was done. With Kafka and similar systems, you essentially have a batch of one transaction and they provide all the framework and facilities needed to create, store and consume that single transaction (batch).
But in addition to this simplistic one transaction in, one process and one output. Kafka and other systems, provide a more general purpose system, with multiple transaction types (events) being created by multiple producers and being consumed by a multitude of processes, that can each produce one or more outputs which could be other events to start the process over again. This create event, process, create event, process, could go on ad infinitum.
And that’s what a data pipeline looks like. Event data comes in, it’s processed (filtered, aggregated, merged, etc.) and generates a different event which causes more processing, which creates other events, which causes other processing…..
And that’s data in motion.
All graphics and photos are from Apache Kafka website