Setting up a single broker Kafka cluster in 15 minutes
Apache Kafka is a high-throughput distributed messaging system; and by high-throughput, we mean the capability to power more than 200 billion messages per day! In fact, LinkedIn’s deployment of Apache Kafka surpassed 1.1 trillion messages per day last year. It provides an elegant and scalable solution to the age old problem of data movement. Hundreds of companies are now adopting Kafka to manage their real-time data. That said, I believe the buzz about this firecracker Apache project is going to stay with us for long. In this blog, I will help you setup your own little single node, single broker Kafka cluster which you can play with.
On you mark…
First things first! Following are the system requirements to create a Kafka cluster:
-
-
-
- Java Runtime Environment
- Apache Zookeeper (tracks status of all kafka nodes, topics, messages and commit offsets)
- Apache Kafka (ofcourse! )
Move on to the linux terminal and execute the following to set up Zookeeper on your machine:
wget http://apache.claz.org/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz
tar -zxvf zookeeper-3.4.6.tar.gz
mv zookeeper-3.4.6/conf/zoo_sample.cfg zookeeper-3.4.6/conf/zoo.cfg
Once you have copied the cfg file, we can go ahead and download Kafka on your machine:
wget http://redrockdigimark.com/apachemirror/kafka/0.10.0.0/kafka_2.11-0.10.0.0.tgz
tar -xzf kafka_2.11-0.10.0.0.tgz
Get set…
Now that you have installed Kafka successfully, it is time to start the services. Kafka is dependent on Zookeeper and hence you always need to start Zookeeper first. Once both the services have been started, you can find that Zookeeper listens on port 2181, and Kafka will be listening on port 9092.
A Kafka server is called by another fancy name – broker. The file server.properties contains the bare minimum default configuration to initiate a Kafka broker. Thus, to create a multi-broker Kafka cluster, you need to define multiple server.properties files. Each broker must have unique broker id and port. This broker is responsible for storing all the messages, their status and commit offsets.
Next, lets create a topic in Kafka which will serve as our feed name. Each data stream will be injected into and read from the Kafka topic.
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
As you see, we define replication factor and number of partitions for this topic.
See the below diagram to understand how topic partitions work:
Now that we have fired up a single broker Kafka cluster, it is time to see the data flow in action.
Go…
The data is injected into Kafka using Producers, and read from Kafka using Consumers.
Execute the following command on a new terminal and start writing messages.
bin/kafka-console-producer.sh --broker localhost:9092 --topic test
Hello Kafka!
In a different terminal, execute the below command to start consuming messages that are being produced above.
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning
Kafka broker stores the messages that have been injected into topic. Hence, even if the consumer is down or stopped for a while, you can see all the previously injected messages once you restart the consumer.
The big picture…
Now that you are aware about how data flows from producer to a topic and then from topic to a consumer, you can implement your own data ingestion models using Kafka. Companies like LinkedIn move most of their feeds through a complex structure of topics and consumer groups. The source and destination of data might differ but we can use plentiful of plugins to move data through Kafka.
In my next blog, we will be building a pipeline using Kafka Connect for moving data through Kafka.
Stay tuned!