Install and Configure Kafka & HBase


As part of this section, we will see how to set up Kafka and HBase and explore some key concepts of both services.

  • Kafka Overview
  • Setup Parcels and Add Kafka Service
  • Validate Kafka
  • HBase Overview
  • Configure HBase
  • Validate HBase

Cluster Topology

We are setting up the cluster on 7+1 nodes. We start with 7 nodes and then we will add one more node later.

  • Gateway(s) and Management Serviceh2
    • bigdataserver-1 – Hue Server
  • Masters
    • bigdataserver-2
      • Zookeeper
      • Active/Standby Namenode
      • HBase Master
    • bigdataserver-3
      • Zookeeper
      • Active/Standby Namenode
      • Active/Standby Resource Manager
      • Impala State Store
      • Oozie Server
      • HBase Master
    • bigdataserver-4
      • Zookeeper
      • Active/Standby Resource Manager
      • Job History Server
      • Spark History Server
      • Hive Server and Hive Metastore
      • Impala Catalog
      • HBase Master
  • Slaves or Worker Nodes
    • bigdataserver-5 – Datanode, Node Manager, Impala Daemon, Region Server, Kafka Broker
    • bigdataserver-6 – Datanode, Node Manager, Impala Daemon, Region Server, Kafka Broker
    • bigdataserver-7 – Datanode, Node Manager, Impala Daemon, Region Server, Kafka Broker

Kafka is typically set up as an external cluster to Big Data Clusters. However, in our environment, we will be setting up 3 broker Kafka Cluster on worker nodes.

Kafka Overview

Let us look into the overview of Apache Kafka before getting into details about configuring and validating it.

  • Apache Kafka is an open-source stream-processing software platform developed by the Apache Software Foundation, written in Scala and Java. It was initially developed as an internal product at LinkedIn and was open-sourced and adopted by apache foundation. Named after author Franz Kafka
  • Salient Features:
    • Highly Scalable (partitioning)
    • Fault Tolerant (replication factor)
    • Low Latency
    • High Throughput

Kafka ecosystem

Heart of Kafka is topic a distributed and fault-tolerant log file. However, over a period of time, Kafka is evolved into an ecosystem of tools.

  • Kafka Connect
  • Kafka Streams and Kafka SQL
  • Producer and Consumer APIs
  • 3rd party plugins to integrate with Flume, logstash, Spark Streaming, Storm, Flink etc.

Kafka Use cases

As micro-services have evolved Kafka become popular to integrate data between different micro-services – asynchronous, real-time as well as the batch.

  • Activity Tracking: Kafka was originally developed for tracking user activity on LinkedIn
  • Messaging: Kafka is also used for messaging, where applications need to send notifications (such as emails) to users.
  • Metrics and logging: Applications publish metrics on a regular basis to a Kafka topic, and those metrics can be consumed by systems for monitoring and alerting.
  • Commit log: database changes can be published to Kafka and applications can easily monitor this stream to receive live updates as they happen. This changelog stream can also be used for replicating database updates to a remote system.
  • Stream processing: Kafka can be integrated with stream frameworks such as Spark Streaming, Flink, Storm etc. Users are allowed to write applications to operate on Kafka messages, performing tasks such as counting metrics, transform data, etc.


Topic: A topic represent a group of files and directories. When we create a topic, it will create directories with topic name and partition index. These directories have a bunch of files which will actually store the messages that are being produced.

Publisher or Producer: Publishers or producers are processes that publish data (push messages) to the log file associated with Kafka topic.

Subscriber or Consumer: Subscribers or consumers are processes that read from the log file associated with Kafka topic

Kafka Pub SubModel

Partition: Kafka topics are divided into a number of partitions, which contains messages in an unchangeable sequence. This allows for multiple consumers to read from a topic in parallel.

Leader: When we create a Kafka topic with partitions and replication factor, each partition will have a leader. Messages will be first written to the partition on broker which is designated as the leader and then copied to rest of followers.

Replication Factor: Each partition can be cloned into multiple copies using replication factor. It will facilitate fault tolerance. With a replication factor of n on m node cluster (where n <= m), the cluster can survive the failure of n-1 nodes at any point in time.

Broker: A Kafka cluster consists of one or more servers (Kafka brokers), which are running Kafka. Producers query metadata of each of the topic and connect to the leader of each partition to produce messages into Kafka topic. Consumers do the same while consuming messages from the topic.

Offset: The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.

Learning Process

We will follow the same standard process to learn while adding any software-based service.

  • Downloading and Installing – The required binaries are downloaded as part of the initial setup, we will just add the Kafka Service to Cloudera Manager.
  • Configuration – we need to understand architecture and plan for the configuration.
    • Architecture – Producer and Consumer Architecture
      • Producers connect to one or more brokers and push messages to topics via leader.
      • Consumers pull a message from the topic by polling topic at regular intervals. Each time consumer read messages it needs to keep track of offset (can be done using multiple ways)
    • Components
      • Zookeeper Ensemble – Already set up as part of the Zookeeper configuration.
      • Kafka brokers
    • Configuration Files
      • /etc/kafka/conf/
  • Service logs/var/log/kafka

Setup Parcels and Add Kafka Service

Let us see how we can add Kafka to the cluster using Cloudera Manager. We will also see how to review components, properties as well as log files with respect to Kafka.

  • Go to Hosts -> Parcels
  • If you want to add any other version of Kafka
  • Download the Kafka Parcel, Distribute and Activate.
  • Add Service “Kafka” and select the hosts for brokers
    • Hosts – bigdataserver-5,6 and 7
    • Gateway – bigdataserver-1
  • Review the properties and Click on Finish
  • Kafka service should start but it may fail.
  • Then go to Kafka Configuration Broker Heap Size value
    • Change the configuration property – broker_max_heap_size from 50 Mb by default and change to 1 Gb.
  • Go to Kafka -> Instances -> Restart Kafka Brokers
  • Make sure to click on Kafka using Cloudera Manager and review summary.
  • We can validate Kafka components by running ps -fu kafka
  • We can review properties files either by using Cloudera Manager or CLI
  • We can troubleshoot any issues using logs in Cloudera Manager or log files under /var/log/kafka.

Validate Kafka

Let us go ahead and validate Kafka by creating a topic and then publishing messages to a topic as well as consuming messages from a topic.

  • Create Kafka Topic

kafka-topics --create --zookeeper bigdataserver-2:2181,bigdataserver-3:2181,bigdataserver-4:2181 --replication-factor 2 --partitions 3 --topic testing

  • Produce the message to the topic

kafka-console-producer --broker-list bigdataserver-5:9092,bigdataserver-6:9092 --topic testing

  • Consume the message from the topic

kafka-console-consumer --bootstrap-server bigdataserver-5:9092,bigdataserver-6:9092 --topic testing --from-beginning

Setting up HBase

HBase is a distributed data storage that comes as part of the Hadoop ecosystem with most of the distributions such as Cloudera, Hortonworks etc. HBase is master-slave architecture with HBase Master Process and Region Servers as slaves.

Key Components

Here are key components of HBase.

  • HBase Master
  • HBase Region Servers
    • WAL – Write Ahead log – store new data that hasn’t yet been persisted to permanent storage
    • Blockcache – Stores frequently read data in memory.
    • Memcache – Stores new data which has not yet been written to disk.
    • Hfiles – Stores the rows as sorted KeyValues on disk.
  • HBase uses HDFS for the file system to persist the data. We can perform real time operations in HBase tables – such as insert, update, delete etc.

Add HBase Service

  • Go to the Cloudera Manager Dashboard
  • Make sure you have installed “zookeeper”
  • Click on Add Service in drop down of the cluster
  • Choose HBase from the list of services
  • We will be using bigdataserver – 2, 3 and 4 as HBase Masters and bigdataserver 5, 6 and 7 as Region Servers.
  • Review properties and complete the setup process.
  • We can review important properties as well as service log files to troubleshoot any issues with respect to HBase service.

Validate HBase

Let us validate HBase installation by running few commands using HBase Shell.

  • On the gateway node of the hbase cluster run hbase shell
  • help command provides list of commands in different categories
  • Namespace – group of tables (similar to schema or database)
    • create – create_namespace 'training'
    • list – list_namespace 'training'
    • list tables – list_namespace_tables 'training'
  • Table – a group of rows which have keys and values
    • While creating the table we need to specify table name and at least one column family
    • Column family will have cells. A cell is nothing but, a name and value pair
    • e.g.: create 'training:hbasedemo', 'cf1'
    • list – list 'training:.*'
    • describe – describe 'training:hbasedemo'
    • truncate – truncate 'training:hbasedemo'
    • Dropping is 2 step process – disable and drop
    • disable – disable 'training:hbasedemo'
    • drop – drop 'training:hbasedemo'
  • Inserting/updating data


Could you please guide me to start and use Kafka if I am running session on putty?