Blog

Apache Kafka Interview Questions and Answers

July 12, 2020
Posted by: Laraonline2020
Category: Technology

1. What is Kafka?

Wikipedia defines Kafka as “an open-source message broker project developed by the Apache Software Foundation written in Scala and is a distributed publish-subscribe messaging system.

2. List the various components in Kafka.

The four major components of Kafka are:

Topic – a stream of messages belonging to the same type
Producer – that can publish messages to a topic
Brokers – a set of servers where the publishes messages are stored
Consumer – that subscribes to various topics and pulls data from the brokers.

3. Explain the role of the offset.

Messages contained in the partitions are assigned a unique ID number that is called the offset. The role of the offset is to uniquely identify every message within the partition.

4. What is a Consumer Group?

Consumer Groups is a concept exclusive to Kafka. Every Kafka consumer group consists of one or more consumers that jointly consume a set of subscribed topics.

5. What is the role of the ZooKeeper?

Kafka uses Zookeeper to store offsets of messages consumed for a specific topic and partition by a specific Consumer Group.

6.Is it possible to use Kafka without ZooKeeper?

No, it is not possible to bypass Zookeeper and connect directly to the Kafka server. If for some reason, ZooKeeper is down, you cannot service any client request.

7. Explain the concept of Leader and Follower?

Every partition in Kafka has one server which plays the role of a Leader, and none or more servers that act as Followers. The Leader performs the task of all read and writes requests for the partition, while the role of the Followers is to passively replicate the leader. In the event of the Leader failing, one of the Followers will take on the role of the Leader. This ensures the load balancing of the server.

8. What roles do Replicas and the ISR play?

Replicas are essentially a list of nodes that replicate the log for a particular partition irrespective of whether they play the role of the Leader. On the other hand, ISR stands for In-Sync Replicas. It is essentially a set of message replicas that are synced to the leaders.

9. Why are Replications critical in Kafka?

Replication ensures that published messages are not lost and can be consumed in the event of any machine error, program error, or frequent software upgrades.

10. If a Replica stays out of the ISR for a long time, what does it signify?

It means that the Follower is unable to fetch data as fast as data accumulated by the Leader.

11. What is the process for starting a Kafka server?

Since Kafka uses ZooKeeper, it is essential to initialize the ZooKeeper server, and then fire up the Kafka server.

To start the ZooKeeper server: > bin/zookeeper-server-start.sh config/zookeeper.properties
Next, to start the Kafka server: > bin/kafka-server-start.sh config/server.properties

12. How do you define a Partitioning Key?

Within the Producer, the role of a Partitioning Key is to indicate the destination partition of the message. By default, a hashing-based Partitioner is used to determine the partition ID given the key. Alternatively, users can also use customized Partitions.

13. In the Producer, when does QueueFullException occur?

QueueFullException typically occurs when the Producer attempts to send messages at a pace that the Broker cannot handle. Since the Producer doesn’t block, users will need to add enough brokers to collaboratively handle the increased load.

14. Explain the role of the Kafka Producer API.

The role of Kafka’s Producer API is to wrap the two producers –kafka.producer.SyncProducer and the kafka.producer.async.AsyncProducer. The goal is to expose all the producer functionality through a single API to the client.

15. What is the difference between Kafka and Flume?

Even though both are used for real-time processing, Kafka is scalable and ensures message durability.

16. What do you know about Partition in Kafka?

In every Kafka broker, there are few partitions available. And, here each partition in Kafka can be either a leader or a replica of a topic.

17 Why is Kafka technology significant to use?

There are some advantages of Kafka, which makes it significant to use:

High-throughput

We do not need any large hardware in Kafka, because it is capable of handling high-velocity and high-volume data. Moreover, it can also support message throughput of thousands of messages per second.

Low Latency

Kafka can easily handle these messages with the very low latency of the range of milliseconds, demanded by most of the new use cases.

Fault-Tolerant

Kafka is resistant to node/machine failure within a cluster.

Durability

As Kafka supports messages replication, so, messages are never lost. It is one of the reasons behind durability.

Scalability

Kafka can be scaled-out, without incurring any downtime on the fly by adding additional nodes.

18. What are main APIs of Kafka?

Apache Kafka has 4 main APIs:

Producer API
Consumer API
Streams API
Connector API

19. What are consumers or users?

Mainly, Kafka Consumer subscribes to a topic(s), and also reads and processes messages from the topic(s). Moreover, with a consumer group name, Consumers label themselves. In other words, within each subscribing consumer group, each record published to a topic is delivered to one consumer instance. Make sure it is possible that Consumer instances can be in separate processes or on separate machines.

20. Explain the concept of Leader and Follower.

In every partition of Kafka, there is one server that acts as the Leader, and none or more servers play the role as Followers.

21.What ensures load balancing of the server in Kafka?

As the main role of the Leader is to perform the task of all read and write requests for the partition, whereas Followers passively replicate the leader. Hence, at the time of Leader failing, one of the Followers takes over the role of the Leader. This entire process ensures the load balancing of the servers.

22.What roles do Replicas and the ISR play?

A list of nodes that replicate the log is Replicas. Especially, for a particular partition. However, they are irrespective of whether they play the role of the Leader.
Also, ISR refers to In-Sync Replicas. On defining ISR, it is a set of message replicas that are synced to the leaders.

23.Why are Replications critical in Kafka?

Because of Replication, we can be sure that published messages are not lost and can be consumed in the event of any machine error, program error, or frequent software upgrades.

24. If a Replica stays out of the ISR for a long time, what does it signify?

Simply, it implies that the Follower cannot fetch data as fast as data accumulated by the Leader.

25. What is the process for starting a Kafka server?

It is a very important step to initialize the ZooKeeper server because Kafka uses ZooKeeper.So, the process for starting a Kafka server is:
In order to start the ZooKeeper server: > bin/zookeeper-server-start.sh config/zookeeper.properties
Next, to start the Kafka server: > bin/kafka-server-start.sh config/server.properties

26. In the Producer, when does QueueFullException occur?

whenever the Kafka Producer attempts to send messages at a pace that the Broker cannot handle at that time QueueFullException typically occurs. However, to collaboratively handle the increased load, users will need to add enough brokers, since the Producer doesn’t block.

27.Explain the role of the Kafka Producer API.

An API which permits an application to publish a stream of records to one or more Kafka topics is what we call Producer API.

28.What is the main difference between Kafka and Flume?

The main difference between Kafka and Flume are:

Types of tool

Apache Kafka– As Kafka is a general-purpose tool for both multiple producers and consumers.
Apache Flume– Whereas, Flume is considered as a special-purpose tool for specific applications.

Replication feature

Apache Kafka– Kafka can replicate the events.
Apache Flume- whereas, Flume does not replicate the events.

29.Is Apache Kafka is a distributed streaming platform? if yes, what you can do with it?

Undoubtedly, Kafka is a streaming platform. It can help:

To push records easily
Also, can store a lot of records without giving any storage problems
Moreover, it can process the records as they come in

30.What can you do with Kafka?

It can perform in several ways, such as:
>> In order to transmit data between two systems, we can build a real-time stream of data pipelines with it.
>> Also, we can build a real-time streaming platform with Kafka, that can actually react to the data.

31.What is the purpose of retention period in Kafka cluster?

However, the retention period retains all the published records within the Kafka cluster. It doesn’t check whether they have been consumed or not. Moreover, the records can be discarded by using a configuration setting for the retention period. And, it results as it can free up some space.

32. Explain the maximum size of a message that can be received by the Kafka?

The maximum size of a message that can be received by the Kafka is approx. 1000000 bytes.

33.What are the types of traditional method of message transfer?

Basically, there are two methods of the traditional message transfer method, such as:

Queuing: It is a method in which a pool of consumers may read a message from the server and each message goes to one of them.

Publish-Subscribe: Whereas in Publish-Subscribe, messages are broadcasted to all consumers.

34. What does ISR stand in Kafka environment?

ISR refers to In sync replicas. These are generally classified as a set of message replicas which are synced to be leaders.

35. What is Geo-Replication in Kafka?

For our cluster, Kafka MirrorMaker offers geo-replication. Messages are replicated across multiple data centers or cloud regions, with MirrorMaker. So, it can be used in active/passive scenarios for backup and recovery; or also to place data closer to our users, or support data locality requirements.

36. Explain Multi-tenancy?

We can easily deploy Kafka as a multi-tenant solution. However, by configuring which topics can produce or consume data, Multi-tenancy is enabled. Also, it provides operations support for quotas.

37. What is the role of Consumer API?

An API that permits an application to subscribe to one or more topics and also to process the stream of records produced to them is what we call Consumer API.

38. Explain the role of Streams API?

An API which permits an application to act as a stream processor, and also consuming an input stream from one or more topics and producing an output stream to one or more output topics, moreover, transforming the input streams to output streams effectively, is what we call Streams API.

39. What is the role of Connector API?

An API that permits to run as well as build the reusable producers or consumers which connect Kafka topics to existing applications or data systems is what we call the Connector API.

40. Explain Producer?

The main role of Producers is to publish data on the topics of their choice. Basically, its duty is to select the record to assign to partition within the topic.

41. Compare: RabbitMQ vs Apache Kafka

One of the Apache Kafka’s alternative is RabbitMQ. So, let’s compare both:
i. Features
Apache Kafka– Kafka is distributed, durable, and highly available, here the data is shared as well as replicated.
RabbitMQ– There are no such features in RabbitMQ.
ii. Performance rate
Apache Kafka– To the tune of 100,000 messages/second.
RabbitMQ- In the case of RabbitMQ, the performance rate is around 20,000 messages/second.

42. Compare: Traditional queuing systems vs Apache Kafka

Let’s compare Traditional queuing systems vs Apache Kafka feature-wise:

Messages Retaining

Traditional queuing systems– It deletes the messages just after processing completion typically from the end of the queue.
Apache Kafka– But in Kafka, messages persist even after being processed. That implies messages in Kafka don’t get removed as consumers receive them.

Logic-based processing

Traditional queuing systems–Traditional queuing systems don’t permit to process logic based on similar messages or events.
Apache Kafka– Kafka permits to process logic based on similar messages or events.

43. Why Should we use Apache Kafka Cluster?

In order to overcome the challenges of collecting the large volume of data, and analyzing the collected data we need a messaging system. Hence Apache Kafka came in the story. Its benefits are:

It is possible to track web activities just by storing/sending events for real-time processes.
Through this, we can Alert as well as report the operational metrics.
Also, we can transform data into the standard format.
Moreover, it allows continuous processing of streaming data to the topics.
Due to its this wide use, it is ruling over some of the most popular applications like ActiveMQ, RabbitMQ, AWS, etc.

44. Explain the term “Log Anatomy”.

We view log as the partitions. A data source writes messages to the log. One of the advantages is, at any time one or more consumers read from the log they select. Here, the below diagram shows a log is being written by the data source and the log is being read by consumers at different offsets.

45. What is Data Log in Kafka?

As we know, messages are retained for a considerable amount of time in Kafka. Moreover, there is flexibility for consumers that they can read as per their convenience. Although, there is a possible case that if Kafka is configured to keep messages for 24 hours and possibly that time consumer is down for time greater than 24 hours, then the consumer may lose those messages. However, still, we can read those messages from the last known offset, but only at a condition that the downtime on part of the consumer is just 60 minutes. Moreover, on what consumers are reading from a topic Kafka doesn’t keep state.

46. Explain how to Tune Kafka for Optimal Performance.

So, ways to tune Apache Kafka it is to tune its several components:

Tuning Kafka Producers
Kafka Brokers Tuning
Tuning Kafka Consumers

47. State Disadvantages of Apache Kafka.

Limitations of Kafka are:

No Complete Set of Monitoring Tools
Issues with Message Tweaking
Not support wildcard topic selection
Lack of Pace

48. Enlist all Apache Kafka Operations.

Apache Kafka Operations are:

Addition and Deletion of Kafka Topics
How to modify the Kafka Topics
Distinguished Turnoff
Mirroring Data between Kafka Clusters
Finding the position of the Consumer
Expanding Your Kafka Cluster
Migration of Data Automatically
Retiring Servers
Data centers

49. Explain Apache Kafka Use Cases?

Apache Kafka has so many use cases, such as:

Kafka Metrics

It is possible to use Kafka for operational monitoring data. Also, to produce centralized feeds of operational data, it involves aggregating statistics from distributed applications.

Kafka Log Aggregation

Moreover, to gather logs from multiple services across an organization.

Stream Processing

While stream processing, Kafka’s strong durability is very useful.

50. Some of the most notable applications of Kafka.

Some of the real-time applications are:

Netflix
Mozilla
Oracle

Features of Kafka Stream.

Some best features of Kafka Stream are

Kafka Streams are highly scalable and fault-tolerant.
Kafka deploys to containers, VMs, bare metal, cloud.
We can say, Kafka streams are equally viable for small, medium, & large use cases.
Also, it is fully integrated with Kafka security.
Write standard Java applications.
Exactly-once processing semantics.
Moreover, there is no need for a separate processing cluster.

51.Features of Kafka Stream.

Some best features of Kafka Stream are

Kafka Streams are highly scalable and fault-tolerant.
Kafka deploys to containers, VMs, bare metal, cloud.
We can say, Kafka streams are equally viable for small, medium, & large use cases.
Also, it is fully integrated with Kafka security.
Write standard Java applications.
Exactly-once processing semantics.
Moreover, there is no need for a separate processing cluster.

52. What do you mean by Stream Processing in Kafka?

The type of processing of data continuously, real-time, concurrently, and in a record-by-record fashion is what we call Kafka Stream processing.

53. What are the types of System tools?

There are three types of System tools:

Kafka Migration Tool

It helps to migrate a broker from one version to another.

Mirror Maker

Mirror Maker tool helps to offer to mirror of one Kafka cluster to another.

Consumer Offset Checker

For the specified set of Topics as well as Consumer Group, it shows Topic, Partitions, Owner.

54. What are Replication Tool and its types?

For the purpose of stronger durability and higher availability, replication tool is available here. Its types are −

Create Topic Tool
List Topic Tool
Add Partition Tool

55. What is Importance of Java in Apache Kafka?

For the need of the high processing rates that come standard on Kafka, we can use java language. Moreover, for Kafka consumer clients also, Java offers good community support. So, we can say it is a right choice to implement Kafka in Java.

56. State one best feature of Kafka.

The best feature of Kafka is “Variety of Use Cases”.
It means Kafka can manage the variety of use cases which are very common for a Data Lake. For Example log aggregation, web activity tracking, and so on.

57. Explain the term “Topic Replication Factor”.

It is very important to factor in topic replication while designing a Kafka system. Hence, if in any case, the broker goes down its topics’ replicas from another broker can solve the crisis.

Explain some Kafka Streams real-time Use Cases.

So, the use cases are:

The New York Times

This company uses it to store and distribute, in real-time, published content to the various applications and systems that make it available to the readers. Basically, it uses Apache Kafka and the Kafka Streams both.

Zalando

As an ESB (Enterprise Service Bus) as the leading online fashion retailer in Europe Zalando uses Kafka.

LINE

Basically, to communicate to one another LINE application uses Apache Kafka as a central data hub for their services.

59. What are Guarantees provided by Kafka?

They are:

The order will be same for both the Messages sent by a producer to a particular topic partition. That
Moreover, the consumer instance sees records in the order in which they are stored in the log.
Also, we can tolerate up to N-1 server failures, even without losing any records committed to the log.

60. How to start a Kafka server?

Given that Kafka exercises ZooKeeper, we can start the ZooKeeper’s server. One can use the convince script packaged with Kafka to get a crude but effective single-node ZooKeeper instance:

bin/zookeeper-server-start.shconfig/zookeeper.properties

Now the Kafka server can start:

bin/Kafka-server-start.shconfig/server.properties

61.Explain the Kafka architecture?

Kafka is nothing but a cluster which holds multiple brokers as it is called as a distributed system.
The topics within the system will hold multiple partitions.

Every broker within the system will hold multiple partitions. Based on this the producers and consumers exchange the message at the same time and the overall execution happens seamlessly.

62.Why replication is required in Kafka?

Replication of message in Kafka ensures that any published message does not lose and can be consumed in case of machine error, program error, or more common software upgrades.

63.Explain how you can get exactly once messaging from Kafka during data production?

During data, production to get exactly-once messaging from Kafka you have to follow two things avoiding duplicates during data consumption and avoiding duplication during data production. Here are the two ways to get exactly one semantics while data production:

Avail a single writer per partition, every time you get a network error checks the last message in that partition to see if your last write succeeded
In the message include a primary key (UUID or something) and de-duplicate on the consumer.

64.Mention What Is The Maximum Size Of The Message Does Kafka Server Can Receive?

The maximum size of the message that Kafka server can receive is 1000000 bytes.

65.What is Geo-Replication in Kafka?

Kafka MirrorMaker provides geo-replication support for your clusters. With MirrorMaker, messages are replicated across multiple data centers or cloud regions. You can use this inactive/passive scenario for backup and recovery, or inactive/active scenarios to place data closer to your users, or support data locality requirements.

Login/Sign Up

Search

Menu