Pyspark x Kafka

I’m trying to implement reading a kafka topic in pyspark, for reading in streaming service. The code I wrote was this:

df = spark
.readStream
.format ( “kafka”)
.option ( “kafka.bootstrap.servers”, “kafka:9092”)
.option ( “subscribe”, “topic1”)
.option ( “startingOffsets”, “latest”)
.load()

however, the following error occurs:
pyspark.sql.utils.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of “Structured Streaming + Kafka Integration Guide”.

would anyone have any tips?


Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster