Issue while loading avro

Hi,

I am using the avro package --packages org.apache.spark:spark-avro_2.11:2.3.0 but it showing me error
I checked the scala version and spark version both looks okay for me.

Can you please help me to find out the issue.

Hi @Nisha_B_Kamath,

Use this version of avro spark-avro_2.12:2.4.4 while launching spark2.

--packages org.apache.spark:spark-avro_2.12:2.4.4

Hi @Shubham_Maurya1,
This is not working.

I gave the command

export SPARK_MAJOR_VERSION=2
pyspark --master yarn --conf spark.ui.port=22189 --packages org.apache.spark:spark-avro_2.12:2.4.4

and it launched pyspark but then when i gave
import org.apache.spark.sql.avro
it says ImportError: No module named org.apache.spark.sql.avro

Can anyone please help me with this issue

@Nisha_B_Kamath,

Import statment works well with scala, In pyspark i don’t know how to validate spark-avro . also there are no documentation on this with pyspark hence it might be not available with pyspark.

if you want to perform read & write operation on avro file then below command is working fine.

pyspark2 --master yarn --conf spark.ui.port=0 --packages org.apache.spark:spark-avro_2.12:2.4.4

for more clearification join our slack-workspace-

@Shubham_Maurya1

I tried using pyspark I am able to launch pyspark with the command given by you but then when I type

Using Python version 2.7.5 (default, Aug 7 2019 00:51:29)
SparkSession available as ‘spark’.

import org.apache.spark.sql.avro
Traceback (most recent call last):
File “”, line 1, in
ImportError: No module named org.apache.spark.sql.avro

Please help

this import statment is not working with pyspark, it works only with spark with scala see my attachment.

after launching pyspark with avro-dependencies what you want to do?

@Shubham_Maurya1
i want to save my resultset in avro format

okay,

use this cmd to launch pyspark

pyspark2 --master yarn --conf spark.ui.port=0 --packages com.databricks:spark-avro_2.11:4.0.0

i am going to save orders data in to avro file here is the code for reference

ordersCSV = spark.read.csv('/public/retail_db/orders'). \
  toDF('order_id', 'order_date', 'order_customer_id', 'order_status')

from pyspark.sql.types import IntegerType, FloatType
orders = ordersCSV. \
  withColumn('order_id', ordersCSV.order_id.cast(IntegerType())). \
  withColumn('order_customer_id', ordersCSV.order_customer_id.cast(IntegerType()))

orders.write. \
  format('com.databricks.spark.avro'). \
  save('/user/shubham/orders_avro')

make sure the location, where you want to save data is your hdfs location.

please try to understand by yourself why previous dependencies is not working.(this is the case only with pyspark)

https://spark.apache.org/docs/latest/sql-data-sources-avro.html#compatibility-with-databricks-spark-avro

@Shubham_Maurya1 Thank you.This one works