Converting avro file format to csv format in Pyspark


#1

Hi all,
I am trying to convert avro file format in to csv and save in hdfs, in Pyspark shell, got some issues while doing so.
Could anyone help me in this.

Thanks in advance,
Aparna


#2

@AparnaSen Provide your code and error to understand clearly.


#3

Thanks Balu.
KIndly find my code below.

pyspark --master yarn --conf spark.ui.port=12569 --num-executors 4 --packages com.databricks:spark-avro_2.10:2.0.1
ordersDF = sqlContext.read.format(“com.databricks.spark.avro”).load("/user/aparna149/orders")
order_itemsDF = sqlContext.read.format(“com.databricks.spark.avro”).load("/user/aparna149/order_items")
ordersDF.registerTempTable(“orders_table”)
order_itemsDF.registerTempTable(“order_items_table”)

sortedData = sqlContext.sql(“select to_date(from_unixtime(o.order_date/1000)) order_date,o.order_status,count(distinct(o.order_id))Total_Orders,round(sum(oi.order_item_subtotal),2) Total_Revenue from orders_table o join order_items_table oi on o.order_id = oi.order_item_order_id group by o.order_date,o.order_status order by order_date desc, o.order_status asc, Total_Orders asc,Total_Revenue desc”)

I want to save the output (sortedData) in csv format in hdfs.
How can I do that? Please help me on this.

Thanks in advance,
Aparna


#4

Hi Aparna,

try this…add csv jar in the package list
pyspark --master yarn --conf spark.ui.port=34243 --packages com.databricks:spark-avro_2.10:2.0.1,com.databricks:spark-csv_2.10:1.1.0

and
while save
sortedData.select(“username”,“tweet”).write.format(“com.databricks.spark.csv”).option(“header”,“true”).save(“soerteddata.csv”)

Should work.


#5

Hi Sarang,
Thanks a lot. It worked, I wrote as below.
sortedData.write.format(“com.databricks.spark.csv”).save(“/user/aparna149/sortedData")

Do we need to write .csv in save() method? while saving as other formats we never explicitly give .extension

One more thing, how to save a RDD in sequence file format in pyspark?
I tried with RDD.saveAsSequenceFile(), but got error.

Can you help me in this?

Thanks in advance,
Aparna


#6

Yes Aparna, .csv is not required…it should be a dir name and not the file name. My bad.

Regarding saveAsSequenceFile(), as per the spark official documentation, it supports only with Java and Scala, not with Python. You can refer the below link,
https://spark.apache.org/docs/1.6.2/programming-guide.html#actions


#7

Hi Sarang,
Thanks again. Yes I checked in the guide, it doesn’t support python. Then how should it be dealt in the exam, if such questions are asked?

Thanks
Aparna


#8

@AparnaSen I dont think they will ask question if it is not supported. They dont even know which language you are going to use. Scala or Python.
I cleared HDPCD- Spark (Scala) and questions were more on in-built data sources like json, orc. Saving the data using different delimiters and in particular column sequence.


#9

Hi Sarang,
Thanks again for your valuable information.
It’s so nice of you.
Congratulations on clearing your certification. I am preparing for CCA in python. Do you know any source to practice for the exam, apart from Arun’s blog?

Thanks
Aparna


#10

Thanks. I used only Durga sir’s Playlist and at least 1 month hands on.


#11

Thanks a lot Sarang. :blush: