Read and Write to Avro files

pyspark
avro

#1

I constantly refer to the page https://arun-teaches-u-tech.blogspot.sg/p/file-formats.html for various file formats and it’s quite helpful.

However, while practicing, with “pyspark”, the import command " import com.databricks.spark.avro._;" never works. It was mentioned in one of the older posts that pyspark should be invoked using below to deal with avro files

pyspark --packages com.databricks:spark-avro_2.10:2.0.1

Is this the only way to do it to deal with avro files? Even in the CCA175 test, if there is any question to read from (or) write to avro files, should that session of pyspark be initated like above?


#2

Hi Rajesh,
I would also like to know the answer of your last questions.
However, answering to your first one,
you can try as below, in addition to
pyspark --packages com.databricks:spark-avro_2.10:2.0.1

import avro.schema
to read the avro file,
sqlContext.read.format(“com.databricks.spark.avro”).load(“input path dir”).show()
or
sqlContext.load(“input path dir”, “com.databricks.spark.avro”)

to save in avro file format,
dataframe.write.format(“com.databricks.spark.avro”).save(“output path dir”)
or
dataframe.save(“output path dir”, “com.databricks.spark.avro”)

Thanks,
Aparna


Not able to import com.databricks.avro._
#3

Yes, after launching pyspark using above databricks package imported, it works fine with the commands you have given. But is that the way to connect in CCA175 also?

When I took the test, sqlContext.read.format(“com.databricks.spark.avro”) didn’t work probably because pyspark is not launched with that import.

I want to know someone who cleared CCA175 on how did they deal with AVRO files.


#4

Can someone confirm how did they deal with AVRO files while taking the CCA175 test? I am looking for someone using Python


#5

Can someone (especially who cleared CCA175) answer my query on dealing with Avro files in Pyspark in CDH cluster


#6

Since our lab uses hortonworks.Avro is not available by default. Hence we are importing the avro packages manually, where as in cloudera you dont import any packages.


#7

Hi @raghu

Thanks for your comment. When I recently took the test and tried to read from / write to avro files, it didn’t work directly from pyspark. Sample commands used.

Ex: df1 = sqlContext.read.format(“com.databricks.spark.avro”).load()
df2.write.format(“com.databricks.spark.avro”).save()

I got the message that com.databricks.spark.avro is not available.

I strugged a bit to launch pyspark with databricks package but it was stuck forever trying to resolve dependencies and I had to kill the terminal multiple times.

So, is there any better way to deal with avro files in the actual CDH cluster while taking the test.


#8