Issue in writting json data to Avro file in pyspark code

pyspark
Issue in writting json data to Avro file in pyspark code
4.5 2

#1

Having issue in writing to an Avro file in pyspark after reading data from JSON file.
importing itself failing for avro in the first line.

import avro.schema

Environment : spark2.3, pycharm2017.3, python3.5
pls help with some code here, thanks a lot.

Srinivas


Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster


#2

Tagging @kmln

Is it multi line json?

Please share

  • Test data
  • Code you have written
    as attachments

We can fix the issues in the code, but will not be able to come up with the complete solution for your problem. Also you need to check whether Python 3.5 is compatible with Spark 2.3 or not.


#3

Sorry for my coding skills, i’m just started learning pyspark. Actual code is below pasted.
There is no issue in the json, its a single line json. Able to generate parquet file, but getting issue for Avro file.

#####Source code
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
from pyspark import SparkContext
from pyspark.sql import SQLContext

sc = SparkContext(master=‘local’, appName='Json to Avro ')
sqlContext = SQLContext(sc)

emp = sqlContext.read.json(“C:\inputs\employees.json”)
emp.printSchema()

#emp.write.parquet(“C:\outputs\empl1parquet”)
emp.write.format(‘com.databricks.spark.avro’).save(‘C:\outputs\empavro11’)

######ERROR from the pycharm console
C:\Users\sv2001\PycharmProjects\Demo\venv\Scripts\python.exe C:/Users/sv2001/PycharmProjects/Demo/Json2Avro.py
Traceback (most recent call last):
File “C:/Users/sv2001/PycharmProjects/Demo/Json2Avro.py”, line 1, in
import avro.schema
File “C:\Users\sv2001\PycharmProjects\Demo\venv\lib\site-packages\avro\schema.py”, line 340
except Exception, e:
^
SyntaxError: invalid syntax

Process finished with exit code 1


#4

@kmln will provide you the solution.

My 2 Cents:
You need to use Databricks avro package and also it is complaining there is syntax issue in the first line.


#5

Hello Srinivas,

I think there is some issue with the avro package you installed. Its throwing an error in the process.

I have tried out the code using test JSON files available on itversity labs and using Databricks package.

df = sqlContext.read.json("/public/retail_db_json/orders")
df.write.format("com.databricks.spark.avro").save("retail_db_orders.avro")

This is working as expected. The only thing we need to do is pass the databricks package while starting pyspark or submitting your code through spark submit.

pyspark --master yarn --packages com.databricks:spark-avro_2.10:2.0.1

To avoid setup issues and accelerate learning process, one can try our state of the art Big Data Cluster with technologies such as Hadoop, Spark, Kafka etc - https://labs.itversity.com



#6

Its working fine, after i updated the code with below 2 lines :

import os
os.environ[‘PYSPARK_SUBMIT_ARGS’] = ‘–packages com.databricks:spark-avro_2.11:4.0.0 pyspark-shell’

In pycharm ide, i did not import previously. Now its working.

Thanks a lot for your help !!!