Export oracle table data to Json file format using pyspark

pyspark

#1

import cx_Oracle
import pandas as pd

conn = cx_Oracle.connect(user=‘usr’, password=‘pwd’, dsn=‘host:1521/servicename’)
sql = “SELECT * from emp”
curs = conn.cursor()
res = curs.execute(sql)

for row in curs:
print(row)

output:
id name sal
1 abc 1000
2 def 2000
3 ghi 3000

—> I would like to convert the above output into a JSON file format including table column names.
How to do that. Pls help with some sample code.

Thanks,
Sri


Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster


#2

Here are the high level steps:

  • Use spark jdbc similar to cx_Oracle to get data into data frame (let us say empDF)
  • empDF.write.json(“PATH_IN_HDFS”)

It will save data in JSON format in specified path with the column names.


#3

Thats great, thank you so much.
Small query - in order to handle millions of oracle records, which library is better in PySpark.


#4

It is a bit tricky. This is typical life cycle.

  • You can partitionColumn with numPartition for initial load
  • Pass query in parenthesis to Spark JDBC API for delta loads

Here is the reference material - https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases