Apache Spark 2.x - Data Frames and Pre-Defined Functions - Create Data Frames using JDBC

Spark also facilitate us to read data from relational databases over JDBC.

  • We need to make sure jdbc jar file is registered using --packages or --jars and --driver-class-path while launching pyspark
  • In Pycharm, we need to copy relevant jdbc jar file to SPARK_HOME/jars
  • We can either use spark.read.format(‘jdbc’) with options or spark.read.jdbc with jdbc url, table name and other properties as dict to read data from remote relational databases.
  • We can pass a table name or query to read data using JDBC into Data Frame
  • While reading data, we can define number of partitions (using numPartitions), criteria to divide data into partitions (partitionColumn, lowerBound, upperBound)
  • Partitioning can be done only on numeric fields
  • If lowerBound and upperBound is specified, it will generate strides depending up on number of partitions and then process entire data. Here is the example
    • We are trying to read order_items data with 4 as numPartitions
    • partitionColumn – order_item_order_id
    • lowerBound – 10000
    • upperBound – 20000
    • order_item_order_id is in the range of 1 and 68883
    • But as we define lowerBound as 10000 and upperBound as 20000, here will be strides – 1 to 12499, 12500 to 14999, 15000 to 17499, 17500 to maximum of order_item_order_id
    • You can check the data in the output path mentioned

Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster

Hi @Ramesh1,

Are the --packages or --jars and --driver-class-path details provided during exam?

@Pooja_Nayak

If it is a custom jar file they will be providing you. But you have to go through the instructions carefully.