Apache Spark 2.x – Data processing - Getting Started - Spark Data Structures - RDDs and Data Frames

Spark Data Structures

We need to deal with 2 types of data structures in Spark - RDD and Data Frames. We will see both in detail as we proceed further.

  • RDD is there for quite some time and it is the low-level data structure which spark uses to distribute the data between tasks while data is being processed.
  • RDD can be created using SparkContext APIs such as textFile.
  • RDD will be divided into partitions while data being processed. Each partition will be processed by one task.
  • The number of RDD partitions is typically based on HDFS block size which is 128MB by default.We can control the number of minimum partitions by using additional arguments while invoking APIs such as textFile.
  • Data Frame is nothing but RDD with the structure. We should be able to access the attributes of Data Frame using names.
  • Typically we read data from file systems such as HDFS,S3,Azure Blob,Local file systems etc
  • Based on the file formats we need to use different APIs available in spark to read data into RDD or Data Frame.
  • Spark uses HDFS APIs to read from,and/or write data to the underlying file system.

Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster