When to use RDD and Dataframe with python


#1

Hi All, Many of problems were solved by RDD and Dataframes. Please can you suggest me when should i use RDD or Dataframe? and why?


#2

Suppose we have a dataset of Employee objects
case class Employees( name: String,rolno :Int, grade: Float)
So,if we have an RDD of type employees ,Spark only knows Employees as blob of objects and can’t look inside these objects.Supposing if the Employees object has hundred of fields and a certain computation requires only few fields,Spark can’t do optimisations for these and will serialize all and send them over the network.
But,with dataframes you can actually give a structure to the data as rows and columns of a table.So,there you can know everything about the structure of the dataset and with optimised data sources like parquet , if Spark sees that you need only few columns to compute the results , it will read and fetch only those columns from parquet saving both disk IO and memory.

Now,if you have another dataset of Departments and you have an rdd of type Departments.If some computation requires joining and filtering of the Departments with Employees dataset.As a programmer you need to optimize the operations to either filter first and join later or join first and filter later.Though Dataframes under the hood are converted to RDDs but they create an optimized plan for you.
So,we can say that Dataframes are relational APIs over RDDs and especially for analysis jobs people prefer to use declarative relational APIs than functional APIs.Dataframes are the abstractions that are used for SparkSQL.