Apache Spark 1.6 - Transform, Stage and Store - Create RDD from collection – using parallelize

Create RDD using parallelize

Quite often we might have to read data from the local file system and then create RDD out of it to process in conjunction with other RDDs. Here are the steps to create RDD using data from files in local file system.

  • Read data from files using Python File I/O APIs
  • Create collection out of it
  • Convert into RDD using sc.parallelize by passing the collection as an argument
  • Now you will have data in the form of RDD which can be processed using Spark APIs
    As part of this topic, we will see how can we convert Collection into RDD

Create RDD using parallelize

Quite often we might have to read data from the local file system and then create RDD out of it to process in conjunction with other RDDs. Here are the steps to create RDD using data from files in local file system.

  • Read data from files using Python File I/O APIs
  • Create collection out of it
  • Convert into RDD using sc.parallelize by passing the collection as an argument
  • Now you will have data in the form of RDD which can be processed using Spark APIs

Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster