Apache Spark 1.6 - Transform, Stage and Store -Create RDD from HDFS files

As part of this topic, we will see how can we read the data from hdfs create RDD out of it.

Reading data using SparkContext

Data can be read from files using textFile of Spark Context object sc. Following are the operations on top of RDD that can be performed to preview the data

    • take
    • first
    • count
    • and more

Create RDD using data from HDFS

  • RDD is an extension to Python list
  • RDD – Resilient Distributed Dataset
  • In-memory
  • Distributed
  • Resilient
  • Reading files from HDFS
  • A quick overview of Transformations and Actions
  • DAG and lazy evaluation
  • Previewing the data using Actions

Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster