Resilient Distributed Datasets (RDD)

Originally published at: http://www.itversity.com/topic/resilient-distributed-datasets-rdd/

Introduction to RDD Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a…