HDPCD - Resilient Distributed Datasets


Originally published at: http://www.itversity.com/topic/hdpcd-resilient-distributed-datasets-scala/

Introduction to RDD Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a…