Reading text data in Spark

apache-spark
Reading text data in Spark
4.0 1

#1

textFile:

This method method creates an RDD from a text file. It can read a file or multiple files in a directory stored on a local file system, HDFS, Amazon S3, or any other Hadoop-supported storage system. It returns an RDD of Strings, where each element represents a line in the input file.

the number of partitions in rdd is equal to number of blocks
example:

val rddtext=sc.
textFile("/Users/Balus/Documents/retail_db/categories")

scala> val rddwholetext=sc.wholeTextFiles("/Users/Balus/Documents/retail_db/categories")
rddwholetext: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[15] at wholeTextFiles at :27

scala> rddtext.partitions.length
res23: Int = 2

wholeTextFiles:

this method reads all text files in a directory and returns RDD of key-value pairs. Each key value pair in the returned rdd corresponds to a single file. The key part stores the path of a file and the value part stores the content of a file.

example:

scala> val rddwholetext=sc.wholeTextFiles("/Users/Balus/Documents/retail_db/categories")
rddwholetext: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[15] at wholeTextFiles at :27

scala> rddwholetext.partitions.length
res27: Int = 1


Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster