Pyspark - Reading and Saving Sequence


When i directly gave “ x: (tuple(x.split(”,", 1))).saveAsSeequenceFile("/user/cloudera/pysprk/departmentsSeq"), it gave me an **error : “dataRDD not defined”.

Hence i first defined ‘dataRDD’ as dataRDD = sc.textFile("/user/cloudera/departments") and then gave the map function.
Output was successful

My doubt is - should we first define the variable as ‘Text file’ before applying ‘Map()’ function to it?
If yes, should we always save it in ‘Text’ format in hdfs?

Could you please clarify the doubt?


@SreeswethaGolla You are trying to save some(Output) data in the form of sequenceFile. So obviously you need to provide a data (Input).

Input Data can be either
(a) file content (local or HDFS) so we need to use a API which can read a file in this case sc.textFile()

(b) manipulated data with the help of sc.parallelize() API

Its not necessarily to be in the “Text” format since HDFS supports avro, parquet, json, sequence file etc.,

Great info. Thanks a lot sir.