How to save the data got using HiveContext

I have used Hive Context and got an output using SQL.
Now i wanted to save the result to a file.But i get the error as below. Do i need to import some package?

val hc = sqlContext.sql(“select * from sakthi123.orders limit 10”);
hc.saveAsTextFile("/user/sakthimuruganv2214/sakthi/spark/hiveresult")
:22: error: value saveAsTextFile is not a member of org.apache.spark.sql.DataFrame
hc.saveAsTextFile("/user/sakthimuruganv2214/sakthi/spark/hiveresult")

@sakthi.murugan.v can you please copy complete code to investigate more …

@sakthi.murugan.v

You have to change it to RDD before saving.
hc.rdd.saveAsTextFile("/user/sakthimuruganv2214/sakthi/spark/hiveresult")

Thanks it worked. So when a variable declared as val will become an RDD variable. Thought by default in spark api a val variable is RDD isnt it!!!

Below is the hierarchy of data structures in Spark.

Data File --> RDD --> Data Frames

Data File : TextFiles/SequenceFiles/NewAPIHadoopFile

RDD : When you load the above files in memory.

Data Frames : An abstraction over RDDs. Basically columnar data(sort of a table).

Related transformations
RDD to DF --> someRDD.map(DEFINE A CASE CLASS TO DEFINE A STRUCTURE TO THE DATA).toDF()
DF to RDD --> someDF.rdd

DataFiles like AVRO/PARQUET/JSON etc already have a structure defined to them. You can directly load them to a DataFrame:
e.g :
avroDataFrame = sqlContext.read.avro("/somePath")
parquetDataFrame = sqlContext.read.parquet"/somePath")
jsonDataFrame = sqlContext.read.json("/somePath")

you can save the above file formats directly as tables.
someDataFrame.write.avro("/somePath") and similarly for parquet and json.

To write data as TextFiles/SequenceFiles/NewAPIHadoopFile,
you have to first convert the Dataframe to RDD and then save it.

someDF.rdd.saveAsTextFile("/somePath").
or
someDF.map(r => (r(0), r(1), …r(n))).saveAsTextFile("/somePath").
When you use map here you have more control on the formatting of the data, especially on the delimiter.

Hope it clears the doubt. Feel free to reply back if you need more information.

1 Like

Thanks Laxman. I understand the concept. So its basically when a resultset is stored from a query in HiveContext is the form of DataFrames. Hence we have to convert into RDD while saving. whereas when i directly read a textfile and then do an action on it like take(5) which means loading into memory, here we don’t need to convert to “.RDD” or even while saving it as different format! and why is that?
Hope you are getting my question…