Remove the header spark

hi,
I have file with header and I want to remove that. So I’m done code like below.
val file = sc.readTest("s3://…);
val header = rrd.first();
val data = file.filter(_ 1= header);
and above code working good.

But doubt is spark always read data in order which file order. if yes my solution is correct, if not what is the solution and I dont want solution like val data = file.filter(“col1,col2…”= header);

Thanks
Suresh Selvaraj

Hi @suresh_selvaraj, there is some option like below.

spark.read.option(“header”,“true”).csv(“filePath”)

for more details you can refer the following documentation.

2 Likes

Thanks vinod. this option coming in spark sqlcontext in 1.x and sparksession in 2.x.

1 Like

This is there in 1.x also but you need to include databricks spark-csv lib in your environment then only you can run this. But if you are focusing keenly for the exam this is solution.

val file = sc.textFile(“file_path”)
val header = file.first

just validate if it is there in first line or second line or third line sometimes you can expect some #comments in thee beginning of the file.

val data = file.map(row => row != header)

in data RDD you will get filtered data without header

This works perfectly fine in all exam based scenarios without miss. For the real time project as you will not have restrictions on libraries usage you can freely use Spark-csv from databricks.

Regards
Venkat

Hey, Whatever solution you gave that will work for csv not for text file right.

This works for csv csv as well.