How spark distinct function will work

Hi,
I have text file(4 gb size) and want to distinct row. How this will work in order to map and suffle operation.

Thanks
Suresh Selvaraj

Hi Suresh ,

Please find the below steps if all the data in text are in structured format.

  1. Create Hive table based on the structure with proper delimeter.
  2. Load the text file into hive.
  3. Try to access the data in Spark using HiveContext.
  4. Now use the Distinct function to get proper output.

Hope this information helps.

Regards ,
Amit

Thanks for your reply Amit,
In saprk itself can do like below but want to know how its works internally. what is happen in suffle side, want know step by step action in map and suffle operation.
val distinctRdd = sc.textFile(“c://test/whetherdata.txt”).distinct;

In Spark , It will work fine. Internally it will create the RDD Collection.

check here: