Spark out of memory

apache-spark

#1

Hi,

I am getting below error message when I am using rdd.collect.foreach(println) in Spark-shell2.

java.lang.OutOfMemoryError: GC overhead limit exceeded

Please help me.


Sign up for our state of the art Big Data cluster for hands on practice as developer. Cluster have Hadoop, Spark, Hive, Sqoop, Kafka and more.



#2

Where are you running into issue?


#3

In the beginning itself. I mean after reading data from csv file by using textFile.


#4

You can try increasing the number of slave nodes . This has worked for me previously. However this might not work if you are partitioning the data badly ( for instance, if you partition your data on a column that contains many duplicates , then all your data ends up in few slaves)


#5

Can you please guide me how to increase number of slaves nodes


#6

collect will get data in RDD into driver program on the gateway node and it can easily run out of issues. You need to be careful while using collect. Try using take(n) to preview your data not collect.

collect is typically used on the final output to convert RDD into array and further process using Scala collection APIs which might not be available in Spark.


#7

Ok thank you sir.
I won’t use collect in the first step again.