How to Reduce size of Schedular

apache-spark

#1

val orders = scala.io.Source.fromFile("/home/cloudera/Public/retail_db/orders/part-00000").getLines.toList

I have created an RDD with 6 partitions :slight_smile

scala> val orderRdd = sc.parallelize(orders, 6)
orderRdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at parallelize at :29

scala> orderRdd.first
18/03/27 16:56:09 WARN scheduler.TaskSetManager: Stage 1 contains a task of very large size (503 KB). The maximum recommended task size is 100 KB.
res9: String = 1,2013-07-25 00:00:00.0,11599,CLOSED

it’s complaining that stage 1 contains a task of very large size. How can we configure a task size to 100 KB.:anguished: I have partitioned RDD to 6 partition while creating orderRDD.


#2

@tariq,

It’s just a ‘WARN’, not ‘ERROR’, you can ignore them. But still if you want to do optimization try below document:
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-TaskScheduler.html


#3

Thank you so much :blush: