WARN scheduler.TaskSetManager: Stage 3 contains a task of very large size (914 KB). The maximum recommended task size is 100 KB
@venkatwilliams Can you please share the code with us. I generally have seen this issue because of data not distributed evenly across partitions after you load the data.
No.of Files in old folder- 19382 (250MB) and new folder - 19382(250MB)
val oldpath = "file:///home/cloudera/venkat/dataset/old/"
val newpath = “file:///home/cloudera/venkat/dataset/new/”
val olddata = sqlContext.read.format(“com.databricks.spark.xml”).option(“rowTag”, “ESI”).load(oldpath)
val newdata = sqlContext.read.format(“com.databricks.spark.xml”).option(“rowTag”, “ESI”).load(newpath)
val diffdata = olddata.except(newdata).select(“ESI”)
for just 250MB of data having those many files is first not good. But if your upstream sends the data like that then you need to use “repartition” or “coalesce” transformations on the rdd.
First see how many partitions does your rdd hold. Can you please post the results of the below commands
Thanks for taking time and looking into it.
Please find the result here for the requested commands. Let me know what to change…
olddata.partitions.size = 2
newdata.partitions.size = 2