XML data comparison in two different folders with identical file names and schemas

@itversity or any one who are working on real time projects need your help on the following.

No.of Files in folder1 - 19382 (250MB) and folder2 - 19382(250MB)

Solution Approach:
val oldpath = "file:///home/cloudera/venkat/dataset/old/"
val newpath = “file:///home/cloudera/venkat/dataset/new/”

val olddata = sqlContext.read.format(“com.databricks.spark.xml”).option(“rowTag”, “ESI”).load(oldpath)
val newdata = sqlContext.read.format(“com.databricks.spark.xml”).option(“rowTag”, “ESI”).load(newpath)

val diffdata = olddata.except(newdata).select(“ESI”)

this is working solution but not optimized one. as it is taking close to 2 hours time do find out differed files in the new folder.

Please provide your inputs on optimizing this nor changing the solution…


Hi Venkat,
Your data size look like small but no of files is huge. So try
1. merge files using coalesce or repartition
2. some people using .option(“rootTag”, “ESI”) as well, Dont know how much will affect if not specify.
3. check no of task is running and any task is taking long time.
Suresh Selvaraj

Hi @suresh_selvaraj

You are right here no.of files are huge but this number is pretty small compare to the actual live data…

  1. I don’t know how to make use of coalesce or repartition here in this case as and every file is a single XML record.
  2. I tried both root tag as well as rowTag. Only row tag better as I have only one record in each xml file.
  3. For some reason here no.of tasks are directly to proportional to no.of files.

I am also getting this warning not sure how to resolve this in Cloudera VM…

WARN scheduler.TaskSetManager: Stage 34 contains a task of very large size (667 KB). The maximum recommended task size is 100 KB.