Questions related to Spark

  1. How do you determine number of partitions and cores, memory for driver, executor to be used for your spark job execution in optimal way

  2. if critical data is missing in a file which is required for processing, how to you handle the case

  3. What metrics you observe to tune the spark job execution

  4. if columns used for joining tables are hold size of 1gb(key on which join is performed) how do you optimally execute spark job

5.Debug tool you use for Spark job execution

  1. Testing tool for Spark

  2. If Data is huge that can not be fit in Memory(RAM/Disk), how do you optimally process the data.

  3. How do you control garbage collection in Spark

  4. What parameters you pass to spark-submit to run the job efficiently

  5. What is the criteria for selecting number of partitions in RDD

  6. How do you automate spark-submit

Is there any reference i can go through to get answers for similar kind of questions.

@mahendra971 These questions are very generic to answer, It is difficult to answer in the forum. If you can give a use case and present your problem, it will be easy to answer the questions.