#What is split size and what will be the impact of it?
Split is logical split of of the data stored in HDFS used during data processing using Map/Reduce program or other processing techniques (Spark/Tez). Split size is user defined and can be chosen based on the data size.
Split is basically used to control number of Mapper in Map/Reduce program. If you have not defined any input split size in Map/Reduce program then default HDFS block split will be considered as input split.
#How to determine number of reducers?
By default number of reducers = 1.
In mapreduce program a property under Job class can be set using
setNumReduceTasks(number of reducers)
During submission of the job also this property can be configured using
-D SET mapreduce.job.reduces=number Of Reducers
In hive number of reducers can be explicitly set using the property
SET mapreduce.job.reduces=number Of Reducers
#If there are 4 unique keys, at max how many reducers can be configured?
Ideally 4 reducers can be set explicitly
A partitioner logic can be defined for the same
#What is the relevance of combiner? Give an example
A Combiner, also known as a semi-reducer, is an optional class that operates by accepting the inputs from the Map class and thereafter passing the output key-value pairs to the Reducer class. The main function of a Combiner is to summarize the map output records with the same key.
A combiner does not have a predefined interface and it must implement the Reducer interface’s reduce() method.
A combiner operates on each map output key. It must have the same output key-value types as the Reducer class.
A combiner can produce summary information from a large dataset because it replaces the original Map output.
#Which compression algorithms typically perform better? One with Java library or Native library?
gzip is naturally supported by Hadoop. gzip is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman Coding.
bzip2 is a freely available, patent free (see below), high-quality data compressor. It typically compresses files to within 10% to 15% of the best available techniques (the PPM family of statistical compressors), whilst being around twice as fast at compression and six times faster at decompression.
The LZO compression format is composed of many smaller (~256K) blocks of compressed data, allowing jobs to be split along block boundaries. Moreover, it was designed with speed in mind: it decompresses about twice as fast as gzip, meaning it’s fast enough to keep up with hard drive read speeds. It doesn’t compress quite as well as gzip — expect files that are on the order of 50% larger than their gzipped version. But that is still 20-50% of the size of the files without any compression at all, which means that IO-bound jobs complete the map phase about four times faster.
Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger. On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more. Snappy is widely used inside Google, in everything from BigTable and MapReduce to our internal RPC systems.
#If the reports are generated primarily based on date, what will be your partitioning strategy?
Partitioning hierarchy will be Year->Month->Day and if the frequency is on the higher side data can be further partitioned as per hour basis.
#If some reports are generated by date and other reports are generated by category, what will be your data modeling strategy?
Partitioning hierarchy will be and if the frequency is on the higher side data can be further partitioned as per hour basis.
If the number of unique categories are minimum data can be partitioned in the hierarchy as follows
#For a compressed file of 1 GB size - by default how many mappers will be created if the compression algorithm is non-splittable and how many mappers will be created if the compression algorithm is splittable?
For a non-splittable algorithm a single map will process all 8 HDFS blocks (considering default block size is 128 MB) , most of which will not be local to the map so may take longer to run.
#When we should use map side join and what is the relevance of hive.mapjoin.smalltable.filesize?
Mapside join can be used if one of the tables can be fit in memory which will reduce overhead on your sort and shuffle data.
Reduce-Side joins are more simple than Map-Side joins since the input datasets need not to be structured.
But it is less efficient as both datasets have to go through the MapReduce shuffle phase. the records with the same key are brought together in the reducer.
Default Value: 256 MB (to be given in bytes)
This is the threshold for the input file size of the small tables to be joined in a query , if the file size is smaller than this threshold,hive execution engine (Mapreduce) will try to convert the common join into map join , which is more efficient compared to reduce side join.
The value can be modified to increase the cache size to perform map side join on with a larger file , however the process will be resource intensive (or time consuming).
#How to configure executor memory, number of executors in spark?
Executors are worker nodes’ processes in charge of running individual tasks in a given Spark job. They are launched at the beginning of a Spark application and typically run for the entire lifetime of an application.
val conf = new SparkConf().setAppName("Aggregate Example").setMaster("local").set("spark.executor.memory", "1g")
During the creation of Configuration object in spark application the number of executors can be set in setMaster() property enclosed within "  " , size of executor can be set in the property set(“spark.executor.memory”, “1g”) 1g = 1GB
Alternatively during submission of application in spark-submit command the argument can be passed
–num-executors for number of executors
–master local for number of executors
–conf spark.executor.memory =4g to set the executor memory to 4GB