The Configure YARN + MRv2 and Understand Concepts

#1

So far, we have seen setting up Zookeeper and HDFS, now we will see how to setup YARN which stands for Yet Another Resource Negotiator. YARN (MRv2) introduces newer daemons that are responsible for job scheduling/monitoring and resource management.

  • Setup YARN + MR2
  • Run Simple Map Reduce Job
  • Components of YARN and MR2
  • Configuration Files and Important Properties
  • Review Web UIs and log files
  • YARN and MR2 CLI
  • YARN Application Life Cycle
  • Map Reduce Job Execution Life Cycle

Cluster Topology

We are setting up the cluster on 7+1 nodes. We start with 7 nodes and then we will add one more node later.

  • Gateway(s) and Management Service
    • bigdataserver-1
  • Masters
    • bigdataserver-2 – Zookeeper, Namenode
    • bigdataserver-3 – Zookeeper, Secondary Namenode
    • bigdataserver-4 – Zookeeper, Resource Manager, Job History Server
  • Slaves or Worker Nodes
    • bigdataserver-5 – Datanode, Node Manager
    • bigdataserver-6 – Datanode, Node Manager
    • bigdataserver-7 – Datanode, Node Manager
  • We will create host group yarn to run commands using ansible on all nodes where YARN is running.

Learning Process

We will follow the same standard process to learn while adding any software-based service.

  • Downloading and Installing – already taken care as part of adding hosts to the cluster.
  • Configuration – we need to understand architecture and plan for the configuration.
    • Architecture – Master, and Slaves
    • Components – Resource Manager, Map Reduce Job History Server and Node Managers
    • Configuration Files – /etc/hadoop/conf
    • With cloudera the location is a bit different and we will see it after setting up the service.
  • Service logs/var/log/hadoop-yarn and /var/log/hadoop-mapreduce
  • Service Data –YARN is Distributed Resource Management Framework. It does not store any data permanently. HDFS is our primary File System to store the data. However there are some properties to persist log files as well intermediate data, but they are not as important as HDFS.

Setup YARN + MR2

Here are the steps involved in setting up YARN + MR2 using Cloudera Manager. There are 2 different processing engines that can be configured using Cloudera Manager (Map Reduce which is a legacy framework and YARN + MR2).

We don’t need to worry much about the legacy framework.

  • Choose the drop-down of cluster Cluster 1
  • Click on Add Service
  • Choose YARN
  • Configure Resource Manager – bigdataserver-4
  • Configure Node Managers – bigdataserver-5 to bigdataserver-7 (same as Datanodes)
  • Once the YARN is started
    • Check whether bigdataserver-1 is configured as Gateway
    • If not click on the instances and choose Add Role Instances
    • Configure bigdataserver-1 and all other nodes as gateway nodes for YARN.
  • Gateways facilitate us to connect to YARN cluster and issue commands to process data using frameworks like Map Reduce, Spark etc.
  • Let us also review a few things as part of the setup process.

Run Simple Map Reduce Job

Let us run simple Map Reduce Job and see what happens. We will be using Hadoop examples that come as part of the setup process itself.

  • We can use Hadoop jar or yarn jar to submit map reduce job as YARN application. Let us run an application called randomtextwriter which will generate 10 GB of data per node by default.
  • This job will take some to run.
  • Typically data will be processed using map tasks and reduce tasks.
    • Map Tasks read the data and perform row-level transformations.
    • Reduce Tasks read the output of Map Tasks and perform transformations such as joins, aggregations etc.
    • Shuffling Process between Map Tasks and Reduce Tasks take care of grouping and partitioning of data based on keys.
    • We do not have to get into too many details at this time as an administrator.
    • This particular application randomtextwriter is map only job where it tries to create 10 GB data per data node. In our case, we will see 30 GB of data.

Exercise: Run relevant Hadoop fs commands to get the size of data that is created by randomtextwriter.

  • Map tasks and Reduce tasks will run as part of YARN containers on Nodemanagers.
  • The life cycle of the job is managed by per job application master.
  • Typically Map Reduce jobs read data from HDFS, process it and save it back to HDFS.
  • This examples job does not take any data from HDFS, it just randomly generates text and writes it back to HDFS.
  • We can keep track of running jobs as well as troubleshoot completed jobs using Resource Manager UI.

Components of YARN and MR2

Now let us explore different components related to YARN as well as Map Reduce 2 and how they are used in Resource Management as well as processing data.

  • YARN stands for Yet Another Resource Negotiator. It provides capabilities related to Resource Management and actual data processing is done by frameworks such as Map Reduce, Spark, Tez etc.

  • We need to configure Resource Manager, Node Managers, Application Timeline (history) Server, Map Reduce Job History Server etc as part of YARN. Here Resource Manager act as master and whereas Node Managers act as slaves.
  • Application Timeline Server and Map Reduce Job History Server is primarily to get details about completed applications or map reduce jobs.
  • Actual data will be processed in Node Managers in the form of containers while Resource Manager manages the resources at the cluster level. For Map Reduce jobs, they are termed as Map Tasks and Reduce Tasks and in case of Spark they are termed as Spark Executors.
  • Node Managers send heartbeat to Resource Manager and Resource Manager keep track of resources at the cluster level.
  • As part of the heartbeat, Node Managers also send resource utilization to Resource Manager, so that Resource Manager keeps track of usage of Node Managers. It will facilitate Resource Manager for the effective usage of the cluster for new jobs.

Configuration files and Important Properties

Unlike plain vanilla distribution and other vendor distributions, Cloudera manages configuration files a bit different. Typically configuration files will be in /etc/hadoop/conf. But when it comes to Cloudera, /etc/hadoop/conf will only have templates. Actual properties files are managed under /var/run/cloudera-scm-agent/process on each node.

  • hadoop-env.sh – for memory settings of Resource Manager, Node Manager etc.
  • core-site.xml – Namenode URI as well as Compression Algorithms.
  • yarn-site.xml – Parameters related Resource and Node Managers. Using appropriate values for yarn is very important, we will review those things as part of the planning of cluster at a later point in time.
  • mapred-site.xml – Parameters related to Map Reduce framework.

Important YARN Properties

Run Jobs

As we have changed the properties with respect to node manager capacity, let us run randomtextwriter again and see how long it take.

  • We can override individual properties at runtime using -D and multiple properties using -conf and xml file similar to yarn-site.xml or mapred-site.xml.

Now let us run word count program from hadoop examples and observe the change in number of map tasks as well as reduce tasks.

  • Let us run word count program with the different number of mappers by overriding mapreduce.input.fileinputformat.split.minsize as well as mapreduce.job.reduces . We are trying to perform word count on 30 GB data (with 30 files of 1 GB each)
  • Without overriding the properties, it uses 128 MB (inherited from dfs.blocksize) and created 270 map tasks to read data and then 12 reducers to aggregate and write the data.
  • After overriding split size is 256 MB, the number of mappers are 150 (5 per file) and reducers are 8 as hard coded.

Here the idea is to only show how to override the properties, not how to determine split size and number of reducers. It will be covered as part of Performance Tuning course.

Review Web UIs and log files

Now let us review the web interfaces and also log files to troubleshoot any YARN or Map Reduce specific issues.

Web UIs

  • Resource Manager Web UI
  • Job History Server Web UI
  • Using Cloudera Manager to troubleshoot the issues of Resource Manager components.
  • Using Resource Manager Web UI to troubleshoot the jobs
  • Resource Manager Web UI acts as a proxy to get details about running or completed jobs.
    • Running Jobs details are served by per application Application Master
    • Completed Jobs are served by History Servers.
    • Typically logs related to completed jobs will be stored in HDFS and History Servers provide UI for them.

Log files

There are service level logs as well as application or job level logs.

  • Service level logs can be accessed using Cloudera Manager Web UI or command prompt on each of the servers.
  • We need to login to the respective server to access logs via command prompt. For example, if we want to look into Resource Manager logs we need to login to the node where Resource Manager is running.
  • Job level logs are primarily available through Resource Manager Web UI via Proxy.
    • While job is running, logs and progress is provided by web service with in Application Master.
    • Once job is completed logs will be copied to HDFS and served by History Servers of the underlying framework (Map Reduce in this case)

YARN and MR2 CLI

Both YARN and MR2 have command line interface.

  • YARN CLI can be used to submit YARN applications and manage them
  • Map Reduce CLI can be used to submit jobs using job files and also to manage them
  • There are hand full of commands which Hadoop Administrator should be aware of.
  • Type yarn or mapred and see the list of commands and use them for managing jobs, check the status of the jobs, kill the jobs and more

YARN Application Life Cycle

Now let us talk about YARN Application Life Cycle. YARN is the resource management framework.

  • We can use distributed data processing frameworks such as Map Reduce, Spark etc., by plugging into YARN.
  • A YARN application can be Map Reduce Job or Spark Application.
  • From YARN perspective data is being processed by containers.
  • Let us understand the life cycle of YARN Application.
    • We use the client to submit YARN Application (for e. g.: Map Reduce Job)
    • The request will go to the Resource Manager. Resource Manager has up to date information about the usage of all the servers on registered Node Managers running on servers.
    • Resource Manager will decide a node on which container should run to manage the job or application using different criteria such as usage of the servers.
    • This container is called as Application Master. It will be up and running until the application is either completed or killed.
    • Now Application Master will talk to Node Managers directly and decide on which nodes containers should run to process the data. It uses Data Locality and Server Usage as criteria before creating containers.
    • These containers will process the data and might get garbage collected depending upon the underlying data processing framework.

Map Reduce Job Execution Life Cycle

Now let us talk about Map Reduce Job Execution Life Cycle. While YARN is Resource Management framework, Map Reduce is distributed data processing framework.

  • On Gateway Node we can submit map reduce jobs using hadoop jar command.
  • There will be JVM launched on the gateway node.
  • It will talk to Resource Manager and get YARN Application id as well as Map Reduce Job id.
  • Also, client will copy the job resources into HDFS using higher replication factor. Job resources are nothing but jar file used to build the application along with dependent jars, additional data files that are passed at run time etc.
  • Resource Manager will choose one of the servers on which node manager is running and create a container. It is called as YARN Application Master.
  • In Map Reduce, YARN Application Master will take care of managing the life cycle of the job.
    • Talk to the Node Manager and create containers to run Map Tasks as well as Reduce Tasks to process the data.
    • Fault Tolerance of tasks – If any task fails, it will be retried (four times) on some other node and reprocess the data. If any task fails more than four times then the entire job will be marked as failed.
    • Also if Node Manager is down, then all the running tasks on that node will be retried on other nodes.
    • Fault Tolerance of application master – If the application master is failed, it will be recreated on some other node and entire job will be rerun.
    • Speculative Execution – If there is any task running slow compared to others, there will be another attempt of the same task on some other node to process the same data (same split). Whichever attempt is finished first will be completed and all other attempts will be killed.
    • All Map Tasks and Reduce Tasks collect metrics as they process data, Application Master will consolidate that information and send periodically to the client.
  • You can review the source code of wordcount program here.
  • First Map Tasks will be created as YARN Containers, job resources will be copied on to the Task and map logic will be executed.
  • As Map Tasks (80% of them) are completed, Reduce Tasks will also be started as YARN Containers and reduce logic will be executed.
  • This process of taking logic to the data is called Data Locality. In conventional applications code will be deployed in application server and data has to be copied to the application server for processing.
  • Data Locality is significant performance booster for heavy weight batch processing.
  • As Map Tasks and Reduce Tasks complete processing of data assigned to them, they will be garbage collected.
  • Once the job is completed, Application Master will be garbage collected after copying the log files to HDFS.
  • We can get the logs of completed Map Reduce jobs using job history server.

By this time you should have set up Cloudera Manager, then install Cloudera Distribution of Hadoop, Configure services such as Zookeeper, HDFS, and YARN. Also, you should be comfortable with HDFS Commands as well as submitting jobs.

Make sure to stop services in Cloudera Manager and also shut down servers provisioned from GCP or AWS to leverage credits or control costs.

0 Likes