Test and Troubleshoots

#1

Benchmark the cluster operational metrics, test system configuration for operation and efficiency. Demonstrate ability to find the root cause of a problem, optimize inefficient execution, and resolve resource contention scenarios.

  • Execute file system commands via HTTPFS
  • Efficiently copy data within a cluster/between clusters
  • Create/Restore a snapshot of an HDFS directory
  • Get/Set ACLs for a file or directory structure
  • Benchmark the cluster (I/O, CPU, network)
  • Resolve errors/warnings in Cloudera Manager
  • Resolve performance problems/errors in cluster operation
  • Determine reason for application failure
  • Configure the Fair Scheduler to resolve application delays

Execute file system commands via HTTPFS

We need to add HTTPFS Role Instance to the cluster in order to use HDFS commands via http. Let us go ahead and setup on our gateway (bigdataserver-1)

To add HTTPFS role

  • Click HDFS -> Add Role Instance -> Select “HTTPFS”
  • Select the host to install HTTPFS gateway daemon
  • Click on Install.
  • Once services are restarted, we need to open the port if the server is behind firewall.

Here are some of the examples of using HTTPFS.

https://gist.githubusercontent.com/dgadiraju/d6bfce4f7d0668c48ff493486cafe60e/raw/f7eb47a5819e3a4dd5d10fae2bc8dc4e8bea277f/cdh-admin-httpfs-examples.sh

For more details on Accessing HDFS via HTTP Rest API, visit the link.

Efficiently copy data within a cluster/between clusters

Let us see how we copy data within a cluster or between clusters.

  • We can use hadoop fs -cp to copy and hadoop fs -mv to move data within a cluster. mv can also be used for renaming the files. We have seen these examples as part of Copying or Moving files within HDFS in important HDFS commands.
  • We can use hadoop distcp to copy data between clusters.
  • We can get the list of control arguments by running hadoop distcp. Here are some important control arguments.
    • -filters – local path to a file containing a list of paths to be excluded.
    • -append – if the file names match and if underlying file format supports data will be appended.
    • -overwrite – if the target files exist, they will be overwritten
    • -delete – delete target files if they are missing in the source
    • -bandwidth
    • -p – preserve properties
  • We have to use HDFS URI for both source and target while running hadoop distcp command.

https://gist.githubusercontent.com/dgadiraju/085ee66867a891108e91cd092dc0f90b/raw/ad1acd0a69ee92b7ac6a0522265fcf06d1719595/cdh-admin-distcp.sh

Create/Restore a snapshot of an HDFS directory

Snapshots are primarily used to create backups for the data in HDFS.

Click here to revise details about Snapshots in HDFS.

Get/Set ACLs for a file or directory structure

ACLs stands for Access Control Lists. It is primarily to provide finer level access to HDFS. The concept is inherited from Linux.

Here we need to understand how to Get or Set ACLs for a file or directory.

Click here to revise details about ACLs in HDFS.

Benchmark the cluster (I/O, CPU, network)

Benchmarking is the process of stress testing the resources of the cluster to understand the performance of a cluster. Let us see more details about it.

  • Hadoop installation provides a jar file called as hadoop-mapreduce-examples.jar .
  • As part of packages, it will be under path starts with /var/lib and with parcels it will be under path starting with /opt/cloudera.
  • As seen earlier it has several applications such as randomtextwriter, wordcount etc.
  • We also get some applications related to benchmarking. The applications are known as TeraSort Benchmark Suite.
  • Also, we have another jar file hadoop-mapreduce-client-jobclient-*-tests.ja r which contains TestDFSIO to benchmark HDFS.
  • We will be using TeraSort benchmark suite, a well-known Hadoop benchmark suite. This suite consists of the following three steps
    • Generate a file – teragen
    • Sort the data – terasort
    • Review the results
  • Once the terasort is run, we should go through the counters and understand how the performance is.

Click here for the article on benchmarking.

First Run

https://gist.githubusercontent.com/dgadiraju/524a6597f0df3a647616651e398b751d/raw/4613710cc3a611138a5d282fff55a5f98562bd1f/cdh-admin-benchmark-commands-01-teragen.sh

https://gist.githubusercontent.com/dgadiraju/524a6597f0df3a647616651e398b751d/raw/4613710cc3a611138a5d282fff55a5f98562bd1f/cdh-admin-benchmark-commands-02-terasort.sh

https://gist.githubusercontent.com/dgadiraju/524a6597f0df3a647616651e398b751d/raw/4613710cc3a611138a5d282fff55a5f98562bd1f/cdh-admin-benchmark-commands-03-teravalidate.sh

https://gist.githubusercontent.com/dgadiraju/524a6597f0df3a647616651e398b751d/raw/4613710cc3a611138a5d282fff55a5f98562bd1f/cdh-admin-benchmark-commands-04-testhdfsio.sh

Resolve errors/warnings in Cloudera Manager

Let us see how can resolve errors and warnings we see in Cloudera Manager.

  • Errors will be in red and warning will be in orange.
  • We need to make sure there are no errors or warnings in Cloudera Manager.
  • We will only be able to work on errors related to Hosts and Services using Cloudera Manager. Application Issues need not be fixed using Cloudera Manager.
  • Let us see some of the common issues.
    • Service or a component of a service is down.
    • The hard disk might be full on one or more nodes in the cluster.
    • Applications are not starting or running very slow. As part of the troubleshooting we might have determined that services such as YARN, Impala are not configured properly.
  • Once corrective action is taken we might have to restart the service. Make sure that there is no restart or re-deploy icons. Also, ensure that the underlying service is working without any issues.

Resolve performance problems/errors in cluster operation

Let us discuss some of the common performance problems or errors in cluster operation.

  • We might see performance problems/errors in almost all the services. But most common ones are related to applications.
  • We typically run applications using one of these – Map Reduce, Spark, Impala, HBase etc.
  • Map Reduce and Spark are typically run using YARN.
  • We need to ensure that clusters are configured with enough resources based on the capacity we have in each of the nodes in the cluster.

Determine the reason for application failure

Let us understand how to troubleshoot and determine the reason for application failure.

  • Typically developers deliver applications in the form of jar files with run guides and support staff deploy or schedule them on gateway node in the cluster.
  • Applications that are deployed or scheduled on a gateway node in the cluster might fail for several reasons.
  • Developers can use any data processing framework as part of the application development (eg: Map Reduce, Spark, Hive, Sqoop, Impala etc.)
  • Based on the framework, we have to go to job logs and should be able to troubleshoot the issues.
  • Applications might have some logic which might not be related to the services we have in the cluster, typically at the beginning or the end of the application. If jobs are not submitted to the cluster then we have to go through application logs.
  • Information should be provided by developers as part of run guides to troubleshoot those kinds of issues.
  • Here are the general guidelines, assuming that jobs are failing once submitted on the cluster.
    • Go to job tracking URL or job history URL
    • Make sure you are in the job UI (by clicking on History or Application Master for Map Reduce and Spark Jobs).
    • Go to the failed tasks
    • Click on failed attempts
    • Go to standard error and go through errors and exceptions.

Configure the Fair Scheduler to resolve application delays

Let us see how we can configure the Fair Scheduler to resolve application delays in our cluster.

0 Likes